train package#

Submodules#

train.dataset module#

class train.dataset.Dataset(name: str, data_path: List[str], data: DataFrame | None = None, cast_dataset: Callable | None = None, max_rows: int = -1)[source]#

Bases: object

Single DNS dataset with loading and preprocessing capabilities

Encapsulates dataset information including name, file paths, and data processing functions. Supports flexible data loading from various sources including CSV, Parquet, and text files with custom preprocessing functions.

__init__(name: str, data_path: List[str], data: DataFrame | None = None, cast_dataset: Callable | None = None, max_rows: int = -1) None[source]#

Loads dataset either from file path using optional preprocessing function or directly from provided DataFrame. Supports various data formats and custom preprocessing callbacks for dataset-specific requirements.

Parameters:
  • name (str) – Unique identifier for the dataset.

  • data_path (List[str]) – File paths to dataset files.

  • data (pl.DataFrame) – Pre-loaded dataset (alternative to data_path).

  • cast_dataset (Callable) – Custom preprocessing function for data loading.

  • max_rows (int) – Maximum rows to load (default: -1 for unlimited).

Raises:

NotImplementedError – When neither data_path nor data is provided.

__len__() int[source]#

Returns number of rows in the dataset.

Returns:

int – Total number of records in the dataset.

class train.dataset.DatasetLoader(base_path: str = '', max_rows: int = -1)[source]#

Bases: object

Manages loading and access to multiple DNS datasets for training.

Provides convenient access to various DNS datasets including DGA detection benchmarks, legitimate traffic datasets, and combined multi-source datasets. Handles dataset-specific loading and preprocessing requirements.

__init__(base_path: str = '', max_rows: int = -1) None[source]#
Parameters:
  • base_path (str) – Base directory path containing all dataset folders.

  • max_rows (int) – Maximum rows to load per dataset (default: -1 for unlimited).

property bambenek_dataset: Dataset#
property cic_dataset: Dataset#
property dga_dataset: Dataset#
property dgarchive_dataset: list[Dataset]#
property dgta_dataset: Dataset#
property heicloud_dataset: Dataset#
train.dataset.cast_bambenek(data_path: str, max_rows: int) DataFrame[source]#

Loads and processes Bambenek DGA dataset from CSV file.

Reads Bambenek DGA domain dataset, renames columns to standard format, adds malicious class label, and applies preprocessing to structure domain components.

Parameters:
  • data_path (str) – Path to the Bambenek dataset CSV file.

  • max_rows (int) – Maximum number of rows to process.

Returns:

pl.DataFrame – Processed Bambenek dataset with structured domain information.

train.dataset.cast_cic(data_path: List[str], max_rows: int) DataFrame[source]#

Loads and processes CIC DNS dataset from multiple CSV files.

Reads CIC DNS datasets (benign, malware, phishing, spam), assigns appropriate class labels based on filename, and combines all datasets into a unified format.

Parameters:
  • data_path (List[str]) – List of paths to CIC dataset CSV files.

  • max_rows (int) – Maximum number of rows to process per file.

Returns:

pl.DataFrame – Combined CIC dataset with structured domain information.

train.dataset.cast_dga(data_path: str, max_rows: int) DataFrame[source]#

Loads and processes DGA dataset from CSV file.

Reads DGA domain dataset, renames columns to standard format, adds malicious class label, and applies preprocessing to structure domain components.

Parameters:
  • data_path (str) – Path to the DGA dataset CSV file.

  • max_rows (int) – Maximum number of rows to process.

Returns:

pl.DataFrame – Processed DGA dataset with structured domain information.

train.dataset.cast_dgarchive(data_path: str, max_rows: int) DataFrame[source]#

Loads and processes DGArchive dataset from CSV file.

Reads DGArchive domain dataset, extracts class label from filename, renames columns to standard format, and applies preprocessing for domain analysis.

Parameters:
  • data_path (str) – Path to the DGArchive dataset CSV file.

  • max_rows (int) – Maximum number of rows to process.

Returns:

pl.DataFrame – Processed DGArchive dataset with structured domain information.

train.dataset.cast_dgta(data_path: str, max_rows: int) DataFrame[source]#

Loads and processes DGTA benchmark dataset from Parquet file.

Reads DGTA benchmark dataset, handles custom UTF-8 encoding, renames columns to standard format, and applies preprocessing for domain structure analysis.

Parameters:
  • data_path (str) – Path to the DGTA dataset Parquet file.

  • max_rows (int) – Maximum number of rows to process.

Returns:

pl.DataFrame – Processed DGTA dataset with structured domain information.

train.dataset.cast_heicloud(data_path: str, max_rows: int) DataFrame[source]#

Loads and processes heiCLOUD dataset from space-separated text file.

Reads heiCLOUD DNS log dataset, parses space-separated columns, extracts domain queries, and labels them as legitimate traffic for training.

Parameters:
  • data_path (str) – Path to the heiCLOUD dataset text file.

  • max_rows (int) – Maximum number of rows to process.

Returns:

pl.DataFrame – Processed heiCLOUD dataset with legitimate domain labels.

train.dataset.preprocess(x: DataFrame) DataFrame[source]#

Preprocesses DataFrame into structured dataset for feature extraction.

Filters out empty queries, removes duplicates, splits domain names into labels, and extracts top-level domain (TLD), second-level domain, and third-level domain components for further analysis.

Parameters:

x (pl.DataFrame) – Raw dataset containing DNS queries for preprocessing.

Returns:

pl.DataFrame – Preprocessed dataset with structured domain components.

train.explainer module#

class train.explainer.Explainer(output_path: str = './results')[source]#

Bases: object

Interprets and explains trained machine learning models for DGA detection.

Provides model interpretation capabilities including rule extraction, feature importance analysis, and threshold rescaling for decision trees and ensemble models used in domain generation algorithm detection tasks.

__init__(output_path: str = './results') None[source]#
Parameters:

output_path (str) – Directory path to save interpretation results.

interpret_model(model, x_test: ndarray, y_test: ndarray, df_cols: list[str], scaler=None) list[str][source]#

Interpret a trained model by extracting decision rules and optionally rescaling them.

Parameters:
  • model (sklearn.ensemble.BaseEnsemble | XGBClassifier) – Trained ML model.

  • x_test (np.ndarray) – Test set features.

  • y_test (np.ndarray) – Test set labels.

  • df_cols (list[str]) – Column names of the features.

  • model_name (str) – Name used for saving output files.

  • scaler (optional) – Scaler used in preprocessing, e.g., StandardScaler. Defaults to None.

class train.explainer.Plotter(output_path: str = './results/data')[source]#

Bases: object

Creates visualizations and plots for dataset analysis and model interpretation

Generates various plots including PCA visualizations, t-SNE projections, label distributions, and feature analysis plots to understand dataset characteristics and model behavior in DGA detection tasks.

__init__(output_path: str = './results/data') None[source]#
Parameters:

output_path (str) – Directory path to save generated visualization files.

create_plots_binary(ds_X: list[ndarray], ds_y: list[ndarray], data: list[Dataset]) None[source]#

Generates comprehensive visualization suite for binary classification datasets.

Creates PCA plots (2D/3D), t-SNE projections, principal component removal analysis, and label distribution charts for multiple datasets to understand data characteristics and class separability in binary DGA detection tasks.

Parameters:
  • ds_X (list[np.ndarray]) – List of feature matrices for each dataset.

  • ds_y (list[np.ndarray]) – List of label arrays for each dataset.

  • data (list[Dataset]) – List of dataset objects containing metadata.

create_plots_multiclass(ds_X: list[ndarray], ds_y: list[ndarray], data: list[Dataset]) None[source]#

Generates visualizations for multiclass DGA family classification datasets.

Creates specialized plots for datasets containing multiple DGA families, focusing on label distribution analysis to understand class imbalances and dataset composition for multiclass classification tasks.

Parameters:
  • ds_X (list[np.ndarray]) – List of feature matrices for each dataset.

  • ds_y (list[np.ndarray]) – List of label arrays for each dataset.

  • data (list[Dataset]) – List of dataset objects containing class information.

train.feature module#

class train.feature.Processor(features_to_drop: List)[source]#

Bases: object

Extracts statistical and linguistic features from domain name datasets.

Computes comprehensive feature sets including domain label statistics, character frequencies, entropy measures, and domain structure analysis for machine learning model training and DGA detection tasks.

__init__(features_to_drop: List) None[source]#
Parameters:

features_to_drop (List) – List of column names to exclude from final features.

transform(x: DataFrame) DataFrame[source]#

Extracts comprehensive feature set from domain name dataset.

Computes domain label statistics, character frequencies for all letters, character type ratios, and entropy measures for different domain levels. Handles missing values and removes specified columns from final output.

Parameters:

x (pl.DataFrame) – Input dataset with domain structure columns.

Returns:

pl.DataFrame – Feature-engineered dataset ready for ML model training.

train.model module#

class train.model.LightGBMModel[source]#

Bases: Model

objective(trial)[source]#

Optimizes the Random Forest model hyperparameters using cross-validation.

Parameters:

trial – A trial object from the optimization framework (e.g., Optuna).

Returns:

float – The best FDR value after cross-validation.

train(trial, X: ndarray, y: ndarray)[source]#

Trains the Random Forest model and saves the trained model to a file.

Parameters:
  • trial – A trial object from the optimization framework.

  • output_path (str) – The directory path to save the trained model.

class train.model.Model[source]#

Bases: object

fdr_metric(y_true: ndarray, y_pred: ndarray) float[source]#

Custom FDR metric to evaluate the performance of the Random Forest model.

Parameters:
  • y_true (np.ndarray) – The true labels.

  • y_pred (np.ndarray) – The predicted labels.

Returns:

float – The False Discovery Rate (FDR).

abstract objective(trial)[source]#
predict(X: ndarray)[source]#

Predicts given X.

Parameters:

x (np.array) – X data

Returns:

np.array – Model output.

abstract train(trial, X: ndarray, y: ndarray)[source]#
class train.model.Pipeline(model: str, datasets: list[Dataset], model_output_path: str, scaler=None)[source]#

Bases: object

Manages end-to-end machine learning pipeline for DGA detection model training.

Orchestrates data preprocessing, feature engineering, model training, and evaluation for domain generation algorithm detection. Supports multiple datasets, model types, and handles data scaling, splitting, and persistence operations.

__init__(model: str, datasets: list[Dataset], model_output_path: str, scaler=None) None[source]#

Initializes complete ML pipeline with datasets and model configuration.

Sets up feature processing, data loading, train/validation/test splitting, and model instantiation based on specified algorithm type. Handles data persistence and visualization setup.

Parameters:
  • model (str) – Model type identifier (‘rf’, ‘xg’, ‘gbm’).

  • datasets (list[Dataset]) – List of datasets for training and evaluation.

  • model_output_path (str) – Directory path for saving trained models.

  • scaler – Optional data scaler for feature normalization.

Raises:

NotImplementedError – If specified model type is not supported.

explain(x: ndarray, y: ndarray) list[str][source]#

Generates interpretable explanations for trained model decisions.

Creates human-readable decision rules and feature importance explanations for supported model types (XGBoost, Random Forest).

Parameters:
  • x (np.ndarray) – Feature matrix for explanation generation.

  • y (np.ndarray) – True labels for explanation context.

Returns:

list[str] – List of interpretable decision rules and explanations.

hyperparam_fit() None[source]#

Performs hyperparameter optimization and model training.

Uses Optuna to search optimal hyperparameters through Bayesian optimization, then trains the model with best parameters found during the search process.

predict(x: ndarray) ndarray[source]#

Generates predictions for input feature matrix.

Parameters:

x (np.ndarray) – Feature matrix for prediction.

Returns:

np.ndarray – Model predictions.

train_test_val_split(X: ndarray, Y: ndarray, train_frac: float = 0.8, random_state: int = 108) tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray][source]#

Splits dataset into training, validation, and test sets with stratification.

Creates stratified splits maintaining class distribution across all subsets. Training set gets specified fraction, validation and test sets split remaining data equally.

Parameters:
  • X (np.ndarray) – Feature matrix to split.

  • Y (np.ndarray) – Label array to split.

  • train_frac (float) – Proportion of data for training set. Default: 0.8

  • random_state (int) – Random seed for reproducible splits.

Returns:

tuple – X_train, X_val, X_test, Y_train, Y_val, Y_test arrays.

class train.model.RandomForestModel[source]#

Bases: Model

objective(trial)[source]#

Optimizes the Random Forest model hyperparameters using cross-validation.

Parameters:

trial – A trial object from the optimization framework (e.g., Optuna).

Returns:

float – The best FDR value after cross-validation.

train(trial, X: ndarray, y: ndarray)[source]#

Trains the Random Forest model and saves the trained model to a file.

Parameters:
  • trial – A trial object from the optimization framework.

  • output_path (str) – The directory path to save the trained model.

class train.model.XGBoostModel[source]#

Bases: Model

fdr_metric(preds: ndarray, dtrain: DMatrix) tuple[str, float][source]#

Custom FDR metric to evaluate model performance based on False Discovery Rate.

Parameters:
  • preds (np.ndarray) – The predicted values.

  • dtrain (xgb.DMatrix) – The training data matrix.

Returns:

tuple – A tuple containing the metric name (“fdr”) and its value.

objective(trial)[source]#

Optimizes the XGBoost model hyperparameters using cross-validation.

Parameters:

trial – A trial object from the optimization framework (e.g., Optuna).

Returns:

float – The best FDR value after cross-validation.

train(trial, X: ndarray, y: ndarray)[source]#

Trains the XGBoost model and saves the trained model to a file.

Parameters:

trial – A trial object from the optimization framework.

train.train module#

class train.train.DatasetEnum(value)[source]#

Bases: str, Enum

Available dataset configurations for DGA detection model training

CIC = 'cic'#
COMBINE = 'combine'#
DGARCHIVE = 'dgarchive'#
DGTA = 'dgta'#
__format__(format_spec)#

Returns format using actual value type unless __str__ has been overridden.

class train.train.DetectorTraining(model_name: rf, model_output_path: str = './results/model', dataset: DatasetEnum = DatasetEnum.COMBINE, data_base_path: str = './data', max_rows: int = -1)[source]#

Bases: object

Orchestrates end-to-end training of DGA detection models.

Manages dataset loading, model selection, training pipeline execution, and model persistence for domain generation algorithm detection. Supports multiple datasets, model types, and handles checksum-based model versioning.

__init__(model_name: rf, model_output_path: str = './results/model', dataset: DatasetEnum = DatasetEnum.COMBINE, data_base_path: str = './data', max_rows: int = -1) None[source]#

Initializes training configuration and dataset loading.

Sets up model training pipeline with specified algorithm, datasets, and output paths. Handles existing model detection and checksum validation for incremental training workflows.

Parameters:
  • model_name (ModelEnum) – ML algorithm type for training.

  • model_output_path (str) – Directory path for saving trained models.

  • dataset (DatasetEnum) – Dataset configuration for training.

  • data_base_path (str) – Base directory containing raw datasets.

  • max_rows (int) – Maximum rows per dataset (default: -1 for unlimited).

Raises:

NotImplementedError – If specified dataset configuration is not supported.

explain() None[source]#

Generates and saves interpretable explanations for the trained model.

Extracts decision rules and model interpretations from the trained classifier and saves them to text files for analysis and understanding of model behavior.

test() None[source]#

Evaluates trained model on all datasets and generates comprehensive reports.

Tests model performance across all loaded datasets, computes metrics including classification reports, FDR, and FTTAR. Saves detailed error analysis and misprediction information for model debugging and improvement.

train(seed: int = 108) None[source]#

Executes complete model training workflow with evaluation and persistence.

Performs hyperparameter optimization, model training, evaluation on test set, and generates comprehensive analysis including model interpretation and performance reports across all datasets.

Parameters:

seed (int) – Random seed for reproducible training results.

class train.train.ModelEnum(value)[source]#

Bases: str, Enum

Available machine learning algorithms for DGA detection

GBM_CLASSIFIER = 'gbm'#
RANDOM_FOREST_CLASSIFIER = 'rf'#
XG_BOOST_CLASSIFIER = 'xg'#
__format__(format_spec)#

Returns format using actual value type unless __str__ has been overridden.

train.train.add_options(options)[source]#