train package#
Submodules#
train.dataset module#
- class train.dataset.Dataset(name: str, data_path: List[str], data: DataFrame | None = None, cast_dataset: Callable | None = None, max_rows: int = -1)[source]#
Bases:
objectSingle DNS dataset with loading and preprocessing capabilities
Encapsulates dataset information including name, file paths, and data processing functions. Supports flexible data loading from various sources including CSV, Parquet, and text files with custom preprocessing functions.
- __init__(name: str, data_path: List[str], data: DataFrame | None = None, cast_dataset: Callable | None = None, max_rows: int = -1) None[source]#
Loads dataset either from file path using optional preprocessing function or directly from provided DataFrame. Supports various data formats and custom preprocessing callbacks for dataset-specific requirements.
- Parameters:
name (str) – Unique identifier for the dataset.
data_path (List[str]) – File paths to dataset files.
data (pl.DataFrame) – Pre-loaded dataset (alternative to data_path).
cast_dataset (Callable) – Custom preprocessing function for data loading.
max_rows (int) – Maximum rows to load (default: -1 for unlimited).
- Raises:
NotImplementedError – When neither data_path nor data is provided.
- class train.dataset.DatasetLoader(base_path: str = '', max_rows: int = -1)[source]#
Bases:
objectManages loading and access to multiple DNS datasets for training.
Provides convenient access to various DNS datasets including DGA detection benchmarks, legitimate traffic datasets, and combined multi-source datasets. Handles dataset-specific loading and preprocessing requirements.
- property bambenek_dataset: Dataset#
- property cic_dataset: Dataset#
- property dga_dataset: Dataset#
- property dgta_dataset: Dataset#
- property heicloud_dataset: Dataset#
- train.dataset.cast_bambenek(data_path: str, max_rows: int) DataFrame[source]#
Loads and processes Bambenek DGA dataset from CSV file.
Reads Bambenek DGA domain dataset, renames columns to standard format, adds malicious class label, and applies preprocessing to structure domain components.
- train.dataset.cast_cic(data_path: List[str], max_rows: int) DataFrame[source]#
Loads and processes CIC DNS dataset from multiple CSV files.
Reads CIC DNS datasets (benign, malware, phishing, spam), assigns appropriate class labels based on filename, and combines all datasets into a unified format.
- train.dataset.cast_dga(data_path: str, max_rows: int) DataFrame[source]#
Loads and processes DGA dataset from CSV file.
Reads DGA domain dataset, renames columns to standard format, adds malicious class label, and applies preprocessing to structure domain components.
- train.dataset.cast_dgarchive(data_path: str, max_rows: int) DataFrame[source]#
Loads and processes DGArchive dataset from CSV file.
Reads DGArchive domain dataset, extracts class label from filename, renames columns to standard format, and applies preprocessing for domain analysis.
- train.dataset.cast_dgta(data_path: str, max_rows: int) DataFrame[source]#
Loads and processes DGTA benchmark dataset from Parquet file.
Reads DGTA benchmark dataset, handles custom UTF-8 encoding, renames columns to standard format, and applies preprocessing for domain structure analysis.
- train.dataset.cast_heicloud(data_path: str, max_rows: int) DataFrame[source]#
Loads and processes heiCLOUD dataset from space-separated text file.
Reads heiCLOUD DNS log dataset, parses space-separated columns, extracts domain queries, and labels them as legitimate traffic for training.
- train.dataset.preprocess(x: DataFrame) DataFrame[source]#
Preprocesses DataFrame into structured dataset for feature extraction.
Filters out empty queries, removes duplicates, splits domain names into labels, and extracts top-level domain (TLD), second-level domain, and third-level domain components for further analysis.
- Parameters:
x (pl.DataFrame) – Raw dataset containing DNS queries for preprocessing.
- Returns:
pl.DataFrame – Preprocessed dataset with structured domain components.
train.explainer module#
- class train.explainer.Explainer(output_path: str = './results')[source]#
Bases:
objectInterprets and explains trained machine learning models for DGA detection.
Provides model interpretation capabilities including rule extraction, feature importance analysis, and threshold rescaling for decision trees and ensemble models used in domain generation algorithm detection tasks.
- __init__(output_path: str = './results') None[source]#
- Parameters:
output_path (str) – Directory path to save interpretation results.
- interpret_model(model, x_test: ndarray, y_test: ndarray, df_cols: list[str], scaler=None) list[str][source]#
Interpret a trained model by extracting decision rules and optionally rescaling them.
- Parameters:
model (sklearn.ensemble.BaseEnsemble | XGBClassifier) – Trained ML model.
x_test (np.ndarray) – Test set features.
y_test (np.ndarray) – Test set labels.
model_name (str) – Name used for saving output files.
scaler (optional) – Scaler used in preprocessing, e.g., StandardScaler. Defaults to None.
- class train.explainer.Plotter(output_path: str = './results/data')[source]#
Bases:
objectCreates visualizations and plots for dataset analysis and model interpretation
Generates various plots including PCA visualizations, t-SNE projections, label distributions, and feature analysis plots to understand dataset characteristics and model behavior in DGA detection tasks.
- __init__(output_path: str = './results/data') None[source]#
- Parameters:
output_path (str) – Directory path to save generated visualization files.
- create_plots_binary(ds_X: list[ndarray], ds_y: list[ndarray], data: list[Dataset]) None[source]#
Generates comprehensive visualization suite for binary classification datasets.
Creates PCA plots (2D/3D), t-SNE projections, principal component removal analysis, and label distribution charts for multiple datasets to understand data characteristics and class separability in binary DGA detection tasks.
- create_plots_multiclass(ds_X: list[ndarray], ds_y: list[ndarray], data: list[Dataset]) None[source]#
Generates visualizations for multiclass DGA family classification datasets.
Creates specialized plots for datasets containing multiple DGA families, focusing on label distribution analysis to understand class imbalances and dataset composition for multiclass classification tasks.
train.feature module#
- class train.feature.Processor(features_to_drop: List)[source]#
Bases:
objectExtracts statistical and linguistic features from domain name datasets.
Computes comprehensive feature sets including domain label statistics, character frequencies, entropy measures, and domain structure analysis for machine learning model training and DGA detection tasks.
- __init__(features_to_drop: List) None[source]#
- Parameters:
features_to_drop (List) – List of column names to exclude from final features.
- transform(x: DataFrame) DataFrame[source]#
Extracts comprehensive feature set from domain name dataset.
Computes domain label statistics, character frequencies for all letters, character type ratios, and entropy measures for different domain levels. Handles missing values and removes specified columns from final output.
- Parameters:
x (pl.DataFrame) – Input dataset with domain structure columns.
- Returns:
pl.DataFrame – Feature-engineered dataset ready for ML model training.
train.model module#
- class train.model.LightGBMModel[source]#
Bases:
Model
- class train.model.Model[source]#
Bases:
object- fdr_metric(y_true: ndarray, y_pred: ndarray) float[source]#
Custom FDR metric to evaluate the performance of the Random Forest model.
- Parameters:
y_true (np.ndarray) – The true labels.
y_pred (np.ndarray) – The predicted labels.
- Returns:
float – The False Discovery Rate (FDR).
- class train.model.Pipeline(model: str, datasets: list[Dataset], model_output_path: str, scaler=None)[source]#
Bases:
objectManages end-to-end machine learning pipeline for DGA detection model training.
Orchestrates data preprocessing, feature engineering, model training, and evaluation for domain generation algorithm detection. Supports multiple datasets, model types, and handles data scaling, splitting, and persistence operations.
- __init__(model: str, datasets: list[Dataset], model_output_path: str, scaler=None) None[source]#
Initializes complete ML pipeline with datasets and model configuration.
Sets up feature processing, data loading, train/validation/test splitting, and model instantiation based on specified algorithm type. Handles data persistence and visualization setup.
- Parameters:
- Raises:
NotImplementedError – If specified model type is not supported.
- explain(x: ndarray, y: ndarray) list[str][source]#
Generates interpretable explanations for trained model decisions.
Creates human-readable decision rules and feature importance explanations for supported model types (XGBoost, Random Forest).
- Parameters:
x (np.ndarray) – Feature matrix for explanation generation.
y (np.ndarray) – True labels for explanation context.
- Returns:
list[str] – List of interpretable decision rules and explanations.
- hyperparam_fit() None[source]#
Performs hyperparameter optimization and model training.
Uses Optuna to search optimal hyperparameters through Bayesian optimization, then trains the model with best parameters found during the search process.
- predict(x: ndarray) ndarray[source]#
Generates predictions for input feature matrix.
- Parameters:
x (np.ndarray) – Feature matrix for prediction.
- Returns:
np.ndarray – Model predictions.
- train_test_val_split(X: ndarray, Y: ndarray, train_frac: float = 0.8, random_state: int = 108) tuple[ndarray, ndarray, ndarray, ndarray, ndarray, ndarray][source]#
Splits dataset into training, validation, and test sets with stratification.
Creates stratified splits maintaining class distribution across all subsets. Training set gets specified fraction, validation and test sets split remaining data equally.
- class train.model.RandomForestModel[source]#
Bases:
Model
- class train.model.XGBoostModel[source]#
Bases:
Model- fdr_metric(preds: ndarray, dtrain: DMatrix) tuple[str, float][source]#
Custom FDR metric to evaluate model performance based on False Discovery Rate.
- Parameters:
preds (np.ndarray) – The predicted values.
dtrain (xgb.DMatrix) – The training data matrix.
- Returns:
tuple – A tuple containing the metric name (“fdr”) and its value.
train.train module#
- class train.train.DatasetEnum(value)[source]#
-
Available dataset configurations for DGA detection model training
- CIC = 'cic'#
- COMBINE = 'combine'#
- DGARCHIVE = 'dgarchive'#
- DGTA = 'dgta'#
- __format__(format_spec)#
Returns format using actual value type unless __str__ has been overridden.
- class train.train.DetectorTraining(model_name: rf, model_output_path: str = './results/model', dataset: DatasetEnum = DatasetEnum.COMBINE, data_base_path: str = './data', max_rows: int = -1)[source]#
Bases:
objectOrchestrates end-to-end training of DGA detection models.
Manages dataset loading, model selection, training pipeline execution, and model persistence for domain generation algorithm detection. Supports multiple datasets, model types, and handles checksum-based model versioning.
- __init__(model_name: rf, model_output_path: str = './results/model', dataset: DatasetEnum = DatasetEnum.COMBINE, data_base_path: str = './data', max_rows: int = -1) None[source]#
Initializes training configuration and dataset loading.
Sets up model training pipeline with specified algorithm, datasets, and output paths. Handles existing model detection and checksum validation for incremental training workflows.
- Parameters:
model_name (ModelEnum) – ML algorithm type for training.
model_output_path (str) – Directory path for saving trained models.
dataset (DatasetEnum) – Dataset configuration for training.
data_base_path (str) – Base directory containing raw datasets.
max_rows (int) – Maximum rows per dataset (default: -1 for unlimited).
- Raises:
NotImplementedError – If specified dataset configuration is not supported.
- explain() None[source]#
Generates and saves interpretable explanations for the trained model.
Extracts decision rules and model interpretations from the trained classifier and saves them to text files for analysis and understanding of model behavior.
- test() None[source]#
Evaluates trained model on all datasets and generates comprehensive reports.
Tests model performance across all loaded datasets, computes metrics including classification reports, FDR, and FTTAR. Saves detailed error analysis and misprediction information for model debugging and improvement.
- train(seed: int = 108) None[source]#
Executes complete model training workflow with evaluation and persistence.
Performs hyperparameter optimization, model training, evaluation on test set, and generates comprehensive analysis including model interpretation and performance reports across all datasets.
- Parameters:
seed (int) – Random seed for reproducible training results.