Usage#
Note
This page is under active development.
Getting Started#
To use heiDGAF, just use the provided docker-compose.yml to quickly bootstrap your environment:
$ HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml up
If you want to run containers individually, use:
$ HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.kafka.yml up
$ docker run ...
Make sure you set the environment variable HOST_IP to your host’s IP address, so that the services can communicate with each other.
Installation#
Install all Python requirements.
$ python -m venv .venv
$ source .venv/bin/activate
(.venv) $ sh install_requirements.sh
Now, you can start each module, e.g. the Inspector:
(.venv) $ python src/inspector/main.py
Configuration#
Logline format configuration#
Configure the format and validation rules for DNS server loglines through flexible field definitions that support timestamps, IP addresses, regular expressions, and list-based validation.
Configuration Overview#
Users can define the format and fields of their DNS server loglines through the
pipeline.log_collection.collector.logline_format parameter. This configuration allows complete customization
of field types, validation rules, and filtering criteria for incoming log data.
For example, a logline might look like this:
2025-04-04T14:45:32.458123Z NXDOMAIN 192.168.3.152 10.10.0.3 test.com AAAA 192.168.15.34 196b
Field Definition Structure#
Each list entry of the parameter defines one field of the input logline, and the order of the entries corresponds to the order of the values in each logline. Each list entry itself consists of a list with two to four entries depending on the field type. For example, a field definition might look like this:
[ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
Field Names and Requirements#
The first entry of each field definition always corresponds to the name of the field. Certain field names are required for proper pipeline operation, while others are forbidden as they are reserved for internal use.
Category |
Field Names |
|---|---|
Required |
|
Forbidden |
|
Required fields must be present in the configuration as they are essential for pipeline processing. Forbidden fields are reserved for internal communication and cannot be used as custom field names.
Field Types and Validation#
The second entry specifies the type of the field. Depending on the type defined, the method for defining validation parameters varies. The third and fourth entries change depending on the type.
There are four field types available:
Field type |
Format of 3rd entry |
Format of 4th entry |
Description |
|---|---|---|---|
|
Timestamp format string |
(not used) |
Validates timestamp fields using Python’s strptime format. Automatically converts to ISO format for internal processing.
Example: |
|
(not used) |
(not used) |
Validates IPv4 and IPv6 addresses. No additional parameters required. |
|
RegEx pattern as string |
(not used) |
Validates field content against a regular expression pattern. If the pattern matches, the field is valid. |
|
List of allowed values |
List of relevant values (optional) |
Validates field values against an allowed list. Optionally defines relevant values for filtering in later pipeline stages. All relevant values must also be in the allowed list. If not specified, all allowed values are deemed relevant. |
Configuration Examples#
Here are examples for each field type:
logline_format:
- [ "timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ" ]
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
- [ "client_ip", IpAddress ]
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)' ]
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
Logging Configuration#
The following parameters control the logging behavior.
Parameter |
Description |
|---|---|
base |
The |
modules |
For each module, the |
If a debug field is set to false, only info-level logging is shown. By default, all the fields are set to false.
Pipeline Configuration#
The following parameters control the behavior of each stage of the heiDGAF pipeline, including the functionality of the modules.
pipeline.log_storage#
Parameter |
Default Value |
Description |
|---|---|---|
input_file |
|
Path of the input file, to which data is appended during usage. Keep this setting unchanged when using Docker; modify the |
pipeline.log_collection#
Parameter |
Description |
|---|---|
logline_format |
Defines the expected format for incoming log lines. See the Logline format configuration section for more details. |
Parameter |
Default Value |
Description |
|---|---|---|
batch_size |
|
Number of entries in a Batch, at which it is sent due to reaching the maximum fill state. |
batch_timeout |
|
Time after which a Batch is sent. Mainly relevant for Batches that only contain a small number of entries, and do not reach the size limit for a longer time period. |
subnet_id.ipv4_prefix_length |
|
The number of bits to trim from the client’s IPv4 address for use as Subnet ID. |
subnet_id.ipv6_prefix_length |
|
The number of bits to trim from the client’s IPv6 address for use as Subnet ID. |
pipeline.data_inspection#
Parameter |
Default Value |
Description |
|---|---|---|
mode |
|
Mode of operation for the data inspector. |
ensemble.model |
|
Model to use when inspector mode is |
ensemble.module |
|
Python module for the ensemble model. |
ensemble.model_args |
Additional Arguments for the ensemble model. |
|
models.model |
|
Model to use for data inspection |
models.module |
|
Base python module for inspection models |
models.model_args |
Additional arguments for the model |
|
models.model_args.is_global |
|
|
anomaly_threshold |
|
Threshold for classifying an observation as an anomaly. |
score_threshold |
|
Threshold for the anomaly score. |
time_type |
|
Unit of time used in time range calculations. |
time_range |
|
Time window for data inspection |
pipeline.data_analysis#
Parameter |
Default Value |
Description |
|---|---|---|
model |
|
Model to use for the detector |
checksum |
Not given here |
Checksum for the model file to ensure integrity |
base_url |
Base URL for downloading the model if not present locally |
|
threshold |
|
Threshold for the detector’s classification. |
Environment Configuration#
The following parameters control the infrastructure of the software.
Parameter |
Default Value |
Description |
|---|---|---|
kafka_brokers |
|
Hostnames and ports of the Kafka brokers, given as list. |
kafka_topics |
Not given here |
Kafka topic names given as strings. These topics are used for the data transfer between the modules. |
monitoring.clickhouse_server.hostname |
|
Hostname of the ClickHouse server. Used by Grafana. |