Usage#

Note

This page is under active development.

Getting Started#

To use heiDGAF, just use the provided docker-compose.yml to quickly bootstrap your environment:

$ HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml up

If you want to run containers individually, use:

$ HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.kafka.yml up
$ docker run ...

Make sure you set the environment variable HOST_IP to your host’s IP address, so that the services can communicate with each other.

Installation#

Install all Python requirements.

$ python -m venv .venv
$ source .venv/bin/activate
(.venv) $ sh install_requirements.sh

Now, you can start each module, e.g. the Inspector:

(.venv) $ python src/inspector/main.py

Configuration#

Logline format configuration#

Configure the format and validation rules for DNS server loglines through flexible field definitions that support timestamps, IP addresses, regular expressions, and list-based validation.

Configuration Overview#

Users can define the format and fields of their DNS server loglines through the pipeline.log_collection.collector.logline_format parameter. This configuration allows complete customization of field types, validation rules, and filtering criteria for incoming log data.

For example, a logline might look like this:

2025-04-04T14:45:32.458123Z NXDOMAIN 192.168.3.152 10.10.0.3 test.com AAAA 192.168.15.34 196b

Field Definition Structure#

Each list entry of the parameter defines one field of the input logline, and the order of the entries corresponds to the order of the values in each logline. Each list entry itself consists of a list with two to four entries depending on the field type. For example, a field definition might look like this:

[ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]

Field Names and Requirements#

The first entry of each field definition always corresponds to the name of the field. Certain field names are required for proper pipeline operation, while others are forbidden as they are reserved for internal use.

Required and forbidden field names#

Category

Field Names

Required

timestamp, status_code, client_ip, record_type, domain_name

Forbidden

logline_id, batch_id

Required fields must be present in the configuration as they are essential for pipeline processing. Forbidden fields are reserved for internal communication and cannot be used as custom field names.

Field Types and Validation#

The second entry specifies the type of the field. Depending on the type defined, the method for defining validation parameters varies. The third and fourth entries change depending on the type.

There are four field types available:

Field types#

Field type

Format of 3rd entry

Format of 4th entry

Description

Timestamp

Timestamp format string

(not used)

Validates timestamp fields using Python’s strptime format. Automatically converts to ISO format for internal processing. Example: "%Y-%m-%dT%H:%M:%S.%fZ"

IpAddress

(not used)

(not used)

Validates IPv4 and IPv6 addresses. No additional parameters required.

RegEx (Regular Expression)

RegEx pattern as string

(not used)

Validates field content against a regular expression pattern. If the pattern matches, the field is valid.

ListItem

List of allowed values

List of relevant values (optional)

Validates field values against an allowed list. Optionally defines relevant values for filtering in later pipeline stages. All relevant values must also be in the allowed list. If not specified, all allowed values are deemed relevant.

Configuration Examples#

Here are examples for each field type:

logline_format:
  - [ "timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ" ]
  - [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
  - [ "client_ip", IpAddress ]
  - [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)' ]
  - [ "record_type", ListItem, [ "A", "AAAA" ] ]

Logging Configuration#

The following parameters control the logging behavior.

logging Parameters#

Parameter

Description

base

The debug field enables debug-level logging if set to true for all files, that do not contain the main modules.

modules

For each module, the debug field can be set to show debug-level logging messages.

If a debug field is set to false, only info-level logging is shown. By default, all the fields are set to false.

Pipeline Configuration#

The following parameters control the behavior of each stage of the heiDGAF pipeline, including the functionality of the modules.

pipeline.log_storage#

logserver Parameters#

Parameter

Default Value

Description

input_file

"/opt/file.txt"

Path of the input file, to which data is appended during usage.

Keep this setting unchanged when using Docker; modify the MOUNT_PATH in docker/.env instead.

pipeline.log_collection#

collector Parameters#

Parameter

Description

logline_format

Defines the expected format for incoming log lines. See the Logline format configuration section for more details.

batch_handler Parameters#

Parameter

Default Value

Description

batch_size

10000

Number of entries in a Batch, at which it is sent due to reaching the maximum fill state.

batch_timeout

30.0

Time after which a Batch is sent. Mainly relevant for Batches that only contain a small number of entries, and do not reach the size limit for a longer time period.

subnet_id.ipv4_prefix_length

24

The number of bits to trim from the client’s IPv4 address for use as Subnet ID.

subnet_id.ipv6_prefix_length

64

The number of bits to trim from the client’s IPv6 address for use as Subnet ID.

pipeline.data_inspection#

inspector Parameters#

Parameter

Default Value

Description

mode

univariate (options: multivariate, ensemble)

Mode of operation for the data inspector.

ensemble.model

WeightEnsemble

Model to use when inspector mode is ensemble.

ensemble.module

streamad.process

Python module for the ensemble model.

ensemble.model_args

Additional Arguments for the ensemble model.

models.model

ZScoreDetector

Model to use for data inspection

models.module

streamad.model

Base python module for inspection models

models.model_args

Additional arguments for the model

models.model_args.is_global

false

anomaly_threshold

0.01

Threshold for classifying an observation as an anomaly.

score_threshold

0.5

Threshold for the anomaly score.

time_type

ms

Unit of time used in time range calculations.

time_range

20

Time window for data inspection

pipeline.data_analysis#

detector Parameters#

Parameter

Default Value

Description

model

rf option: XGBoost

Model to use for the detector

checksum

Not given here

Checksum for the model file to ensure integrity

base_url

https://heibox.uni-heidelberg.de/d/0d5cbcbe16cd46a58021/

Base URL for downloading the model if not present locally

threshold

0.5

Threshold for the detector’s classification.

Environment Configuration#

The following parameters control the infrastructure of the software.

environment Parameters#

Parameter

Default Value

Description

kafka_brokers

hostname: kafka1, port: 8097, hostname: kafka2, port: 8098, hostname: kafka3, port: 8099

Hostnames and ports of the Kafka brokers, given as list.

kafka_topics

Not given here

Kafka topic names given as strings. These topics are used for the data transfer between the modules.

monitoring.clickhouse_server.hostname

clickhouse-server

Hostname of the ClickHouse server. Used by Grafana.