Releases: stefanDeveloper/heiDGAF
v1.0.1
Overview
This release addresses a critical stability issue in the Inspector component where message timestamps outside the configured time window could trigger an IndexError, halting data inspection and analysis.
Bug Fix: Out-of-Bounds Index
Issue:
When messages arrived with timestamps earlier or later than the expected [begin_timestamp, end_timestamp] range, the internal _count_errors() method attempted to write counts beyond the size of the counts array, resulting in:
IndexError: index XXXX is out of bounds for axis 0 with size YYYY
Resolution:
- Added bounds checking and filtering of invalid time indices.
- Safely ignore timestamps that fall outside the configured time range.
- Added clear warnings in logs when such messages are skipped.
- Introduced new unit tests verifying correct counting and stability.
Result:
The Inspector now handles out-of-range messages gracefully without crashing, ensuring uninterrupted operation and accurate in-range metric aggregation.
Additional Improvements
- Minor logging improvements for debugging time range and timestamp issues.
- Enhanced test coverage for
_count_errors()edge cases (empty batches, duplicate timestamps, out-of-range messages).
Verification
A new test (test_count_errors_valid_and_out_of_range) confirms that:
- Counts are correctly computed for valid timestamps.
- Out-of-range messages are ignored safely.
- No
IndexErroror unexpected array resizing occurs.
⚙️ Upgrade Notes
No configuration changes are required.
It is recommended to upgrade to this version if your Inspector processes messages that may arrive late or have timestamp drift.
v1.0.0
heiDGAF Release Notes
Version 1.0.0-rc1 - Release Candidate
Release Date: October 2025
Overview
We are excited to announce the first release candidate of heiDGAF (Heidelberg Domain Generation Algorithm Framework), a comprehensive real-time DNS anomaly detection pipeline designed for cybersecurity applications. This release provides a complete end-to-end solution for detecting malicious domain generation algorithms (DGAs) and suspicious DNS traffic patterns.
Key Features
Real-Time Processing Pipeline
- 5-Stage Architecture: Modular pipeline design with Log Storage, Log Collection, Log Filtering, Inspection, and Detection stages
- Apache Kafka Integration: Asynchronous message processing with exactly-once semantics
- ClickHouse Database: High-performance analytics database for monitoring and logging
- Docker Support: Complete containerized deployment with docker-compose
Advanced Anomaly Detection
- Time-Series Analysis: StreamAD-based anomaly detection with univariate, multivariate, and ensemble models
- Machine Learning Classification: Pre-trained ML models for DGA detection using Random Forest and XGBoost
- Two-Tier Detection: Initial time-series filtering followed by domain-level classification
- Configurable Thresholds: Flexible scoring and anomaly thresholds for different deployment scenarios
Comprehensive Data Processing
- Flexible Log Format: Configurable logline parsing with type validation and relevance filtering
- Batch Processing: Intelligent batching with subnet-based grouping and temporal windowing
- Feature Engineering: Advanced domain name feature extraction including entropy, character distributions, and linguistic patterns
- Data Validation: Multi-stage validation ensuring data integrity throughout the pipeline
Technical Specifications
Supported Detection Models
Time-Series Anomaly Detection (StreamAD):
- Univariate: ZScoreDetector, KNNDetector, SpotDetector, SRDetector, OCSVMDetector, MadDetector, SArimaDetector
- Multivariate: xStreamDetector, RShashDetector, HSTreeDetector, LodaDetector, RrcfDetector
- Ensemble: WeightEnsemble, VoteEnsemble
Machine Learning Classification:
- Random Forest (pre-trained, default)
- XGBoost support
- LightGBM support
- Custom model integration via URL-based download with SHA256 validation
Data Sources & Datasets
- Training Datasets: CIC-Bell-DNS-2021, DGTA-BENCH, DGArchive, Bambenek, heiCLOUD
- Input Formats: DNS log files, Kafka topics
- Output: Real-time alerts, monitoring dashboards, JSON warning logs
Monitoring & Operations
Comprehensive Monitoring
- Fill Level Tracking: Real-time monitoring of data volumes across all pipeline stages
- Performance Metrics: Batch processing statistics, timing data, and throughput monitoring
- Health Checks: Stage-specific monitoring with busy state management
- Alert Management: Structured alert generation with risk scoring
Configuration Management
- Centralized Configuration: Single YAML file for all pipeline settings
- Environment-Specific Settings: Separate configurations for development, testing, and production
- Hot-Swappable Parameters: Runtime configuration updates for thresholds and model parameters
Installation & Deployment
System Requirements
- Python 3.11+
- Apache Kafka cluster
- ClickHouse database
- Docker & docker-compose (optional)
Quick Start
# Clone repository
git clone https://github.com/stefanDeveloper/heiDGAF.git
# Docker deployment
HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml upConfiguration
- Default configuration in
config.yaml - Environment-specific overrides in
docker/.env - Flexible logline format configuration with type validation
Performance Features
Scalability
- Horizontal Scaling: Multi-broker Kafka setup with partitioned topics
- Batch Optimization: Configurable batch sizes (default: 10,000 entries) with timeout handling
- Memory Efficiency: Streaming processing with bounded memory usage
- Concurrent Processing: Asynchronous processing across all pipeline stages
Reliability
- Exactly-Once Processing: Kafka exactly-once semantics for data consistency
- Error Handling: Comprehensive exception handling with graceful degradation
- Data Validation: Multi-level validation ensuring data integrity
- Monitoring Integration: Full observability with ClickHouse analytics
Security Features
Model Integrity
- SHA256 Validation: Cryptographic validation of downloaded models
- Secure Downloads: HTTPS-based model retrieval with checksum verification
- Local Caching: Secure local storage of validated models
Data Protection
- Input Validation: Comprehensive input sanitization and type checking
- Anomaly Isolation: Secure processing of suspicious data without exposure
- Audit Trails: Complete logging of all processing stages for forensic analysis
Documentation & Training
Comprehensive Documentation
- API Documentation: Complete class and method documentation with consistent docstring style
- Configuration Guide: Detailed configuration examples and best practices
- Deployment Guide: Step-by-step deployment instructions
- Model Training: Custom model training utilities and documentation
Training & Explanation Tools
- Model Training Pipeline: End-to-end training workflow for custom models
- Feature Engineering: Advanced domain name feature extraction utilities
- Model Explanation: Interpretation and visualization tools for trained models
- Dataset Utilities: Data preprocessing and validation tools for multiple dataset formats
What's New in RC1
Code Quality Improvements
- Standardized Docstrings: All modules now follow consistent documentation style
- Type Annotations: Complete type hints throughout the codebase
- Error Handling: Enhanced exception handling and logging
Enhanced Training Pipeline
- Multi-Dataset Support: Support for DGTA, Bambenek, CIC, DGArchive, and heiCLOUD datasets
- Feature Extraction: Comprehensive domain name feature engineering
- Model Export: Automated model packaging with SHA256 checksums
- Hyperparameter Optimization: Optuna-based hyperparameter tuning
Configuration Updates
- Time Window Configuration: 20ms time windows for high-resolution anomaly detection
- Model Parameters: Updated default thresholds and model configurations
- Kafka Topics: Standardized topic naming convention
Known Limitations
- IPv6 Support: Limited IPv6 subnet handling (requires manual configuration)
- Model Format: Currently supports pickle-based model serialization only
- Real-time Constraints: Processing latency dependent on batch size and model complexity
- Resource Requirements: Memory usage scales with batch size and model complexity
Upgrade Path
This is the first release candidate. The final 1.0.0 release will maintain backward compatibility for:
- Configuration file formats
- Kafka topic structures
- ClickHouse schema definitions
- Model artifact formats
Contributing
We welcome contributions! Please see our contributing guidelines for:
- Code style and formatting requirements
- Testing procedures and coverage requirements
- Documentation standards
- Model contribution guidelines
License
heiDGAF is released under the EUPL-1.2 license. Pre-trained models are also licensed under EUPL-1.2.
Repository: GitHub
Documentation: Read the Docs
Support: GitHub Issues