Configuration¶

Drakkar configuration is built on pydantic-settings and supports YAML files, environment variable overrides, and inline Python objects. Every option has a sensible default so you can start with a minimal config and grow from there.

Looking for an annotated example?

The companion Config Reference page is a single drakkar.yaml showing every field with a one-line comment and its DK_* env-var override. Use it to scan what’s available; come back here for the deep tables and prose.

Configuration Loading¶

Drakkar resolves its configuration from three sources, applied in order. Later sources override earlier ones:

Built-in defaults – every field has a default value in the Pydantic models.
YAML file – structured configuration loaded from disk.
Environment variables – DK_-prefixed env vars, deep-merged on top of YAML.

YAML File Resolution¶

The YAML file path is determined by the first match:

The config_path argument passed to DrakkarApp.
The DK_CONFIG environment variable.
If neither is set, Drakkar runs with env-only configuration (all defaults + env overrides).

# Option 1: explicit path
app = DrakkarApp(handler=MyHandler(), config_path='drakkar.yaml')

# Option 2: env var (set DK_CONFIG=/etc/drakkar/config.yaml)
app = DrakkarApp(handler=MyHandler())

# Option 3: inline config object (no file needed)
from drakkar.config import DrakkarConfig, KafkaConfig, ExecutorConfig

app = DrakkarApp(
    handler=MyHandler(),
    config=DrakkarConfig(
        kafka=KafkaConfig(brokers='kafka:9092', source_topic='my-events'),
        executor=ExecutorConfig(max_executors=8),
    ),
)

Note

When you pass a config object directly, no YAML file or env var loading occurs – you have full control.

Environment Variable Overrides¶

Environment variables use the DK_ prefix with __ (double underscore) as the nesting delimiter:

# Override kafka.brokers
export DK_KAFKA__BROKERS=kafka-prod:9092

# Override executor.max_executors
export DK_EXECUTOR__MAX_EXECUTORS=16

# Override debug.port
export DK_DEBUG__PORT=9000

Env vars are parsed into a nested dict structure and deep-merged on top of the YAML values. Leaf values from env vars always win over YAML values; nested dicts are merged recursively.

Loading Order Summary¶

defaults --> YAML file --> environment variables (deep merge)

Root Config (`DrakkarConfig`)¶

Top-level settings that control worker identity and cluster grouping.

Field	Type	Default	Description
`worker_name_env`	`str`	`'WORKER_ID'`	Name of the environment variable that holds the worker name. Used in logs, metrics, and the debug UI. If the env var is empty or unset, falls back to `drakkar-{hex_id}`.
`cluster_name`	`str`	`''`	Logical cluster name for grouping workers in the debug UI. Workers with the same cluster name are displayed together and can cross-trace messages.
`cluster_name_env`	`str`	`''`	Name of an environment variable that holds the cluster name. If set and non-empty, overrides the static `cluster_name` value.

worker_name_env: HOSTNAME
cluster_name: search-cluster
cluster_name_env: DK_CLUSTER

Kafka Source (`kafka:`)¶

Settings for the Kafka consumer that reads input messages.

Field	Type	Default	Description
`brokers`	`str`	`'localhost:9092'`	Kafka bootstrap servers (comma-separated for multiple brokers). Also used as the fallback for sink and DLQ brokers when they are left empty.
`source_topic`	`str`	`'input-events'`	The Kafka topic to consume messages from.
`consumer_group`	`str`	`'drakkar-workers'`	Consumer group ID. All workers in the same group share partition assignments.
`max_poll_records`	`int`	`100`	Maximum number of messages returned per poll batch. Higher values improve throughput; lower values reduce latency.
`max_poll_interval_ms`	`int`	`300000`	Maximum time (ms) between poll calls before Kafka considers the consumer dead and triggers a rebalance. Increase this if your tasks take a long time.
`session_timeout_ms`	`int`	`45000`	Session timeout (ms) for group membership. If the broker does not receive a heartbeat within this window, the consumer is removed from the group.
`heartbeat_interval_ms`	`int`	`3000`	Interval (ms) between heartbeats sent to the broker. Should be less than `session_timeout_ms / 3`.
`startup_align_enabled`	`bool`	`true`	When `true`, delay the first Kafka subscribe until a shared wall-clock boundary so a fleet of workers converges on a single rebalance. Disable for single-process dev runs.
`startup_min_wait_seconds`	`float`	`4.0`	Minimum seconds to sleep before aligning. Acts as a buffer for slow init (DB connects, schema migrations, cache warm-up). Must be `>= 0`.
`startup_align_interval_seconds`	`int`	`10`	Alignment interval in seconds. Workers wake at the next `time.time() % interval == 0` boundary — default `10` aligns on `:00/:10/:20/:30/:40/:50` of every minute. Must be `>= 1`.

kafka:
  brokers: kafka-1:9092,kafka-2:9092
  source_topic: search-requests
  consumer_group: search-workers
  max_poll_records: 200
  max_poll_interval_ms: 600000
  session_timeout_ms: 60000
  heartbeat_interval_ms: 5000
  # Rolling-deploy: serialize group rebalances by waking all workers at
  # the same wall-clock moment instead of whenever each finishes init.
  startup_align_enabled: true
  startup_min_wait_seconds: 4.0
  startup_align_interval_seconds: 10

Staggered startup alignment¶

During a rolling deploy, workers come up one at a time over a span of seconds. Each fresh subscribe call triggers a Kafka consumer-group rebalance, which stalls consumption on every other worker in the group. A fleet of 10 workers that boots over ~15 seconds can cause 10 cascading rebalances and several seconds of effective downtime.

With startup_align_enabled: true (the default), each worker runs its normal init (on_startup, cache engine, recorder, periodic tasks, sink connects) and then — before calling consumer.subscribe() — waits until:

At least startup_min_wait_seconds have elapsed (lets slow-init workers catch up to fast ones).
The wall-clock Unix-epoch seconds are a multiple of startup_align_interval_seconds (default 10 — i.e. :00/:10/:20/:30/:40/:50).

Workers whose init completes anywhere in the same 10-second window then all subscribe at the next boundary within one “tick” of each other, collapsing N rebalances into 1.

The sleep window is logged as startup_align_waiting / startup_align_done lifecycle events with the target instant so deploy runbooks can account for the deliberate pause.

Tuning: - Larger fleets: keep startup_align_interval_seconds: 10 (default) — works well up to dozens of workers. - Slow-init workers (seconds of migrations): raise startup_min_wait_seconds so the boundary is likely to fall AFTER the slowest worker’s init. - Very small clusters / dev iteration: set startup_align_enabled: false to skip the pause entirely.

Executor Pool (`executor:`)¶

Controls the subprocess executor pool that runs user-defined binaries.

Field	Type	Default	Constraints	Description
`binary_path`	`str \\| None`	`None`	min length 1 if set	Default binary path for all tasks. If `None`, each ExecutorTask returned by arrange() must provide its own `binary_path`, otherwise the task fails with a clear error. See Binary Path Resolution.
`env`	`dict[str, str]`	`{}`		Environment variables passed to all executor subprocesses. Merged on top of the (filtered) parent process env. Per-task `ExecutorTask.env` overrides these on conflict. See Environment Variables.
`env_inherit_parent`	`bool`	`true`		When `true`, the parent process env is passed to subprocesses (with `env_inherit_deny` patterns applied). Set `false` to run subprocesses with only `executor.env` + `ExecutorTask.env` — fully isolated from the parent env.
`env_inherit_deny`	`list[str]`	see below		Case-insensitive `fnmatch` patterns matched against parent env var names. Matching vars are not inherited by subprocesses even when `env_inherit_parent` is `true`. Default excludes `DK_` internals and common secret names so operator-configured secrets never leak to executor binaries. Set to `[]` to trust the full parent env. Default patterns: `DK_`, `PASSWORD`, `SECRET`, `TOKEN`, `_KEY`, `_DSN`, `CREDENTIAL`.
`max_executors`	`int`	`4`	>= 1	Maximum number of concurrent subprocesses. Controls the `asyncio.Semaphore` size – tasks beyond this limit wait in a queue. See Concurrency and Backpressure.
`task_timeout_seconds`	`int`	`120`	>= 1	Wall-clock timeout (seconds) per subprocess. If a process exceeds this, it is killed and treated as a failure.
`window_size`	`int`	`100`	>= 1	Maximum number of messages collected per arrange() window. Larger windows allow more batching in `arrange()`; smaller windows reduce latency.
`max_retries`	`int`	`3`	>= 0	Maximum number of retry attempts per failed task (0 = no retries). A task can run up to `max_retries + 1` times total.
`drain_timeout_seconds`	`int`	`30`	>= 1	Maximum time (seconds) to wait for in-flight tasks during shutdown or partition revocation. When drain times out, offsets for still-in-flight tasks are not committed — those messages will replay on restart (at-least-once). Tune together with `task_timeout_seconds`.
`backpressure_high_multiplier`	`int`	`32`	>= 1	Multiplier for the pause threshold. When total queued messages reach `max_executors * backpressure_high_multiplier`, Kafka consumption is paused.
`backpressure_low_multiplier`	`int`	`4`	>= 1	Multiplier for the resume threshold. When total queued messages drop to `max(1, max_executors * backpressure_low_multiplier)`, Kafka consumption resumes.

Backpressure Formula¶

Backpressure prevents unbounded memory growth by pausing Kafka consumption when too many messages are buffered:

high_watermark = max_executors * backpressure_high_multiplier
low_watermark  = max(1, max_executors * backpressure_low_multiplier)

With defaults (max_executors=4):

High watermark = 4 * 32 = 128 – pause consumption
Low watermark = max(1, 4 * 4) = 16 – resume consumption

The gap between high and low watermarks prevents rapid pause/resume oscillation (hysteresis).

executor:
  binary_path: /usr/local/bin/my-processor
  max_executors: 8
  task_timeout_seconds: 300
  window_size: 50
  max_retries: 5
  drain_timeout_seconds: 10
  backpressure_high_multiplier: 16
  backpressure_low_multiplier: 2

Sinks (`sinks:`)¶

Sinks define where processed results are delivered. See Sinks for payload models, routing, and the delivery lifecycle. Each sink type is a dictionary mapping instance names to their configuration. You can configure multiple instances of the same type (e.g., two Kafka sinks writing to different topics).

Kafka Sink (`sinks.kafka.<name>`)¶

Produces messages to a Kafka topic.

Field	Type	Default	Description
`topic`	`str`	(required)	Target Kafka topic for output messages.
`brokers`	`str`	`''`	Kafka brokers for this sink. If empty, inherits from `kafka.brokers` (same cluster as the source).
`ui_url`	`str`	`''`	URL to a web UI for this sink (e.g., Kafka UI, Kowl). Displayed as a link in the debug dashboard.

PostgreSQL Sink (`sinks.postgres.<name>`)¶

Inserts rows into a PostgreSQL database via an asyncpg connection pool.

Field	Type	Default	Constraints	Description
`dsn`	`str`	(required)		PostgreSQL connection string (e.g., `postgresql://user:pass@host:5432/db`).
`pool_min`	`int`	`2`	>= 1	Minimum number of connections in the pool.
`pool_max`	`int`	`10`	>= 1	Maximum number of connections in the pool.
`ui_url`	`str`	`''`		URL to a database management UI (e.g., pgAdmin).

MongoDB Sink (`sinks.mongo.<name>`)¶

Inserts documents into a MongoDB database via motor AsyncIOMotorClient.

Field	Type	Default	Description
`uri`	`str`	(required)	MongoDB connection URI (e.g., `mongodb://host:27017`).
`database`	`str`	(required)	Target database name.
`ui_url`	`str`	`''`	URL to a MongoDB management UI (e.g., Mongo Express).

HTTP Sink (`sinks.http.<name>`)¶

Sends JSON payloads to an HTTP endpoint.

Field	Type	Default	Constraints	Description
`url`	`str`	(required)	`http://` or `https://` scheme; non-empty host; must not be a cloud metadata endpoint	Target URL for HTTP requests. Validated at config load time.
`method`	`str`	`'POST'`		HTTP method to use.
`timeout_seconds`	`int`	`30`	>= 1	Request timeout in seconds.
`headers`	`dict[str, str]`	`{}`		Additional HTTP headers sent with each request.
`max_retries`	`int`	`3`	>= 0	Maximum retry attempts for failed HTTP requests.
`ui_url`	`str`	`''`		URL to a related web UI.

Cloud metadata endpoints are rejected

To prevent accidental IAM-credential leaks via SSRF-like misconfiguration, the following hosts cannot be used as an HTTP sink target: 169.254.169.254 (AWS / Azure / GCP / Alibaba / OpenStack IMDS), metadata.google.internal, metadata.packet.net, 100.100.100.200 (Alibaba), 192.0.0.192 (Oracle). Private, loopback, and internal hostnames are not blocked — internal webhook services remain legitimate targets.

Redis Sink (`sinks.redis.<name>`)¶

Sets key-value pairs in Redis.

Field	Type	Default	Description
`url`	`str`	`'redis://localhost:6379/0'`	Redis connection URL.
`key_prefix`	`str`	`''`	Prefix prepended to all keys (e.g., `cache:` produces keys like `cache:my-key`).
`ui_url`	`str`	`''`	URL to a Redis management UI (e.g., RedisInsight).

Filesystem Sink (`sinks.filesystem.<name>`)¶

Appends JSONL lines to files on disk.

Field	Type	Default	Description
`base_path`	`str`	`''`	Optional base directory. Individual payloads specify their own full paths.
`ui_url`	`str`	`''`	URL to a file browser or related UI.

Example: Multiple Named Sinks¶

sinks:
  kafka:
    search-results:
      topic: search-results
    analytics:
      topic: analytics-events
      brokers: kafka-analytics:9092

  postgres:
    main-db:
      dsn: postgresql://user:pass@db:5432/myapp
      pool_min: 5
      pool_max: 20
      ui_url: http://pgadmin.internal:5050

  mongo:
    logs:
      uri: mongodb://mongo:27017
      database: app_logs

  http:
    webhook:
      url: https://api.example.com/webhook
      method: POST
      timeout_seconds: 15
      headers:
        Authorization: "Bearer ${API_TOKEN}"
      max_retries: 5

  redis:
    cache:
      url: redis://redis:6379/0
      key_prefix: "result:"
      ui_url: http://redis-insight.internal:8001

  filesystem:
    archive:
      base_path: /data/archive

Circuit Breaker (`sinks.circuit_breaker`)¶

Every registered sink has a per-instance circuit breaker. The breaker trips after failure_threshold consecutive terminal failures (retries exhausted + DLQ action) and skips all deliveries for cooldown_seconds before allowing a single probe through. A successful probe closes the breaker; a failing probe reopens with a fresh cooldown. See Sinks → Circuit Breaker for the full state machine and metrics.

Field	Type	Default	Constraints	Description
`failure_threshold`	`int`	`5`	>= 1	Consecutive terminal failures required to trip the breaker. A `SKIP` outcome does NOT count — it’s operator intent, not a health signal.
`cooldown_seconds`	`float`	`30.0`	>= 0.1	Seconds the breaker stays open before promoting to half-open and allowing a single probe through.

sinks:
  circuit_breaker:
    failure_threshold: 5
    cooldown_seconds: 30.0

The breaker defaults are reasonable for most deployments — lower failure_threshold in latency-sensitive pipelines where one stuck sink would block the whole delivery fan-out; raise cooldown_seconds when probing a recovering downstream is itself expensive.

Dead Letter Queue (`dlq:`)¶

Failed sink deliveries can be routed to a DLQ Kafka topic. The DLQ captures the original payloads, error details, and metadata for later inspection or reprocessing.

Field	Type	Default	Description
`topic`	`str`	`''`	DLQ Kafka topic name. If empty, auto-derived as `{source_topic}_dlq` (e.g., `input-events_dlq`).
`brokers`	`str`	`''`	Kafka brokers for the DLQ. If empty, inherits from `kafka.brokers`.

dlq:
  topic: failed-events
  brokers: kafka-prod:9092

Metrics (`metrics:`)¶

Prometheus metrics endpoint configuration.

Field	Type	Default	Constraints	Description
`enabled`	`bool`	`true`		Enable or disable the Prometheus metrics HTTP server.
`port`	`int`	`9090`	1–65535	Port for the Prometheus metrics endpoint.

metrics:
  enabled: true
  port: 9090

Logging (`logging:`)¶

Structured logging configuration via structlog.

Field	Type	Default	Constraints	Description
`level`	`str`	`'INFO'`		Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.
`format`	`str`	`'json'`	`json` or `console`	Output format. `json` produces machine-readable structured logs. `console` produces colorized human-readable output for local development.
`output`	`str`	`'stderr'`		Log destination: `stderr`, `stdout`, or a file path. File paths support `{worker_id}` and `{cluster_name}` template variables. Parent directories are created automatically.

logging:
  level: INFO
  format: json
  output: stderr  # or stdout, or /var/log/drakkar/{worker_id}.log

Debug / Flight Recorder (`debug:`)¶

The debug subsystem provides a flight recorder (SQLite-backed event log), a web UI dashboard, WebSocket live streaming, and worker autodiscovery. This is the largest configuration section.

Authentication¶

The debug UI’s auth is opt-in by default. With debug.enabled=true and debug.auth_token empty (the default), the worker emits a structured debug_ui_unauthenticated warning at startup naming the bound host:port and the two opt-in paths (YAML key + env var), then continues starting normally. The UI is read-only by design — no endpoint stops a worker, replays Kafka messages, mutates sinks, or fakes pipeline data — and Drakkar is intended to run inside a private contour (VPC / internal cluster / operator-only ingress), so the framework treats “unauthenticated + warned” as a reasonable starting point rather than a misconfiguration.

To require auth on the protected endpoints (database download, merge, message probe) and the WebSocket live-event stream, set a strong token:

Generate a strong value: python -c "import secrets; print(secrets.token_urlsafe(32))".
Configure via YAML: set debug.auth_token: <value> in your config.
Or via environment: export DK_DEBUG__AUTH_TOKEN=<value> (overrides YAML when both are set).

When the token is set, protected endpoints reject requests without a matching Authorization: Bearer <token> header (or ?token=<token> query parameter); the WebSocket additionally validates the Origin header against allowed_ws_origins (or the request’s Host header). Comparison uses secrets.compare_digest for timing-side-channel safety; leading/trailing whitespace in the configured token is stripped on load (a auth_token: " " of only spaces is treated as empty and the warning still fires).

The unauthenticated-startup warning runs at DrakkarApp._async_run() startup, before the recorder and debug server are constructed. See drakkar/app_security.py::warn_if_debug_unauthenticated for the implementation.

Core Settings¶

Field	Type	Default	Constraints	Description
`enabled`	`bool`	`true`		Enable or disable the entire debug feature. Set to `false` to skip the flight recorder, web UI, and all associated overhead.
`host`	`str`	`'127.0.0.1'`		Bind address for the debug server. Default `127.0.0.1` (localhost only). Use `0.0.0.0` to expose on all interfaces. Auth is opt-in regardless of host — when binding to a non-loopback address inside anything other than a fully-trusted private network, set `auth_token` (see below).
`port`	`int`	`8080`	1–65535	Port for the debug web UI (FastAPI server).
`auth_token`	`str`	`''`		Bearer token for sensitive debug endpoints (database download, merge, message probe) and for the WebSocket live-event stream at `/ws`. Empty (the default) disables auth entirely — every endpoint is reachable without credentials and the WebSocket skips both token and Origin checks. This is intentional: the UI is read-only and intended for private-network deployments, and a startup warning (`debug_ui_unauthenticated`) names the unauthenticated posture in logs. When set to a non-empty value, protected HTTP endpoints require `Authorization: Bearer <token>` header or `?token=<token>` query parameter; WebSocket connections without a valid token are closed with code 4401. Comparison uses `secrets.compare_digest` to avoid timing side-channels. Leading/trailing whitespace is stripped on config load so `auth_token: " secret "` in YAML still works (and a token of only spaces is treated as empty). Read-only pages (dashboard, live, partitions, sinks, history) are always accessible.
`allowed_ws_origins`	`list[str]`	`[]`		Explicit allowlist of WebSocket `Origin` header values. Only consulted when `auth_token` is set (empty token = no origin check, dev workflow preserved). Empty list + non-empty `auth_token` = same-origin fallback: Origin host must match the `Host` header. Non-empty list = strict allowlist; any Origin not in the list is rejected with close code 4403. Comparison is case-insensitive and normalizes default ports (`:80` for http, `:443` for https) so `https://ops.internal` and `https://ops.internal:443` are equivalent. Missing Origin header (non-browser clients) is always accepted – the token check already authenticated them.
`debug_url`	`str`	`''`		External URL for the debug UI. Used when workers discover each other – if set, this URL is advertised instead of the auto-detected `http://{ip}:{port}`. Useful behind load balancers or Kubernetes ingresses.
`db_dir`	`str`	`'/tmp'`		Directory for SQLite database files. Set to `''` to run without any disk persistence (in-memory only, WebSocket streaming still works). Use a shared filesystem (e.g., NFS, EFS) for cross-worker autodiscovery and merge.

Persistence Flags¶

These flags control which tables are created in the SQLite database. All require db_dir to be non-empty. Any combination is valid.

Field	Type	Default	Description
`store_events`	`bool`	`true`	Write processing events (consumed, task_started, task_completed, task_failed, etc.) to the `events` table. Disable to reduce disk I/O on high-throughput workers.
`store_config`	`bool`	`true`	Write worker configuration to the `worker_config` table. This enables autodiscovery – other workers sharing the same `db_dir` can find and link to this worker.
`store_state`	`bool`	`true`	Periodically snapshot worker state (uptime, partitions, pool utilization, queue depth, counters) to the `worker_state` table.
`state_sync_interval_seconds`	`int`	`10`	>= 1

Exposed Environment Variables¶

Field	Type	Default	Description
`expose_env_vars`	`list[str]`	`[]`	List of environment variable names to capture and store in `worker_config.env_vars_json`. Useful for recording deployment metadata (e.g., `['GIT_SHA', 'DEPLOY_ENV', 'K8S_POD_NAME']`).

Database Rotation and Retention¶

Field	Type	Default	Constraints	Description
`rotation_interval_minutes`	`int`	`60`	>= 1	How often (minutes) to rotate the SQLite database file. On rotation, a new timestamped file is created and the old one is finalized. A `-live.db` symlink always points to the current file.
`retention_hours`	`int`	`24`	>= 1	Delete rotated database files older than this many hours.
`retention_max_events`	`int`	`100000`	>= 100	Upper bound on total events across all DB files. Also controls the maximum number of retained DB files (`max_files = retention_max_events / 10000`).

Output Storage¶

Field	Type	Default	Constraints	Description
`store_output`	`bool`	`true`		Include subprocess stdout/stderr in event records. Disable to save disk space when output is large or not needed for debugging.
`flush_interval_seconds`	`int`	`5`	>= 1	How often (seconds) the in-memory event buffer is flushed to SQLite.
`max_buffer`	`int`	`50000`	>= 1000	Maximum number of events held in the in-memory buffer. When full, oldest events are dropped (ring buffer).
`max_flush_retries`	`int`	`3`	>= 1	How many times a flush batch is re-queued on transient `OperationalError` (database is locked, disk I/O error, etc.) before the batch is dropped. On drop, `drakkar_recorder_flush_batches_dropped_total` ticks; on each retry `drakkar_recorder_flush_retries_total` ticks.
`max_ui_rows`	`int`	`5000`	>= 100	Maximum number of rows returned to the debug web UI in list views.

Duration Thresholds¶

These thresholds control which events are recorded, logged, and streamed. See Duration Thresholds for detailed behavior and Performance Tuning for recommendations. They help reduce noise in high-throughput systems where most tasks complete quickly.

Field	Type	Default	Constraints	Description
`log_min_duration_ms`	`int`	`500`	>= 0	Minimum task duration (ms) to emit a `slow_task_completed` or `slow_task_failed` log message. Set to `0` to log all tasks.
`ws_min_duration_ms`	`int`	`500`	>= 0	Minimum task duration (ms) to broadcast via WebSocket to the live UI. Fast tasks that complete under this threshold are invisible in the live view (reduces UI noise). Failed tasks always appear regardless. Set to `0` to show all tasks.
`event_min_duration_ms`	`int`	`0`	>= 0	Minimum task duration (ms) to persist to the SQLite database. Set above `0` to skip storing fast tasks entirely.
`output_min_duration_ms`	`int`	`500`	>= 0	Minimum task duration (ms) to include stdout/stderr in the persisted event record. Tasks under this threshold are recorded but without output data.

Prometheus Integration¶

These settings add clickable Prometheus graph links to the debug dashboard.

Field	Type	Default	Description
`prometheus_url`	`str`	`''`	Base URL of your Prometheus server (e.g., `http://prometheus:9090`). If empty, no Prometheus links are shown in the debug UI.
`prometheus_rate_interval`	`str`	`'5m'`	Rate interval used in PromQL `rate()` expressions for dashboard links (e.g., `1m`, `5m`, `15m`).
`prometheus_worker_label`	`str`	`''`	PromQL label filter for worker-scoped queries. Supports template variables: `{worker_id}`, `{cluster_name}`, `{metrics_port}`, `{debug_port}`. If empty, defaults to `instance="{hostname}:{metrics_port}"`. Example: `worker_id="{worker_id}"`.
`prometheus_cluster_label`	`str`	`''`	PromQL label filter for cluster-wide queries. Supports the same template variables. Example: `cluster="{cluster_name}"`. If empty, cluster-wide links are not shown.

Custom Links¶

Field	Type	Default	Description
`custom_links`	`list[dict[str, str]]`	`[]`	List of custom links displayed in the debug dashboard navigation. Each entry is a dict with `name` and `url` keys. URL values support template variables: `{worker_id}`, `{cluster_name}`, `{metrics_port}`, `{debug_port}`.

debug:
  enabled: true
  port: 8080
  debug_url: https://debug.example.com/worker-1
  db_dir: /shared/drakkar-debug
  store_events: true
  store_config: true
  store_state: true
  state_sync_interval_seconds: 10
  expose_env_vars:
    - GIT_SHA
    - DEPLOY_ENV
    - K8S_POD_NAME
  rotation_interval_minutes: 60
  retention_hours: 48
  retention_max_events: 200000
  store_output: true
  flush_interval_seconds: 5
  max_buffer: 50000
  max_flush_retries: 3
  max_ui_rows: 5000
  log_min_duration_ms: 1000
  ws_min_duration_ms: 500
  event_min_duration_ms: 100
  output_min_duration_ms: 1000
  prometheus_url: http://prometheus:9090
  prometheus_rate_interval: 5m
  prometheus_worker_label: 'worker_id="{worker_id}"'
  prometheus_cluster_label: 'cluster="{cluster_name}"'
  custom_links:
    - name: Grafana Dashboard
      url: http://grafana:3000/d/drakkar?var-worker={worker_id}
    - name: Kibana Logs
      url: http://kibana:5601/app/discover#/?_a=(query:(match_phrase:(worker_id:'{worker_id}')))

Cache (`cache:`)¶

A handler-accessible key/value cache with in-memory hot reads, write-behind SQLite persistence, and optional cross-worker peer-sync. Disabled by default – when enabled: false, handler.cache is a no-op stub so handler code can call self.cache.set(...) unconditionally without if guards.

See Cache for the full API (set / peek / get / delete) and the periodic loops (flush, cleanup, peer-sync).

Cache Settings¶

Field	Type	Default	Constraints	Description
`enabled`	`bool`	`false`		Master switch. When `false`, the cache is a no-op stub. When `true` without a `db_dir` (anywhere), the engine warns and continues without persistence – in-memory only.
`db_dir`	`str`	`''`		Directory for the per-worker `<worker_id>-cache.db` SQLite file. Empty falls back to `debug.db_dir`. Use a shared filesystem (NFS, EFS) for peer-sync to discover other workers’ cache files.
`flush_interval_seconds`	`float`	`3.0`	> 0	Interval (seconds) for the write-behind loop that drains dirty in-memory entries to SQLite. Lower = less data loss on crash; higher = less write amplification.
`cleanup_interval_seconds`	`float`	`60.0`	> 0	Interval (seconds) for the loop that deletes rows whose `expires_at_ms` has passed and refreshes Prometheus DB-size gauges.
`max_memory_entries`	`int \\| null`	`10000`	>= 1 or `null`	Cap for the in-memory LRU. `null` = unbounded (a startup warning fires so the choice is visible in logs). The DB is the source of truth, so eviction never loses data – evicted entries re-warm on the next `get()`.

Peer Sync (`cache.peer_sync:`)¶

The peer-sync loop pulls recent entries from sibling workers’ -cache.db files (LWW merge by updated_at_ms). Requires debug.store_config: true for autodiscovery – if disabled, peer sync silently no-ops.

Field	Type	Default	Constraints	Description
`enabled`	`bool`	`true`		When `false`, only the local SQLite is used – no cross-worker propagation. Flush and cleanup loops still run.
`interval_seconds`	`float`	`30.0`	> 0	Interval (seconds) between peer-sync cycles.
`batch_size`	`int`	`500`	>= 1	Maximum rows pulled from each peer per cycle.
`timeout_seconds`	`float`	`5.0`	> 0	Per-peer read timeout (seconds). One slow peer cannot block the rest.
`cycle_deadline_seconds`	`float \\| null`	`null`	>= 0.1 and `<` `interval_seconds`	Hard wall-clock cap on a single sync cycle. `null` derives `interval_seconds * 0.9`. Must be strictly less than `interval_seconds` – config load fails otherwise so the misconfiguration surfaces at startup.

cache:
  enabled: true
  db_dir: /shared/drakkar-cache       # empty falls back to debug.db_dir
  flush_interval_seconds: 3.0
  cleanup_interval_seconds: 60.0
  max_memory_entries: 10000           # null = unbounded (warns)
  peer_sync:
    enabled: true
    interval_seconds: 30.0
    batch_size: 500
    timeout_seconds: 5.0
    cycle_deadline_seconds: null      # null = interval_seconds * 0.9

Webapp (`webapp:`)¶

Optional synchronous-HTTP entry point. Disabled by default – when enabled: false, no FastAPI server runs and the handler’s HTTP hooks are never invoked. Webapp users declare HttpRequestT / HttpResponseT as the third and fourth generic parameters of BaseDrakkarHandler; missing types raise ConfigurationError at startup.

See Webapp for the full feature guide (enabling, hooks, request/response shape, status codes, shutdown semantics).

Webapp Settings¶

Field	Type	Default	Constraints	Description
`enabled`	`bool`	`false`		Master switch. When `false`, the FastAPI server is not started and the HTTP hooks are not invoked.
`host`	`str`	`'0.0.0.0'`		Interface uvicorn binds. Use `'127.0.0.1'` for host-private deployments.
`port`	`int`	`8090`		Port uvicorn binds. Distinct from the metrics and debug-UI ports.
`path`	`str`	`'/process'`	starts with `'/'`, length > 1	Single POST route the framework registers.
`sinks_enabled`	`bool`	`false`		When `true`, calls `on_message_complete` after the executor fan-out and routes returned `CollectResult` payloads through the SinkManager. When `false`, sinks are skipped and the response carries `sinks: null`.
`request_timeout_seconds`	`float`	`30.0`	> 0	Per-request budget enforced via `asyncio.wait_for` on the webapp loop. On timeout the client receives a 504 and the runner’s post-execute hooks are cooperatively cancelled.
`max_concurrent`	`int`	`64`	> 0	Per-worker semaphore capacity for in-flight HTTP requests. The 65^th concurrent request returns 503 `status='capacity'` immediately rather than queuing.
`clients`	`list[WebClientConfig]`	one anonymous client (`name='anonymous'`, `token=''`, `rpm=4`)	length >= 1	Configured tenants. Empty `clients: []` fails at config load.

Webapp Clients (`webapp.clients[]`)¶

Field	Type	Default	Constraints	Description
`name`	`str`	required	non-empty	Tenant name. Used in metric labels (`drakkar_webapp_requests_total{client=...}`), recorder rows, and the response body.
`token`	`str`	`''`	at most one client may have empty token; non-empty tokens unique	Bearer token presented in `Authorization: Bearer <token>`. Empty token = anonymous slot for requests without an `Authorization` header.
`rpm`	`int`	`4`	> 0	Per-client requests-per-minute cap, enforced on a 60-second sliding window.

webapp:
  enabled: true
  host: 0.0.0.0
  port: 8090
  path: /process
  sinks_enabled: false
  request_timeout_seconds: 30.0
  max_concurrent: 64
  clients:
    - name: anonymous
      token: ""
      rpm: 4
    - name: tenant-a
      token: "secret-tenant-a-token"
      rpm: 60

When every configured client has an empty token, the worker logs a webapp_unauthenticated_warning at startup so private-network deployments that should have had a token configured surface in log aggregation.

Annotated `drakkar.yaml` example¶

A copy-paste-ready YAML showing every field with one-line comments and the matching DK_* env-var override sits on its own page: Config Reference. Use this page (Configuration) for the deep tables and prose; use the Reference for a quick scan of “what can I change here?”

Configuration¶

Configuration Loading¶

YAML File Resolution¶

Environment Variable Overrides¶

Loading Order Summary¶

Root Config (DrakkarConfig)¶

Kafka Source (kafka:)¶

Staggered startup alignment¶

Executor Pool (executor:)¶

Backpressure Formula¶

Sinks (sinks:)¶

Kafka Sink (sinks.kafka.<name>)¶

PostgreSQL Sink (sinks.postgres.<name>)¶

MongoDB Sink (sinks.mongo.<name>)¶

HTTP Sink (sinks.http.<name>)¶

Redis Sink (sinks.redis.<name>)¶

Filesystem Sink (sinks.filesystem.<name>)¶