Config Calculator¶
Interactive calculator that recommends Drakkar configuration based on your hardware and workload characteristics. See Performance Tuning for detailed bottleneck analysis and tuning strategies.
Starting point, not a final answer
These are estimated values to use as a starting point for tuning. They are not calibrated against real production workloads. Every deployment is different — binary startup cost, message size, sink latency, and OS scheduling all affect optimal values. Use these recommendations as initial config, then iterate based on the metrics to watch section below.
What to Watch After Deploying¶
After applying the recommended config, monitor these metrics to understand how well the worker is performing and where to tune further.
Is the worker keeping up?¶
| Metric | Good | Bad | Action |
|---|---|---|---|
rate(drakkar_messages_consumed_total[5m]) |
Stable or matches production rate | Dropping or zero | Check if consumer is paused (backpressure) or partitions were revoked |
drakkar_backpressure_active |
0 most of the time | Stuck at 1 | Increase max_executors or add horizontal workers |
drakkar_total_queued |
Stable, not growing | Growing over time | Processing rate < production rate. Scale up or batch in arrange() |
Is the executor pool sized correctly?¶
| Metric | Good | Bad | Action |
|---|---|---|---|
drakkar_executor_pool_active |
50-80% of max_executors on average |
Pegged at max_executors constantly |
Add more slots or more workers |
rate(drakkar_executor_idle_slot_seconds_total[5m]) |
Near zero | High (slots idle while messages wait) | arrange() is the bottleneck — make it faster or reduce window_size |
rate(drakkar_consumer_idle_seconds_total[5m]) |
Low when topic has data | High while topic has data | Worker is over-provisioned for this workload |
Are tasks healthy?¶
| Metric | Good | Bad | Action |
|---|---|---|---|
histogram_quantile(0.95, rate(drakkar_executor_duration_seconds_bucket[5m])) |
Close to expected p95 | Much higher than expected | Binary performance degraded, or resource contention |
rate(drakkar_executor_tasks_total{status="failed"}[5m]) |
Near zero or matching expected failure rate | Spiking | Check on_error() logic, binary health, input data |
rate(drakkar_executor_timeouts_total[5m]) |
Zero | Non-zero | Increase task_timeout_seconds or investigate stuck processes |
rate(drakkar_task_retries_total[5m]) |
Low | High | Transient failures are frequent — check error patterns |
Are sinks healthy?¶
| Metric | Good | Bad | Action |
|---|---|---|---|
rate(drakkar_sink_deliver_errors_total[5m]) |
Zero | Non-zero | Check sink connectivity, credentials, capacity |
histogram_quantile(0.95, rate(drakkar_sink_deliver_duration_seconds_bucket[5m])) |
Low (< 100ms for Kafka/Redis, < 500ms for Postgres/HTTP) | High or spiking | Sink is overloaded or network is degraded |
rate(drakkar_sink_dlq_messages_total[5m]) |
Zero | Non-zero | Deliveries are failing and being routed to DLQ — investigate sink errors |
rate(drakkar_dlq_send_failures_total[5m]) |
Zero | Non-zero | Critical — both sink AND DLQ failed. Data is being lost. |
Is the handler code fast enough?¶
| Metric | Good | Bad | Action |
|---|---|---|---|
histogram_quantile(0.95, rate(drakkar_handler_duration_seconds_bucket{hook="arrange"}[5m])) |
<< task duration | Comparable to or exceeding task duration | arrange() is the bottleneck — cache lookups, reduce I/O |
histogram_quantile(0.95, rate(drakkar_handler_duration_seconds_bucket{hook="collect"}[5m])) |
<< task duration | High | collect() is doing too much work — move heavy logic elsewhere |
Iterating on config¶
- Start with calculator values
- Run under production-like load for 10+ minutes
- Check the tables above
- Adjust one parameter at a time, observe for another 10 minutes
- Typical iteration cycle:
max_executorsfirst, thenwindow_size, then debug thresholds last (they affect observability, not throughput)
Principles¶
The calculator follows these rules to derive each parameter:
Executor sizing¶
max_executors = available cores - 20% reserved. The reserved cores
handle the asyncio event loop, OS, Kafka consumer, and sink I/O. Each
executor slot runs one subprocess consuming one CPU core. Going beyond
available cores causes context switching without throughput gain.
Further capped by partitions/workers * 4 since there is no point
having more slots than potential queued work.
window_size targets 2-5 seconds of aggregate work per window
(shorter for slow tasks, longer for fast). This balances two forces:
larger windows reduce arrange() call frequency and enable batching,
while smaller windows reduce commit latency since offsets only commit
after the slowest task in the window finishes.
Kafka consumer¶
max_poll_records = enough messages per poll to feed one window
cycle across active partitions. Too low starves the pipeline (partition
queues empty between polls). Too high wastes memory on queued messages
that will not be processed for minutes.
session_timeout_ms controls how fast Kafka detects a dead worker.
Lower = faster rebalance, but more false positives if the event loop is
temporarily busy. Fast tasks (heartbeat-friendly) use 10s. Slow tasks
use 45-60s to avoid spurious rebalances.
max_poll_interval_ms must exceed the worst-case window duration
(all tasks hitting the timeout). If a window takes longer, Kafka
kicks the consumer out of the group and triggers rebalance.
Backpressure¶
backpressure_high_multiplier and low_multiplier control the
pause/resume hysteresis. high_watermark = max_executors * high_mult.
Fast tasks drain quickly and need less buffer. Slow tasks need more
buffer but should not over-fetch (each message is minutes of work).
The gap between high and low prevents rapid pause/resume cycling.
Debug thresholds¶
All four *_min_duration_ms thresholds follow the same principle:
hide noise, preserve signal. For fast workloads (p80 < 50ms), the
majority of tasks are routine – only slow outliers and failures matter.
For slow workloads (p80 > 500ms), every task is worth observing. The
thresholds scale with p80.
store_output trades disk I/O for debuggability. For fast tasks
producing small stdout, the per-second write volume is high and the
data is rarely needed. For slow tasks, stdout often contains the only
clue about what went wrong.