Failure detector tuning
The
failure detector decides
when a peer is unreachable or down. Three knobs:
Cluster.join(system, { host, port, seeds, failureDetector: { heartbeatIntervalMs: 500, // default unreachableAfterMs: 2_000, // default downAfterMs: 5_000, // default },});This page is the operational tuning guide — when to change these, by how much, and what symptoms each value affects. For the conceptual model, see failure detector.
The defaults
Section titled “The defaults”| Setting | Default | Meaning |
|---|---|---|
heartbeatIntervalMs | 500 | How often heartbeats are sent. |
unreachableAfterMs | 2 000 | Silence threshold for unreachable. |
downAfterMs | 5 000 | Silence threshold for down. |
LAN clusters: defaults are fine. Cross-region or noisy networks need adjustment.
Symptom-driven tuning
Section titled “Symptom-driven tuning”Symptom: false-positive unreachable
Section titled “Symptom: false-positive unreachable”Member flips to unreachable, then back to reachable, thenunreachable, then reachable. Flapping every few seconds.Cause: unreachableAfterMs is too tight for your
network’s jitter. Brief GC pauses or transient network
delays exceed the threshold.
Fix: raise unreachableAfterMs.
failureDetector: { heartbeatIntervalMs: 500, unreachableAfterMs: 5_000, // ← was 2_000 downAfterMs: 15_000, // ← was 5_000 (keep ratio similar)}Symptom: slow failover
Section titled “Symptom: slow failover”A node dies, but the cluster takes 10+ seconds to react.Singletons / sharded entities don't move for a long time.Cause: unreachableAfterMs + downing-strategy delay is
too long. The cluster is waiting before deciding.
Fix: lower unreachableAfterMs and downAfterMs
proportionally.
failureDetector: { heartbeatIntervalMs: 250, // ← was 500 unreachableAfterMs: 1_000, // ← was 2_000 downAfterMs: 3_000, // ← was 5_000}Trade-off: more false positives on jittery networks. Measure your network’s RTT + jitter first.
Symptom: cross-region cluster flaps
Section titled “Symptom: cross-region cluster flaps”WAN-spanning cluster. Members at the far region intermittentlyshow unreachable, recover seconds later.Cause: defaults assume LAN-style latencies (sub-ms RTT). A 100ms-RTT WAN with normal jitter exceeds the 2s threshold under typical conditions.
Fix: substantially raise all three thresholds.
failureDetector: { heartbeatIntervalMs: 1_000, // ← was 500 unreachableAfterMs: 10_000, // ← was 2_000 downAfterMs: 30_000, // ← was 5_000}For cross-region clusters, bias toward stability over fast detection. A 30-second window catches real failures while tolerating WAN jitter.
Symptom: heartbeat bandwidth too high
Section titled “Symptom: heartbeat bandwidth too high”Cluster transport bandwidth shows constant ~5 % of capacityjust for heartbeats.Cause: heartbeats run at 500ms × N peers. Large clusters multiply.
Fix: raise heartbeatIntervalMs.
failureDetector: { heartbeatIntervalMs: 2_000, // ← was 500 unreachableAfterMs: 6_000, // ← keep ratio downAfterMs: 15_000,}Bandwidth scales linearly with frequency. Going from 500 ms to 2 s drops heartbeat traffic 4x.
Keeping the ratios sane
Section titled “Keeping the ratios sane”unreachableAfterMs ≥ 3 × heartbeatIntervalMsdownAfterMs ≥ 2 × unreachableAfterMsThese minimums prevent flapping from a single missed heartbeat:
unreachableAfterMs >= 3 × heartbeat→ tolerates up to 2 consecutive missed heartbeats before flagging.downAfterMs >= 2 × unreachable→ “stay unreachable for a while” before declaring down.
The defaults (500 / 2000 / 5000) follow this rule: 2000 = 4× 500, 5000 = 2.5× 2000.
Tuning per-environment
Section titled “Tuning per-environment”Different environments often need different values:
actor-ts.cluster.failure-detector { heartbeat-interval = ${?HEARTBEAT_INTERVAL} unreachable-after = ${?UNREACHABLE_AFTER} down-after = ${?DOWN_AFTER}}Set via env vars per-environment:
# Local dev (fast):export HEARTBEAT_INTERVAL=250msexport UNREACHABLE_AFTER=1s
# Production WAN:export HEARTBEAT_INTERVAL=1sexport UNREACHABLE_AFTER=10sDon’t hard-code production values; the difference between dev and prod tuning is real.
Measuring before tuning
Section titled “Measuring before tuning”cluster.subscribe(MemberUnreachable, (evt) => { metrics.counter('cluster_unreachable_total', { address: evt.member.address.toString() }).inc();});
cluster.subscribe(MemberReachable, (evt) => { // also track time spent unreachable});Histogram-track “duration in unreachable state.” P99 should be
well below downAfterMs (otherwise legitimate flaps
trigger downing). P50 + jitter gives you network’s RTT
profile.
For most teams, a week of observing in production is more useful than guessing at tuning values up front.
Where to next
Section titled “Where to next”- Failure detector — the conceptual model.
- Gossip cadence tuning — the complementary knob.
- Downing strategies — what happens after a member is downed.
- Configuration — the HOCON keys.