Failure detector
A clustered actor system needs to agree on who’s alive. The failure detector is the per-peer state machine that decides:
healthy ──── no heartbeat for `unreachableAfterMs` ────► unreachable ▲ │ │ │ │ heartbeat received no heartbeat for │ │ `downAfterMs` more │ └──────────────────────── ▼ │ down (recommended)healthy— heartbeats arriving normally.unreachable— past the threshold; the cluster avoids routing here. Still officially a member; can recover.down— past the longer threshold; the cluster considers the peer permanently gone. Triggers downing.
The detector is per-peer — different peers can be in different states at once. The cluster’s overall behavior (routing, sharding, singleton election) reads these states to decide what to do.
The defaults
Section titled “The defaults”{ heartbeatIntervalMs: 500, // send a heartbeat every 500ms unreachableAfterMs: 2_000, // mark unreachable after 2s of silence downAfterMs: 5_000, // mark down after 5s of silence}For typical LAN clusters (1-10 nodes, sub-millisecond latency), these defaults work fine. They give:
- 2-second detection window for “this peer might be having trouble.”
- 5-second decision window for “this peer is definitively gone.”
Tuning
Section titled “Tuning”Override via the failureDetector field in the cluster settings:
Cluster.join(system, { host, port, seeds, failureDetector: { heartbeatIntervalMs: 1_000, unreachableAfterMs: 5_000, downAfterMs: 15_000, },});When to tune:
| Workload | Direction |
|---|---|
| Cross-region cluster (high RTT) | Increase all three. 100ms RTT means a single missed heartbeat is normal noise. |
| Local Docker compose (sub-ms RTT) | Decrease for faster failover tests. Production-tune up before deploying. |
| Network with periodic blips | Increase unreachableAfterMs to avoid false positives, but keep downAfterMs larger so a real failure still gets detected. |
| Cost-sensitive (chatty heartbeats) | Increase heartbeatIntervalMs — at 5s intervals, gossip + heartbeat is < 1KB/sec per peer. |
The ratio downAfterMs / unreachableAfterMs (default ~2.5x) is the
flap-tolerance window: a peer that’s marked unreachable can
recover to healthy without being downed, as long as a heartbeat
arrives within the difference.
Heartbeats are implicit
Section titled “Heartbeats are implicit”Every gossip exchange counts as a heartbeat.Every direct message that travels over the cluster transport counts.The framework doesn’t send separate “ping” messages — any cluster traffic from a peer resets that peer’s last-seen timestamp. Gossip is the most reliable source (regular interval), but application messages contribute too.
This means: a cluster with very chatty actors gets better failure detection (more heartbeats); an idle cluster relies entirely on gossip.
What the detector decides — and doesn’t
Section titled “What the detector decides — and doesn’t”The detector returns 'healthy' / 'unreachable' / 'down'.
What the cluster does with that:
| Decision | Cluster behavior |
|---|---|
healthy | Normal routing. No effect. |
unreachable | Mark the member unreachable in the membership table. Routers skip them; sharding doesn’t allocate new shards to them. Singleton manager won’t elect a leader from an unreachable side. |
down | Trigger downing. If a downing strategy is configured, it decides which addresses to forcibly evict; the cluster announces those nodes as removed. |
Crucially: the detector deciding down doesn’t automatically
remove the peer. Removal goes through the downing strategy — the
detector is the signal, not the action. Without a downing
strategy, a down decision stays advisory.
The view from the local node
Section titled “The view from the local node”Every node runs its own detector, watching its own peers. This means two nodes can disagree about whether a third is reachable:
n1 ──── healthy heartbeats ────► n2n1 ──── (network partition) ──── n3 ← n1 sees n3 as unreachablen2 ──── healthy heartbeats ──── n3 ← n2 sees n3 as healthyThe cluster gossip propagates these per-node observations. A member is considered globally unreachable when enough peers report it that way — the threshold is configurable in some downing strategies (see KeepMajority, KeepReferee).
Custom failure detector
Section titled “Custom failure detector”The framework’s detector is intentionally simple — plain elapsed-time
thresholds, no statistical variance tracking. For LAN scale this
is sufficient. If your network needs Phi-accrual (variance-aware,
adaptive thresholds), you’d implement the FailureDetector interface
yourself and inject it via Cluster.join’s failureDetector
override — but this isn’t documented as a public extension point
yet; opening an issue describes the use case.
Diagnosing failure-detector decisions
Section titled “Diagnosing failure-detector decisions”import { MemberUnreachable, MemberReachable } from 'actor-ts';
cluster.subscribe(MemberUnreachable, (evt) => { console.log(`${evt.member.address} marked unreachable`);});
cluster.subscribe(MemberReachable, (evt) => { console.log(`${evt.member.address} marked reachable again`);});Subscribe to the cluster events for visibility. In production, wire these into metrics — a histogram of “unreachable durations” shows whether the threshold matches your network’s actual blip profile.
Where to next
Section titled “Where to next”- Cluster overview — the membership state machine the detector feeds.
- Downing strategies —
what happens after the detector decides
down. - Joining and seeds — how peers first appear in the membership table.
- Configuration — the
HOCON keys (
actor-ts.cluster.failure-detector.*). - Failure-detector tuning — the operations-focused tuning page.
The FailureDetector API
reference covers the full surface.