Skip to content

Cluster overview

A cluster is a group of ActorSystems — typically one per node — that know about each other. Nodes gossip their membership state, detect each other’s failures, and route messages across the network. Once joined, code on any node can tell an actor on any other node; the cluster transport hides the wire.

Two ways to think about it:

  • From the application’s perspective: one logical actor system spread across N nodes. An ActorRef can point at an actor on any node; the same code that worked single-node still works.
  • From the runtime’s perspective: N independent systems exchanging gossip + heartbeats, each tracking who’s alive, routing messages through a chosen transport.
import { ActorSystem, Cluster } from 'actor-ts';
const system = ActorSystem.create('my-app');
const cluster = await Cluster.join(system, {
host: '0.0.0.0',
port: 2552,
seeds: ['node-a:2552', 'node-b:2552'],
roles: ['compute'],
});
// Once joined:
cluster.state; // → ClusterState
cluster.upMembers(); // → ReadonlyArray<Member> of up-status nodes
cluster.subscribe(MemberUp, ...); // listen for membership events

Three settings are doing most of the work:

SettingPurpose
host + portThis node’s external address. Peers contact it here.
seedsOther nodes’ addresses. At startup, this node contacts them to join the existing cluster. An empty list = “I’m the first one” (auto-promotes to leader).
rolesTags this node carries. Routers and shard regions can filter on these (role: 'compute' skips nodes without the tag).

After Cluster.join resolves, the node is in the cluster (though possibly still in joining state for a few seconds until convergence).

Every member goes through a small state machine:

optional

leader confirms

failure detector

heartbeat resumes

cluster.leave()

tombstone TTL

joining

weaklyUp

up

unreachable

leaving

exiting

removed

  • joining — just announced itself. Other peers know about it but it isn’t routable yet.
  • weakly-up (optional) — gossip-visible to peers but the leader hasn’t confirmed it. Useful for partition-tolerant joins; see Weakly-up.
  • up — fully in the cluster, routable. This is the steady-state status.
  • unreachable — the failure detector flagged this peer as not heartbeating. Still officially a member, but routing avoids it. Transient; flips back to reachable if heartbeats resume.
  • leaving / exiting — the member is gracefully leaving (via cluster.leave()).
  • removed — formally evicted. Kept as a tombstone for a TTL (default 24 h) so stale gossip from a slow peer can’t accidentally resurrect it.

The cluster’s events stream surfaces every transition — subscribe to MemberUp / MemberRemoved / UnreachableMember etc. and react.

Gossip is how members agree on who’s in the cluster. Every gossipIntervalMs (default 500 ms), each member picks a random reachable peer and exchanges its view of the cluster. Over a few rounds, every peer converges on the same state — without any central coordinator.

The protocol carries:

  • Membership table — every member’s address, status, roles, and version vector.
  • Reachability observations — “I haven’t heard from X recently.”

Two peers exchanging gossip merge their tables by picking the higher version for each member. This is what makes convergence work without a leader-elected coordinator: every concurrent update is eventually seen by every peer.

For the deep dive, see Joining and seeds.

The framework uses a phi accrual failure detector. Each member maintains a per-peer heartbeat history; if a peer’s heartbeats stop arriving within a statistical window, the detector flags it as suspect.

  • unreachableAfterMs — once suspicion exceeds the threshold for this long, the member is marked unreachable.
  • downAfterMs — if the suspicion persists this long, the member is downed (split-brain resolution).

Adaptive — the detector accommodates variable network latency, not just a fixed timeout. See Failure detector for the phi math.

When the network partitions, two halves of the cluster may both remain operational but lose contact with each other. Without intervention, both sides keep running and accept conflicting writes — the classic split-brain problem.

actor-ts ships several downing strategies:

StrategyWhat it does
KeepMajorityThe side with more nodes wins; the smaller side downs itself.
KeepOldestThe side containing the oldest member wins.
KeepRefereeA designated referee node’s view wins.
KeepQuorumRequires a configured quorum size; failures below it down the cluster.

See Downing strategies for the full set. Pick deliberately — the default (no downing strategy) requires manual intervention during a partition.

Once two nodes share a cluster, ref.tell(msg) to a foreign-node ref just works:

const remote = await system.actorSelection(
'actor-ts://my-app@10.0.0.5:2552/user/api/sessions/user-42',
).resolveOne();
remote.tell({ kind: 'whatever' });
// → serialized, sent over the transport, delivered to the foreign actor's mailbox

The cluster transport serializes the message (JSON by default), includes the routing path, and the destination’s transport delivers it. Reply-to refs serialize cleanly — the receiving node re-attaches a remote-routable handle so replies flow back over the same transport.

See Refs across nodes for the wire-format details.

The cluster module is the foundation; everything interesting about distributed actor systems comes from the extensions that build on it:

ExtensionWhat it adds
ShardingOne actor per “entity key,” distributed across nodes, with automatic rebalancing on membership changes.
SingletonOne actor cluster-wide. Re-spawned elsewhere if the host node leaves.
DistributedPubSubTopic-based fan-out across the cluster.
DistributedDataCRDT-based shared state with eventual consistency.
Cluster routerRoutes messages across cluster-up-members at a well-known path.
ReceptionistService registry — actors register, others look up by key.

You don’t enable these by default; you reach for them when needed. This page covers the foundation they all share — once you understand membership, gossip, and failure detection, the extensions follow.

A “cluster” of one node is valid. Calling Cluster.join with no seeds (or seeds that are unreachable) gives you a singleton cluster — the local node auto-promotes to leader, becomes up, and every extension that depends on cluster (sharding, singleton, pubsub) works as if it were in a larger cluster.

This means cluster code can be developed and tested with a single node; you don’t need a Docker Compose setup to get started. Add more nodes later by giving them seeds pointing at the first.

Three primary motivations:

  1. Scale-out: more actors than one node’s memory or CPU can handle. Sharding distributes them.
  2. Fault tolerance: if one node dies, the work moves to another. Singleton and sharding handle the failover.
  3. Geographic distribution: actors near their users or data, coordinating with the rest of the cluster over the WAN.

For a single-process app, you don’t need the cluster module. For a multi-process app where each process holds independent state, you don’t need it either — just use process boundaries. Reach for it when you genuinely need shared logical state across multiple machines.

The Cluster API reference covers the join/leave/subscribe surface.