Troubleshooting
When something’s wrong in production, this page is the starting point. Each section is a symptom, the likely cause(s), and what to check / log / measure to confirm.
For the diagnostic tools themselves:
- Logging — structured logs.
- Stock metrics — built-in actor / cluster metrics.
- Tracing — per-request flow.
Cluster
Section titled “Cluster”Symptom: cluster won’t form
Section titled “Symptom: cluster won’t form”Pods start, but never reach Up.
Causes to check (in order):
- Seeds unreachable — wrong addresses / DNS / RBAC.
- Cluster port firewalled — pods can’t talk on 2552.
- Different system names —
ActorSystem.create('app-a')on one node,'app-b'on another. - TLS misconfiguration — handshake fails silently.
Diagnostics:
cluster.subscribe(MemberJoined, (e) => system.log.info(`saw join: ${e.member.address}`));cluster.subscribe(SelfUp, () => system.log.info('self up'));Check whether SelfUp fires. If not, the local node hasn’t
even bootstrapped. Look at the seed-provider output:
kubectl logs pod-1 | grep -i seed# → "discovered N seeds: ..." ← should not be emptyIf empty: seed provider’s selector / DNS / config is wrong.
Symptom: cluster flaps
Section titled “Symptom: cluster flaps”Members oscillate between reachable and unreachable.
Causes:
- Tight failure-detector thresholds for the network’s jitter. See failure-detector tuning.
- Network partitions (real ones — diagnose at the infrastructure level).
- GC pauses longer than the unreachable threshold.
Diagnostics:
rate(cluster_unreachable_duration_ms_count[5m]) > 0.1# Frequent transitionsIn logs:
[INFO ] cluster — node-X marked unreachable[INFO ] cluster — node-X marked reachable[INFO ] cluster — node-X marked unreachablePersistent flapping signals threshold tuning.
Sharding
Section titled “Sharding”Symptom: sharded entities don’t spawn
Section titled “Symptom: sharded entities don’t spawn”Messages to a sharding region don’t reach entities.
Causes:
- Coordinator not yet started — sharding is async; messages sent before the coordinator is ready get buffered.
- No nodes match the
role— empty up-member set, so no shards are allocated. - extractEntityId returns undefined — the message routing has nothing to hash.
Diagnostics:
sharding_shards_hosted{type="entity-type"}# Should equal numShards (total) across the clusterLogs:
kubectl logs pod-1 | grep -i shard# → "coordinator: allocating shard X to node-Y"If you see “no candidates,” your role filter rejects every
node. Check role tags in Cluster.join.
Symptom: rebalance storm
Section titled “Symptom: rebalance storm”Sharding metrics show continuous rebalances; entities flicker.
Causes:
- Aggressive allocation strategy —
LeastShardAllocationStrategywithrebalanceThreshold: 1and frequent membership changes. - Flapping cluster (see above) — every change triggers rebalance.
Fix:
new LeastShardAllocationStrategy(/* threshold */ 5, /* max */ 3);Higher threshold = less sensitivity to small imbalances.
Persistence
Section titled “Persistence”Symptom: actor takes 30 seconds to start
Section titled “Symptom: actor takes 30 seconds to start”preStart takes a long time; the actor’s first
onReceive is delayed.
Cause: deep journal without snapshots. Recovery reads every event ever.
Diagnostics:
histogram_quantile(0.99, persistence_recovery_duration_ms_bucket)If P99 recovery is > 1 second, set a snapshot policy:
override snapshotPolicy() { return everyNEvents(100); }See Snapshots.
Symptom: events recovered but state is wrong
Section titled “Symptom: events recovered but state is wrong”On restart, the actor’s state doesn’t match what it should be from the journal.
Causes:
onEventhas a side effect — runs during replay and somehow alters the state path.onEventusesDate.now()or random — non-deterministic; each replay produces different state.- Schema change without an adapter — old events have a
different shape than what
onEventexpects.
Diagnostics: compare the actor’s state after recovery to what’s in the journal. Replay manually in a test to isolate.
Memory + performance
Section titled “Memory + performance”Symptom: memory grows unboundedly
Section titled “Symptom: memory grows unboundedly”Heap grows linearly with uptime; eventually OOM.
Causes (most common first):
- Unbounded mailboxes with a slow consumer. Mailbox depth metric grows.
- Subscriber set leaks — actors registering with the event stream or DistributedPubSub but never unsubscribing on stop.
- DistributedData keys accumulating —
LWWMapwith millions of keys. - Persistent buffers — stash buffers, ask reply-to refs.
Diagnostics:
actor_mailbox_size{class=...}# Find actors with persistently large queues# Heap dump via runtime tools:node --inspect / Bun's profilerCheck mailbox sizing + the leak patterns in event stream.
Symptom: HTTP latency high under load
Section titled “Symptom: HTTP latency high under load”Actor work is fine; HTTP responses are slow.
Cause: an actor monopolizes the event loop, starving HTTP handlers.
Fix: per-actor ThroughputDispatcher on the heavy actor.
Tests + dev
Section titled “Tests + dev”Symptom: tests hang at the end
Section titled “Symptom: tests hang at the end”Bun / Vitest test process doesn’t exit.
Cause: await system.terminate() not called in a fixture
teardown. Leaked schedulers / actor cells keep the event
loop alive.
Fix:
afterEach(async () => { await tk.shutdown();});See TestKit.
Symptom: flaky timing-sensitive tests
Section titled “Symptom: flaky timing-sensitive tests”Tests pass locally, fail in CI.
Cause: real-clock timing in tests. CI machines are slower / different.
Fix: use ManualScheduler and control time deterministically.
Where to start when nothing’s obvious
Section titled “Where to start when nothing’s obvious”1. Check logs for ERROR-level entries. Filter by time window around the issue.2. Check stock metrics — what changed? (rate of restarts, mailbox depth, member count).3. Check cluster events on the event stream.4. If a request is failing, follow the trace ID through logs / trace backend.5. Reproduce in a multi-node-spec test if you can.Where to next
Section titled “Where to next”- Operations overview — the broader production checklist.
- FAQ — common questions and pitfalls.
- Stock metrics — what to read when symptoms emerge.
- Logging — how to make logs actually useful.
- Tracing — per-request flow when logs aren’t enough.