Skip to content

Operations overview

The framework runs the same code in dev and prod, but production deployment has its own set of considerations that don’t matter on a laptop:

TopicConcerns
DeploymentHow does a node start, register with the cluster, accept traffic, and shut down cleanly?
TuningGossip cadence, mailbox sizes, failure-detector thresholds — defaults are sensible, but workloads vary.
SecurityTLS on the cluster transport, secret management, key rotation.
UpgradesSchema migrations, rolling deployments without downtime.
TroubleshootingWhat logs, metrics, and traces tell you about a misbehaving system.

This section maps each to a deeper page.

For a non-trivial actor system going to production:

□ Persistence backend chosen + production-grade
(e.g. SQLite for single-node, Cassandra for multi-node)
□ Cluster wired up (if multi-node) with concrete seed strategy
□ Downing strategy configured (split-brain protection)
□ Coordinated shutdown configured with SIGTERM hooks
□ Health checks exposed via HttpManagement
□ Metrics exposed via Prometheus or OTel
□ Log aggregation hooked up (structured logging via withFields/MDC)
□ TLS enabled on the cluster transport
□ Secrets via env vars, NOT in application.conf
□ Kubernetes manifests with PreStop hook + grace period
□ Tracing optional but configured if multi-actor flows exist
□ Runbook for "actor X keeps crashing" / "cluster won't form"

None of these are framework-specific tricks — they’re general production hygiene applied to an actor system. Each is covered in its own page.

HowWhen
KubernetesCloud + container orchestration. StatefulSets, headless services, RBAC for the K8s API seed provider.
Docker ComposeLocal multi-node clusters for testing and small deployments.
Process managersystemd / PM2 — for bare-metal or VM deployments.

K8s is the most common production path; the page covers Deployment vs StatefulSet, PreStop hooks, and the seed-provider configuration.

Defaults work for most workloads — reach for these pages when you see specific symptoms:

KnobWhen to tune
Gossip cadenceLarge clusters (>20 nodes) where the default 1-second gossip is wasteful, or small clusters where you want faster convergence.
Failure detectorTight networks (LAN) where 2-second unreachable detection is too slow, or noisy networks where it triggers false alarms.
Mailbox sizingWhen you see memory growth from unbounded mailboxes, or when bounded mailboxes are dropping more than expected.
Dispatcher tuningWhen you observe HTTP latency degradation under actor-heavy load, or low CPU utilization with high actor throughput.

If you can’t articulate the symptom, don’t tune. Defaults aren’t optimal but they’re rarely bad.

Three layers:

ConcernPage
Cluster transport TLS + authCluster security
Encryption at rest for persisted dataMaster key rotation
TLS everywhere (HTTP, brokers, journals)TLS everywhere

The cluster transport defaults to unauthenticated TCP — fine inside a private network, dangerous on the public internet. Always enable TLS + shared-secret auth for any cluster that crosses untrusted boundaries.

The framework supports rolling deployments without downtime when configured for it. Two distinct concerns:

The rolling-migration page is the most practical — a step-by-step recipe for “I need to add a field to my event without taking the cluster down.”

Symptom → Likely cause
─────────────────────────────────────────────────────────
Actor isn't receiving messages → Stopped ref, dead-letter, mailbox full
Cluster won't form → Seed nodes unreachable, port conflict
Sharded entities won't spawn → Coordinator not on a leader, role mismatch
PersistentActor takes 30s to start → No snapshot, deep journal — set snapshotPolicy
Tests hang at the end → Forgot await system.terminate()
Memory grows unboundedly → Unbounded mailboxes; pick a mailbox cap

See Troubleshooting for the diagnostic-by-symptom catalog.