Skip to content

Coordinated shutdown

system.terminate() stops the actor system, but a production app usually has work to do before that: drain in-flight HTTP requests, tell the cluster you’re leaving, flush a journal, close broker connections. And those steps have an order — leave the cluster before you stop the sharding region; stop the HTTP server before the actors that handle requests.

Coordinated shutdown is the DSL for that. You register tasks against named phases; the framework runs them in dependency order, one phase after the next, each with a timeout cap. Calling run() from any trigger (SIGTERM, K8s PreStop hook, an admin endpoint) executes the whole pipeline once.

import {
ActorSystem,
CoordinatedShutdownId,
Phases,
type Reason,
} from 'actor-ts';
const system = ActorSystem.create('my-app');
const cs = system.extension(CoordinatedShutdownId);
cs.addTask(Phases.ServiceUnbind, 'close-http', async (reason) => {
await httpServer.close();
});
cs.addTask(Phases.ServiceRequestsDone, 'drain-in-flight', async () => {
await waitForInFlightRequests(/* up to 10s */);
});
cs.installProcessHooks(); // SIGTERM/SIGINT → cs.run(ProcessTerminateReason)

Three things happen when SIGTERM lands:

  1. The runtime calls cs.run(new ProcessTerminateReason('SIGTERM')).
  2. The phases run in canonical order. Inside each phase, all registered tasks run in parallel; the phase waits for them all (or for their timeouts).
  3. The pipeline finishes with the built-in actor-system-terminate task, which calls system.terminate() for you.

Everything in between — HTTP unbinding, cluster leave, journal flush — is whatever you added.

Listed in execution order:

#Phase nameTypical tasks
1before-service-unbindLast-chance announcements before the server stops accepting connections.
2service-unbindStop the HTTP server / gRPC server / WebSocket listener from accepting new connections.
3service-requests-doneWait for in-flight requests to finish; abort the rest.
4service-stopClose client connections, release sockets.
5before-cluster-shutdownOptional pre-cluster-leave hooks.
6cluster-sharding-shutdown-regionTell the sharding region to hand off entities.
7cluster-leaveIssue a Cluster.leave() — gossip leaving status.
8cluster-exitingWait for the cluster to acknowledge the leave.
9cluster-exiting-doneConfirm cluster transition is complete.
10cluster-shutdownTear down cluster transports.
11before-actor-system-terminateLast-chance app-level cleanup (flush journals, close brokers).
12actor-system-terminateThe built-in system.terminate() task.

You don’t have to use every phase. Empty phases are no-ops; only phases with registered tasks do anything. In a single-node app without cluster, only phases 1-4 and 11-12 see tasks.

The Phases constant exports the canonical names — prefer it over string literals for autocomplete:

import { Phases } from 'actor-ts';
cs.addTask(Phases.ServiceUnbind, ...); // ✓ typed
cs.addTask('service-unbind', ...); // ✗ stringly-typed, no auto-complete

For app-specific work that doesn’t fit a canonical phase, declare your own:

cs.addPhase({
name: 'flush-metrics',
timeoutMs: 3_000,
dependsOn: [Phases.BeforeActorSystemTerminate],
recover: true,
});
cs.addTask('flush-metrics', 'push-prometheus', async () => {
await metricsRegistry.flush();
});

The dependsOn field is what makes the order DAG-shaped rather than linear — your phase runs after before-actor-system-terminate but before actor-system-terminate (because the latter has the former in its own implicit chain).

The framework does a topological sort, so cycles fail loud at registration time (Error: cycle in phase dependencies).

Every task is a function from a Reason to void | Promise<void>:

type ShutdownTask = (reason: Reason) => Promise<void> | void;

The reason lets a task branch on why shutdown was triggered:

cs.addTask(Phases.ClusterLeave, 'gossip-leave', async (reason) => {
if (reason instanceof ClusterDowningReason) {
// We were downed — don't bother gossiping a leave.
return;
}
await cluster.leave();
});

Built-in Reason classes:

ClassWhen
ProcessTerminateReason(signal)SIGTERM/SIGINT via installProcessHooks.
ActorSystemTerminateReasonUser called system.terminate() directly.
ClusterLeavingReasonCluster initiated a graceful leave.
ClusterDowningReasonCluster forced this node out.
UnknownReasonTrigger not specified.

You can subclass Reason for app-specific triggers (AdminEndpointReason, HotReloadReason, etc.).

All tasks in a phase run concurrently — they’re started together and the phase waits for the last one (or its timeout). If you have ordering requirements within a phase (task B must wait for task A), put them in different phases with a dependsOn.

Each phase has a timeoutMs (default 5 s); each task is wrapped in a timeout race. A task that doesn’t finish in time is logged and either:

  • Recovered from (the phase continues, recover: true — the default). The next phase starts.
  • Halts the pipeline (recover: false). Subsequent phases are not run; shutdown stops mid-flight.

Override per phase:

cs.setPhaseTimeout(Phases.ServiceRequestsDone, 30_000); // 30s drain budget

Or define your own with the wanted timeoutMs / recover:

cs.addPhase({
name: 'aggressive-cleanup',
timeoutMs: 1_000, // strict cap
dependsOn: [Phases.BeforeActorSystemTerminate],
recover: false, // failure → halt
});
cs.installProcessHooks();
// Or: cs.installProcessHooks(['SIGTERM', 'SIGINT', 'SIGUSR2']);

This attaches handlers that call cs.run(new ProcessTerminateReason(signal)). Calling twice is harmless (idempotent). Tests usually skip the hooks; production always wires them up.

removeProcessHooks() undoes them — useful for tests that instantiate a system, run, and tear down inside a single process.

In Kubernetes, the pod-shutdown sequence is:

1. K8s sends SIGTERM and ends the pod's grace period clock.
2. K8s also calls the PreStop hook (if configured), running concurrently.
3. After max(graceful-shutdown, grace-period), K8s sends SIGKILL.

The standard recipe:

// On SIGTERM, run coordinated shutdown:
cs.installProcessHooks();
// PreStop hook script (in your container image):
// #!/bin/sh
// sleep 10 # give upstream LBs time to drain this pod
// exit 0

The sleep in PreStop gives the load balancer time to drop this pod from rotation before the actor system starts shutting down — so in-flight HTTP requests don’t see “I’m draining, go away.”

See Operations — Kubernetes for the full deployment manifest.

cs.run() is idempotent — calling it multiple times returns the same in-flight promise. Three independent triggers (SIGTERM, an admin endpoint, and a cluster downing) all calling run doesn’t re-run the pipeline. The first call starts it; subsequent calls await the same completion.

This matters because in production you often have multiple shutdown paths:

// SIGTERM path:
cs.installProcessHooks();
// Admin-endpoint path:
app.post('/shutdown', async (req, res) => {
await cs.run(new AdminEndpointReason());
res.send('ok');
});
// Cluster downing path is auto-wired by the cluster extension.

All three end up running the same shutdown sequence once.

By the time the promise resolves:

  • Every task in every phase has either succeeded or timed out.
  • The built-in actor-system-terminate task has called system.terminate(), which has stopped every actor and closed the dispatcher and scheduler.
  • The process is free to exit (process.exit(0)). Nothing left to do.

A common shell of a production main:

async function main() {
const system = ActorSystem.create('my-app');
const cs = system.extension(CoordinatedShutdownId);
// Register tasks...
cs.installProcessHooks();
// Block until shutdown completes (e.g. via SIGTERM).
await new Promise(() => {}); // never resolves; the hooks drive shutdown
}
main().catch((err) => {
console.error(err);
process.exit(1);
});

When SIGTERM arrives, the hook fires cs.run(...), the pipeline runs, the system terminates, and Node exits cleanly because there are no more handles keeping the loop alive.

The CoordinatedShutdown API reference covers addTask, addPhase, run, and the full phase constant set.