Skip to content

Replicated snapshots

In single-writer event sourcing, snapshots save the state at a particular seqNr — recovery loads the snapshot, replays events after that seqNr.

In replicated event sourcing, the picture is more complex — there’s no single linear seqNr; instead, events are partially ordered across replicas via vector clocks. Snapshots must carry the vector clock alongside the state.

Snapshot contents:
- state (after applying all events causally seen at the time)
- vector clock ({ A: 100, B: 95, C: 60 })

On recovery:

1. Load snapshot.
2. Read events from journal AFTER the snapshot's vc.
3. Apply each (re-running conflict resolution if any are concurrent).
4. Ready.

The vc lets the recovering replica skip events that causally precede the snapshot (already incorporated).

Same picking heuristics as single-writer ES, but bias toward more frequent:

  • Replicated workloads accumulate events from multiple replicas at once.
  • The journal grows faster (N replicas × per-replica rate).
  • Recovery has to re-run conflict resolution for each not-snapshotted concurrent event.
class Account extends ReplicatedEventSourcedActor<...> {
override snapshotPolicy() {
return everyNEvents(100); // every 100 events
}
}

For replicated entities accumulating 1000 events/day across all replicas, snapshot every 100 means at most 100 events to re-process on recovery — sub-second.

{
state: State,
vectorClock: { A: 100, B: 95, C: 60 },
seqNr: 205, // local-replica seq for compatibility
ts: 1716297600000
}

The snapshot is a normal snapshot blob with the vector clock as an extra metadata field. Existing snapshot stores (in-memory, SQLite, object-storage) handle it without changes — the framework adds the vc transparently.

preStart():
↓ load latest snapshot
↓ state = snapshot.state
↓ vc = snapshot.vectorClock
↓ read events from journal
↓ for each event in journal:
↓ if event.vc <= snapshot.vc: skip (already incorporated)
↓ else if event.vc concurrent with vc: invoke resolver
↓ else: apply via onEvent
↓ ready

The “skip” case is what makes snapshots bound recovery time — events written before the snapshot are skipped.

All standard snapshot stores work:

  • InMemorySnapshotStore — tests.
  • SqliteSnapshotStore — single-node (rare for replicated ES).
  • ObjectStorageSnapshotStore — shared across replicas.

For replicated ES with multiple replicas across regions, a shared snapshot store is critical — each replica recovers faster if it can load the latest snapshot from any replica, not just its own.

{
journal: sharedJournal,
snapshotStore: new ObjectStorageSnapshotStore({
backend: new S3ObjectStorageBackend({ /* shared bucket */ }),
}),
}

Long-lived deployments accumulate retired replicas in vector clocks:

vc { A: 1000, B: 500, C: 200, RETIRED-D: 50, RETIRED-E: 30 }

The retired replicas’ components are inert but take space + slow comparisons.

The framework’s snapshot machinery can prune retired replicas from the vector clock at snapshot time:

class Account extends ReplicatedEventSourcedActor<...> {
override pruneVectorClockOnSnapshot(): ReplicaId[] {
// Return replica IDs we know are retired
return ['retired-d', 'retired-e'];
}
}

Snapshots then store smaller vector clocks; recovery skips considering retired replicas.

Use carefully — pruning a replica that’s still alive (but quiet) effectively forgets its history. Only prune after confirmed retirement (decommissioned, removed from rotation for ≫ replication lag).

Replica A writes event_A at t1.
Replica A snapshots at t2 (sees state with event_A).
snapshot.vc = { A: 1 }
Meanwhile, replica B was concurrently writing event_B at t1.5.
event_B has vc { B: 1 }; not in snapshot.
Replica A reads event_B at t3:
snapshot.vc { A: 1 } vs event_B.vc { B: 1 } → concurrent
invoke resolver, apply.

Concurrent events that arrive after the snapshot are handled at read-time by the resolver — same as without snapshots.

Snapshot writes for replicated ES are slightly heavier than single-writer:

  • Vector clock serialization — typically 50-200 bytes extra per snapshot.
  • Resolver state merging — if the snapshot is taken during concurrent-write reconciliation, the merge runs first.

Negligible in most cases. Bigger snapshots come from the state itself.