Skip to content

std.cluster

std.cluster currently exposes the local supervised-actor runtime for the :cluster profile tracer bullet.

This page documents what exists now. Local grains have a source-level activation shell, a local activation registry that enforces one live writer per durable identity, a local namespace lookup layer, and explicit GrainStore-backed lifecycle callbacks. Compiler-generated state serializers, placement, membership, migration, and remote transport remain future :cluster work.

The current facade is local-only:

  • LocalActorSystem owns one cluster-budget Nursery and one Supervisor.
  • Actors run as nursery tasks.
  • Supervisor strategies are one_for_one, one_for_all, and rest_for_one.
  • Restart policies are permanent, transient, and temporary.
  • Restart budgets and pledge-violation restart opt-in are exposed.
  • Child status snapshots expose lifecycle, actor id, task id, task state, last exit reason, and restart count.
  • Actor tombstones can be counted, classified for repeated deterministic patterns, and mirrored to a caller-provided sink.
  • Grain declarations with durable identity syntax lower through the same local supervised activation shell as actors.
  • Local grain lookup/start maps (grain_type, grain_id) to one stable local actor reference, so duplicate activation attempts reuse the existing live activation instead of creating a second mutator.
  • Local namespace lookup maps (grain_type, namespace) to an internal durable grain id, then routes duplicate lookups through the same single-writer activation registry.
  • Persistent local grain start can invoke caller-provided load/store callbacks backed by GrainStoreBytes after setup, after message/timeout boundaries, and before teardown.

Define a message protocol with message. The declaration uses tagged variants:

message CounterMsg {
Tick,
Stop,
}

Attach the protocol to an actor with actor Name(msg: Msg). The payload binding name is part of the header; the current generated handler still receives the raw i64 tag as __msg.

@mailbox(capacity: 4)
actor Counter(msg: CounterMsg) do
var count: u64 = 0
receive do
count += __msg
end
end

For a source-level Janus actor, the compiler emits the supervised start wrappers:

ActorName_start_supervised(system: u64, slot: u64, policy: u32) -> u64
ActorName_start_supervised_ref(system: u64, slot: u64, policy: u32) -> u64

ActorName_start_supervised returns the transient ActorId. ActorName_start_supervised_ref starts the actor and returns a stable local actor reference for the supervised (system, slot) identity. Use the _ref form for production send and observation paths that should survive supervisor restarts.

{.profile: cluster.}
use std.cluster.local as cluster
message CounterMsg {
Tick,
Stop,
}
@mailbox(capacity: 4)
actor Counter(msg: CounterMsg) do
var count: u64 = 0
receive do
count += __msg
end
end
pub func main() -> i32 do
let system = cluster.local_new(
1 as u64,
cluster.STRATEGY_ONE_FOR_ONE,
1 as u64,
)
if system == 0 as u64 do return 1 end
let counter = Counter_start_supervised_ref(
system,
0 as u64,
cluster.POLICY_PERMANENT,
)
if counter == 0 as u64 do return 2 end
if cluster.local_ref_mailbox_capacity(counter) != 4 as i64 do
return 3
end
if cluster.local_ref_try_send(counter, 1 as i64) != 1 as i32 do
return 4
end
if cluster.local_shutdown(system) != 1 as i32 do return 5 end
if cluster.local_destroy(system) != 1 as i32 do return 6 end
return 0
end

The wrapper hides the setup/handler/destroy runtime-entry plumbing. Public Janus APIs accept typed callables or generated actor/grain starters; callable addresses are not ordinary u64 values on the language surface.

The compiler accepts the final local-persistent grain header shape and emits the same local supervised start wrapper used by actors:

@persist(via: GrainStoreBytes)
@lifecycle(activation: .lazy)
grain User(id: u64, msg: UserMsg) do
var count: u64 = 0
receive do
UserMsg.Ping => do
count += 1
end,
UserMsg.Stop => do
return 0
end,
end
end

For the compiler slice, User_start_supervised(system, slot, policy) remains an activation shell over the local actor runtime. For the runtime registry slice, use std.cluster.local to locate or start a grain activation by durable numeric identity:

let user_ref = cluster.local_grain_lookup_or_start(
system,
100 as u64, // grain type id
42 as u64, // durable grain id
0 as u64, // local supervisor slot
cluster.POLICY_PERMANENT,
4 as u64, // mailbox capacity
user_setup,
user_handler,
user_destroy,
)

If another call uses the same (grain_type, grain_id), the runtime returns the same stable local reference while the activation is live. This pins the first grain runtime invariant: one durable identity has one active local writer.

The local namespace layer resolves human-readable namespace keys to internal durable grain ids before entering the same activation registry:

let user_ref = cluster.local_grain_lookup_or_start_namespace(
system,
100 as u64, // grain type id
"users/alice", // local namespace key
0 as u64, // local supervisor slot
cluster.POLICY_PERMANENT,
4 as u64, // mailbox capacity
user_setup,
user_handler,
user_destroy,
)

local_grain_namespace_lookup returns the mapped internal id, or 0 when the namespace is unbound. local_grain_lookup_or_start_namespace derives and stores an internal id on first lookup, then returns the same live activation ref for duplicate namespace lookups. local_grain_namespace_bind can bind aliases to an existing id; rebinding an existing namespace to a different id is rejected.

For local persistence, use the persistent lookup/start variant and pass lifecycle callbacks:

let user_ref = cluster.local_grain_lookup_or_start_persistent(
system,
100 as u64,
42 as u64,
0 as u64,
cluster.POLICY_PERMANENT,
4 as u64,
user_setup,
user_handler,
user_destroy,
store_ctx as u64,
load,
store,
)

The load/store callbacks use this shape:

pub func load(ctx: u64, grain_type: u64, grain_id: u64, state: u64) -> i32 do
// Return >= 0 for a valid cold miss or restore, negative for fatal load.
end
pub func store(ctx: u64, grain_type: u64, grain_id: u64, state: u64) -> i32 do
// Return 1 when durable state was committed, 0 on failure.
end

ctx is the caller-provided store context, commonly a pointer to a GrainStoreBytes facade. The runtime calls load after setup returns a state pointer, calls store after message and timeout handlers, and calls store again before teardown. Store failure turns the handler boundary into a stop so the activation does not continue pretending volatile mutation was committed.

Use local_grain_persistence_load_failures(system) and local_grain_persistence_store_failures(system) to inspect persistence callback failures observed by the local runtime. The counters are scoped to the local actor system handle and increment only when a user-provided load callback returns a negative value or a store callback returns anything other than 1.

The current registry and namespace layer are still local and partly in-memory. They do not yet provide compiler-generated GrainStore serializers, passivation, migration, remote routing, cross-node placement, or durable namespace persistence. Those are separate runtime layers.

The local grain registry helpers are:

cluster.local_grain_lookup_or_start(system, grain_type, grain_id, slot, policy, capacity, setup, handler, destroy) -> u64
cluster.local_grain_lookup_or_start_persistent(system, grain_type, grain_id, slot, policy, capacity, setup, handler, destroy, ctx, load, store) -> u64
cluster.local_grain_ref_try_send(grain_ref, msg) -> i32
cluster.local_grain_active_count(system) -> u64
cluster.local_grain_persistence_load_failures(system) -> u64
cluster.local_grain_persistence_store_failures(system) -> u64
cluster.local_grain_namespace_lookup(system, grain_type, namespace) -> u64
cluster.local_grain_namespace_bind(system, grain_type, namespace, grain_id) -> i32
cluster.local_grain_lookup_or_start_namespace(system, grain_type, namespace, slot, policy, capacity, setup, handler, destroy) -> u64
cluster.local_grain_lookup_or_start_namespace_persistent(system, grain_type, namespace, slot, policy, capacity, setup, handler, destroy, ctx, load, store) -> u64

Stable local actor references are scalar handles. They encode the local system handle, child slot, and slot generation, not the runtime ActorId, so a permanent or transient child keeps the same reference after restart. If you stop a child and reuse the slot for a different child, the old reference becomes invalid instead of aliasing the replacement. The current ref helpers are:

cluster.local_actor_ref(system, slot) -> u64
cluster.local_ref_try_send(actor_ref, msg) -> i32
cluster.local_ref_child_actor_id(actor_ref) -> i32
cluster.local_ref_child_lifecycle(actor_ref) -> i32
cluster.local_ref_child_task_state(actor_ref) -> i32
cluster.local_ref_child_last_exit(actor_ref) -> i32
cluster.local_ref_mailbox_len(actor_ref) -> i64
cluster.local_ref_mailbox_capacity(actor_ref) -> i64
cluster.local_ref_stop_child(actor_ref, reason) -> i32

Capability-gated callers use the same reference shape with explicit ClusterLocalCap authority:

cluster.local_actor_ref_cap(cap, system, slot) -> u64
cluster.local_ref_try_send_cap(cap, actor_ref, msg) -> i32
cluster.local_ref_child_actor_id_cap(cap, actor_ref) -> i32
cluster.local_ref_child_lifecycle_cap(cap, actor_ref) -> i32
cluster.local_ref_child_task_state_cap(cap, actor_ref) -> i32
cluster.local_ref_child_last_exit_cap(cap, actor_ref) -> i32
cluster.local_ref_mailbox_len_cap(cap, actor_ref) -> i64
cluster.local_ref_mailbox_capacity_cap(cap, actor_ref) -> i64
cluster.local_ref_stop_child_cap(cap, actor_ref, reason) -> i32

Use ActorRef[Msg] for compile-time message protocol checks on direct spawned actors. Use the scalar local actor reference above for the supervised local bridge path.

Inside receive, you can either write normal statements against __msg or write bare match arms. Bare arms desugar to match __msg { ... }:

receive do
0 => do
count += 1
end,
1 => do
return 0
end,
else => do
count = count
end,
end

For typed message protocols, receive arms can match named variants, destructure payload fields, guard on destructured bindings, and include a timeout arm:

message CounterMsg {
Tick,
Set { value: u64 },
Stop,
}
receive do
CounterMsg.Tick => do
count += 1
end,
CounterMsg.Set { value } when value >= 0 as u64 => do
count += value
end,
CounterMsg.Stop => do
return 0
end,
else => do
count = count
end,
after 0 => do
count = count
end,
end

The shorthand { value } binds the payload field named value into the arm scope. Message payload fields must be SBI-conformant; pointer-typed fields are rejected at declaration time with E2530.

For compiler-generated supervised actors, an after N => ... arm is wired into the local runtime. The compiler emits an ActorName_timeout(actor) helper and the generated ActorName_start_supervised* wrappers register it with the mailbox timeout. Delivered messages still call ActorName_handler(actor, msg); an empty mailbox at the timeout boundary calls ActorName_timeout(actor).

Direct spawned actors can use typed actor references:

pub func send_tick(ref: ActorRef[CounterMsg]) -> i32 do
ref.send(CounterMsg.Tick)
return 0
end
pub func spawn_counter() -> ActorRef[CounterMsg] do
return spawn Counter()
end

ActorRef[Msg] is a compile-time protocol witness over the current actor handle ABI. The compiler checks direct ref.send(Msg.UnitVariant) calls, typed local bindings, and direct return spawn Actor() expressions. Unit variants lower to their i64 tag. Payload-carrying variants are now supported: fields transfer through boxed slot arrays, and receive arms can destructure them with Msg.Variant { field } patterns. All message fields must be SBI-conformant (owned, by-value, no pointers) — the compiler rejects non-conformant declarations with E2530.

SPEC-029 sendability is enforced before actor payload delivery ships. For proven actor, channel, and mailbox send boundaries:

  • ref T payloads are rejected with E2801.
  • iso T payloads are accepted and the binding is consumed.
  • Reading a consumed iso binding emits E2802.
  • val T and tag T payloads are sendable.

This is a type check, not a serialization trait check. Janus does not require a Serialize trait for actor messages. Wire-ready message payloads must use SBI-compatible layout when the distributed transport path lands.

Use local_stop_child when a caller wants to stop a live child without applying its restart policy:

let stopped = cluster.local_stop_child(
system,
0 as u64,
cluster.STOP_REASON_SHUTDOWN,
)

Shutdown and normal stop reasons do not create tombstones. Abnormal, killed, and pledge-violation stop reasons do create tombstones, but still do not restart the child. local_handle_crash and local_handle_exit remain the restart-policy paths for simulated or observed actor exits.

The local actor mailbox is bounded. Actors without @mailbox use the runtime channel default: one pending handoff slot. The public send surface is non-blocking:

let sent = cluster.local_try_send(system, 0 as u64, 42 as i64)
let sent_ref = cluster.local_ref_try_send(actor_ref, 42 as i64)

Return codes are stable for the current tracer bullet:

  • 1: the message was accepted.
  • 0: the child slot is empty or the mailbox is full.
  • -1: the mailbox channel is closed.

Use @mailbox(capacity: N) on a compiler-generated actor to set the supervised actor mailbox capacity. The compiler also uses the same value for direct spawn Actor() mailboxes. The overflow argument is parsed for the source surface, but the current runtime path enforces capacity only. Production callers should treat 0 as backpressure or missing-child rejection and retry, drop, or escalate according to their actor protocol.

Mailbox pressure is observable through scalar status accessors:

let pending = cluster.local_child_mailbox_len(system, 0 as u64)
let slots = cluster.local_child_mailbox_capacity(system, 0 as u64)

The default reports slots == 1. An actor declared with @mailbox(capacity: 4) reports slots == 4. Both functions return -1 when the slot has no live child.

The Janus facade exposes local supervisor and child status without exposing actor state:

let supervisor_state = cluster.local_supervisor_state(system)
let lifecycle = cluster.local_child_lifecycle(system, 0 as u64)
let task_state = cluster.local_child_task_state(system, 0 as u64)
let last_exit = cluster.local_child_last_exit(system, 0 as u64)

local_supervisor_state returns:

  • SUPERVISOR_STATE_RUNNING
  • SUPERVISOR_STATE_STOPPED
  • SUPERVISOR_STATE_FAILED
  • -1 for an invalid handle

local_child_lifecycle returns:

  • CHILD_LIFECYCLE_UNCONFIGURED
  • CHILD_LIFECYCLE_CONFIGURED
  • CHILD_LIFECYCLE_RUNNING
  • CHILD_LIFECYCLE_STOPPED
  • CHILD_LIFECYCLE_FAILED
  • -1 for an invalid handle or slot

local_child_task_state returns TASK_STATE_READY, TASK_STATE_RUNNING, TASK_STATE_BLOCKED, TASK_STATE_BUDGET_EXHAUSTED, TASK_STATE_COMPLETED, TASK_STATE_CANCELLED, or -1 when no live task is present.

local_child_last_exit returns the same STOP_REASON_* codes used by local_handle_exit, or -1 when no exit is recorded.

Every status accessor has a _cap form that consumes ClusterLocalCap. These accessors report lifecycle and pressure only; they do not expose actor-local variables or grain-owned state.

LocalActorSystem is the ergonomic root for the local tracer bullet. It keeps callers on the public std.cluster path instead of reaching into runtime internals.

const cluster = @import("std_cluster");
var system = try cluster.LocalActorSystem.init(
allocator,
1, // nursery id
"root", // supervisor id
.one_for_one,
2, // child slots
);
defer system.deinit();

Children are started from ChildSpec values. A child start function receives the actor-system nursery and the allocator owned by the supervisor.

fn startWorker(nursery: *cluster.Nursery, allocator: std.mem.Allocator) !cluster.SupervisedChild {
const actor = try allocator.create(cluster.Actor);
errdefer allocator.destroy(actor);
actor.* = try cluster.Actor.init(allocator, 1, 1);
errdefer actor.deinit();
const task = cluster.spawn(nursery, actor, workerHandler) orelse return error.ActorSpawnRejected;
return .{ .actor = actor, .task = task };
}
_ = try system.startChild(0, .{
.id = "worker",
.start_fn = startWorker,
.restart = .permanent,
});

You can also configure children first and start them later:

try system.configureChild(0, .{
.id = "worker",
.start_fn = startWorker,
.restart = .permanent,
});
const started = try system.startConfiguredChildren();

Use handleCrash for ordinary abnormal actor failure:

try system.handleCrash(0);

Use handleExit when the caller knows the exact stop reason:

try system.handleExit(0, .pledge_violated);

The Janus facade exposes the same path with stable STOP_REASON_* codes:

if cluster.local_handle_exit(
system,
0 as u64,
cluster.STOP_REASON_PLEDGE_VIOLATED,
) != 1 as i32 do return 1 end

Use handleExitAt for deterministic restart-window tests or runtime loops that already have a timestamp:

try system.handleExitAt(0, .abnormal, 100);
const status = system.statusAt(100);

Abnormal terminal exits now produce actor tombstones. Normal exits and shutdown exits are intentionally skipped; tombstones are for failure classes that may need replay, audit, or repair.

The local runtime keeps the existing bounded in-memory tombstone index and can also mirror each tombstone to a caller-provided sink:

The current low-level sink hook is a raw-address bridge API: local_set_tombstone_sink_addr(system, ctx_addr, append_addr). Treat it as an internal bridge surface, not a user-facing Janus callback API. User .jan documentation must not teach @intFromPtr or raw function-address conversion. The public typed callback surface belongs to the std.exec/cluster cleanup work.

The callback receives an opaque context pointer and a callback-scoped record pointer. Copy or persist the record during the callback; do not retain record_raw.

Sink counters are exposed for monitoring:

let stored = cluster.local_tombstone_sink_appends(system)
let failed = cluster.local_tombstone_sink_failures(system)

Stable stop-reason codes are available as STOP_REASON_NORMAL, STOP_REASON_SHUTDOWN, STOP_REASON_ABNORMAL, STOP_REASON_KILLED, STOP_REASON_PLEDGE_VIOLATED, and STOP_REASON_MIGRATION_ABORTED.

The supervisor hot index can classify the latest tombstone against prior tombstones with the same deterministic pattern: child slot, spec id, stop reason, code version, and input digest. Janus exposes scalar accessors for the current local runtime:

let matches = cluster.local_tombstone_classify_match_count(
system,
now_seconds,
3 as u32,
60 as i64,
)
let deadly = cluster.local_tombstone_classify_deadly(
system,
now_seconds,
3 as u32,
60 as i64,
)
let oldest = cluster.local_tombstone_classify_oldest_sequence(
system,
now_seconds,
3 as u32,
60 as i64,
)

matches is the number of hot-index tombstones matching the latest pattern inside the window. deadly returns 1 when matches reaches the threshold. oldest returns the oldest matching tombstone sequence, or 0 when no latest tombstone exists. Each function also has a _cap form that consumes ClusterLocalCap.

std.cluster.tombstones converts callback records into canonical STL events. The adapter keeps cluster supervision and STL storage separate: the sink copies scalar tombstone fields, builds an ActorTombstone, and appends through an std.stl.lsm_store.LSMStore.

use std.cluster.local as cluster
use std.cluster.tombstones as tombstones
use std.db.lsm as lsm
use std.stl.lsm_store as lsm_store
use std.stl.store as store
pub func tombstone_sink(ctx: u64, record_raw: u64) -> i32 do
let gs = as[*lsm.GrainStoreBytes](ctx)
var stl = lsm_store.make_store(gs)
var t = tombstones.zero()
t.sequence = cluster.tombstone_sequence(record_raw)
t.child = cluster.tombstone_child(record_raw)
t.reason = cluster.tombstone_reason(record_raw)
t.attempt_count = cluster.tombstone_attempt_count(record_raw)
t.timestamp_seconds = cluster.tombstone_timestamp_seconds(record_raw)
if tombstones.append_lsm(&stl, &t) != store.STORE_OK do
return 0
end
return 1
end

The sink context should point at the borrowed GrainStoreBytes. The callback creates a short-lived LSMStore wrapper over that same store; fresh wrappers can rescan LSM truth later for count, rank lookup, and flush.

The local actor system can route a completed nursery task back to the supervised child slot:

const task = system.childTaskAt(1) orelse return error.MissingTask;
task.markCompleted(5);
const restarted_idx = try system.handleTaskCompleteByTask(task);

Stale task handles are rejected. This matters after a restart, because the old task pointer must not be allowed to affect the replacement child.

Restart budgets are opt-in:

system.setRestartLimit(2, 60);

From Janus:

_ = cluster.local_set_restart_limit(system, 2 as u32, 60 as i64)

The limit is counted per restart window. When the budget is exhausted, the supervisor moves to failed, records the failed child and reason, and stops remaining active children according to the implemented supervisor failure cleanup.

Janus callers can test exhaustion through:

let exhausted = cluster.local_restart_limit_exhausted(system)

Pledge violations do not restart by default. This is intentional because pledge failure is a capability boundary event, not an ordinary crash. Local systems can explicitly opt in:

system.setRestartPledgeViolations(true);

From Janus:

_ = cluster.local_set_restart_pledge_violations(system, 1 as u32)

Use stopChild, stopChildren, or shutdown for explicit lifecycle control:

try system.stopChild(0, .shutdown);
_ = try system.stopChildren(.killed);
system.shutdown();

shutdown stops active children and moves the supervisor to stopped.

The facade exposes supervisor and child snapshots:

const supervisor_status = system.status();
const child_status = system.childStatus(0);
const failure = system.failure();

SupervisorStatus includes:

  • strategy and state
  • slot count
  • configured, active, stopped, and failed child counts
  • total restarts
  • restart exhaustion metadata
  • restart limit and remaining restarts

ChildStatus includes:

  • lifecycle
  • configured spec id and restart policy
  • actor id and task id when running
  • task state when available
  • last exit reason
  • restart count
  • Local runtime only.
  • No grain API.
  • No placement, membership, gossip, or remote send.
  • No automatic actor registry integration.
  • No hot reload.
  • No persistence for actor state. Actor tombstones can be persisted to STL; live actor state replay remains future work.
  • Slot type is u64. Heterogeneous typed state and non-u64 payload fields remain future work.

The current goal is a correct local supervised-actor tracer bullet. Distributed :cluster features build on this surface later.