Clustering

Hive supports decentralized clustering where up to ~5 hived nodes form a peer group with leaderless, eventually-consistent state replication. Every node is its own authority: it applies writes locally and instantly, then gossips them to all connected peers, which merge them. There is no leader, no election, and no quorum - the cluster stays usable with any number of nodes online (1..5, or even all-down then cold-restart).

Overview

Leaderless replication: every node applies its own writes locally and gossips them to all peers. Peers merge incoming writes but never re-broadcast them (Hive is a full mesh, so one hop reaches everyone).
Conflict resolution: projects and teams use last-writer-wins by a per-object updated_at timestamp; sessions use a generation-counter + tombstone-TTL scheme (owner-authoritative). Distinct objects are never lost on merge.
Anti-entropy on connect: when a peer connects, both sides exchange a full state snapshot and merge it per-object, so a node that was offline catches up the moment it reconnects.
Client failover: clients connect to any node via the multi-node connection list; every node serves reads and writes locally.
No process migration: agent/PTY processes are OS-local. A session is owned by the node running its process; if that node is offline the session shows as unreachable (its metadata is still replicated) until the owner returns.

Why leaderless? Hive's availability requirement (any node may be offline at any time, often only 1-2 online) is an AP requirement. Consensus (Raft) is CP and cannot make progress in a minority/solo node by design, so leader election was the wrong tool. Going leaderless permanently removes the entire election bug class (split-vote livelock, catch-up wedge, phantom-leader, lonely-takeover).

How It Works

Node Roles

There are no distinct roles. Every node is a co-equal authority for its own writes. For backward compatibility with the existing protocol, ClusterInfo.role reports every node as Leader and leader_id points at the node itself - this conveys "no single leader," and the daemon's local-apply paths key off is_leader() (always true).

Startup Sequence

Node starts and is immediately its own authority (applies writes locally).
Connects to configured peers via peer WebSocket (peer port).
On each peer connection (inbound or outbound), both sides send a full state snapshot for anti-entropy and merge what they are missing.
Local writes apply instantly and gossip to whichever peers are connected; offline peers catch up via anti-entropy when they reconnect.

Gossip and Anti-Entropy

apply_local_mutation (the single funnel every local write passes through) applies the mutation to local state and broadcasts the corresponding PeerMessage to all connected peers. Inbound replicated mutations (SessionUpserted / SessionRemoved / ProjectUpserted / ProjectDeleted / TeamUpserted / TeamRemoved) are applied to local state but not re-broadcast - this avoids gossip loops in the full mesh.

On PeerJoined, the node sends its full current state via ReplicationSender::send_snapshot. Because both sides fire PeerJoined, both exchange snapshots and converge. Snapshots are merged by the safe per-object ClusterState::restore (below), never blindly overwritten.

Conflict Resolution (Safe Merge)

Two independent nodes each own their own state, so the per-node version counter is not comparable across nodes - the old "reject snapshot if its version is not newer" guard is gone. ClusterState::restore merges per object instead:

Sessions: union. Incoming sessions honour the existing tombstone + generation punch-through rules (see Session Liveness Is Owner-Authoritative); a local session the peer lacks is kept. Tombstones are unioned (latest removed_at, highest recorded generation).
Projects / Teams: union, last-writer-wins by updated_at on a key collision; an exact-timestamp tie keeps the existing entry (deterministic). Every project/team create/update is stamped updated_at = now centrally in apply_local_mutation, so a local write always wins over a stale gossip. Incremental ProjectUpserted / TeamUpserted apply is LWW too - an out-of-order replicated upsert older than what we hold is dropped.

Practical-consistency note: for a single user, concurrent edits to the same object on two simultaneously-offline nodes are rare; in that rare case LWW silently drops one side. Distinct objects are always preserved.

State Replication

Replicated data includes:

Session metadata (SessionInfo: id, name, status, working_dir, model, timestamps)
Recent output history (for replay)
Project settings (name, working_dir, default model, timestamps, updated_at)
Teams (members, status, results, updated_at)

Not replicated:

Running agent/PTY processes (OS-local)
Live broadcast channels (recreated on demand)

Session Liveness Is Owner-Authoritative

Every session is owned by exactly one node - the node that holds the live PTY/SDK process. Only that owner may add, keep, or remove its own session:

The leader never reaps a remote peer's sessions on transient signals. A dropped ownership announce, a brief link flap, or a disconnect does not remove the peer's sessions. Those processes keep running across a network blip; tearing their cluster state down on churn (e.g. an election storm) used to destroy live sessions. The session simply shows as unreachable until the owner returns.
A departed node's sessions are pruned only by an explicit remove_node. This trades lingering phantoms for a guarantee that churn never loses a live session.
Generation-stamped recovery. Each SessionCreated/SessionUpdated an owner emits carries a monotonic generation (seeded from wall-clock millis, so it increases across restarts too). A SessionRemoved records the removed generation as a tombstone. If a session is ever removed erroneously, the live owner re-asserts it with a higher generation that punches through the tombstone, restoring it to cluster state. Owners re-assert on reconnect, on receiving a snapshot, and on a periodic timer.
"Owner unknown" is not "dead." A node asked to deliver input to - or kill - a session it does not positively own (a PTY-less mirror, or an entry transiently absent from state) returns a soft, retryable error and never declares the session dead - so a routing hiccup cannot cascade into a cluster-wide removal. A kill request first waits briefly for ownership to settle (it can read as unknown during churn) before routing; only if it stays unknown does the soft error surface, which the client retries. Killing the local mirror in that window would emit a SessionRemoved the live owner simply refuses and re-announces, falsely reporting success while the PTY survives.

Node Resource Metrics

Each daemon samples its own host resource usage - CPU %, memory, swap, and network throughput - every 10 seconds into a rolling 1-hour ring (360 samples, in memory; not persisted). Collection uses the cross-platform sysinfo crate and runs on a dedicated OS thread so it never competes with the async runtimes.

Two derived views leave each node:

A compact summary - the current sample plus 1-hour averages - is broadcast to peers every tick via PeerMetricsAnnounce (a Lossy control-lane message) and carried in each node's NodeInfo.metrics. The daemon re-pushes ClusterStatus on the same tick, so the app's node cards show live CPU / memory / swap / network for every node, updating roughly every 10 seconds.
The full per-sample series is fetched on demand with GetNodeMetricsHistory { node_id } → NodeMetricsHistory. A request for a remote node is forwarded to that node (which owns its own ring) and the response relayed back. This keeps the hot path lean - only summaries ride the cluster continuously; the heavier series is paid for only when something is charting it.

The app's cluster view renders the live values and an inline sparkline per metric on each node card.

Node Failure and Recovery

Leaderless replication has no failover step - there is no leader to replace:

When a node goes offline, the remaining nodes keep applying their own writes locally and gossiping to each other; the cluster stays fully usable.
Sessions owned by the offline node show as unreachable (their OS-local processes are gone) but their replicated metadata is preserved.
Clients connect to any node via the --nodes list; every node serves reads and writes locally.
When the node comes back, anti-entropy on reconnect exchanges full snapshots in both directions and the per-object merge reconciles everything it missed.
Even an all-down then cold restart converges: each node loads its own persisted state and re-merges as peers reconnect.

Setup Guide

Prerequisites

Two or three machines (or ports on one machine for testing)
Same cluster_token on all nodes
Same token (client auth) on all nodes for seamless failover
Network connectivity between all nodes on the peer port

Two-Node Cluster

Node A (10.0.0.1):

toml

bind = "0.0.0.0"
port = 9178
token = "client-auth-token"
claude_bin = "claude"

[cluster]
peers = ["10.0.0.2:9179"]
peer_port = 9179
cluster_token = "my-cluster-secret"
advertise_addresses = ["10.0.0.1:9178"]
display_name = "node-a"

Node B (10.0.0.2):

toml

bind = "0.0.0.0"
port = 9178
token = "client-auth-token"
claude_bin = "claude"

[cluster]
peers = ["10.0.0.1:9179"]
peer_port = 9179
cluster_token = "my-cluster-secret"
advertise_addresses = ["10.0.0.2:9178"]
display_name = "node-b"

Local Testing (Single Machine)

Node A (port 9178/9179):

toml

bind = "127.0.0.1"
port = 9178
token = "test-token"

[cluster]
peers = ["127.0.0.1:9279"]
peer_port = 9179
cluster_token = "test-cluster"
advertise_addresses = ["127.0.0.1:9178"]
display_name = "local-a"

Node B (port 9278/9279):

toml

bind = "127.0.0.1"
port = 9278
token = "test-token"

[cluster]
peers = ["127.0.0.1:9179"]
peer_port = 9279
cluster_token = "test-cluster"
advertise_addresses = ["127.0.0.1:9278"]
display_name = "local-b"

Start each with separate config files or config directories.

Client Configuration

bash

export HIVE_TOKEN="client-auth-token"
export HIVE_NODES="10.0.0.1:9178,10.0.0.2:9178"

# Commands work transparently - the client connects to any node
hive ls
hive new --dir /project
hive send <session-id> "hello"
hive cluster-status
hive rename-node <node-id> "my-node"   # rename a node at runtime

Desktop App

The Tauri desktop app provides a graphical Cluster Status view with:

Connection state at a glance (every node is a co-equal authority - there is no leader/follower distinction in leaderless mode)
Pencil icons to rename any node inline
The connected node is labeled with its display name (or "Connected Node" if unnamed)
An Open session as user button on each node card

Node cards (including the connected node) are ordered by display name (falling back to node id), so a card's position is stable and predictable.

Known nodes are remembered across restarts. When a peer first connects, its identity (node id, name, address) is recorded in the daemon's persisted peer_nodes registry. A node that is configured but not currently connected stays in the view as an Offline card instead of disappearing - even after a fleet-wide update wipes every daemon's in-memory peer cache. Removing a node (its card's trash icon) clears it from the registry so it does not reappear.

Open the app, connect to any cluster node, and navigate to Cluster to see the full topology.

Open session as user

Each node card has an Open session as user action. It opens a modal listing that node's non-system OS users (plus a shell picker) and launches a single one-off terminal session that:

runs as the selected user (via runuser on Unix / hive-runas.exe on Windows - the same mechanism a project's run_as_users mapping uses), and
starts in that user's home directory, so the session is grouped under the username in the workspace.

Unlike the per-project run-as mapping, this session is not tied to any project. The home directory comes from the node's OsUserInfo.home_dir (resolved from /etc/passwd on Unix); when no home directory is known the user is not selectable.

Edit Cluster Configuration dialog

The Settings button at the top of the Cluster Status view opens an Edit Cluster Configuration dialog that surfaces the most common settings without hand-editing config.toml. Saves persist to disk and broadcast to every connected peer:

Timing - heartbeat_interval_ms, min_quorum, election_timeout_min/max_ms (retained for backward compatibility; no longer affect leaderless behaviour)
Advertise Addresses - comma-separated host:port list of addresses peers use to reach this node (leave blank to keep the auto-detected value)
Adoption Password - the optional user_secret that allows adopt-mode access without the full token
Default PTY Shell - per-node only; sets default_pty_shell (e.g. bash, pwsh); blank falls back to platform auto-detection
Cluster Update Source - the remote-update manifest URL, bearer token, auto-check toggle and interval (see Remote-URL Updates); replicated to peers via SyncRemoteUpdateConfig

Switch to Edit as text to edit all of the above as a flat key=value block - useful when you want to round-trip a setting that isn't exposed in the form. Note: shell presets and peer_nodes are managed elsewhere (the per-node Shells section of the node config dialog and Adopt Node respectively) and aren't editable from this dialog.

Verification

1. Connectivity Test

Start both nodes. Check logs:

INFO starting Hive daemon
INFO cluster peer server started peer_bind="0.0.0.0:9179"
INFO cluster mode enabled node_id=a1b2c3d4-...
...
INFO sent anti-entropy snapshot to newly connected peer peer=b2c3d4e5-...

hive cluster-status on either node lists both nodes as connected.

2. Replication Test

Create a session on either node:

bash

hive new --dir /tmp/test --name "test-session"

The session should appear in hive ls on both nodes (it gossips to the peer).

3. Offline / Merge Test

bash

# Connect to the cluster from a client
hive --nodes 10.0.0.1:9178,10.0.0.2:9178 ls

# Take one node offline, create a project on the still-up node, then bring the
# offline node back. Anti-entropy on reconnect merges the new project both ways.
hive --nodes 10.0.0.1:9178,10.0.0.2:9178 ls
# Same sessions/projects visible from either node

Limitations

~5 nodes - the full-mesh gossip is designed for small clusters.
No process migration - agent/PTY processes are OS-local; a session is only interactive on its owning node, and is unreachable while that node is offline.
Peer transport is plain WebSocket - peer connections do not use TLS.
No automatic peer discovery - peers must be explicitly configured.
LWW drops one side of a same-object conflict - concurrent edits to the same project/team on two simultaneously-offline nodes resolve by updated_at; the older edit is silently dropped (rare for a single user; distinct objects are never lost).

Tuning

Leaderless replication has no election or heartbeat timers to tune. The heartbeat_interval_ms / election_timeout_* / min_quorum config keys are retained for backward compatibility but no longer affect behaviour.

Clustering ​

Overview ​

How It Works ​

Node Roles ​

Startup Sequence ​

Gossip and Anti-Entropy ​

Conflict Resolution (Safe Merge) ​

State Replication ​

Session Liveness Is Owner-Authoritative ​

Node Resource Metrics ​

Node Failure and Recovery ​

Setup Guide ​

Prerequisites ​

Two-Node Cluster ​

Local Testing (Single Machine) ​

Client Configuration ​

Desktop App ​

Open session as user ​

Edit Cluster Configuration dialog ​

Verification ​

1. Connectivity Test ​

2. Replication Test ​

3. Offline / Merge Test ​

Limitations ​

Tuning ​

Clustering

Overview

How It Works

Node Roles

Startup Sequence

Gossip and Anti-Entropy

Conflict Resolution (Safe Merge)

State Replication

Session Liveness Is Owner-Authoritative

Node Resource Metrics

Node Failure and Recovery

Setup Guide

Prerequisites

Two-Node Cluster

Local Testing (Single Machine)

Client Configuration

Desktop App

Open session as user

Edit Cluster Configuration dialog

Verification

1. Connectivity Test

2. Replication Test

3. Offline / Merge Test

Limitations

Tuning