Skip to content

Cluster Troubleshooting Runbook

This page covers the current operational checks for Hive cluster deployments. It focuses on commands and config fields that exist in the current CLI and daemon.

1. Node Does Not Join the Cluster

Symptoms

  • hive cluster-status never shows the new node
  • Logs show repeated peer connection failures
  • The node starts, but no peer connections appear

Check

  1. Verify the shared cluster secret on every node:

    toml
    [cluster]
    cluster_token = "same-secret-on-every-node"
  2. Verify peer addresses:

    toml
    [cluster]
    peers = ["10.0.0.2:9179", "10.0.0.3:9179"]
    peer_port = 9179
  3. Verify the node is advertising a reachable client address:

    toml
    [cluster]
    advertise_addresses = ["10.0.0.1:9178"]
  4. Verify network reachability:

    bash
    curl http://10.0.0.1:9178/health
    nc -zv 10.0.0.1 9179

Fix

  • Correct cluster_token, peers, peer_port, or advertise_addresses
  • Restart the daemon after changing cluster settings

2. Peers Connect but Sessions Do Not Replicate

Symptoms

  • A node does not show sessions or projects created on a peer
  • A node that was offline comes back but still shows stale state

Cause

Leaderless replication gossips each local write to all peers once and runs anti-entropy (both sides exchange a full StateSnapshot, merged per-object) on every reconnect. If a node is missing a peer's objects, the gossip message was lost while the peer was unreachable and the reconnect anti-entropy has not yet run - almost always a peer-connectivity problem, not a state-machine wedge.

Check

  1. Confirm both nodes see each other as connected:

    bash
    hive cluster-status
  2. Inspect logs for the anti-entropy snapshot exchange on reconnect:

    bash
    RUST_LOG=hive_cluster=debug,hive_daemon=debug hived

    Look for sent anti-entropy snapshot to newly connected peer.

  3. Confirm each node can still reach the other on both the client and peer ports.

Fix

  • Restore peer connectivity. Once the peer connection re-establishes, the reconnect anti-entropy exchange merges both sides' state automatically - no restart needed. Distinct objects are never lost; concurrent edits to the same object resolve last-writer-wins by updated_at.
  • If a node still does not converge, re-check peers, peer_port, and advertise_addresses on both sides.

3. Client Auth Works on One Node but Fails on Another

Hive uses two different tokens:

TokenPurpose
tokenClient auth for CLI and app connections
cluster_tokenPeer auth between cluster nodes

Check

  1. For CLI/app connection failures, verify HIVE_TOKEN matches the node's token or cluster-wide access token you intend to use.
  2. For node-to-node failures, verify every node has the same cluster_token.
  3. Inspect logs for wrong token, auth rejected, or timeout messages.

Fix

  • Update the client token you are using
  • Or update the daemon config so all peer nodes share the same cluster_token

4. A Node Goes Offline

What to expect

Leaderless clustering has no leader, no election, and no quorum, so there is no failover to wait on. Every remaining node keeps serving reads and writes for the state it already holds - a node going offline never stalls the rest of the cluster. The only thing lost while a node is offline is interactive access to the sessions it owns: agent and PTY processes are OS-local, so a session is only interactive on its owning node and is unreachable until that node returns.

When the node comes back, anti-entropy on reconnect merges any state that diverged while it was away (see section 2).

Symptoms that are NOT a cluster fault

  • hive cluster-status shows a peer as not connected - this is expected while that node is down; the rest of the cluster is unaffected.
  • Sessions owned by the offline node fail to load or accept input - expected; they resume when the owner returns.

Check

  1. Confirm the still-up nodes see each other as connected:

    bash
    hive cluster-status
  2. Confirm the offline node is actually down (vs. a peer-connectivity break) and bring it back or restore connectivity on peer_port.

5. Node Drops Out After Restart or IP Change

Symptoms

  • A node was healthy before restart but disappears afterward
  • DHCP or Tailscale address changed

Check

  1. Confirm the node's current client-facing address.
  2. Update advertise_addresses if the node should now be reached at a different host:port.
  3. Update peers on the other nodes if the peer endpoint changed.

Example:

toml
[cluster]
advertise_addresses = ["100.64.1.12:9178"]
peers = ["100.64.1.13:9179", "100.64.1.14:9179"]

Fix

  • Save the corrected addresses
  • Restart the changed node
  • If needed, restart the peers so they reconnect using the updated config

6. Remote Update From URL Fails

Check

  1. Verify the manifest URL is reachable:

    bash
    curl -I https://updates.example.com/hive/manifest.json
  2. Verify the manifest contains an artifact for the target platform.

  3. For private endpoints, verify HIVE_UPDATE_TOKEN or --remote-token.

Current commands

bash
hive update --from https://updates.example.com/hive/manifest.json
hive update-node --from https://updates.example.com/hive/manifest.json

Fix

  • Correct the manifest URL
  • Publish the missing platform artifact
  • Supply the correct bearer token

7. Mobile App Is Very Slow When Several Sessions Are Already Open

Symptoms

  • The app launches or resumes on Android, but the workspace stays sluggish for several seconds
  • Switching into the workspace with many restored sessions causes delayed paints or input lag
  • Daemon logs show a burst of GetSessionHistory requests immediately after reconnect

Cause

  • The app restores background session subscriptions after reconnect.
  • Older builds also requested full history for every restored pane session, even when the terminal was not currently attached on screen.
  • On mobile this multiplies terminal replay work and can stall the first interactive paint.

Fix

  • Update to a build that only refreshes full history for terminals that are actually attached and visible.
  • Background sessions still re-subscribe for live output, but they defer history replay until the user actually opens them.

Check

  1. Open Logs or the daemon log folder.
  2. Reproduce an app launch/resume with several open sessions.
  3. Confirm that reconnect now shows history fetches for the visible terminal(s), not every restored pane session.

8. Android Gboard Suggestion Taps Duplicate Earlier Text In Form Fields

Symptoms

  • Accepting a suggested word in an app form field inserts the earlier text again
  • The duplication is intermittent and happens more often with rapid mobile typing
  • Plain typing usually works, but suggestion-chip picks can replay part or all of the existing buffer

Cause

  • Android IME composition can keep an uncommitted in-progress value inside the input element.
  • Older builds only stored the parent v-model value, so any rerender during composition wrote that stale value back into the DOM.
  • Gboard then interpreted the stale write-back as a reset and re-sent the buffered text.

Fix

  • Update to a build that keeps a local in-progress value inside shared app Input and Textarea components until composition ends.

Check

  1. Open any form field in the app, such as session rename, task command, or project notes.
  2. Type a few words with Gboard and accept a suggestion chip mid-sentence.
  3. Confirm the field now keeps only the intended text instead of re-inserting the earlier buffer.

9. Daemon Lags Under Heavy CPU Load From Other Processes

Symptoms

  • Peer-connection keepalive pings time out and peers mark this node offline while the host is otherwise busy (media transcoding, builds, backups)
  • WebSocket sessions stutter or disconnect even though the network is fine
  • hive cluster-status from another node reports this node as flapping

What users now see in clients

  • Recent CLI and app builds surface non-fatal daemon notices instead of leaving the slowdown silent.
  • Remote-session attaches can show a notice that the live PTY is running on a different cluster node and Hive is forwarding traffic there.
  • If a viewer falls behind on terminal output, the client can show warnings such as This client is falling behind on live session output or Hive stopped one live output stream because this client could not keep up.

What hived does automatically

  • On every start, hived raises its own CPU and I/O priority - nice -10 on Linux/macOS via setpriority(2), best-effort I/O priority 0 on Linux, and HIGH_PRIORITY_CLASS on Windows via SetPriorityClass. On Linux it also lowers its OOM score to make the daemon a last-resort kill target. This applies on fresh installs, self-updates, and manual restarts. A line like raised CPU priority (nice = -10) appears in the daemon log on success; a warning is logged if the OS denied the request.
  • Spawned workload processes are pushed to low priority so Claude, shells, task commands, headless agents, and CPU-heavy builds do not compete with the daemon's control plane. On Linux they are set to nice +10, SCHED_BATCH, and the lowest best-effort I/O priority (BELOW_NORMAL_PRIORITY_CLASS on Windows). PTY sessions re-apply this briefly after spawn so wrappers such as runuser cannot pass daemon priority to the real shell.
  • On Linux, when the systemd unit delegates the cgroup (Delegate=yes), hived splits its service cgroup into daemon/ and workload/ leaves, moves its own threads into daemon/, and weights daemon/ ~100x over workload/ (cpu.weight 10000 vs 100). Spawned PTY trees are placed in workload/ and their descendants inherit it, so a build that fans out long after spawn still cannot starve the daemon - an OS-enforced CPU reservation rather than the CFS hint nice alone provides. A line like cgroup workload isolation enabled appears on success; cgroup workload isolation unavailable (e.g. an older unit without Delegate=yes, or cgroup v1) means it fell back to nice/ioprio only. The split is best-effort and never blocks startup or session spawn.
  • The systemd units generated by the CLI, desktop app, and hive.ps1 deploy set Nice=-10, IOSchedulingClass=best-effort, IOSchedulingPriority=0, CPUWeight=10000, IOWeight=10000, OOMScoreAdjust=-900, Delegate=yes, MemoryMin=256M, and MemoryLow=256M so the elevated priority and cgroup delegation are in place before the binary's own startup code runs.
  • No application timer ever drops a client connection (SSH semantics). A widening pong gap is logged as connection stale and held; output sheds via the resumable ring instead of timing out; genuine death is detected by the kernel (TCP keepalive + TCP_USER_TIMEOUT on the listener) and surfaces as a socket error. The only daemon-initiated teardown is the zombie reap of a connection with no attached sessions that has also stopped answering pings. See docs/transport.md (Keepalive).
  • If replicated state briefly loses a live PTY session's owner while output is still visible, the owner re-announces for several seconds and PTY input probes connected peers before surfacing owner is not known yet.
  • The peer circuit breaker credits host-load stall before disconnecting. A node thrashing in swap cannot drain its own outbound channels in time and would otherwise time out every peer send at once and partition itself out of a healthy mesh. Instead, when at least 5 s of local scheduling stall is credited across a timeout streak, the link is held (holding peer, not partitioning self) and recovers when load drops; a genuinely dead peer is still reaped by the per-peer receiver task and TCP keepalive. See docs/transport.md.
  • Nothing slow runs on a connection's select loop. Session-stream operations, claims, kills, history fetches, and every peer-forwarding request are dispatched to ordered off-loop workers or detached tasks, so a slow peer's RPC timeout can never stall this client's heartbeat and output while a different node is overloaded - the same protection long applied to keystrokes and resizes.

10. Terminal Cursor Leaves Blinking Copies While Claude or Codex Is Working

Symptoms

  • While a Claude Code or Codex PTY session is actively repainting, the cursor appears to blink in old positions around the screen
  • The text itself keeps updating, but stale cursor blocks briefly remain behind

Cause

  • Some Chromium-backed canvas/WebGL terminal renderer paths can leave cursor blink artifacts behind during rapid TUI repaint bursts.

Fix

  • Update to a build that suppresses cursor blinking while output is actively streaming, then restores it after the burst goes idle with a clean repaint.

Check

  1. Confirm hived is actually running with raised priority:

    bash
    # Linux
    ps -o pid,ni,cmd -C hived
    # NI column should be -10 (or close to it)
    powershell
    # Windows
    Get-Process hived | Select-Object Id, PriorityClass
    # PriorityClass should be High
  2. On Linux, if the NI value is 0, the daemon's self-elevation was denied - check journalctl -u hived | grep "raise CPU priority" for the reason (usually capability/permission related).

  3. Confirm session children are not inheriting hived's elevated nice value:

    bash
    ps -eo pid,ppid,ni,cmd | grep -E 'hived|claude|codex|bash|zsh|pwsh' | grep -v grep
    # hived should be around -10; session children should normally be 0.

Fix

  • For a service install on Linux, ensure the unit was deployed via current hive.ps1 deploy or hive daemon install so the Nice= and CPUWeight= directives are present. If editing an existing unit by hand, add the priority directives under [Service] and sudo systemctl daemon-reload && sudo systemctl restart hived.
  • If a specific offending process (e.g. a wedged transcoder) is monopolising the CPU, cap or kill it rather than raising hived further - REALTIME_PRIORITY_CLASS / SCHED_RR are intentionally avoided because a hot loop in hived under real-time scheduling can lock up the entire host.

Useful Commands

bash
# Show cluster topology
hive cluster-status

# Discover Hive nodes on Tailscale
hive discover

# Read local daemon status
hive daemon status

# Read local daemon config
hive daemon get-config

# Read local daemon token
hive daemon get-token

# Check daemon health endpoint
curl http://localhost:9178/health

# Run the daemon with verbose logs
RUST_LOG=hive_cluster=debug,hive_daemon=debug hived

Config Paths

See configuration.md for the full reference. Common paths:

  • Linux: ~/.config/hive/Hive/config.toml
  • macOS: ~/Library/Application Support/com.hive.Hive/config.toml
  • Windows: C:\Users\<user>\AppData\Roaming\hive\Hive\config\config.toml

Hive - remote AI coding agents over WebSocket.