
ctx Hub: Failure modes¶
What can go wrong, what the system does about it, and what you
should do. Complementary to
ctx Hub Operations.
Design posture
The hub is best-effort knowledge sharing, not a durable
ledger. Local .context/ files are the source of truth for
each project; the hub is a fan-out channel. This framing
informs every failure-mode decision below.
Network¶
Client loses connection mid-stream¶
What happens: ctx connection listen detects the EOF, waits
with exponential backoff, and reconnects. On reconnect it passes
its last-seen sequence; the hub replays everything newer.
What you should do: nothing. If reconnects are looping, check
firewall state on the hub and ctx hub status output.
Partition — majority side reachable¶
What happens: clients routed to the majority side continue to publish and listen. The minority nodes step down to followers that cannot accept writes (Raft quorum lost).
What you should do: let it heal. When the partition closes, followers catch up via sequence-based sync automatically.
Partition — split brain (no quorum)¶
What happens: no node holds a majority, so no leader is
elected. All nodes become read-only. ctx connection publish and
ctx add --share fail with a "no leader" error; local writes
still succeed.
What you should do: fix the network. If the partition is
permanent (e.g., a data center is gone), bootstrap a new cluster
from the survivors with ctx hub peer remove for the dead nodes.
Hub unreachable during ctx add --share¶
What happens: the local write succeeds; the share step prints
a warning and exits non-zero on the share leg only. --share is
best-effort; it never blocks local context updates.
What you should do: run ctx connection publish later to
backfill, or rely on another --share for the same entry ID.
The hub deduplicates by entry ID.
Storage¶
Disk full on the leader¶
What happens: entries.jsonl append fails. The hub rejects
writes with an error and stays up for read traffic. Clients
retry; followers keep their in-sync status using whatever the
leader already wrote.
What you should do: free disk or grow the volume, then nothing else — the hub resumes accepting writes on the next append attempt.
Corrupt entries.jsonl¶
What happens: if the last line is a partial JSON write from a crash, the hub truncates it on startup and logs a warning. If any earlier line is malformed, the hub refuses to start.
What you should do: inspect with
jq -c . <data-dir>/entries.jsonl > /dev/null to find the bad
line. Move the bad region to a .quarantine file, then start.
Nothing is ever silently dropped.
meta.json / entries.jsonl sequence mismatch¶
What happens: the hub refuses to start. This usually means someone copied one file without the other.
What you should do: restore both files from the same backup,
or accept the higher sequence by regenerating meta.json from
entries.jsonl (manual for now — file a bug).
Cluster¶
Leader crash, clean shutdown¶
What happens: ctx hub stop triggers stepdown first, so
a new leader is elected before the old one exits. In-flight
writes drain. Clients reconnect to the new leader transparently.
Leader crash, hard fail (kill -9, power loss)¶
What happens: Raft detects the missing heartbeat and elects a new leader within a few seconds. Writes the old leader accepted but had not yet replicated can be lost — see the Raft-lite warning in the cluster recipe.
What you should do: if you need stronger durability, run
ctx connection listen on a dedicated "collector" project that
persists entries locally as a write-ahead backup.
Split-brain after rejoin¶
What happens: Raft reconciles: the minority side's uncommitted writes are discarded, and the majority's log is authoritative.
What you should do: nothing automatic. If you know the
minority had important writes, grep for them in
<data-dir>/entries.jsonl.rejected (written by the reconciliation
pass) and replay them with ctx connection publish.
Auth and tokens¶
Lost admin token¶
What happens: you cannot register new projects.
What you should do: retrieve it from
<data-dir>/admin.token. If that file is also gone, stop the hub
and regenerate — note that all existing client tokens keep
working; only new registrations need the admin token.
Compromised admin token¶
What happens: anyone with the token can register new projects and publish. They cannot read existing entries without a client token for a project that subscribes.
What you should do: rotate the admin token
(regenerate <data-dir>/admin.token and restart), revoke
suspicious client registrations via clients.json, and audit
entries.jsonl for unexpected origins.
Compromised client token¶
What happens: the attacker can publish as that project and
read anything that project is subscribed to. Because Origin
is self-asserted on publish, the attacker can also publish
entries tagged with any other project's name, so
attribution in entries.jsonl cannot be trusted after a
token compromise.
What you should do: remove the client's entry from
clients.json, restart the hub, and re-register the legitimate
project with a fresh token. Audit entries.jsonl for entries
published after the compromise timestamp and quarantine any
that look suspicious — remember that Origin on those entries
proves nothing.
Compromised hub host¶
What happens: <data-dir>/clients.json stores client
tokens verbatim (not hashed). Anyone with read access to
that file has every client token in hand and can impersonate
any registered project until each one is rotated.
What you should do: treat it as a total hub compromise.
Stop the hub, wipe <data-dir> (keep a forensic copy first),
regenerate the admin token, and have every client re-register.
See Security model
for the mitigations that reduce the blast radius while the
hashing follow-up is pending.
Clock skew¶
Hub entries carry a timestamp assigned by the publishing client. The hub does not rewrite timestamps. Clients with significant clock skew will publish entries that look out of order in the shared feed.
What you should do: run NTP on all client machines. If you see entries dated in the future or far past, the publisher's clock is the culprit.
The short list¶
| Symptom | First thing to check |
|---|---|
| Client can't reach hub | Firewall, then ctx hub status |
| "No leader" errors | Cluster quorum — run ctx hub status on each peer |
| Hub won't start after crash | Last line of entries.jsonl |
| Entries missing after restore | Check clients.json sequence vs local .sync-state.json |
| Duplicate entries in shared feed | Client replayed after restore — safe, dedup by ID |
| Followers lagging | Disk or network on the follower, not the leader |