WebSocket protocol
Envelope format, HMAC, sequencing, and the worker connection lifecycle.
The WSS link between the control plane and each worker uses a tiny custom envelope format defined in internal/wsproto.
Why a custom envelope?
The browsers' WebSocket is just framed bytes — there's no built-in concept of "this message is type X" or "this message hasn't been replayed". We add a tiny JSON envelope on top with five fields (type, ID, sequence, timestamp, payload) and an HMAC. The HMAC + sequence number + clock-skew bound mean a captured frame can't be tampered with, replayed, or held and delivered out of order.
If "HMAC" or "sequence number" are new, see the Glossary.
Envelope
{
"t": "<message type>",
"i": "<random 16-byte hex>",
"s": "<uint64 sequence>",
"ts": "<unix millis>",
"p": "<JSON payload>",
"h": "<hex HMAC-SHA256>"
}- The HMAC is computed over the canonical byte string
type|id|seq|ts|payload. - The shared key is
sha256(api_key)— derived independently on both sides. The rawapi_keyis configured on the worker, and the control plane stores onlyapi_key_hash(lowercase hex of the same sha256). At hub upgrade time the server recovers the 32 raw bytes by hex-decoding the stored hash; both sides therefore arrive at identical key material without it ever crossing the wire. - Constants:
MaxMessageBytes = 2 MiB,MaxClockSkew = 5 minutes. - Both sides keep a per-direction monotonic sequence counter and a sliding-window dedup set (capacity 1024, pruned at gap > 256).
Validation order
Every received envelope is checked in this order:
- HMAC — constant-time compare against
Verify(env, key). - Clock — reject if
|now - ts| > 5 minutes. - Sequence — reject if
seqwas seen recently or is below the cutoff.
Failures close the connection (worker side) or send a TypeError envelope (server side) and disconnect.
Message types
| Type | Direction | Payload | Purpose |
|---|---|---|---|
register | worker → server | RegisterPayload | First HTTP register call (not WS) |
registered | server → worker | RegisteredPayload | Confirms worker_id |
heartbeat | worker → server | HeartbeatPayload | Liveness + resource usage (every 30s) |
job | server → worker | JobPayload (with DeployPayload) | Dispatch a deploy/restart/start/stop/delete |
job_status | worker → server | JobStatusPayload | Mid-job progress (0–100) |
job_result | worker → server | JobResultPayload | Terminal job outcome |
app_status | worker → server | AppStatusPayload | App stopped / state change |
violation | worker → server | ViolationPayload | OOM / disk / crashloop notice |
runtime_log | worker → server | RuntimeLogPayload (batch) | Container stdout/stderr lines |
build_log | worker → server | BuildLogPayload | Streaming build output |
app_stats | worker → server | AppStatsPayload | CPU/memory snapshots |
error | server → worker | ErrorPayload | Structured error report |
Connection lifecycle (worker side)
Constants in internal/worker/transport.go:
- Handshake timeout — 10 s
- Write deadline — 10 s
- Pong wait — 90 s
- Ping period — 72 s
Reconnect semantics
Two design rules in connectAndServe() keep the wire state coherent across drops:
outSeqis monotonic across reconnects — never reset. Resetting would open a window where late stale envelopes from the previous connection get accepted with the same seq numbers as fresh ones, corrupting the server's view.- Stale send buffer is drained before a fresh connection starts streaming. Anything queued during the outage carries old timestamps / seq numbers that would now fail the 5-minute clock check anyway, so dropping it locally is cheaper than letting the server reject it.
Server-side acceptSeq uses the sliding-window dedup set, so it accepts whatever first seq the worker sends after a fresh upgrade — no resync handshake needed.
Worker auth on reconnect
Worker WSS upgrade carries Authorization: Bearer <token> — header only, no query-string fallback (the ?api_key= form was removed as part of the 9c571e9 security pass). The token is whatever the worker's tokenFn returns at dial time:
- Preferred: a HS256 JWT with claims
{sub: workerID, key_hash: <api_key_hash>, iat, exp, jti}. The signing key is HKDF-SHA256 overcfg.APIServerSecret()with purposewisehosting-worker-jwt-v1. JWTs are minted at registration and refreshed every ~13 minutes (TTL 15 min, refresh skew 2 min). - Fallback: the raw
api_keyif the JWT is missing or expired (e.g. control-plane downtime during refresh). The raw key always works — JWTs are an additional layer, not a replacement.
The hub resolves the credential via (*Hub) resolveAuthForUpgrade(token):
- If the token has three dot-separated segments, parse as a JWT and verify the HS256 signature against the HKDF-derived key. On success, fetch the worker row by
claims.Suband requireworker.APIKeyHash == claims.KeyHash— this binds the JWT to the current key, so rotating the API key invalidates every outstanding JWT. - Otherwise, treat the token as a raw API key, hash it (
sha256(rawKey)→ hex), and look up byapi_key_hashviaFindOnlineWorkerByAPIKey.
FindOnlineWorkerByAPIKey matches any worker by key regardless of status — a worker that just went offline from missed heartbeats is still allowed to reconnect, otherwise the scheduler's status flip would lock it out forever.
The dialer also pins NextProtos: ["http/1.1"]. Without this, a CDN that supports HTTP/2 may negotiate h2 via ALPN and silently break the WebSocket upgrade — gorilla/websocket only speaks HTTP/1.1.
On a successful upgrade the hub:
- Closes any existing connection for the same worker ID (
replaced by new connection). - Marks the row
onlineif it wasoffline. - Calls
replayPendingJobs()immediately to re-send assigned-but-un-acked jobs.
Server-side rate limiting
internal/api/ws.go token-buckets log ingress per worker:
- Sustained: 256 KiB/s
- Burst: 1 MiB
- Max line length: 8 KiB
- Max lines per
runtime_logbatch: 5000
Lines beyond these limits are dropped silently to protect the log bus and the database.