WebSocket protocol

The WSS link between the control plane and each worker uses a tiny custom envelope format defined in internal/wsproto.

Why a custom envelope?

The browsers' WebSocket is just framed bytes — there's no built-in concept of "this message is type X" or "this message hasn't been replayed". We add a tiny JSON envelope on top with five fields (type, ID, sequence, timestamp, payload) and an HMAC. The HMAC + sequence number + clock-skew bound mean a captured frame can't be tampered with, replayed, or held and delivered out of order.

If "HMAC" or "sequence number" are new, see the Glossary.

Envelope

{
  "t":  "<message type>",
  "i":  "<random 16-byte hex>",
  "s":  "<uint64 sequence>",
  "ts": "<unix millis>",
  "p":  "<JSON payload>",
  "h":  "<hex HMAC-SHA256>"
}

The HMAC is computed over the canonical byte string type|id|seq|ts|payload.
The shared key is sha256(api_key) — derived independently on both sides. The raw api_key is configured on the worker, and the control plane stores only api_key_hash (lowercase hex of the same sha256). At hub upgrade time the server recovers the 32 raw bytes by hex-decoding the stored hash; both sides therefore arrive at identical key material without it ever crossing the wire.
Constants: MaxMessageBytes = 2 MiB, MaxClockSkew = 5 minutes.
Both sides keep a per-direction monotonic sequence counter and a sliding-window dedup set (capacity 1024, pruned at gap > 256).

Validation order

Every received envelope is checked in this order:

HMAC — constant-time compare against Verify(env, key).
Clock — reject if |now - ts| > 5 minutes.
Sequence — reject if seq was seen recently or is below the cutoff.

Failures close the connection (worker side) or send a TypeError envelope (server side) and disconnect.

Message types

Type	Direction	Payload	Purpose
`register`	worker → server	`RegisterPayload`	First HTTP register call (not WS)
`registered`	server → worker	`RegisteredPayload`	Confirms `worker_id`
`heartbeat`	worker → server	`HeartbeatPayload`	Liveness + resource usage (every 30s)
`job`	server → worker	`JobPayload` (with `DeployPayload`)	Dispatch a deploy/restart/start/stop/delete
`job_status`	worker → server	`JobStatusPayload`	Mid-job progress (0–100)
`job_result`	worker → server	`JobResultPayload`	Terminal job outcome
`app_status`	worker → server	`AppStatusPayload`	App stopped / state change
`violation`	worker → server	`ViolationPayload`	OOM / disk / crashloop notice
`runtime_log`	worker → server	`RuntimeLogPayload` (batch)	Container stdout/stderr lines
`build_log`	worker → server	`BuildLogPayload`	Streaming build output
`app_stats`	worker → server	`AppStatsPayload`	CPU/memory snapshots
`error`	server → worker	`ErrorPayload`	Structured error report

Connection lifecycle (worker side)

Constants in internal/worker/transport.go:

Handshake timeout — 10 s
Write deadline — 10 s
Pong wait — 90 s
Ping period — 72 s

Reconnect semantics

Two design rules in connectAndServe() keep the wire state coherent across drops:

outSeq is monotonic across reconnects — never reset. Resetting would open a window where late stale envelopes from the previous connection get accepted with the same seq numbers as fresh ones, corrupting the server's view.
Stale send buffer is drained before a fresh connection starts streaming. Anything queued during the outage carries old timestamps / seq numbers that would now fail the 5-minute clock check anyway, so dropping it locally is cheaper than letting the server reject it.

Server-side acceptSeq uses the sliding-window dedup set, so it accepts whatever first seq the worker sends after a fresh upgrade — no resync handshake needed.

Worker auth on reconnect

Worker WSS upgrade carries Authorization: Bearer <token> — header only, no query-string fallback (the ?api_key= form was removed as part of the 9c571e9 security pass). The token is whatever the worker's tokenFn returns at dial time:

Preferred: a HS256 JWT with claims {sub: workerID, key_hash: <api_key_hash>, iat, exp, jti}. The signing key is HKDF-SHA256 over cfg.APIServerSecret() with purpose wisehosting-worker-jwt-v1. JWTs are minted at registration and refreshed every ~13 minutes (TTL 15 min, refresh skew 2 min).
Fallback: the raw api_key if the JWT is missing or expired (e.g. control-plane downtime during refresh). The raw key always works — JWTs are an additional layer, not a replacement.

The hub resolves the credential via (*Hub) resolveAuthForUpgrade(token):

If the token has three dot-separated segments, parse as a JWT and verify the HS256 signature against the HKDF-derived key. On success, fetch the worker row by claims.Sub and require worker.APIKeyHash == claims.KeyHash — this binds the JWT to the current key, so rotating the API key invalidates every outstanding JWT.
Otherwise, treat the token as a raw API key, hash it (sha256(rawKey) → hex), and look up by api_key_hash via FindOnlineWorkerByAPIKey.

FindOnlineWorkerByAPIKey matches any worker by key regardless of status — a worker that just went offline from missed heartbeats is still allowed to reconnect, otherwise the scheduler's status flip would lock it out forever.

The dialer also pins NextProtos: ["http/1.1"]. Without this, a CDN that supports HTTP/2 may negotiate h2 via ALPN and silently break the WebSocket upgrade — gorilla/websocket only speaks HTTP/1.1.

On a successful upgrade the hub:

Closes any existing connection for the same worker ID (replaced by new connection).
Marks the row online if it was offline.
Calls replayPendingJobs() immediately to re-send assigned-but-un-acked jobs.

Server-side rate limiting

internal/api/ws.go token-buckets log ingress per worker:

Sustained: 256 KiB/s
Burst: 1 MiB
Max line length: 8 KiB
Max lines per runtime_log batch: 5000

Lines beyond these limits are dropped silently to protect the log bus and the database.