WiseHosting
Architecture

WebSocket protocol

Envelope format, HMAC, sequencing, and the worker connection lifecycle.

The WSS link between the control plane and each worker uses a tiny custom envelope format defined in internal/wsproto.

Why a custom envelope?

The browsers' WebSocket is just framed bytes — there's no built-in concept of "this message is type X" or "this message hasn't been replayed". We add a tiny JSON envelope on top with five fields (type, ID, sequence, timestamp, payload) and an HMAC. The HMAC + sequence number + clock-skew bound mean a captured frame can't be tampered with, replayed, or held and delivered out of order.

If "HMAC" or "sequence number" are new, see the Glossary.

Envelope

{
  "t":  "<message type>",
  "i":  "<random 16-byte hex>",
  "s":  "<uint64 sequence>",
  "ts": "<unix millis>",
  "p":  "<JSON payload>",
  "h":  "<hex HMAC-SHA256>"
}
  • The HMAC is computed over the canonical byte string type|id|seq|ts|payload.
  • The shared key is sha256(api_key) — derived independently on both sides. The raw api_key is configured on the worker, and the control plane stores only api_key_hash (lowercase hex of the same sha256). At hub upgrade time the server recovers the 32 raw bytes by hex-decoding the stored hash; both sides therefore arrive at identical key material without it ever crossing the wire.
  • Constants: MaxMessageBytes = 2 MiB, MaxClockSkew = 5 minutes.
  • Both sides keep a per-direction monotonic sequence counter and a sliding-window dedup set (capacity 1024, pruned at gap > 256).

Validation order

Every received envelope is checked in this order:

  1. HMAC — constant-time compare against Verify(env, key).
  2. Clock — reject if |now - ts| > 5 minutes.
  3. Sequence — reject if seq was seen recently or is below the cutoff.

Failures close the connection (worker side) or send a TypeError envelope (server side) and disconnect.

Message types

TypeDirectionPayloadPurpose
registerworker → serverRegisterPayloadFirst HTTP register call (not WS)
registeredserver → workerRegisteredPayloadConfirms worker_id
heartbeatworker → serverHeartbeatPayloadLiveness + resource usage (every 30s)
jobserver → workerJobPayload (with DeployPayload)Dispatch a deploy/restart/start/stop/delete
job_statusworker → serverJobStatusPayloadMid-job progress (0–100)
job_resultworker → serverJobResultPayloadTerminal job outcome
app_statusworker → serverAppStatusPayloadApp stopped / state change
violationworker → serverViolationPayloadOOM / disk / crashloop notice
runtime_logworker → serverRuntimeLogPayload (batch)Container stdout/stderr lines
build_logworker → serverBuildLogPayloadStreaming build output
app_statsworker → serverAppStatsPayloadCPU/memory snapshots
errorserver → workerErrorPayloadStructured error report

Connection lifecycle (worker side)

Constants in internal/worker/transport.go:

  • Handshake timeout — 10 s
  • Write deadline — 10 s
  • Pong wait — 90 s
  • Ping period — 72 s

Reconnect semantics

Two design rules in connectAndServe() keep the wire state coherent across drops:

  • outSeq is monotonic across reconnects — never reset. Resetting would open a window where late stale envelopes from the previous connection get accepted with the same seq numbers as fresh ones, corrupting the server's view.
  • Stale send buffer is drained before a fresh connection starts streaming. Anything queued during the outage carries old timestamps / seq numbers that would now fail the 5-minute clock check anyway, so dropping it locally is cheaper than letting the server reject it.

Server-side acceptSeq uses the sliding-window dedup set, so it accepts whatever first seq the worker sends after a fresh upgrade — no resync handshake needed.

Worker auth on reconnect

Worker WSS upgrade carries Authorization: Bearer <token> — header only, no query-string fallback (the ?api_key= form was removed as part of the 9c571e9 security pass). The token is whatever the worker's tokenFn returns at dial time:

  • Preferred: a HS256 JWT with claims {sub: workerID, key_hash: <api_key_hash>, iat, exp, jti}. The signing key is HKDF-SHA256 over cfg.APIServerSecret() with purpose wisehosting-worker-jwt-v1. JWTs are minted at registration and refreshed every ~13 minutes (TTL 15 min, refresh skew 2 min).
  • Fallback: the raw api_key if the JWT is missing or expired (e.g. control-plane downtime during refresh). The raw key always works — JWTs are an additional layer, not a replacement.

The hub resolves the credential via (*Hub) resolveAuthForUpgrade(token):

  1. If the token has three dot-separated segments, parse as a JWT and verify the HS256 signature against the HKDF-derived key. On success, fetch the worker row by claims.Sub and require worker.APIKeyHash == claims.KeyHash — this binds the JWT to the current key, so rotating the API key invalidates every outstanding JWT.
  2. Otherwise, treat the token as a raw API key, hash it (sha256(rawKey) → hex), and look up by api_key_hash via FindOnlineWorkerByAPIKey.

FindOnlineWorkerByAPIKey matches any worker by key regardless of status — a worker that just went offline from missed heartbeats is still allowed to reconnect, otherwise the scheduler's status flip would lock it out forever.

The dialer also pins NextProtos: ["http/1.1"]. Without this, a CDN that supports HTTP/2 may negotiate h2 via ALPN and silently break the WebSocket upgrade — gorilla/websocket only speaks HTTP/1.1.

On a successful upgrade the hub:

  1. Closes any existing connection for the same worker ID (replaced by new connection).
  2. Marks the row online if it was offline.
  3. Calls replayPendingJobs() immediately to re-send assigned-but-un-acked jobs.

Server-side rate limiting

internal/api/ws.go token-buckets log ingress per worker:

  • Sustained: 256 KiB/s
  • Burst: 1 MiB
  • Max line length: 8 KiB
  • Max lines per runtime_log batch: 5000

Lines beyond these limits are dropped silently to protect the log bus and the database.

On this page