Relay Observability
Track message delivery, inspect traces, monitor metrics, and debug failures
Relay Observability
Relay tracks every message from publish to final delivery. The tracing system records timing, budget consumption, and error details for each delivery attempt. Aggregate metrics give you a snapshot of bus health, and the dead letter queue captures messages that could not be delivered.
Message Tracing
Every message published through Relay gets a trace ID. Each delivery to an endpoint creates a span — a complete record of that delivery attempt: when it was sent, when it arrived, when processing completed, and whether it succeeded.
Trace Span Fields
Prop
Type
Looking Up a Trace
curl http://localhost:4242/api/relay/messages/01HX.../traceThe response includes every span in the trace chain, ordered by sentAt ascending:
{
"traceId": "01HXABC123",
"spans": [
{
"id": "01HXDEF456",
"messageId": "01HXABC123",
"traceId": "01HXABC123",
"subject": "relay.agent.backend",
"status": "delivered",
"sentAt": "2025-02-26T12:00:00.000Z",
"deliveredAt": "2025-02-26T12:00:00.050Z",
"processedAt": "2025-02-26T12:00:00.200Z",
"errorMessage": null,
"metadata": null
}
]
}For messages that fan out to multiple endpoints, the trace contains one span per endpoint.
Trace Statuses
| Status | Meaning |
|---|---|
sent | Message published, delivery in progress |
delivered | Message delivered and processed successfully |
failed | Subscription handler threw an error |
timeout | Message rejected (budget exceeded, access denied, TTL expired, no matching endpoints, or circuit breaker open) |
A healthy message moves from sent to delivered. The time between sentAt and deliveredAt is delivery latency. The time between deliveredAt and processedAt is processing latency, which includes the time the adapter (e.g., Claude Code) takes to handle the message.
Using MCP Tools
Agents can inspect traces without HTTP calls using the built-in MCP tools:
relay_get_trace Get the full delivery trace for a message by ID
relay_get_metrics Get aggregate delivery metrics for the busrelay_get_trace accepts a messageId and returns the same trace data as the REST endpoint. Use this when an agent needs to verify a message was delivered before proceeding.
Delivery Metrics
Relay computes aggregate delivery metrics from the trace store using live SQL aggregates. These give you a summary of bus health without inspecting individual traces.
Fetching Metrics
curl http://localhost:4242/api/relay/trace/metrics{
"totalMessages": 1542,
"deliveredCount": 1480,
"failedCount": 12,
"deadLetteredCount": 50,
"avgDeliveryLatencyMs": 45.2,
"p95DeliveryLatencyMs": null,
"activeEndpoints": 8,
"budgetRejections": {
"hopLimit": 0,
"ttlExpired": 0,
"cycleDetected": 0,
"budgetExhausted": 0
}
}Metrics Fields
Prop
Type
Interpreting the Numbers
A healthy bus has a high delivered-to-total ratio and low dead letter counts.
-
High
deadLetteredCountrelative to total — Check budget rejections. A spike inhopLimitrejections often indicates a message loop between two agents. TTL expirations may mean agents are too slow to process within the 1-hour window. -
Rising
failedCount— Subscriber handlers are throwing errors. Check individual traces for theerrorMessagefield. The circuit breaker automatically stops delivering to endpoints with 5 consecutive failures. -
High
avgDeliveryLatencyMs— A single slow endpoint can pull up the mean. Inspect traces filtered by endpoint to find the bottleneck.
Relay metrics reflect the full history of the current database, not a recent time window.
p95DeliveryLatencyMs currently returns null and budgetRejections counters return 0 — these
are tracked at the RelayCore level but not yet aggregated into trace store metrics.
The DorkOS client includes a Delivery Metrics Dashboard in the Relay panel. Access it from the Relay tab in the sidebar when Relay is enabled.
Debugging Failed Deliveries
Check the dead letter queue
Dead letters are messages that could not be delivered. Fetch them:
curl http://localhost:4242/api/relay/dead-lettersEach dead letter includes the original envelope with subject, payload, and budget. Common reasons:
- No matching endpoints — The subject has no registered endpoints. Verify with
GET /api/relay/endpoints. - Budget exceeded — The message's hop count, TTL, or call budget was exhausted.
- Access denied — The sender lacks permission to publish to the target subject. Check
access-rules.json.
Inspect the message trace
If the message was delivered but the handler failed, look up the trace:
curl http://localhost:4242/api/relay/messages/{messageId}/traceCheck status and errorMessage on each span. A failed status with an error message tells you exactly what went wrong. For Claude Code adapter failures, the error typically includes the Agent SDK error message.
Check endpoint health
If deliveries are being rejected, the endpoint's circuit breaker may be open. The circuit breaker opens after 5 consecutive failures (configurable in ~/.dork/relay/config.json). After a 30-second cooldown, it allows a single probe through. If the probe succeeds, normal delivery resumes.
Check rate limits and backpressure
If a sender is publishing too quickly, messages are rejected by the rate limiter. The default allows 100 messages per 60-second window per sender. If an endpoint's mailbox has too many unprocessed messages, backpressure triggers at 80% capacity (warning) and 100% capacity (rejection).
Both settings are tunable in ~/.dork/relay/config.json and hot-reloaded without a restart.
SSE Stream for Real-Time Monitoring
Subscribe to the Relay SSE stream for live visibility:
curl -N http://localhost:4242/api/relay/stream?subject=%3EThe subject query parameter filters which messages appear. Use > (URL-encoded as %3E) to see all messages, or provide a pattern like relay.agent.* to filter by audience.
The stream emits four event types:
| Event | Description |
|---|---|
relay_connected | Initial connection confirmation with the filter pattern |
relay_message | A message envelope matching the subscription pattern |
relay_backpressure | A backpressure signal from an endpoint approaching or exceeding mailbox capacity |
relay_signal | Other signals (dead letters, typing indicators, delivery receipts) |
The SSE stream is for debugging and monitoring. Long-lived SSE connections consume server resources — close them when no longer needed. For production monitoring, use the REST API or the UI dashboard.