Relay Observability
Track message delivery, inspect traces, monitor metrics, and debug failures in the Relay message bus
Relay Observability
Relay tracks every message from publish to final delivery. The tracing system records timing information, budget consumption, and error details for each delivery attempt. Aggregate metrics provide a bird's-eye view of your message bus health, and the dead letter queue captures messages that could not be delivered.
This guide covers how to use traces, metrics, and debugging tools to understand what is happening inside Relay.
Message Tracing
Every message published through Relay is assigned a trace ID, and each delivery to an endpoint creates a span. A span records the full lifecycle of a single delivery attempt: when it was sent, when it arrived at the endpoint, when processing completed, and whether it succeeded or failed.
Trace Span Fields
Prop
Type
Looking Up a Trace
To inspect the full delivery trace for a message, use the trace endpoint with the message ID:
curl http://localhost:4242/api/relay/messages/01HX.../traceThe response includes every span in the trace chain, ordered by sentAt ascending:
{
"traceId": "01HXABC123",
"spans": [
{
"id": "01HXDEF456",
"messageId": "01HXABC123",
"traceId": "01HXABC123",
"subject": "relay.agent.backend",
"status": "delivered",
"sentAt": "2025-02-26T12:00:00.000Z",
"deliveredAt": "2025-02-26T12:00:00.050Z",
"processedAt": "2025-02-26T12:00:00.200Z",
"errorMessage": null,
"metadata": null
}
]
}For messages that fan out to multiple endpoints, the trace contains one span per endpoint.
Trace Statuses
Each span transitions through a lifecycle of statuses:
| Status | Meaning |
|---|---|
sent | Message published, delivery in progress |
delivered | Message delivered and processed successfully |
failed | Subscription handler threw an error |
timeout | Message rejected (budget exceeded, access denied, TTL expired, no matching endpoints, or circuit breaker open) |
A healthy message moves from sent to delivered. The time between sentAt and deliveredAt is the delivery latency. The time between deliveredAt and processedAt is the processing latency, which includes the time the adapter (e.g., Claude Code) takes to handle the message.
Using MCP Tools
Agents running inside DorkOS can inspect traces using the built-in MCP tools without making HTTP calls:
relay_get_trace Get the full delivery trace for a message by ID
relay_get_metrics Get aggregate delivery metrics for the busThe relay_get_trace tool accepts a messageId parameter and returns the same trace data as the REST endpoint. This is useful for agents that need to verify whether a message they sent was successfully delivered before proceeding with the next step.
Delivery Metrics Dashboard
Relay computes aggregate delivery metrics from the trace store using live SQL aggregates. These metrics provide a summary of bus health without requiring you to inspect individual traces.
Fetching Metrics
Retrieve the current metrics snapshot from the trace metrics endpoint:
curl http://localhost:4242/api/relay/trace/metrics{
"totalMessages": 1542,
"deliveredCount": 1480,
"failedCount": 12,
"deadLetteredCount": 50,
"avgDeliveryLatencyMs": 45.2,
"p95DeliveryLatencyMs": null,
"activeEndpoints": 8,
"budgetRejections": {
"hopLimit": 0,
"ttlExpired": 0,
"cycleDetected": 0,
"budgetExhausted": 0
}
}Metrics Fields
Prop
Type
Interpreting the Numbers
A healthy Relay bus has a high delivered-to-total ratio and low dead letter counts. Here are patterns to watch for:
-
High
deadLetteredCountrelative to total -- Check the budget rejections breakdown to find the root cause. A spike inhopLimitrejections often indicates a message loop between two agents. TTL expirations may mean agents are too slow to process messages within the default 1-hour window. -
Rising
failedCount-- Subscriber handlers are throwing errors. Check the individual traces for theerrorfield to identify which adapter or endpoint is failing. The circuit breaker will automatically stop delivering to endpoints with 5 consecutive failures. -
High
avgDeliveryLatencyMs-- Deliveries are slow on average. This is often caused by a single slow endpoint pulling up the mean. Inspect traces filtered by endpoint to find the bottleneck.
Relay metrics are computed as live SQL aggregates over the trace store. They reflect the full history of the current Relay database, not just a recent time window. To reset metrics, you would need to clear the relay_traces table in the database. Note: p95DeliveryLatencyMs currently returns null and budgetRejections counters return 0 — these are tracked at the RelayCore level but not yet aggregated into the trace store metrics.
The DorkOS client UI includes a Delivery Metrics Dashboard in the Relay panel that visualizes these numbers with success/failure breakdowns and latency distribution. Access it from the Relay tab in the sidebar when Relay is enabled.
Debugging Failed Deliveries
When a message does not reach its destination, Relay provides several tools to diagnose the issue. Start with the dead letter queue, then drill into individual traces, and check the reliability subsystems.
Check the dead letter queue
Dead letters are messages that could not be delivered to any endpoint. Fetch them from the dead letters endpoint:
curl http://localhost:4242/api/relay/dead-lettersEach dead letter includes the original envelope with its subject, payload, and budget. The reason for rejection is embedded in the envelope metadata. Common reasons include:
- No matching endpoints -- The subject has no registered endpoints. Verify the endpoint is registered with
GET /api/relay/endpoints. - Budget exceeded -- The message's hop count, TTL, or call budget was exhausted before delivery.
- Access denied -- The sender does not have permission to publish to the target subject. Check
access-rules.json.
Inspect the message trace
If the message was delivered but the handler failed, look up the trace:
curl http://localhost:4242/api/relay/messages/{messageId}/traceCheck the status and errorMessage fields on each span. A failed status with an error message tells you exactly what went wrong in the subscriber handler. For Claude Code adapter failures, the error typically includes the Agent SDK error message.
Check endpoint health
If deliveries are being rejected, the endpoint's circuit breaker may be open. The Relay metrics endpoint reports system-level metrics, and individual endpoint health can be assessed by looking at consecutive failure patterns in the traces.
The circuit breaker opens after 5 consecutive failures (configurable in ~/.dork/relay/config.json). After a 30-second cooldown, it allows a single probe message through. If the probe succeeds, normal delivery resumes.
Check rate limits and backpressure
If a sender is publishing too quickly, their messages are rejected by the rate limiter. The default allows 100 messages per 60-second window per sender. If an endpoint's mailbox has too many unprocessed messages, backpressure kicks in at 80% capacity (warning) and 100% capacity (rejection).
Both settings are tunable in ~/.dork/relay/config.json and hot-reloaded without a restart.
SSE Stream for Real-Time Monitoring
For real-time visibility, subscribe to the Relay SSE stream. The stream emits events for every message delivery and signal:
curl -N http://localhost:4242/api/relay/stream?subject=%3EThe subject query parameter filters which messages appear in the stream. Use > (URL-encoded as %3E) to see all messages, or provide a specific pattern like relay.agent.* to filter by audience.
The stream emits four event types:
| Event | Description |
|---|---|
relay_connected | Initial connection confirmation with the filter pattern |
relay_message | A message envelope matching the subscription pattern |
relay_backpressure | A backpressure signal from an endpoint approaching or exceeding mailbox capacity |
relay_signal | Other signals (dead letters, typing indicators, delivery receipts, etc.) |
The SSE stream is primarily intended for debugging and monitoring. In production, use the REST API for polling metrics and the UI dashboard for visual monitoring. Long-lived SSE connections consume server resources and should be closed when no longer needed.