Relay Observability

Relay tracks every message from publish to final delivery. The tracing system records timing, budget consumption, and error details for each delivery attempt. Aggregate metrics give you a snapshot of bus health, and the dead letter queue captures messages that could not be delivered.

Message Tracing

Every message published through Relay gets a trace ID. Each delivery to an endpoint creates a span — a complete record of that delivery attempt: when it was sent, when it arrived, when processing completed, and whether it succeeded.

Trace Span Fields

Prop

Type

Looking Up a Trace

curl http://localhost:4242/api/relay/messages/01HX.../trace

The response includes every span in the trace chain, ordered by sentAt ascending:

{
  "traceId": "01HXABC123",
  "spans": [
    {
      "id": "01HXDEF456",
      "messageId": "01HXABC123",
      "traceId": "01HXABC123",
      "subject": "relay.agent.backend",
      "status": "delivered",
      "sentAt": "2025-02-26T12:00:00.000Z",
      "deliveredAt": "2025-02-26T12:00:00.050Z",
      "processedAt": "2025-02-26T12:00:00.200Z",
      "errorMessage": null,
      "metadata": null
    }
  ]
}

For messages that fan out to multiple endpoints, the trace contains one span per endpoint.

Trace Statuses

Status	Meaning
`sent`	Message published, delivery in progress
`delivered`	Message delivered and processed successfully
`failed`	Subscription handler threw an error
`timeout`	Message rejected (budget exceeded, access denied, TTL expired, no matching endpoints, or circuit breaker open)

A healthy message moves from sent to delivered. The time between sentAt and deliveredAt is delivery latency. The time between deliveredAt and processedAt is processing latency, which includes the time the adapter (e.g., Claude Code) takes to handle the message.

Using MCP Tools

Agents can inspect traces without HTTP calls using the built-in MCP tools:

relay_get_trace     Get the full delivery trace for a message by ID
relay_get_metrics   Get aggregate delivery metrics for the bus

relay_get_trace accepts a messageId and returns the same trace data as the REST endpoint. Use this when an agent needs to verify a message was delivered before proceeding.

Delivery Metrics

Relay computes aggregate delivery metrics from the trace store using live SQL aggregates. These give you a summary of bus health without inspecting individual traces.

Fetching Metrics

curl http://localhost:4242/api/relay/trace/metrics

{
  "totalMessages": 1542,
  "deliveredCount": 1480,
  "failedCount": 12,
  "deadLetteredCount": 50,
  "avgDeliveryLatencyMs": 45.2,
  "p95DeliveryLatencyMs": null,
  "activeEndpoints": 8,
  "budgetRejections": {
    "hopLimit": 0,
    "ttlExpired": 0,
    "cycleDetected": 0,
    "budgetExhausted": 0
  }
}

Metrics Fields

Prop

Type

Interpreting the Numbers

A healthy bus has a high delivered-to-total ratio and low dead letter counts.

High deadLetteredCount relative to total — Check budget rejections. A spike in hopLimit rejections often indicates a message loop between two agents. TTL expirations may mean agents are too slow to process within the 1-hour window.
Rising failedCount — Subscriber handlers are throwing errors. Check individual traces for the errorMessage field. The circuit breaker automatically stops delivering to endpoints with 5 consecutive failures.
High avgDeliveryLatencyMs — A single slow endpoint can pull up the mean. Inspect traces filtered by endpoint to find the bottleneck.

Relay metrics reflect the full history of the current database, not a recent time window. p95DeliveryLatencyMs currently returns null and budgetRejections counters return 0 — these are tracked at the RelayCore level but not yet aggregated into trace store metrics.

The DorkOS client includes a Delivery Metrics Dashboard in the Relay panel. Access it from the Relay tab in the sidebar when Relay is enabled.

Debugging Failed Deliveries

Check the dead letter queue

Dead letters are messages that could not be delivered. Fetch them:

curl http://localhost:4242/api/relay/dead-letters

Each dead letter includes the original envelope with subject, payload, and budget. Common reasons:

No matching endpoints — The subject has no registered endpoints. Verify with GET /api/relay/endpoints.
Budget exceeded — The message's hop count, TTL, or call budget was exhausted.
Access denied — The sender lacks permission to publish to the target subject. Check access-rules.json.

Inspect the message trace

If the message was delivered but the handler failed, look up the trace:

curl http://localhost:4242/api/relay/messages/{messageId}/trace

Check status and errorMessage on each span. A failed status with an error message tells you exactly what went wrong. For Claude Code adapter failures, the error typically includes the Agent SDK error message.

Check endpoint health

If deliveries are being rejected, the endpoint's circuit breaker may be open. The circuit breaker opens after 5 consecutive failures (configurable in ~/.dork/relay/config.json). After a 30-second cooldown, it allows a single probe through. If the probe succeeds, normal delivery resumes.

Check rate limits and backpressure

If a sender is publishing too quickly, messages are rejected by the rate limiter. The default allows 100 messages per 60-second window per sender. If an endpoint's mailbox has too many unprocessed messages, backpressure triggers at 80% capacity (warning) and 100% capacity (rejection).

Both settings are tunable in ~/.dork/relay/config.json and hot-reloaded without a restart.

SSE Stream for Real-Time Monitoring

Subscribe to the Relay SSE stream for live visibility:

curl -N http://localhost:4242/api/relay/stream?subject=%3E

The subject query parameter filters which messages appear. Use > (URL-encoded as %3E) to see all messages, or provide a pattern like relay.agent.* to filter by audience.

The stream emits four event types:

Event	Description
`relay_connected`	Initial connection confirmation with the filter pattern
`relay_message`	A message envelope matching the subscription pattern
`relay_backpressure`	A backpressure signal from an endpoint approaching or exceeding mailbox capacity
`relay_signal`	Other signals (dead letters, typing indicators, delivery receipts)

The SSE stream is for debugging and monitoring. Long-lived SSE connections consume server resources — close them when no longer needed. For production monitoring, use the REST API or the UI dashboard.

Relay Observability

Relay Observability

Message Tracing

Trace Span Fields

Looking Up a Trace

Trace Statuses

Using MCP Tools

Delivery Metrics

Fetching Metrics

Metrics Fields

Interpreting the Numbers

Debugging Failed Deliveries

Check the dead letter queue

Inspect the message trace

Check endpoint health

Check rate limits and backpressure

SSE Stream for Real-Time Monitoring

Next Steps

Relay Messaging

Relay Concepts

Agent Discovery

On this page