When the gateway runs in platform mode (OTARI_AI_TOKEN is set), it
delegates per-request authorization and provider-credential resolution to a
peer platform service over HTTP. This document describes the wire contract
the gateway expects from that peer.
The default peer implementation is otari,
but any service that implements this contract can stand in.
Endpoints
The gateway calls two endpoints, both rooted at the configured platform base URL:
Endpoint | Purpose |
| Authorize a request and return one or more provider credentials to try |
| Report the outcome of an attempt back to the platform |
{base} means the gateway platform base_url setting. The gateway concatenates literally. The peer service is responsible for including any API-version prefix it exposes its own routes under. For the reference otari deployment that prefix is /api/v1, so the base URL is http://backend:8000/api/v1 and the gateway ends up POSTing to http://backend:8000/api/v1/gateway/provider-keys/resolve.
Authentication
Both endpoints require X-Gateway-Token: <gw_...> in the request headers. This
proves the caller is the gateway instance configured against this platform
deployment. The resolve endpoint additionally requires X-User-Token: <tk_...>,
which is the workspace API token forwarded opaquely from the end user's
Authorization: Bearer ... header.
Resolve
Request
POST /gateway/provider-keys/resolve
X-Gateway-Token: gw_...
X-User-Token: tk_...
Content-Type: application/json
{
"model": "gpt-4o-mini",
"provider": "openai" // optional; otherwise inferred from model prefix
}
Response — multi-attempt shape (preferred)
{
"request_id": "01HXY...",
"fallback_enabled": true,
"attempts": [
{
"attempt_id": "01HX1...",
"position": 0,
"provider": "anthropic",
"model": "claude-sonnet-4-5",
"api_key": "sk-ant-...",
"api_base": null,
"managed": false
},
{
"attempt_id": "01HX2...",
"position": 1,
"provider": "openai",
"model": "gpt-4o",
"api_key": "sk-...",
"api_base": "https://api.openai.com/v1",
"managed": false
}
]
}The gateway iterates attempts in order. On a retryable failure it moves to the
next entry; on success it stops. The attempt_id of the entry that ultimately
succeeded (or the last one tried, on total failure) is what the gateway echoes
back via X-Correlation-ID and reports through /gateway/usage.
request_id groups every attempt_id from the same resolve call so the
platform can attribute spend, render trace timelines, and emit fallback events.
The gateway also surfaces it as the X-Otari-Request-ID response header.
fallback_enabled is informational — set by the platform when its routing
policy actually allows fallback (i.e. the policy has multiple enabled entries
and fallback_enabled = true). The gateway uses len(attempts) > 1 for its
own behaviour.
attempts MUST contain at least one entry. An empty list is treated as a
platform bug and surfaced as 502 Bad Gateway.
Response — single-attempt shape
The gateway also accepts a flat payload:
{
"provider": "openai",
"model": "gpt-4o-mini",
"api_key": "sk-...",
"api_base": "https://api.openai.com/v1",
"managed": true,
"correlation_id": "01HXC..."
}The gateway maps this onto a single-attempt route (attempts = [{...}],
fallback_enabled = false) and behaves as it always has — no retry loop, errors
propagate to the client. New platform implementations should prefer the
multi-attempt shape.
Failure
Status | Behaviour |
| Mapped through to the client as-is. |
| Mapped to |
Network/timeout | Mapped to |
Usage report
After every attempt — successful or failed — the gateway sends:
POST /gateway/usage
X-Gateway-Token: gw_...
Content-Type: application/json
{
"correlation_id": "01HX1...", // = the attempt_id from the resolve response
"status": "success" | "error",
"usage": { // present on success only
"prompt_tokens": 13,
"completion_tokens": 7,
"total_tokens": 20
},
"error_class": "http_401" // optional on error; omitted when the
// gateway can't classify the failure
// (e.g. mid-stream errors). See below.
}
A multi-attempt request that iterates two attempts produces two usage reports —
one per attempt — sharing the same request_id (recoverable via the original
resolve response). The platform is responsible for correlating them.
error_class is a short tag describing why the attempt was abandoned:
Tag | Cause |
|
|
|
|
| Provider returned an HTTP status code (e.g. |
| Any other exception class |
The field is omitted entirely when the gateway can't classify the failure
back to an exception — this happens with mid-stream errors surfaced via the
SSE channel, where only an error string is available. Treat a missing
error_class as "uncategorised error" when aggregating.
Retry semantics
The usage endpoint is called as a background task on the gateway side. It
retries on transient failures (timeout, network error, 5xx) up to
PLATFORM_USAGE_MAX_RETRIES times with exponential backoff
(0.25s, 0.5s, 1s). It does not retry on 401, 404, 409, 422 —
those are treated as terminal client errors.
Streaming
Streaming requests (stream: true) iterate attempts just like non-streaming
requests, with one structural difference: the gateway can only fall through
before any bytes have been flushed to the client. Once an attempt yields its
first chunk, the gateway commits to that attempt; any further error
propagates to the SSE channel as today.
The mechanism is a per-attempt first-chunk gate. For each attempt:
Open the upstream stream (
acompletion(stream=True, ...)). If this raises
— provider returned401/5xx/ network error before the stream even
opened — classify the error: retryable failures move to the next attempt;
non-retryable failures propagate.Wait for the first chunk with a bounded timeout
(STREAMING_FALLBACK_FIRST_CHUNK_TIMEOUT_MS, default 2000 ms). If the
upstream raises before yielding or the wait times out, move to the next
attempt.Once a first chunk is in hand, commit. Stitch it back onto the iterator
and start flushing SSE chunks to the client.
Latency contract: zero added latency in the success case — the first
chunk is held only for the microseconds it takes to call the SSE response
builder. In the failure case, each abandoned attempt costs at most
first_chunk_timeout_seconds.
What this catches: auth errors (401/403), rate-limits (429),
upstream 5xx, connection failures, hung connections, "stream opens but
errors before yielding."
What this doesn't catch: errors that arrive after the first chunk has
flushed (mid-stream connection drops, refusal messages embedded in normal
content chunks). These are out of reach without either prefix-buffering
(which would add visible latency on every request) or a client-cooperative
restart event (which would break OpenAI SDK compatibility).
Mid-stream failover is not currently planned. If a future client SDK starts
honouring a custom restart event, it could be added behind that capability
flag.
Configuration
Env var | Default | Notes |
| — | Setting this enables platform mode. |
|
| Per-resolve timeout. |
|
| Per-usage-report timeout. |
|
| Max retries for transient usage-report failures. |
|
| Per-attempt budget for the streaming first-chunk gate. |
Source: mozilla-ai/otari/docs/platform-protocol.md
