Rate limits
How OwnLLM applies budgets, concurrency limits, and per-tenant QPS.
OwnLLM has three layers of throttling. Knowing which one fired is the first step to working around it.
| Layer | Scope | Knob |
|---|---|---|
| Budget | Per API key, monthly | budgetMonthlyCents on the key |
| Concurrency | Per API key, in-flight | Hard-coded (32 in v1) |
| Tenant QPS | Per tenant, per second | Hard-coded (32 QPS in v1) |
Budgets
Budgets are tracked in cents and applied to chat completions only.
Each model has a flat cost_per_1k_tokens we maintain internally;
the cost is debited from the key's monthly counter on every
completed request.
When a key would push past its budget, the API returns 429 with
code: "budget_exceeded". The user can:
- Bump the budget in the chat web at Profile → API keys (their tenant admin can also do it).
- Wait until the next billing cycle (1st of the calendar month).
There's no soft-warning at 80% in v1 — the request that would push past the budget gets the full 429. That's on the roadmap.
Concurrency limit
A single API key can have at most 32 chat completions in flight at
once. Past that, the API returns 429 with code: "concurrency_limit".
This rarely matters for human-driven chat — it's only an issue when
running an agent that bursts. Use a queue or backpressure on the
client side; respect the Retry-After header.
Tenant QPS
The whole tenant is capped at 32 successful completions per second
in v1. This is intentionally generous — it only fires under abuse
or runaway agents. When it fires, every key in the tenant gets 429
with code: "tenant_qps_limit".
If you're a real user hitting this, get in touch — the cap moves up on demand.
Headers
Successful and failed responses set:
X-RateLimit-Limit-Requests: 32
X-RateLimit-Remaining-Requests: 27
X-RateLimit-Reset-Requests: 1714500030
X-RateLimit-Limit-Tokens-Monthly: 500000
X-RateLimit-Remaining-Tokens-Monthly: 412310Use them to back off proactively instead of waiting for a 429.
What is not rate-limited
/v1/models(GET) is uncapped at this scale./v1/chat/completions(POST) stream chunks are uncapped — the limit is at request count, not byte rate.- The Atlas dashboard's polling and the Atlas heartbeat are not user requests; they don't consume the per-key budget.
Why so few knobs?
OwnLLM's main throughput bottleneck is the GPU on the paired machine, not the proxy. Adding more proxy-level throttling only helps with abuse; the GPU naturally rate-limits the throughput of real workloads. For most tenants the budget is the only limit you ever notice.