OwnLLM Docs
API

Rate limits

How OwnLLM applies budgets, concurrency limits, and per-tenant QPS.

OwnLLM has three layers of throttling. Knowing which one fired is the first step to working around it.

LayerScopeKnob
BudgetPer API key, monthlybudgetMonthlyCents on the key
ConcurrencyPer API key, in-flightHard-coded (32 in v1)
Tenant QPSPer tenant, per secondHard-coded (32 QPS in v1)

Budgets

Budgets are tracked in cents and applied to chat completions only. Each model has a flat cost_per_1k_tokens we maintain internally; the cost is debited from the key's monthly counter on every completed request.

When a key would push past its budget, the API returns 429 with code: "budget_exceeded". The user can:

  1. Bump the budget in the chat web at Profile → API keys (their tenant admin can also do it).
  2. Wait until the next billing cycle (1st of the calendar month).

There's no soft-warning at 80% in v1 — the request that would push past the budget gets the full 429. That's on the roadmap.

Concurrency limit

A single API key can have at most 32 chat completions in flight at once. Past that, the API returns 429 with code: "concurrency_limit".

This rarely matters for human-driven chat — it's only an issue when running an agent that bursts. Use a queue or backpressure on the client side; respect the Retry-After header.

Tenant QPS

The whole tenant is capped at 32 successful completions per second in v1. This is intentionally generous — it only fires under abuse or runaway agents. When it fires, every key in the tenant gets 429 with code: "tenant_qps_limit".

If you're a real user hitting this, get in touch — the cap moves up on demand.

Headers

Successful and failed responses set:

X-RateLimit-Limit-Requests:        32
X-RateLimit-Remaining-Requests:    27
X-RateLimit-Reset-Requests:        1714500030
X-RateLimit-Limit-Tokens-Monthly:  500000
X-RateLimit-Remaining-Tokens-Monthly: 412310

Use them to back off proactively instead of waiting for a 429.

What is not rate-limited

  • /v1/models (GET) is uncapped at this scale.
  • /v1/chat/completions (POST) stream chunks are uncapped — the limit is at request count, not byte rate.
  • The Atlas dashboard's polling and the Atlas heartbeat are not user requests; they don't consume the per-key budget.

Why so few knobs?

OwnLLM's main throughput bottleneck is the GPU on the paired machine, not the proxy. Adding more proxy-level throttling only helps with abuse; the GPU naturally rate-limits the throughput of real workloads. For most tenants the budget is the only limit you ever notice.

On this page