Rate limits

OwnLLM has three layers of throttling. Knowing which one fired is the first step to working around it.

Layer	Scope	Knob
Budget	Per API key, monthly	`budgetMonthlyCents` on the key
Concurrency	Per API key, in-flight	Hard-coded (32 in v1)
Tenant QPS	Per tenant, per second	Hard-coded (32 QPS in v1)

Budgets

Budgets are tracked in cents and applied to chat completions only. Each model has a flat cost_per_1k_tokens we maintain internally; the cost is debited from the key's monthly counter on every completed request.

When a key would push past its budget, the API returns 429 with code: "budget_exceeded". The user can:

Bump the budget in the chat web at Profile → API keys (their tenant admin can also do it).
Wait until the next billing cycle (1st of the calendar month).

There's no soft-warning at 80% in v1 — the request that would push past the budget gets the full 429. That's on the roadmap.

Concurrency limit

A single API key can have at most 32 chat completions in flight at once. Past that, the API returns 429 with code: "concurrency_limit".

This rarely matters for human-driven chat — it's only an issue when running an agent that bursts. Use a queue or backpressure on the client side; respect the Retry-After header.

Tenant QPS

The whole tenant is capped at 32 successful completions per second in v1. This is intentionally generous — it only fires under abuse or runaway agents. When it fires, every key in the tenant gets 429 with code: "tenant_qps_limit".

If you're a real user hitting this, get in touch — the cap moves up on demand.

Headers

Successful and failed responses set:

X-RateLimit-Limit-Requests:        32
X-RateLimit-Remaining-Requests:    27
X-RateLimit-Reset-Requests:        1714500030
X-RateLimit-Limit-Tokens-Monthly:  500000
X-RateLimit-Remaining-Tokens-Monthly: 412310

Use them to back off proactively instead of waiting for a 429.

What is not rate-limited

/v1/models (GET) is uncapped at this scale.
/v1/chat/completions (POST) stream chunks are uncapped — the limit is at request count, not byte rate.
The Atlas dashboard's polling and the Atlas heartbeat are not user requests; they don't consume the per-key budget.

Why so few knobs?

OwnLLM's main throughput bottleneck is the GPU on the paired machine, not the proxy. Adding more proxy-level throttling only helps with abuse; the GPU naturally rate-limits the throughput of real workloads. For most tenants the budget is the only limit you ever notice.