Key selection & rotation
How llm-rotate picks a key, tracks its health, and rotates away from failing credentials.
The core job of llm-rotate is choosing a healthy key for every call and
reacting when one fails. This page covers the selection strategies and the key
health state machine.
Selection strategies
Set the strategy globally via configure(strategy=...) or in
defaults.selection_strategy. The default is health_aware.
| Strategy | Behavior |
|---|---|
health_aware (default) | Scores candidates by priority, weight, recent failures, and route health. Avoids keys that have been failing. |
round_robin | Rotates through a provider's keys in order. |
priority | Always picks the highest-priority available key. |
weighted | Random selection proportional to each key's weight. |
configure(
registry={"keys": [...]},
use_keys=[...],
strategy="health_aware",
priorities={"openai-primary": 10, "openai-backup": 1},
)The health state machine
Every key carries a runtime health state. Transitions are driven by call outcomes and the cooldown/quarantine timers from configuration.
| State | Meaning |
|---|---|
ACTIVE | Healthy and eligible for selection. |
COOLDOWN | Temporarily rested after a rate-limit; skipped until the cooldown elapses. |
QUARANTINE | Sidelined after repeated or auth failures; skipped until quarantine elapses. |
PROBATION | Recovering — allowed back but watched closely. |
DISABLED | Taken out of rotation entirely. |
A typical lifecycle:
ACTIVE ──rate limit──▶ COOLDOWN ──timer──▶ PROBATION ──success──▶ ACTIVE
ACTIVE ──auth fail ×N─▶ QUARANTINE ──timer─▶ PROBATION
PROBATION ──failure──▶ QUARANTINE- A rate-limit moves the key to
COOLDOWNforcooldown_seconds. - Repeated auth/permission failures move it to
QUARANTINEforquarantine_seconds. - After
max_consecutive_failures, the key is taken out of rotation.
What triggers rotation
When a call fails, llm-rotate classifies the error and decides whether to
retry on another key. Retryable categories include:
rate_limit→ cooldown, try next keyquota_exhausted→ try next keyinvalid_auth/permission_denied→ quarantine, try next keytimeout,transient_server_error,connection_error→ retry with backoffmodel_unavailable,broker_route_unavailable→ try next key / route
Non-retryable request errors (a malformed request, say) are surfaced immediately rather than burning through your key pool. See Errors for the full taxonomy.
Retries
Each call retries up to max_retries times (default 3) across keys, with
exponential backoff and jitter. Override per call:
response = await lm.chat(
"gpt-4o-mini",
[{"role": "user", "content": "Hello"}],
max_retries=5,
)Proactive rate limiting
Beyond reacting to 429s, llm-rotate avoids them in two ways. Both are opt-in
and zero-cost when unused:
- Client-side budgets. Set
rate_limit_rpmand/orrate_limit_tpmon a key and a sliding-window limiter skips that key once it would exceed its budget in the trailing 60 seconds — selection simply moves on to another key. Keys without budgets configured are never throttled. - Header-driven cooldowns. When a provider response carries rate-limit
headers (
retry-after, remaining-requests/tokens), the key is moved intoCOOLDOWNbefore it starts returning errors.
Advisory leases
Under concurrency, llm-rotate takes a short advisory lease on the key it
selects so simultaneous requests spread across the pool instead of stampeding a
single near-limit key. Leases are advisory only — if every key is leased,
selection falls back to a leased key rather than failing, so throughput is never
sacrificed. With the Redis backend, leases coordinate across workers.
When everything is exhausted
If every eligible key (and every fallback provider) is unavailable, the call
raises NoAvailableKeyError, which includes a health
report and the earliest time a key is expected to become available again.
By default, health state lives in an in-memory store, so it resets on restart and isn't shared across processes. For multi-worker deployments, switch on the Redis state store so cooldowns, rotation, and leases coordinate across every worker.