Architecture8 min read10 September 2025

AI System Reliability Patterns: Fallbacks, Circuit Breakers, and Graceful Degradation

Reliable AI systems are not systems that never fail. They are systems that fail gracefully, recover quickly, and always give users a usable experience even when the AI component is unavailable.

AP

Ajay Prajapat

AI Systems Architect

LLM APIs go down. Latency spikes. Rate limits are hit. Models are deprecated. Building an AI system that depends on external model APIs as a single point of failure is building a system that will fail in production in ways that are visible to users. The reliability patterns for AI systems borrow from distributed systems engineering — fallbacks, circuit breakers, bulkheads, and graceful degradation — applied to the specific failure modes of AI components.

Fallback Patterns for AI Components

Model fallback

When the primary model is unavailable or degraded, route to a secondary model. A primary model call that fails after defined retries falls back to a secondary model from a different provider. The secondary model may have different capability characteristics — design the output schema to be compatible with both, and monitor the fallback rate as an indicator of primary model health.

Cached response fallback

For repeated queries (FAQ, standard questions), serve cached responses when the model is unavailable. The user gets a response — potentially one generated in a prior successful call — rather than an error. This is appropriate when the response is not time-sensitive and the query pattern is repetitive.

Graceful degradation

When AI is unavailable, degrade to a non-AI version of the feature rather than failing completely. An AI-powered document extraction system falls back to a manual review queue. An AI recommendation engine falls back to rule-based recommendations. The user experience degrades, but the workflow continues.

Circuit Breaker Pattern for AI APIs

A circuit breaker monitors the success rate of calls to an external AI API. When the failure rate exceeds a threshold, the circuit "opens" and subsequent calls fail fast (without waiting for timeout) and route to the fallback. After a defined period, the circuit "half-opens" and allows a limited number of test calls. If they succeed, the circuit closes and normal operation resumes.

  • Open threshold: open the circuit when failure rate exceeds 50% over a 60-second window
  • Open duration: hold the circuit open for 30-120 seconds before allowing test calls
  • Half-open test: allow 10% of calls through; close the circuit when success rate returns to >95%
  • Alert on circuit open: every circuit open event should trigger an alert — it is a reliability incident

Bulkhead Pattern: Isolating AI Failures

The bulkhead pattern isolates different AI use cases so that a failure or performance problem in one does not cascade to others. If the AI summarisation feature is overwhelmed with requests, this should not affect the AI classification feature. Separate thread pools, connection pools, and rate limit budgets per AI feature prevent cascade failures.

Timeout Design for AI Systems

Default HTTP timeouts (30 seconds) are poorly matched to LLM inference latency. Design timeouts based on the p99 latency of your specific model and task: set the timeout at 1.5-2x the p99 latency. A call that exceeds this timeout should fail fast and route to the fallback — a user waiting 60 seconds for a response is a worse experience than an immediately served fallback.

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.