Architecture7 min read2 September 2025

AI Error Monitoring and Alerting: What to Watch and When to Wake Someone Up

AI systems degrade in ways that traditional error monitoring does not catch. The alerts that matter for AI systems are different from the alerts that matter for conventional software.

AP

Ajay Prajapat

AI Systems Architect

A traditional web service generates clear signals when something is wrong: HTTP 500s spike, error rate rises, latency increases, database query time degrades. AI systems can be delivering progressively worse quality outputs while generating no errors at all — successful API calls returning low-quality responses that pass validation and reach users. Traditional monitoring catches infrastructure failures; AI-specific monitoring catches quality failures that infrastructure metrics miss.

The Five Monitoring Dimensions for AI Systems

1. Infrastructure health (standard)

  • API error rate by model and endpoint
  • p50, p95, p99 latency
  • Queue depth and consumer lag (for async systems)
  • Rate limit hit rate

2. Cost anomalies

  • Cost per request vs historical baseline
  • Token count per request (input + output) vs baseline
  • Total daily cost vs budget
  • Sudden cost spikes (often indicate runaway retries or malformed inputs)

3. Output quality metrics

  • Quality score distribution (% of outputs above/below quality threshold)
  • Low-confidence output rate (what % of outputs have confidence below the review threshold)
  • Human override rate (what % of auto-approved outputs are later corrected)
  • Evaluation set score (weekly automated evaluation against ground truth)

4. Data quality signals

  • Input data quality metrics (completeness, format compliance)
  • Distribution shift indicators (statistical distance of current inputs from baseline)
  • Missing required fields rate
  • Upstream data freshness (when did the source data last update?)

5. Business outcome metrics

  • The business metric the AI system is designed to move — measured continuously
  • Downstream process health (if AI outputs feed another process, track that process's health)
  • Customer-visible error rate or satisfaction score

Alert Tiering: What Wakes Someone Up vs What Goes on the Dashboard

  • Page immediately: API error rate >10%, cost anomaly >3x baseline, quality score drop >15% in 1 hour, safety or compliance output failure
  • Notify next business day: quality score drop >5% sustained over 24 hours, cost >120% of budget for 48 hours, evaluation set score drop >3%
  • Dashboard only: gradual quality trends, cost creep within budget, low-confidence output rate trends
  • Never alert on: normal operating variance — set baselines on rolling 30-day averages, not fixed numbers

Baseline Management

Alert thresholds set against fixed numbers become stale as the system evolves and usage patterns change. Use rolling baselines: alert when a metric deviates significantly from its rolling 7-day or 30-day average. This adapts to normal growth in volume, expected seasonal patterns, and gradual system improvements without requiring manual threshold updates.

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.