Architecture7 min read2 September 2025

AI Error Monitoring and Alerting: What to Watch and When to Wake Someone Up

AI systems degrade in ways that traditional error monitoring does not catch. The alerts that matter for AI systems are different from the alerts that matter for conventional software.

Ajay Prajapat

AI Systems Architect

A traditional web service generates clear signals when something is wrong: HTTP 500s spike, error rate rises, latency increases, database query time degrades. AI systems can be delivering progressively worse quality outputs while generating no errors at all — successful API calls returning low-quality responses that pass validation and reach users. Traditional monitoring catches infrastructure failures; AI-specific monitoring catches quality failures that infrastructure metrics miss.

The Five Monitoring Dimensions for AI Systems

1. Infrastructure health (standard)

API error rate by model and endpoint
p50, p95, p99 latency
Queue depth and consumer lag (for async systems)
Rate limit hit rate

2. Cost anomalies

Cost per request vs historical baseline
Token count per request (input + output) vs baseline
Total daily cost vs budget
Sudden cost spikes (often indicate runaway retries or malformed inputs)

3. Output quality metrics

Quality score distribution (% of outputs above/below quality threshold)
Low-confidence output rate (what % of outputs have confidence below the review threshold)
Human override rate (what % of auto-approved outputs are later corrected)
Evaluation set score (weekly automated evaluation against ground truth)

4. Data quality signals

Input data quality metrics (completeness, format compliance)
Distribution shift indicators (statistical distance of current inputs from baseline)
Missing required fields rate
Upstream data freshness (when did the source data last update?)

5. Business outcome metrics

The business metric the AI system is designed to move — measured continuously
Downstream process health (if AI outputs feed another process, track that process's health)
Customer-visible error rate or satisfaction score

Alert Tiering: What Wakes Someone Up vs What Goes on the Dashboard

Page immediately: API error rate >10%, cost anomaly >3x baseline, quality score drop >15% in 1 hour, safety or compliance output failure
Notify next business day: quality score drop >5% sustained over 24 hours, cost >120% of budget for 48 hours, evaluation set score drop >3%
Dashboard only: gradual quality trends, cost creep within budget, low-confidence output rate trends
Never alert on: normal operating variance — set baselines on rolling 30-day averages, not fixed numbers

Baseline Management

Alert thresholds set against fixed numbers become stale as the system evolves and usage patterns change. Use rolling baselines: alert when a metric deviates significantly from its rolling 7-day or 30-day average. This adapts to normal growth in volume, expected seasonal patterns, and gradual system improvements without requiring manual threshold updates.

Back to all articles

Key Takeaways

AI systems can deliver progressively worse quality with no infrastructure errors — quality metrics are not optional
Monitor five dimensions: infrastructure health, cost anomalies, output quality, data quality signals, and business outcome metrics
Cost anomalies >3x baseline should page immediately — they usually indicate a bug, not genuine usage growth
Quality score drops >15% in 1 hour should page; drops >5% sustained over 24 hours should notify
Set alert thresholds against rolling 30-day baselines, not fixed numbers — fixed thresholds become stale
The business outcome metric the AI is designed to move should be monitored continuously alongside technical metrics

Apply This To Your Business

Book a strategy call to discuss how these patterns apply to your specific systems and team.

Book a Call

AI Systems Architect

Want to apply these ideas in your business?

A strategy call is where the thinking in these articles meets your specific systems, team, and goals.

Book a Strategy Call