Designing for High and Variable Latency
LLM inference latency is typically 500ms-5s for complex tasks, with significant variance. This changes the design of any API that exposes this capability.
- Streaming by default for user-facing endpoints: stream partial responses as they are generated rather than waiting for the full response — perceived latency is dramatically lower with streaming
- Async for background tasks: for AI tasks that do not require immediate response (document processing, batch analysis), use an async pattern — POST to start, GET or webhook for result
- Explicit timeouts at every layer: set timeout values that reflect realistic LLM latency, not the 30s default that causes silent failures
- Distinguish client timeout from server timeout: a client timing out at 30s does not mean the AI task failed — design status APIs so clients can check whether processing completed