Alert When Context Window Usage Exceeds 90%
Problem: LLMs drop the earliest tokens when a prompt exceeds the context window - no warning, no error. Outputs degrade even before the limit is hit.
Symptom: Incomplete answers, hallucinations, broken agent flows, and misleading LLM metrics
Fix: Log token count per request and alert when usage exceeds 90% of model limit
Prevent Silent LLM Failures
Problem
When a prompt exceeds the context window, the LLM silently drops the earliest tokens - no error, no log, no warning. This is known as prompt truncation. The model keeps responding, but with incomplete context. Outputs still look plausible, but logic fails, hallucinations creep in, and tool calls misfire. Worse: metrics like hallucination rate, grounding precision, and step success in agent chains get corrupted - and you won’t know why.
If your system behaves fine in dev but fails in prod - especially under long chats or deep tool use - suspect prompt truncation.
Example - Agent Fails Silently Mid-Flow
Your agent prompt includes system instructions, chat history, and recent tool outputs. One more step gets added - and it pushes the prompt over the model limit.
The final tool result gets dropped. The agent loops or stalls. Looks like a logic bug. It’s just truncation.
Example - RAG Answer Misses the Best Chunks
Your retriever pulls high-relevance documents and appends them to the prompt. Instructions and formatting push the input past the token limit.
The last - and most relevant - chunks get cut. The answer sounds grounded but is wrong. Looks like bad retrieval. It’s silent truncation.
Fix
- Track every prompt: Log total token count before each model call
- Monitor usage: Calculate percent used (e.g., 88%) and alert at 90% or lower if prompt shape varies
- Catch truncation early: Log both pre- and post-truncation lengths (if supported) and flag high-risk prompts
Log Example
{
"id": "rag-chain-lookup-42",
"prompt_tokens": 7180,
"context_limit": 8192,
"context_usage_percent": 87.7
}
Impact
- Catch truncation early - detect and alert before the model loses context, so answers stay intact and agent flows don’t break unexpectedly
- Isolate prompt length as a root cause - avoid wasting hours debugging hallucinations, tool failures, or broken chains that were caused by silent truncation
- Restore trust in LLM metrics - grounding precision, hallucination rate, and agent step success become meaningful again when prompt integrity is guaranteed