Alert When Context Window Usage Exceeds 90%

TL;DR

Problem: LLMs drop the earliest tokens when a prompt exceeds the context window - no warning, no error. Outputs degrade even before the limit is hit.

Symptom: Incomplete answers, hallucinations, broken agent flows, and misleading LLM metrics

Fix: Log token count per request and alert when usage exceeds 90% of model limit

Prevent Silent LLM Failures

Problem

When a prompt exceeds the context window, the LLM silently drops the earliest tokens - no error, no log, no warning. This is known as prompt truncation. The model keeps responding, but with incomplete context. Outputs still look plausible, but logic fails, hallucinations creep in, and tool calls misfire. Worse: metrics like hallucination rate, grounding precision, and step success in agent chains get corrupted - and you won’t know why.

If your system behaves fine in dev but fails in prod - especially under long chats or deep tool use - suspect prompt truncation.

Example - Agent Fails Silently Mid-Flow

Your agent prompt includes system instructions, chat history, and recent tool outputs. One more step gets added - and it pushes the prompt over the model limit.

The final tool result gets dropped. The agent loops or stalls. Looks like a logic bug. It’s just truncation.

Example - RAG Answer Misses the Best Chunks

Your retriever pulls high-relevance documents and appends them to the prompt. Instructions and formatting push the input past the token limit.

The last - and most relevant - chunks get cut. The answer sounds grounded but is wrong. Looks like bad retrieval. It’s silent truncation.

Fix

Track every prompt: Log total token count before each model call
Monitor usage: Calculate percent used (e.g., 88%) and alert at 90% or lower if prompt shape varies
Catch truncation early: Log both pre- and post-truncation lengths (if supported) and flag high-risk prompts

Log Example

{
  "id": "rag-chain-lookup-42",
  "prompt_tokens": 7180,
  "context_limit": 8192,
  "context_usage_percent": 87.7
}

Impact

Catch truncation early - detect and alert before the model loses context, so answers stay intact and agent flows don’t break unexpectedly
Isolate prompt length as a root cause - avoid wasting hours debugging hallucinations, tool failures, or broken chains that were caused by silent truncation
Restore trust in LLM metrics - grounding precision, hallucination rate, and agent step success become meaningful again when prompt integrity is guaranteed