Counting tokens before you send a prompt to an LLM API is a 1-second habit that prevents three problems: context window errors, surprise costs, and slow responses. Here is exactly how to do it for any model in 2026.
1. Context window errors. Every LLM has a hard token limit per request. GPT-4o is 128K. Claude is 200K. Gemini 2.5 Pro is 2M. Send more than the limit and the API rejects your call with an error. Counting first catches this before you waste a network round-trip.
2. Cost surprise prevention. A 50K token input on GPT-4o costs $0.125. On Claude Opus 4 it costs $0.75. If you don't know the input size before sending, you don't know the cost. Counting first removes the surprise.
3. Model routing. If your input is small (under 4K tokens), a cheap model is fine. If your input is huge (50K+), you need a model with a large context window — and you may want to use a cheaper model to control cost. Counting first lets you route intelligently.
For one-off checks:
If it fits, send. If it doesn't, truncate or switch models.
Check token count for any prompt in 5 seconds.
Open Token Counter →For production systems, count programmatically before each API call:
// Pseudocode for any LLM
function safeCallLLM(systemPrompt, userPrompt, model) {
const totalTokens = countTokens(systemPrompt + userPrompt);
const contextLimit = MODELS[model].contextWindow;
const reservedForOutput = 4000;
if (totalTokens > contextLimit - reservedForOutput) {
// Option 1: truncate
userPrompt = truncateToFit(userPrompt, contextLimit - reservedForOutput - countTokens(systemPrompt));
// Option 2: switch model
// model = pickLargerModel(totalTokens);
// Option 3: chunk and process separately
}
return callAPI(systemPrompt, userPrompt, model);
}
Common token counting libraries:
tiktoken (Python) or @dqbd/tiktoken (JS)anthropic.messages.count_tokens()genai.count_tokens()tokenizers library| Model | Context window | Reserve for output | Safe input limit |
|---|---|---|---|
| GPT-4o, GPT-4.1 | 128K | 4K | 124K |
| GPT-4o mini | 128K | 4K | 124K |
| Claude Haiku 3.5 | 200K | 4K | 196K |
| Claude Sonnet 4 | 200K | 8K | 192K |
| Claude Opus 4 | 200K | 8K | 192K |
| Gemini 2.0 Flash | 1M | 8K | 992K |
| Gemini 2.5 Flash | 1M | 8K | 992K |
| Gemini 2.5 Pro | 2M | 8K | 1.99M |
Strategy 1 — Truncate. Drop the oldest content first (chat history, less relevant retrieved context). Keep the most recent and most relevant.
Strategy 2 — Summarize. Replace long history or context with a shorter summary. Cost: one extra API call to generate the summary, then use the summary instead of the raw content.
Strategy 3 — Switch models. If your input is 150K tokens, GPT-4o (128K) won't fit. Switch to Claude (200K) or Gemini (1M+).
Strategy 4 — Chunk and process separately. Split the input into manageable pieces, process each, then combine results. Most common for very long documents.
Strategy 5 — Use embeddings + retrieval. Instead of sending all context, embed it once, then retrieve the most relevant chunks per query. Reduces input from 200K tokens to 5K tokens for most queries.
Think of every API call as a budget. The budget is your context window. You spend tokens on:
Add them up before each call. Make sure they fit. If they don't, drop the lowest-priority item or switch models.
Before you send any prompt:
Or just paste it all into the Token Counter and skip the math.
Count tokens for any prompt before sending. Free, instant.
Open Token Counter →