Why count tokens before calling an LLM API?

Three reasons: (1) avoid context window errors that fail the request, (2) prevent surprise costs from large prompts, (3) decide whether to use a cheap or premium model based on input size. Counting takes one second and saves money plus errors.

How do I count tokens in code before calling the API?

Use tiktoken (Python) or @dqbd/tiktoken (JavaScript) for OpenAI. Use Anthropic count_tokens API for Claude. Use Google AI SDK count_tokens for Gemini. For one-off counts outside code, use a free browser-based token counter.

What happens if I exceed the context window?

The API returns an error (usually 400 Bad Request) and rejects the call. You will not be charged for a rejected call. To avoid this, count tokens before sending and either truncate input, summarize history, or switch to a model with a larger context window.

How much buffer should I leave in the context window?

Leave 10-20% of the context window for output tokens. If GPT-4o has 128K tokens total and you want 4K tokens of output, your input should fit in 124K. For safety, target 110K-120K input to leave room for response.

How to Count Tokens Before Sending to an LLM API

Last updated: April 20266 min readAI Tools

Counting tokens before you send a prompt to an LLM API is a 1-second habit that prevents three problems: context window errors, surprise costs, and slow responses. Here is exactly how to do it for any model in 2026.

The Three Reasons to Count First

1. Context window errors. Every LLM has a hard token limit per request. GPT-4o is 128K. Claude is 200K. Gemini 2.5 Pro is 2M. Send more than the limit and the API rejects your call with an error. Counting first catches this before you waste a network round-trip.

2. Cost surprise prevention. A 50K token input on GPT-4o costs $0.125. On Claude Opus 4 it costs $0.75. If you don't know the input size before sending, you don't know the cost. Counting first removes the surprise.

3. Model routing. If your input is small (under 4K tokens), a cheap model is fine. If your input is huge (50K+), you need a model with a large context window — and you may want to use a cheaper model to control cost. Counting first lets you route intelligently.

The 5-Second Method

For one-off checks:

Open the Token Counter
Paste your prompt + system message + any retrieved context + chat history
Read the token count
Compare to your model's context window minus your expected output size

If it fits, send. If it doesn't, truncate or switch models.

Check token count for any prompt in 5 seconds.

Open Token Counter →

The In-Code Method

For production systems, count programmatically before each API call:

// Pseudocode for any LLM
function safeCallLLM(systemPrompt, userPrompt, model) {
  const totalTokens = countTokens(systemPrompt + userPrompt);
  const contextLimit = MODELS[model].contextWindow;
  const reservedForOutput = 4000;

  if (totalTokens > contextLimit - reservedForOutput) {
    // Option 1: truncate
    userPrompt = truncateToFit(userPrompt, contextLimit - reservedForOutput - countTokens(systemPrompt));
    // Option 2: switch model
    // model = pickLargerModel(totalTokens);
    // Option 3: chunk and process separately
  }

  return callAPI(systemPrompt, userPrompt, model);
}

Common token counting libraries:

OpenAI: tiktoken (Python) or @dqbd/tiktoken (JS)
Claude: anthropic.messages.count_tokens()
Gemini: genai.count_tokens()
Llama / DeepSeek / others: Hugging Face tokenizers library

Context Window Sizes by Model (April 2026)

Model	Context window	Reserve for output	Safe input limit
GPT-4o, GPT-4.1	128K	4K	124K
GPT-4o mini	128K	4K	124K
Claude Haiku 3.5	200K	4K	196K
Claude Sonnet 4	200K	8K	192K
Claude Opus 4	200K	8K	192K
Gemini 2.0 Flash	1M	8K	992K
Gemini 2.5 Flash	1M	8K	992K
Gemini 2.5 Pro	2M	8K	1.99M

What to Do When You're Over the Limit

Strategy 1 — Truncate. Drop the oldest content first (chat history, less relevant retrieved context). Keep the most recent and most relevant.

Strategy 2 — Summarize. Replace long history or context with a shorter summary. Cost: one extra API call to generate the summary, then use the summary instead of the raw content.

Strategy 3 — Switch models. If your input is 150K tokens, GPT-4o (128K) won't fit. Switch to Claude (200K) or Gemini (1M+).

Strategy 4 — Chunk and process separately. Split the input into manageable pieces, process each, then combine results. Most common for very long documents.

Strategy 5 — Use embeddings + retrieval. Instead of sending all context, embed it once, then retrieve the most relevant chunks per query. Reduces input from 200K tokens to 5K tokens for most queries.

The Token Budget Mental Model

Think of every API call as a budget. The budget is your context window. You spend tokens on:

System prompt (fixed across calls)
Retrieved context (varies, controlled by your retrieval strategy)
Chat history (grows over time, controlled by your truncation strategy)
User message (varies, controlled by user)
Reserved output (fixed at session start)

Add them up before each call. Make sure they fit. If they don't, drop the lowest-priority item or switch models.

The 30-Second Pre-Send Habit

Before you send any prompt:

Estimate your input size in your head (200 words ≈ 260 tokens)
Add system prompt (usually 200-500 tokens)
Add retrieved context if RAG (usually 1K-5K tokens)
Add chat history if applicable (~300 tokens per turn)
Compare to your model's safe input limit
If within limit, send. If not, truncate or switch.

Or just paste it all into the Token Counter and skip the math.

Count tokens for any prompt before sending. Free, instant.

Open Token Counter →

How to Count Tokens Before Sending to an LLM API

The Three Reasons to Count First

The 5-Second Method

The In-Code Method

Context Window Sizes by Model (April 2026)

What to Do When You're Over the Limit

The Token Budget Mental Model

The 30-Second Pre-Send Habit

Related Posts

Token Counter

Count Without tiktoken

Claude 200K Context