Counting OpenAI tokens

I wrote a little library that estimates token usage for the OpenAI chat API. You can find it on GitHub and npm, but if you want to know why I built it, read on!

Why count tokens?

It’s important to use your precious LLM context window effectively. Even though context windows are getting bigger, shorter prompts cost less, leave more room for the completions, and seem to perform better. When you’ve got too much to pack into a prompt, you need to decide what to remove. This might mean omitting some less-than-crucial information, or it could mean using an LLM to summarise a chunk of the context to “compress” it.

To do this well, you need to accurately count the tokens in your prompt. Counting tokens used to be easy: you’d feed your text into a tokeniser, and get a number out. OpenAI even provide an official tokeniser called tiktoken.

These days, the story is more complex.

Chat messages are turned into ChatML

OpenAI encourages use of the chat API rather than the completion API. To use this API, you pass an array of message objects, each of which have a role and content. Although the API itself speaks JSON, when the messages are passed to the underlying model, they’re formatted as “ChatML”. Here’s an example of what ChatML looks like:

<|im_start|>system
You are a helpful robot.
Be kind.<|im_end|>
<|im_start|>user
How are you<|im_end|>
<|im_start|>assistant
I am doing well! How are you doing?<|im_end|>

This means that running tiktoken over JSON.stringify(messages) will give the wrong result, and instead you need to consider the format used under the hood.

Function definitions become TypeScript declarations

The function calling API is a great way of getting more machine-friendly output from the OpenAI models. You provide an array of function definitions, specified as JSON schema objects, and the model can choose to call one of those functions by returning its name and arguments.

The function definitions you pass to the API also count towards your prompt token count. However, as with messages, getting the token count isn’t as simple as taking the JSON-encoded function definitions and passing them to tiktoken. OpenAI appears to be turning the function definitions into TypeScript type definitions. Meaning this:

{
  "name": "getWeather",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "The city to get the weather for"
      },
      "unit": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"]
      }
    }
  }
}

Gets turned into something like this:

namespace functions {
  type getWeather = (_: {
    // The city to get the weather for
    location?: string;
    unit?: "celsius" | "fahrenheit";
  }) => any;
} // namespace functions

Reproducing the exact prompt is an uphill battle

Understanding the formatting of chat messages and function declarations will get you closer to a precise estimate, but unless you can reproduce the exact prompt OpenAI is using, it’s hard to predict the token count exactly. Fortunately, OpenAI does return the number prompt tokens used in their API, so with lots of trial and error, it’s possible to come up with a relatively good estimate by observing what changes when you change the prompt. For instance:

  • Each completion seems to carry a fixed 3-token overhead
  • If a message includes the name property, it uses one token fewer than it would otherwise
  • Using the function calling API appears to add a fixed overhead of 9 tokens per completion, presumably in part to add a system message that includes the TypeScript function declarations. If you provide your own system messages, you can remove 4 tokens from that 9, but you need to add a trailing newline to the first of your system messages.

Ultimately, I’d love to see OpenAI publish this logic as a library, or even just document how the prompt is constructed.

In the meantime, I’ve published the results of my trial and error into a TypeScript library called openai-chat-tokens. It takes care of adjusting for ChatML, converting the JSON-schema-style functions to TypeScript declarations, and factoring in the other fixed overheads and offsets I’ve found.

If you use it and find cases where it’s wrong, add a test case! If you have an OpenAI API token, you can automatically validate the test data against the OpenAI API by setting validateAll to true in the tests. And if you know of a better way of calculating prompt token usage (other than calling the OpenAI API, which is slow and pricey), I’ve love to hear it.