Skip to content

Tokenization in Code2Prompt

When working with language models, text needs to be transformed into a format that the model can understand—tokens, which are sequences of numbers. This transformation is handled by a tokenizer.


A tokenizer converts raw text into tokens, which are the building blocks for how language models process input. These tokens can represent words, subwords, or even individual characters, depending on the tokenizer’s design.

For code2prompt, we use the tiktoken tokenizer. It’s efficient, robust, and optimized for OpenAI models. You can explore its functionality in the official repository

👉 tiktoken GitHub Repository

If you want to learn more about tokenizer in general, check out the

👉 Mistral Tokenization Guide.

Tokenization is implemented using tiktoken-rs. tiktoken supports these encodings used by OpenAI models:

CLI ArgumentEncoding nameOpenAI models
cl100kcl100k_baseChatGPT models, text-embedding-ada-002
p50kp50k_baseCode models, text-davinci-002, text-davinci-003
p50k_editp50k_editUse for edit models like text-davinci-edit-001, code-davinci-edit-001
r50kr50k_base (or gpt2)GPT-3 models like davinci
gpt2o200k_baseGPT-4o models

For more context on the different tokenizers, see the OpenAI Cookbook