Skip to content

Tokenization in Code2Prompt

When working with language models, text needs to be transformed into a format that the model can understand : tokens, which are sequences of numbers. This transformation is handled by a tokenizer.


A tokenizer converts raw text into tokens, which are the building blocks for how language models process input. These tokens can represent words, subwords, or even individual characters, depending on the tokenizer’s design.

For code2prompt, we use the tiktoken tokenizer. It’s efficient, robust, and optimized for OpenAI models. You can explore its functionality in the official repository

👉 tiktoken GitHub Repository

If you want to see a tokenizer in action and play with it, check out the

👉 Tiktokenizer

If you want to learn more about tokenizer in general, check out the

👉 Mistral Tokenization Guide

If you want a technical deep dive into tokenizers, check out Andrej Karpathy’s blog post on building your own tokenizer.

👉 Build Your Own Tokenizer

Tokenization is implemented using tiktoken-rs. tiktoken supports these encodings used by OpenAI models:

CLI ArgumentEncoding nameOpenAI models
cl100kcl100k_baseChatGPT models, text-embedding-ada-002
p50kp50k_baseCode models, text-davinci-002, text-davinci-003
p50k_editp50k_editUse for edit models like text-davinci-edit-001, code-davinci-edit-001
r50kr50k_base (or gpt2)GPT-3 models like davinci
gpt2o200k_baseGPT-4o models

For more context on the different tokenizers, see the OpenAI Cookbook