Tokenization in Code2Prompt

When working with language models, text needs to be transformed into a format that the model can understand—tokens, which are sequences of numbers. This transformation is handled by a tokenizer.

What is a Tokenizer?

A tokenizer converts raw text into tokens, which are the building blocks for how language models process input. These tokens can represent words, subwords, or even individual characters, depending on the tokenizer’s design.

For code2prompt, we use the tiktoken tokenizer. It’s efficient, robust, and optimized for OpenAI models. You can explore its functionality in the official repository

👉 tiktoken GitHub Repository

If you want to learn more about tokenizer in general, check out the

👉 Mistral Tokenization Guide.

Implementation in `code2prompt`

Tokenization is implemented using tiktoken-rs. tiktoken supports these encodings used by OpenAI models:

CLI Argument	Encoding name	OpenAI models
`cl100k`	`cl100k_base`	ChatGPT models, `text-embedding-ada-002`
`p50k`	`p50k_base`	Code models, `text-davinci-002`, `text-davinci-003`
`p50k_edit`	`p50k_edit`	Use for edit models like `text-davinci-edit-001`, `code-davinci-edit-001`
`r50k`	`r50k_base` (or `gpt2`)	GPT-3 models like `davinci`
`gpt2`	`o200k_base`	GPT-4o models

For more context on the different tokenizers, see the OpenAI Cookbook

Tokenization in Code2Prompt

What is a Tokenizer?

Implementation in code2prompt

Implementation in `code2prompt`