Tokenizers encode chats and associated message data into tokens for training and inference.

TokenSlice

TokenSlice(
    start: int,
    end: int,
    type: SliceType,
    obj: SliceObj | None = None,
    metadata: dict[str, Any] | None = None,
)
Represents a slice of tokens within a tokenized chat.

end

end: int
The ending index of the slice in the token list.

metadata

metadata: dict[str, Any] | None = None
Additional metadata associated with this slice, if any.

obj

obj: SliceObj | None = None
The original object this slice corresponds to, if any.

start

start: int
The starting index of the slice in the token list.

type

type: SliceType
The type of the slice (e.g. message, tool_call, etc.).

TokenizedChat

TokenizedChat(
    text: str,
    tokens: list[int],
    slices: list[TokenSlice],
    obj: Chat | None = None,
    metadata: dict[str, Any] | None = None,
)
A tokenized representation of a chat, containing the full text, token list, and structured slices of tokens.

metadata

metadata: dict[str, Any] | None = None
Additional metadata associated with the tokenized chat, if any.

obj

obj: Chat | None = None
The original chat object, if available.

slices

slices: list[TokenSlice]
Structured slices of tokens, each representing a part of the chat.

text

text: str
The full text of the chat, formatted as a single string.

tokens

tokens: list[int]
The list of tokens representing the chat text.

Tokenizer

Base class for all rigging tokenizers. This class provides common functionality and methods for tokenizing chats.

model

model: str
The model name to be used by the tokenizer.

decode

decode(tokens: list[int]) -> str
Decodes a list of tokens back into a string. Parameters:
  • tokens (list[int]) –The list of tokens to decode.
Returns:
  • str –The decoded string.

encode

encode(text: str) -> list[int]
Encodes the given text into a list of tokens. Parameters:
  • text (str) –The text to encode.
Returns:
  • list[int] –A list of tokens representing the encoded text.

format_chat

format_chat(chat: Chat) -> str
Formats the chat into a string representation. Parameters:
  • chat (Chat) –The chat object to format.
Returns:
  • str –A string representation of the chat.

tokenize_chat

tokenize_chat(chat: Chat) -> TokenizedChat
Transform a chat into a tokenized format with structured slices. Parameters:
  • chat (Chat) –The chat object to tokenize.
Returns:
  • TokenizedChat –A TokenizedChat object containing the tokenized chat data.

TransformersTokenizer

A tokenizer implementation using Hugging Face Transformers. This class provides tokenization capabilities for chat conversations using transformers models and their associated tokenizers.

apply_chat_template_kwargs

apply_chat_template_kwargs: dict[str, Any] = Field(
    default_factory=dict
)
Additional keyword arguments for applying the chat template.

decode_kwargs

decode_kwargs: dict[str, Any] = Field(default_factory=dict)
Additional keyword arguments for decoding tokens.

encode_kwargs

encode_kwargs: dict[str, Any] = Field(default_factory=dict)
Additional keyword arguments for encoding text.

tokenizer

tokenizer: PreTrainedTokenizer
The underlying PreTrainedTokenizer instance.

encode

encode(text: str) -> list[int]
Encodes the given text into a list of tokens. Parameters:
  • text (str) –The text to encode.
Returns:
  • list[int] –A list of tokens representing the encoded text.

from_obj

from_obj(
    tokenizer: PreTrainedTokenizer,
) -> TransformersTokenizer
Create a new instance of TransformersTokenizer from an already loaded tokenizer. Parameters:
  • tokenizer (PreTrainedTokenizer) –The tokenizer associated with the model.
Returns:
  • TransformersTokenizer –The TransformersTokenizer instance.

get_tokenizer

get_tokenizer(identifier: str) -> Tokenizer
Get a tokenizer by an identifier string. Uses Transformers by default. Identifier strings are formatted like <provider>!<model>,<**kwargs> (provider is optional and defaults to transformers if not specified) Examples:
  • “meta-llama/Meta-Llama-3-8B-Instruct” -> TransformersTokenizer(model="meta-llama/Meta-Llama-3-8B-Instruct”)`
  • “transformers!microsoft/Phi-4-mini-instruct” -> TransformersTokenizer(model="microsoft/Phi-4-mini-instruct")
Parameters:
  • identifier (str) –The identifier string to use to get a tokenizer.
Returns:
  • Tokenizer –The tokenizer object.
Raises:
  • InvalidTokenizerError –If the identifier is invalid.

register_tokenizer

register_tokenizer(
    provider: str,
    tokenizer_cls: type[Tokenizer] | LazyTokenizer,
) -> None
Register a tokenizer class for a provider id. This let’s you use [rigging.tokenizer.get_tokenizer][] with a custom tokenizer class. Parameters:
  • provider (str) –The name of the provider.
  • tokenizer_cls (type[Tokenizer] | LazyTokenizer) –The tokenizer class to register.
Returns:
  • None –None