rigging.tokenizer - Dreadnode Documentation

Tokenizers encode chats and associated message data into tokens for training and inference.

TokenSlice

TokenSlice(
    start: int,
    end: int,
    type: SliceType,
    obj: SliceObj | None = None,
    metadata: dict[str, Any] | None = None,
)

Represents a slice of tokens within a tokenized chat.

end

end: int

The ending index of the slice in the token list.

metadata

metadata: dict[str, Any] | None = None

Additional metadata associated with this slice, if any.

obj

obj: SliceObj | None = None

The original object this slice corresponds to, if any.

start

start: int

The starting index of the slice in the token list.

type

type: SliceType

The type of the slice (e.g. message, tool_call, etc.).

TokenizedChat

TokenizedChat(
    text: str,
    tokens: list[int],
    slices: list[TokenSlice],
    obj: Chat | None = None,
    metadata: dict[str, Any] | None = None,
)

A tokenized representation of a chat, containing the full text, token list, and structured slices of tokens.

metadata

metadata: dict[str, Any] | None = None

Additional metadata associated with the tokenized chat, if any.

obj

obj: Chat | None = None

The original chat object, if available.

slices

slices: list[TokenSlice]

Structured slices of tokens, each representing a part of the chat.

text

text: str

The full text of the chat, formatted as a single string.

tokens

tokens: list[int]

The list of tokens representing the chat text.

Tokenizer

Base class for all rigging tokenizers. This class provides common functionality and methods for tokenizing chats.

model

model: str

The model name to be used by the tokenizer.

decode

decode(tokens: list[int]) -> str

Decodes a list of tokens back into a string. Parameters:

tokens (list[int]) –The list of tokens to decode.

Returns:

str –The decoded string.

Source code in rigging/tokenizer/base.py

def decode(self, tokens: list[int]) -> str:
    """
    Decodes a list of tokens back into a string.

    Args:
        tokens: The list of tokens to decode.

    Returns:
        The decoded string.
    """
    raise NotImplementedError(
        f"{self.__class__.__name__}.decode() must be implemented by subclasses.",
    )

encode

encode(text: str) -> list[int]

Encodes the given text into a list of tokens. Parameters:

text (str) –The text to encode.

Returns:

list[int] –A list of tokens representing the encoded text.

Source code in rigging/tokenizer/base.py

def encode(self, text: str) -> list[int]:
    """
    Encodes the given text into a list of tokens.

    Args:
        text: The text to encode.

    Returns:
        A list of tokens representing the encoded text.
    """
    raise NotImplementedError(
        f"{self.__class__.__name__}.encode() must be implemented by subclasses.",
    )

format_chat

format_chat(chat: Chat) -> str

Formats the chat into a string representation. Parameters:

chat (Chat) –The chat object to format.

Returns:

str –A string representation of the chat.

Source code in rigging/tokenizer/base.py

def format_chat(self, chat: "Chat") -> str:
    """
    Formats the chat into a string representation.

    Args:
        chat: The chat object to format.

    Returns:
        A string representation of the chat.
    """
    raise NotImplementedError(
        f"{self.__class__.__name__}.format_chat() must be implemented by subclasses.",
    )

tokenize_chat

tokenize_chat(chat: Chat) -> TokenizedChat

Transform a chat into a tokenized format with structured slices. Parameters:

chat (Chat) –The chat object to tokenize.

Returns:

TokenizedChat –A TokenizedChat object containing the tokenized chat data.

Source code in rigging/tokenizer/base.py

async def tokenize_chat(self, chat: "Chat") -> TokenizedChat:
    """
    Transform a chat into a tokenized format with structured slices.

    Args:
        chat: The chat object to tokenize.

    Returns:
        A TokenizedChat object containing the tokenized chat data.
    """
    chat_text = self.format_chat(chat)
    chat_tokens = self.encode(chat_text)

    slices: list[TokenSlice] = []
    search_start = 0

    # Process messages in order
    for message in chat.all:
        # Find this message
        if not (match := self._find_in_tokens(message.content, chat_tokens, 0, search_start)):
            warnings.warn(
                f"Warning: Could not find message '{message.content[:50]}...' in chat tokens",
                TokenizerWarning,
                stacklevel=2,
            )
            continue

        msg_start, msg_end = match
        msg_metadata = message.metadata or {}
        msg_metadata["role"] = message.role
        if message.tool_call_id:
            msg_metadata["tool_call_id"] = message.tool_call_id

        # Add message slice
        slices.append(
            TokenSlice(
                start=msg_start,
                end=msg_end,
                type="message",
                obj=message,
                metadata=msg_metadata,
            ),
        )

        # Find parts within this message
        message_tokens = chat_tokens[msg_start:msg_end]
        part_search_start = 0

        # Process message slices in order
        for slice_ in message.slices:
            part_text = message.content[slice_.slice_]
            part_match = self._find_in_tokens(
                part_text,
                message_tokens,
                msg_start,
                part_search_start,
            )
            if not part_match:
                warnings.warn(
                    f"Warning: Could not find part '{part_text[:50]}...' in message tokens",
                    TokenizerWarning,
                    stacklevel=2,
                )
                continue

            part_start, part_end = part_match
            slices.append(
                TokenSlice(
                    start=part_start,
                    end=part_end,
                    type=slice_.type,
                    obj=slice_.obj,
                    metadata=slice_.metadata,
                ),
            )

            # Continue searching after this part
            part_search_start = part_end - msg_start

        # Continue searching after this message
        search_start = msg_end

    slices.sort(key=lambda s: s.start)

    return TokenizedChat(
        text=chat_text,
        tokens=chat_tokens,
        slices=slices,
        obj=chat,
        metadata=chat.metadata,
    )

TransformersTokenizer

A tokenizer implementation using Hugging Face Transformers. This class provides tokenization capabilities for chat conversations using transformers models and their associated tokenizers.

apply_chat_template_kwargs

apply_chat_template_kwargs: dict[str, Any] = Field(
    default_factory=dict
)

Additional keyword arguments for applying the chat template.

decode_kwargs

decode_kwargs: dict[str, Any] = Field(default_factory=dict)

Additional keyword arguments for decoding tokens.

encode_kwargs

encode_kwargs: dict[str, Any] = Field(default_factory=dict)

Additional keyword arguments for encoding text.

tokenizer

tokenizer: PreTrainedTokenizer

The underlying PreTrainedTokenizer instance.

encode

encode(text: str) -> list[int]

Encodes the given text into a list of tokens. Parameters:

text (str) –The text to encode.

Returns:

list[int] –A list of tokens representing the encoded text.

Source code in rigging/tokenizer/transformers_.py

def encode(self, text: str) -> list[int]:
    """
    Encodes the given text into a list of tokens.

    Args:
        text: The text to encode.

    Returns:
        A list of tokens representing the encoded text.
    """
    return self.tokenizer.encode(text, **self.encode_kwargs)  # type: ignore [no-any-return]

from_obj

from_obj(
    tokenizer: PreTrainedTokenizer,
) -> TransformersTokenizer

Create a new instance of TransformersTokenizer from an already loaded tokenizer. Parameters:

tokenizer (PreTrainedTokenizer) –The tokenizer associated with the model.

Returns:

TransformersTokenizer –The TransformersTokenizer instance.

Source code in rigging/tokenizer/transformers_.py

@classmethod
def from_obj(cls, tokenizer: "PreTrainedTokenizer") -> "TransformersTokenizer":
    """
    Create a new instance of TransformersTokenizer from an already loaded tokenizer.

    Args:
        tokenizer: The tokenizer associated with the model.

    Returns:
        The TransformersTokenizer instance.
    """
    return cls(model=str(tokenizer), _tokenizer=tokenizer)

get_tokenizer

get_tokenizer(identifier: str) -> Tokenizer

Get a tokenizer by an identifier string. Uses Transformers by default. Identifier strings are formatted like <provider>!<model>,<**kwargs> (provider is optional and defaults to transformers if not specified) Examples:

“meta-llama/Meta-Llama-3-8B-Instruct” -> TransformersTokenizer(model="meta-llama/Meta-Llama-3-8B-Instruct”)`
“transformers!microsoft/Phi-4-mini-instruct” -> TransformersTokenizer(model="microsoft/Phi-4-mini-instruct")

Parameters:

identifier (str) –The identifier string to use to get a tokenizer.

Returns:

Tokenizer –The tokenizer object.

Raises:

InvalidTokenizerError –If the identifier is invalid.

Source code in rigging/tokenizer/base.py

@lru_cache(maxsize=128)
def get_tokenizer(identifier: str) -> Tokenizer:
    """
    Get a tokenizer by an identifier string. Uses Transformers by default.

    Identifier strings are formatted like `<provider>!<model>,<**kwargs>`

    (provider is optional and defaults to `transformers` if not specified)

    Examples:
        - "meta-llama/Meta-Llama-3-8B-Instruct" -> `TransformersTokenizer(model="`meta-llama/Meta-Llama-3-8B-Instruct")`
        - "transformers!microsoft/Phi-4-mini-instruct" -> `TransformersTokenizer(model="microsoft/Phi-4-mini-instruct")`

    Args:
        identifier: The identifier string to use to get a tokenizer.

    Returns:
        The tokenizer object.

    Raises:
        InvalidTokenizerError: If the identifier is invalid.
    """

    provider: str = next(iter(g_tokenizers.keys()))
    model: str = identifier

    if not identifier:
        raise InvalidTokenizerError(identifier)

    # Split provider, model, and kwargs

    if "!" in identifier:
        try:
            provider, model = identifier.split("!")
        except Exception as e:
            raise InvalidTokenizerError(identifier) from e

    if provider not in g_tokenizers:
        raise InvalidTokenizerError(identifier)

    if not isinstance(g_tokenizers[provider], type):
        lazy_generator = t.cast("LazyTokenizer", g_tokenizers[provider])
        g_tokenizers[provider] = lazy_generator()

    generator_cls = t.cast("type[Tokenizer]", g_tokenizers[provider])

    kwargs: dict[str, t.Any] = {}
    if "," in model:
        try:
            model, kwargs_str = model.split(",", 1)
            kwargs = dict(arg.split("=", 1) for arg in kwargs_str.split(","))
        except Exception as e:
            raise InvalidTokenizerError(identifier) from e

    # Decode any base64 values if present
    def decode_value(value: str) -> t.Any:
        if value.startswith("base64:"):
            with contextlib.suppress(Exception):
                decoded = base64.b64decode(value[7:])
                return TypeAdapter(t.Any).validate_json(decoded)
        return value

    kwargs = {k: decode_value(v) for k, v in kwargs.items()}

    # Do some subtle type conversion
    for k, v in kwargs.items():
        if not isinstance(v, str):
            continue

        try:
            kwargs[k] = float(v)
            continue
        except ValueError:
            pass

        try:
            kwargs[k] = int(v)
            continue
        except ValueError:
            pass

        if isinstance(v, str) and v.lower() in ["true", "false"]:
            kwargs[k] = v.lower() == "true"

    return generator_cls(model=model, **kwargs)

register_tokenizer

register_tokenizer(
    provider: str,
    tokenizer_cls: type[Tokenizer] | LazyTokenizer,
) -> None

Register a tokenizer class for a provider id. This let’s you use [rigging.tokenizer.get_tokenizer][] with a custom tokenizer class. Parameters:

provider (str) –The name of the provider.
tokenizer_cls (type[Tokenizer] | LazyTokenizer) –The tokenizer class to register.

Returns:

None –None

Source code in rigging/tokenizer/base.py

def register_tokenizer(provider: str, tokenizer_cls: type[Tokenizer] | LazyTokenizer) -> None:
    """
    Register a tokenizer class for a provider id.

    This let's you use [rigging.tokenizer.get_tokenizer][] with a custom tokenizer class.

    Args:
        provider: The name of the provider.
        tokenizer_cls: The tokenizer class to register.

    Returns:
        None
    """
    global g_tokenizers  # noqa: PLW0602
    g_tokenizers[provider] = tokenizer_cls

​TokenSlice

​end

​metadata

​obj

​start

​type

​TokenizedChat

​metadata

​obj

​slices

​text

​tokens

​Tokenizer

​model

​decode

​encode

​format_chat

​tokenize_chat

​TransformersTokenizer

​apply_chat_template_kwargs

​decode_kwargs

​encode_kwargs

​tokenizer

​encode

​from_obj

​get_tokenizer

​register_tokenizer

TokenSlice

end

metadata

obj

start

type

TokenizedChat

metadata

obj

slices

text

tokens

Tokenizer

model

decode

encode

format_chat

tokenize_chat

TransformersTokenizer

apply_chat_template_kwargs

decode_kwargs

encode_kwargs

tokenizer

encode

from_obj

get_tokenizer

register_tokenizer