Underlying LLMs (or any function which completes text) is represented as a generator in Rigging. They are typically instantiated using identifier strings and the get_generator function. The base interface is flexible, and designed to support optimizations should the underlying mechanisms support it (batching async, K/V cache, etc.)

Identifiers

Much like database connection strings, Rigging generators can be represented as strings which define what provider, model, API key, generation params, etc should be used.
Throughout our code, we frequently use these generator identifiers as CLI arguments, environment variables, and API parameters. They are convenient for passing around complex configurations without having to represent model configurations in multiple places. They are also used to serialize generators to storage when chats are stored, so you can save and load them easily without having to reconfigure the generator each time.
Here are some examples of valid identifiers:
gpt-4.1
openai/o3-mini
gemini/gemini-2.5-pro
claude-4-sonnet-latest
vllm_hosted/meta-llama/Llama-3.1-8B-Instruct
ollama/qwen3

openai/gpt-4,api_key=sk-1234
anthropic/claude-3-7-haiku-latest,stop=output:;---,seed=1337
together_ai/meta-llama/Llama-3-70b-chat-hf
openai/google/gemma-7b,api_base=https://integrate.api.nvidia.com/v1
Identifiers are formally defined as follows:
<provider>!<model>,<**kwargs>
  • provider maps to a particular subclass of Generator (optional).
  • model is a any str value, typically used by the provider to indicate a specific LLM to target.
  • kwargs are used to carry:
    1. API key (,api_key=...) or the base URL (,api_base=...) for the model provider.
    2. Serialized GenerateParamsfields like like temp, stop tokens, etc.
    3. Additional provider-specific attributes to set on the constructed generator class. For instance, you can set the LiteLLMGenerator.max_connectionsproperty by passing ,max_connections= in the identifier string.
The provider is optional and Rigging will fallback to litellm/LiteLLMGenerator by default. You can view the LiteLLM docs for more information about supported model providers and parameters. Building generators from string identifiers is optional, but a convenient way to represent complex LLM configurations.
Back to StringsAny generator can be converted back into an identifier using either to_identifier or get_identifier.
generator = rg.get_generator("gpt-3.5-turbo,temperature=0.5")
print(generator.to_identifier())
# litellm!gpt-3.5-turbo,temperature=0.5

API Keys

All generators carry a .api_key attribute which can be set directly, or by passing ,api_key= as part of an identifier string. Not all generators will require one, but they are common enough that we include the attribute as part of the base class. Typically you will be using a library like LiteLLM underneath, and can simply use environment variables:
export OPENAI_API_KEY=...
export TOGETHER_API_KEY=...
export TOGETHERAI_API_KEY=...
export MISTRAL_API_KEY=...
export ANTHROPIC_API_KEY=...

Rate Limits

Generators that leverage remote services (LiteLLM) expose properties for managing connection/request limits:
  • LiteLLMGenerator.max_connections
  • LiteLLMGenerator.min_delay_between_requests
However, a more flexible solution is ChatPipeline.wrap() with a library like backoff to catch many, or specific errors, like rate limits or general connection issues.
import rigging as rg

import backoff
import backoff.types

def on_backoff(details: backoff.types.Details) -> None:
    print(f"Backing off {details['wait']:.2f}s")

pipeline = (
    rg.get_generator("claude-3-haiku-20240307")
    .chat("Give me a 4 word phrase about machines.")
    .wrap(
        backoff.on_exception(
            backoff.expo,
            Exception,  # This should be scoped down
            on_backoff=on_backoff,
        )
    )
)

chats = await pipeline.run_many(50)
You’ll find that the exception consistency inside LiteLLM can be quite poor. Different providers throw different types of exceptions for all kinds of status codes, response data, etc. With that said, you can typically find a target list that works well for your use-case.

Local Models

We have experimental support for both vLLM and transformers generators for loading and running models directly in the same Python process. In general vLLM is more consistent with Rigging’s preferred API, but the dependency requirements are heavier. Where needed, you can wrap an existing model into a rigging generator by using the VLLMGenerator.from_obj() or TransformersGenerator.from_obj() methods. These are helpful for any picky model construction that might not play well with our rigging constructors.
The use of these local generators requires the vllm and transformers packages to be installed. You can use rigging[all] to install them all at once, or pick your preferred package individually.
import rigging as rg

tiny_llama = rg.get_generator(
    "vllm!TinyLlama/TinyLlama-1.1B-Chat-v1.0," \
    "gpu_memory_utilization=0.3," \
    "trust_remote_code=True"
)

llama_3 = rg.get_generator(
    "transformers!meta-llama/Meta-Llama-3-8B-Instruct"
)
Loading and UnloadingYou can use the Generator.load and Generator.unload methods to better control memory usage. Local providers typically are lazy and load the model into memory only when first needed.

Self-Hosted Models

In addition to loading models directly inside the Python process, you often want to access models via some self-hosted server like Ollama or vLLM. Using self-hosted models is well supported in the LiteLLM ecosystem, and usually just requires some consideration for the API base URL and API key. Beyond specific servers, many services expose models in the “openai-compatible” format, which can be used with the openai/ LiteLLM prefix (usually just openai/<model>,api_base=http://...,api_key=...).

Self-Hosted Ollama

We’ll load the qwen3:0.6b model from Ollama, and the ollama server will host the model on http://localhost:11434 by default.
$ ollama run qwen3:0.6b
...

$ ollama ps                                                                                                                       docker-desktop
NAME          ID              SIZE      PROCESSOR    UNTIL
qwen3:0.6b    7df6b6e09427    2.3 GB    100% GPU     4 minutes from now
Using this model in Rigging is as simple as using the ollama/ or ollama_chat/ prefixes:
import rigging as rg

qwen = rg.get_generator("ollama/qwen3:0.6b")

chat = await qwen.chat("Hello!").run()
print(chat.conversation)

# [user]: Hello!

# [assistant]: <think>
# Okay, the user said "Hello!" so I need to respond appropriately. Let me start by acknowledging their greeting.
# A simple "Hello!" is good, but maybe add a friendly note to make it more engaging. I should keep it short and positive.
# Let me check if there's any specific context I should consider, but since there's no additional info, just a standard
# greeting should work. Make sure the response is welcoming and open-ended.
# </think>

# Hello! How can I assist you today? 😊
If you are running the Ollama server somewhere besides localhost, just pass the api_base to the generator:
qwen = rg.get_generator(
  "ollama/qwen3:0.6b,api_base=http://remote-server:11434"
)

Self-Hosted vLLM Server

vLLM ships with it’s own openai-compatible server using the vllm serve command. LiteLLM uses the hosted_vllm/ prefix to connect there, otherwise you can use the openai/ prefix noted below.
$ vllm serve --model Qwen/Qwen3-0.6B
Rigging can now use the vLLM model for Chat or Completions:
import rigging as rg

qwen = rg.get_generator("hosted_vllm/Qwen/Qwen3-0.6B,api_base=http://<server>:8000/v1")

chat = await qwen.chat("Hello!").run()
print(chat.conversation)

# [user]: Hello!

# [assistant]: <think>
# Okay, the user said "Hello!" so I need to respond appropriately. Let me start by acknowledging their greeting.
# A simple "Hello!" is good, but maybe add a friendly note to make it more engaging. I should keep it short and positive.
# Let me check if there's any specific context I should consider, but since there's no additional info, just a standard
# greeting should work. Make sure the response is welcoming and open-ended.
# </think>

# Hello! How can I assist you today? 😊

Self-Hosted OpenAI-Compatible Server

For most other self-hosted models, the server will expose OpenAI-compatible endpoints, and you can use the openai/ prefix for LiteLLM as noted in their docs:
Selecting openai as the provider routes your request to an OpenAI-compatible endpoint using the upstream official OpenAI Python API library.This library requires an API key for all requests, either through the api_key parameter or the OPENAI_API_KEY environment variable.If you don’t want to provide a fake API key in each request, consider using a provider that directly matches your OpenAI-compatible endpoint, such as hosted_vllm or llamafile.
import rigging as rg

gen = rg.get_generator("openai/<model>,api_base=http://<server>/v1"

chat = await gen.chat("Hello").run()
print(chat.conversation)
You can also use the openai/ prefix along with api_key= and api_base= for vLLM:
gen = rg.get_generator(
  "openai/<model>,api_base=http://localhost:8000/v1,api_key=sk-none"
)

Overload Generation Params

When working with both CompletionPipeline and ChatPipeline, you can overload and update any generation params by using the associated .with_() function.
import rigging as rg

pipeline = rg.get_generator("gpt-3.5-turbo,max_tokens=50").chat([
    {"role": "user", "content": "Say a haiku about boats"},
])

for temp in [0.1, 0.5, 1.0]:
    chat = await pipeline.with_(temperature=temp).run()
    print(chat.last.content)

HTTP Generator

The HTTPGenerator allows you to wrap any HTTP endpoint as a generator, making it easy to integrate external LLMs or AI services into your Rigging pipelines. It works by defining a specification that maps message content into HTTP requests and parses responses back into messages. The specification is assigned to the .spec field on the generator, and can be applied as a Python dictionary, JSON string, YAML string, or base64 encoded JSON/YAML string. This flexibility allows you to easily share and reuse specifications across different parts of your application. For most common use cases, you can use one of the built-in factory methods. These provide a simple, high-level interface and give you full autocompletion for configuration in your IDE.

For JSON APIs: HTTPGenerator.for_json_endpoint()

This is the perfect choice for APIs that accept a JSON request body and return a JSON response.
import rigging as rg

# Define a generator for a fictional API
openai_clone = rg.HTTPGenerator.for_json_endpoint(
    model="some-model-v1",
    api_key="MY_SECRET_KEY",
    url="https://api.example.com/v1/generate",
    auth={"header": "Authorization", "format": "Bearer {api_key}"},
    request_body={
        "model_name": "$model",
        "prompt": "$content",
        "history": "$messages",
    },
    response={"content_path": "$.choices[0].text"}
)

response = await openai_clone.chat("Tell me a story about a brave robot.")
print(response.content)

For Text & Template-based APIs: HTTPGenerator.for_text_endpoint()

If you’re interacting with a simpler API that takes plain text, this factory is ideal. The entire request body is a single Jinja2 template string.
import rigging as rg

text_api = rg.HTTPGenerator.for_text_endpoint(
    url="http://api.example.com/prompt",
    api_key="MY_OTHER_KEY",
    auth={"header": "X-Api-Key"},
    # The entire request body is a Jinja2 template
    request_template="User prompt for model {{ model }}: {{ content }}",
    # The response is parsed with a regex to find the content
    response_pattern="API Response: (.*)"
)

Advanced Usage with HTTPSpec

For maximum control, you can bypass the factories and define a full HTTPSpec object yourself. This is useful for complex scenarios involving multi-step transformations or other non-standard requirements.
import rigging as rg

spec = r"""
request:
  url: "https://{{ model }}.platform.dreadnode.io/submit"
  headers:
    "X-Api-Key": "{{ api_key }}"
    "Content-Type": "application/json"
  transforms:
    - type: "json"
      pattern: {
        "data": "$content"
      }
response:
  transforms:
    - type: "jsonpath"
      pattern: $.flag,output,message
"""

crucible = rg.get_generator("http!spanglish,api_key=<key>") # (1)
crucible.spec = spec

chat = await crucible.chat("A flag please").run()

print(chat.conversation)
# [user]: A flag please
#
# [assistant]: Una bandera, por favor.
1. Were are using the .model field on the generator to carry our crucible challenge
Saving schemasEncoded YAML is the default storage when an HTTP generator is serialized to an identifier using to_identifier. This also means that when we save our chats to storage, they maintain their http specification.
print(crucible.to_identifier())
# http!spanglish,spec=eyJyZXF1ZXN0Ijp7InVyb...

State Management

You can make any HTTPGenerator stateful by using the .state dictionary. This is a mutable dictionary that you can use to store any dynamic information—like session IDs or temporary credentials—that needs to be accessed by your templates.
import uuid

# Create a generator and set an initial state
my_api = rg.HTTPGenerator.for_json_endpoint(...)
my_api.state["session_id"] = str(uuid.uuid4())

# The request_body can now reference this state
# request_body={"prompt": "$content", "session": "{{ state.session_id }}"}

Hooks

For advanced scenarios like handling expiring credentials, you can combine the state dictionary with an async hook function. The hook is called after every HTTP request, allowing it to inspect the response and dynamically update the state before automatically retrying.
import rigging as rg
import httpx

# 1. Define a hook to refresh a token on a 401 Unauthorized error
async def refresh_token_hook(generator: rg.generator.HTTPGenerator, response: httpx.Response) -> rg.generator.HookAction:
    if response.status_code == 401:
        print("Token expired, refreshing...")
        # (Your logic to get a new token would go here)
        new_token = "new-super-secret-token"

        # Update the generator's state in-place
        generator.state["access_token"] = new_token

        # Tell the generator to retry the request
        return "retry"

    # For all other responses, continue as normal
    return "continue"

# 2. Configure the generator with the hook and initial state
stateful_api = rg.HTTPGenerator.for_json_endpoint(
    url="https://api.secure.com/v1/live/chat",
    hook=refresh_token_hook,
    state={"access_token": "initial-token"},
    auth={"header": "Authorization", "format": "Bearer {{ state.access_token }}"},
    request_body={"query": "$content"},
    response={"content_path": "$.data.result"}
)`

# This call will be transparently retried if the token has expired
chat = await stateful_api.chat("What are my account details?").run()

Template Context Variables

When building requests with either the factories or a full spec, the following RequestTransformContext variables are available in your Jinja templates (e.g., {{ variable }}) and JSON value substitutions (e.g., "$variable"):
  • content: Content of the last message.
  • messages: List of all message objects in the history.
  • api_key: The API key from the generator’s configuration.
  • model: The model identifier from the generator’s configuration.
  • state: The generator’s mutable state dictionary (see Stateful Generators).
  • params: Generation parameters from the current call (e.g., temperature).
  • all_content: Concatenated content of all messages.
  • role: Role of the last message (user/assistant/system).
For advanced spec files with multiple transformation steps, the output of the previous step is also available to the next as result, data, or output.

Transforms

The HTTP generator supports different types of transforms for both request building and response parsing. Each serves a specific purpose and has its own pattern syntax.
Transform ChainingTransforms are applied in sequence, with each transform’s output becoming the input for the next. This allows you to build complex processing pipelines:
transforms:
    - type: "jsonpath"
    pattern: "$.data"  # Extract data object
    - type: "jinja"
    pattern: "{{ result | tojson }}"  # Convert to string
    - type: "regex"
    pattern: "message: (.*)"  # Extract specific field
Jinja (request + response) The jinja transform type provides full Jinja2 template syntax. Access context variables directly and use Jinja2 filters and control structures.
transforms:
  - type: "jinja"
    pattern: >
      {
        "content": "{{ all_content }}",
        "timestamp": "{{ now() }}",
        {% if params.temperature > 0.5 %}
        "mode": "creative"
        {% endif %}
      }
JSON (request only) The json transform type lets you build JSON request bodies using a template object. Use $ prefix to reference context variables, with dot notation for nested access:
transforms:
  - type: "json"
    pattern: {
      "messages": "$messages",
      "temperature": "$params.temperature",
      "content": "$content",
      "data": "$state.data",
      "static_field": "hello"
    }
JSONPath (response only) The jsonpath transform type uses JSONPath expressions to extract data from JSON responses:
transforms:
  - type: "jsonpath"
    pattern: "$.choices[0].message.content"
Regex (response only) The regex transform type uses regular expressions to extract content from text responses:
transforms:
  - type: "regex"
    pattern: "<output>(.*?)</output>"

Writing a Generator

All generators should inherit from the Generator base class, and can elect to implement handlers for messages and/or texts:
  • async def generate_messages(...) - Used for ChatPipeline.run variants.
  • async def generate_texts(...) - Used for CompletionPipeline.run variants.
If your generator doesn’t implement a particular method like text completions, Rigging will simply raise a NotImplementedError for you. It’s currently undecided whether generators should prefer to provide weak overloads for compatibility, or whether they should ignore methods which can’t be used optimally to help provide clarity to the user about capability. You’ll find we’ve opted for the former strategy in our generators.
Generators operate in a batch context by default, taking in groups of message lists or texts. Whether your implementation takes advantage of this batching is up to you, but where possible you should be optimizing as much as possible.
Generators don’t make any assumptions about the underlying mechanism that completes text. You might use a local model, API endpoint, or static code, etc. The base class is designed to be flexible and support a wide variety of use cases. You’ll find the inclusion of api_key, model, and generation params are common enough that they are included in the base class.
from rigging import Generator, GenerateParams, Message, GeneratedMessage

class Custom(Generator):
    # model: str
    # api_key: str
    # params: GeneratorParams

    custom_field: bool

    async def generate_messages(
        self,
        messages: t.Sequence[t.Sequence[Message]],
        params: t.Sequence[GenerateParams],
    ) -> t.Sequence[GeneratedMessage]:
        # merge_with is an easy way to combine overloads
        params = [
            self.params.merge_with(p).to_dict() for p in params
        ]

        # Access self vars where needed
        api_key = self.api_key
        model_id = self.model
        custom = self.custom_field

        # Build output messages with stop reason, usage, etc.
        # output_messages = ...

        return output_messages


generator = Custom(model='foo', custom_field=True)
generator.chat(...)
Use the register_generator method to add your generator class under a custom providerid so it can be used with get_generator.
import rigging as rg

rg.register_generator('custom', Custom)

custom = rg.get_generator('custom!foo,custom_field=True')