Models Module¶
The cogent.models module provides a 3-tier API for working with LLMs - from simple string-based models to full control with direct SDK access.
🎯 3-Tier Model API¶
Cogent offers three levels of abstraction - choose based on your needs:
Tier 1: High-Level (String Models) ⭐ Recommended¶
The simplest way to get started. Just use model name strings:
from cogent import Agent
# Auto-resolves to gpt-5.4
agent = Agent("Helper", model="gpt4")
# Auto-resolves to gemini-2.5-flash
agent = Agent("Helper", model="gemini")
# Auto-resolves to claude-sonnet-4
agent = Agent("Helper", model="claude")
# Provider prefix for explicit control
agent = Agent("Helper", model="anthropic:claude-opus-4")
agent = Agent("Helper", model="openai:gpt-5.4")
30+ Model Aliases:
- gpt4, gpt4-mini, gpt4-turbo, gpt35, gpt5, gpt5-mini, gpt5-nano
- claude, claude-opus, claude-haiku
- gemini, gemini-flash, gemini-flash-lite, gemini-pro, gemini3, gemini-3.1 ⚠️
- llama, llama-70b, llama-8b, mixtral, qwen
- mistral, mistral-small, mistral-small-4, mistral-medium, mistral-large
- command, command-a, command-r, command-r7b, command-reasoning, command-vision
- grok, grok-4, grok-4.20, grok-fast
- deepseek, deepseek-r1
- cerebras, cerebras-70b, cerebras-qwen, cerebras-qwen-235b, cerebras-glm, cerebras-gpt-oss
- ollama
⚠️ = Preview model (not production-ready)
API Key Loading (Priority Order):
1. Explicit api_key= parameter (highest)
2. Environment variables (includes .env when loaded)
3. Config file cogent.toml / cogent.yaml or ~/.cogent/config.* (lowest)
Tier 2: Medium-Level (Factory Functions)¶
For when you need a model instance without an agent. Supports 4 flexible usage patterns:
from cogent.models import create_chat
# Pattern 1: Model name only (auto-detects provider)
llm = create_chat("gpt-5.4") # OpenAI
llm = create_chat("gemini-2.5-pro") # Google Gemini
llm = create_chat("claude-sonnet-4") # Anthropic
llm = create_chat("llama-3.1-8b-instant") # Groq
llm = create_chat("mistral-small-latest") # Mistral
# Pattern 2: Provider:model syntax (explicit provider prefix)
llm = create_chat("openai:gpt-5.4")
llm = create_chat("gemini:gemini-2.5-flash")
llm = create_chat("anthropic:claude-sonnet-4-20250514")
# Pattern 3: Separate provider and model arguments
llm = create_chat("openai", "gpt-5.4")
llm = create_chat("gemini", "gemini-2.5-pro")
llm = create_chat("anthropic", "claude-sonnet-4")
# Pattern 4: With additional configuration
llm = create_chat("gpt-5.4", temperature=0.7, max_tokens=1000)
llm = create_chat("openai", "gpt-5.4", api_key="sk-custom...")
# Use the model
response = await llm.ainvoke("What is 2+2?")
print(response.content)
Auto-Detection: Patterns 1 and 2 automatically detect the provider from model name prefixes:
- OpenAI: gpt-, o1-, o3-, o4-, text-embedding-, gpt-audio, gpt-realtime, sora-
- Gemini: gemini-, text-embedding-
- Anthropic: claude-
- xAI: grok-
- DeepSeek: deepseek-
- Cerebras: llama3.1- (opinionated default — use cerebras:* for explicit routing)
- Mistral: mistral-, ministral-, magistral-, devstral-, codestral-, voxtral-, ocr-
- Cohere: command-, c4ai-aya-, embed-, rerank-
- Groq: llama-, mixtral-, qwen-, gemma-
- Cloudflare: @cf/
Tier 3: Low-Level (Direct Model Classes)¶
For maximum control over model configuration:
from cogent.models import OpenAIChat, AnthropicChat, GeminiChat
# Full control over all parameters
model = OpenAIChat(
model="gpt-5.4",
temperature=0.7,
max_tokens=2000,
api_key="sk-...",
organization="org-...",
)
model = GeminiChat(
model="gemini-2.5-flash",
temperature=0.9,
api_key="...",
)
model = AnthropicChat(
model="claude-sonnet-4-20250514",
max_tokens=4096,
api_key="sk-ant-...",
)
When to Use Each Tier:
| Tier | Use Case | Example |
|---|---|---|
| Tier 1 (Strings) | Quick prototyping, simple agents | Agent(model="gpt4") |
| Tier 2 (Factory) | Reusable model instances | create_chat("claude") |
| Tier 3 (Direct) | Custom config, advanced features | OpenAIChat(temperature=0.9) |
Model Catalog¶
ModelCatalog is a queryable collection of model metadata populated by live provider API calls. There is no bundled static catalog — every fetch hits the real API and returns current model IDs, pricing, and context windows.
Fetch models from a provider¶
from cogent.models.catalog import ModelCatalog
# Fetch from one provider
catalog = await ModelCatalog.from_provider("openai")
# List active models
for m in catalog.list_models():
print(m.id, m.context_window)
# Query helpers
catalog.list_models(provider="openai", capability="tools")
catalog.get_model("gpt-5.4")
catalog.is_available("claude-sonnet-4-6")
catalog.find_latest(family="gpt-5.4")
catalog.cheapest(capability="tools", by="input")
catalog.summary() # {provider: {status: count}}
Supported provider names: "openai", "anthropic", "groq", "mistral", "gemini", "xai", "deepseek", "cerebras", "cohere", "openrouter".
Fetch from OpenRouter (all providers in one call)¶
OpenRouter exposes 200+ models from all major providers with live pricing and context-window data in a single request.
catalog = await ModelCatalog.from_openrouter()
# Models from every provider, with pricing
for m in catalog.list_models(status=None):
print(f"{m.provider}/{m.id} ${m.input_cost_per_1m}/1M in")
# Find the cheapest tool-capable model across all providers
best = catalog.cheapest(capability="tools")
print(best.id, best.input_cost_per_1m)
Cache results locally¶
Save a snapshot to disk and reload it for offline use or to avoid redundant API calls.
# Save
catalog = await ModelCatalog.from_openrouter()
catalog.save("~/.cogent/models.json")
# Load
catalog = ModelCatalog.load("~/.cogent/models.json")
print(catalog.fetched_at) # ISO-8601 timestamp of when it was fetched
The saved format is {"fetched_at": "...", "models": [...]} — a plain snapshot with no schema versioning. Use fetched_at to decide whether to refresh.
discover_models.py script¶
The scripts/discover_models.py utility probes provider APIs and prints available models. It delegates to ModelCatalog.from_provider() internally.
# Print all providers
uv run python scripts/discover_models.py
# Single provider
uv run python scripts/discover_models.py --provider anthropic
# OpenRouter (200+ models with pricing)
uv run python scripts/discover_models.py --provider openrouter
# Save to cache
uv run python scripts/discover_models.py --save ~/.cogent/models.json
Configuration¶
.env File (Recommended for Development)¶
Create a .env file in your project root:
# .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=AIza...
GROQ_API_KEY=gsk_...
Cogent automatically loads .env files using python-dotenv.
Model Overrides (Environment + Config)¶
You can override default chat or embedding models via env vars or config files.
Environment variables (highest):
OPENAI_CHAT_MODEL=gpt-4.1
OPENAI_EMBEDDING_MODEL=text-embedding-3-large
GEMINI_CHAT_MODEL=gemini-2.5-flash
GEMINI_EMBEDDING_MODEL=gemini-embedding-001
MISTRAL_CHAT_MODEL=mistral-small-latest
MISTRAL_EMBEDDING_MODEL=mistral-embed
GROQ_CHAT_MODEL=llama-3.1-8b-instant
COHERE_CHAT_MODEL=command-a-03-2025
COHERE_EMBEDDING_MODEL=embed-english-v3.0
CLOUDFLARE_CHAT_MODEL=@cf/meta/llama-3.1-8b-instruct
CLOUDFLARE_EMBEDDING_MODEL=@cf/baai/bge-base-en-v1.5
GITHUB_CHAT_MODEL=gpt-4.1
GITHUB_EMBEDDING_MODEL=text-embedding-3-large
OLLAMA_CHAT_MODEL=qwen2.5:7b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
Config file (fallback):
Config File (Recommended for Production)¶
Create a config file at one of these locations:
TOML Format (cogent.toml or ~/.cogent/config.toml):
[models]
default = "gpt4"
[models.openai]
api_key = "sk-..."
organization = "org-..."
[models.anthropic]
api_key = "sk-ant-..."
[models.gemini]
api_key = "..."
[models.groq]
api_key = "gsk_..."
YAML Format (cogent.yaml or ~/.cogent/config.yaml):
models:
default: gpt4
openai:
api_key: sk-...
organization: org-...
anthropic:
api_key: sk-ant-...
gemini:
api_key: ...
Environment Variables¶
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=AIza...
export GROQ_API_KEY=gsk_...
Provider Support¶
All chat models now accept multiple input formats for maximum convenience:
1. Simple String (Most Convenient)¶
2. List of Dicts (Standard Format)¶
response = await model.ainvoke([
{"role": "system", "content": "You are helpful"},
{"role": "user", "content": "Hello"},
])
3. Message Objects (Type-Safe)¶
from cogent.core.messages import SystemMessage, HumanMessage
response = await model.ainvoke([
SystemMessage(content="You are helpful"),
HumanMessage(content="Hello"),
])
OpenAI¶
from cogent.models import OpenAIChat, OpenAIEmbedding
# Tier 1: Simple string
agent = Agent("Helper", model="gpt4")
# Tier 2: Factory
model = create_chat("gpt4")
model = create_chat("openai", "gpt-5.4")
# Tier 3: Direct
model = OpenAIChat(
model="gpt-5.4",
temperature=0.7,
max_tokens=2000,
api_key="sk-...", # Or OPENAI_API_KEY env var
)
# Embeddings
embeddings = OpenAIEmbedding(model="text-embedding-3-small")
# Primary API with metadata
result = await embeddings.embed(["Hello world"])
print(result.embeddings) # Vectors
print(result.metadata) # Full metadata
# Convenience for single text
result = await embeddings.embed("Query")
vector = result.embeddings[0]
With tools:
from cogent.tools import tool
@tool
def search(query: str) -> str:
"""Search the web."""
return f"Results for: {query}"
model = ChatModel(model="gpt-5.4")
bound = model.bind_tools([search])
response = await bound.ainvoke([
{"role": "user", "content": "Search for AI news"}
])
if response.tool_calls:
print(response.tool_calls)
xAI (Grok)¶
from cogent.models import XAIChat
# Latest flagship model (2M context, reasoning always on)
model = XAIChat(model="grok-4.20", api_key="xai-...")
# Non-reasoning variant (faster for latency-sensitive tasks)
model = XAIChat(model="grok-4.20-0309-non-reasoning")
# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")
# Vision model
model = XAIChat(model="grok-2-vision-1212")
# With reasoning effort (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
response = await model.ainvoke("What is 101 * 3?")
print(response.metadata.tokens.reasoning_tokens)
Available Models:
- grok-4.20-0309-reasoning (alias: grok, grok-4.20, grok-4.20-reasoning): Latest flagship — 2M context, fast + reasoning
- grok-4.20-0309-non-reasoning (alias: grok-4.20-non-reasoning): Non-reasoning variant — 2M context
- grok-4.20-multi-agent-0309: Multi-agent optimised variant — 2M context, reasoning
- grok-4-0709 (alias: grok-4): Grok 4 stable snapshot — 256K context, reasoning
- grok-4-1-fast-reasoning (alias: grok-fast-reasoning): Fast model with explicit reasoning — 2M context
- grok-4-1-fast-non-reasoning (alias: grok-fast, grok-fast-non-reasoning): Fast model without reasoning — 2M context
- grok-3, grok-3-mini: Previous generation
- grok-2-vision-1212 (alias: grok-vision): Vision model
Environment Variable: XAI_API_KEY
DeepSeek¶
from cogent.models import DeepSeekChat
# Standard chat model
model = DeepSeekChat(model="deepseek-chat", api_key="sk-...")
# Reasoning model with Chain of Thought
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke("9.11 and 9.8, which is greater?")
# Access reasoning content
if hasattr(response, 'reasoning'):
print("Reasoning:", response.reasoning)
print("Answer:", response.content)
Available Models:
- deepseek-chat: General chat model with function calling
- deepseek-reasoner: Reasoning model with Chain of Thought (no function calling)
Environment Variable: DEEPSEEK_API_KEY
Note: DeepSeek Reasoner does NOT support function calling, temperature, or sampling parameters.
Cerebras (Ultra-Fast Inference)¶
from cogent.models import CerebrasChat
# Llama 3.1 8B (default)
model = CerebrasChat(model="llama3.1-8b", api_key="csk-...")
# Llama 3.3 70B
model = CerebrasChat(model="llama-3.3-70b")
# Streaming
async for chunk in model.astream(messages):
print(chunk.content, end="")
Available Models:
- llama3.1-8b: Llama 3.1 8B (default) — alias cerebras, cerebras-llama
- llama-3.3-70b: Llama 3.3 70B — alias cerebras-70b
- qwen-3-32b: Qwen 3 32B — alias cerebras-qwen
- qwen-3-235b-a22b-instruct-2507: Qwen 3 235B MoE — alias cerebras-qwen-235b
- zai-glm-4.7: ZAI GLM 4.7 — alias cerebras-glm
- gpt-oss-120b: GPT OSS 120B (reasoning model) — alias cerebras-gpt-oss
Note: All Cerebras aliases use explicit cerebras:model routing. Use cerebras:gpt-oss-120b or the cerebras-gpt-oss alias — bare gpt-oss-* strings are NOT routed to OpenAI.
Environment Variable: CEREBRAS_API_KEY
Note: Cerebras provides industry-leading inference speed using Wafer-Scale Engine (WSE-3).
Cloudflare Workers AI¶
from cogent.models import CloudflareChat, CloudflareEmbedding
# Chat models
model = CloudflareChat(
model="@cf/meta/llama-3.3-70b-instruct",
account_id="...",
api_key="...",
)
# Embeddings
embeddings = CloudflareEmbedding(
model="@cf/baai/bge-base-en-v1.5",
account_id="...",
api_key="...",
)
Available Models: All Cloudflare Workers AI models with @cf/ prefix
Environment Variables: CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_API_TOKEN
Azure AI Foundry (GitHub Models)¶
from cogent.models.azure import AzureAIFoundryChat
# GitHub Models
model = AzureAIFoundryChat.from_github(
model="meta/Meta-Llama-3.1-8B-Instruct",
token=os.getenv("GITHUB_TOKEN"),
)
# Azure AI Foundry endpoint
model = AzureAIFoundryChat(
model="gpt-5.4-mini",
endpoint="https://...",
api_key="...",
)
Available via GitHub Models: Llama, Phi, Mistral, Cohere, and more
Environment Variable: GITHUB_TOKEN
OpenRouter¶
OpenRouter is a unified API gateway that routes to 200+ models from OpenAI, Anthropic, Google, Meta, Mistral, and many others. Use it to access any model through a single API key, enable automatic fallbacks, or add web search to any request.
from cogent.models.openrouter import OpenRouterChat
# Tier 1: String alias
agent = Agent("Helper", model="or-claude") # → anthropic/claude-3.5-sonnet
agent = Agent("Helper", model="or-gpt4o") # → openai/gpt-4o
agent = Agent("Helper", model="or-auto") # → OpenRouter auto-router
# Tier 2: Factory
from cogent.models import create_chat
llm = create_chat("openrouter", "mistralai/mistral-7b-instruct:free")
# Tier 3: Direct class (full control)
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
temperature=0.7,
max_tokens=4096,
)
Environment Variable: OPENROUTER_API_KEY
Available Aliases: or-gpt4o, or-gpt4o-mini, or-claude, or-claude-haiku, or-gemini, or-llama, or-llama-free, or-mistral-free, or-auto
Provider Routing¶
Control which underlying providers OpenRouter routes to, and whether fallbacks are allowed:
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
provider={
"order": ["Anthropic", "AWS Bedrock"], # try in order
"allow_fallbacks": False, # hard-fail if both unavailable
"require_parameters": True, # only providers supporting all params
},
)
Model Fallbacks¶
Supply a ranked list of fallback models. OpenRouter tries each in order if a model is unavailable or rate-limited:
llm = OpenRouterChat(
model="anthropic/claude-opus-4",
fallback_models=[
"anthropic/claude-sonnet-4",
"openai/gpt-4o",
],
)
Plugins¶
Plugins extend what any model can do without modifying your prompt.
Web search — attaches live search results to the request:
Response healing — automatically retries malformed structured-output responses:
File parser and context compression are also available; pass any plugin object supported by the OpenRouter plugins API.
Sampling Parameters¶
All standard and OpenRouter-specific sampling params are supported:
llm = OpenRouterChat(
model="meta-llama/llama-3.3-70b-instruct",
temperature=0.8,
top_p=0.9,
top_k=40,
frequency_penalty=0.2,
presence_penalty=0.1,
repetition_penalty=1.05,
min_p=0.05,
top_a=0.1,
seed=42,
stop=["END"],
)
Tool Choice¶
Pass tool_choice through bind_tools to force, prevent, or select specific tool use:
from cogent.tools import tool
@tool
def search(query: str) -> str:
"""Search the web."""
return f"Results for: {query}"
# Force the model to call a tool
bound = llm.bind_tools([search], tool_choice="required")
# Force a specific function
bound = llm.bind_tools([search], tool_choice={"type": "function", "function": {"name": "search"}})
# Disable tool use entirely
bound = llm.bind_tools([search], tool_choice="none")
Anthropic Beta Features¶
For Anthropic models routed through OpenRouter, pass beta feature flags via the anthropic_beta field:
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
anthropic_beta="interleaved-thinking-2025-05-14",
)
# Multiple betas
llm = OpenRouterChat(
model="anthropic/claude-sonnet-4",
anthropic_beta=["interleaved-thinking-2025-05-14", "prompt-caching-2024-07-31"],
)
Reasoning Control¶
For reasoning/thinking models, control effort, token budget, and whether reasoning tokens appear in the response:
# Exclude reasoning tokens from the response (model still thinks internally)
llm = OpenRouterChat(model="deepseek/deepseek-v3.2", reasoning_exclude=True)
# Set a specific token budget (Anthropic models, some Qwen and Gemini 2.5)
llm = OpenRouterChat(model="anthropic/claude-3.7-sonnet", reasoning_max_tokens=2000)
# Control effort level (OpenAI o-series, Grok, Gemini 3)
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low")
# Disable thinking entirely on a model that reasons by default
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="none")
# Combine: low effort, exclude from response
llm = OpenRouterChat(model="openai/o3-mini", reasoning_effort="low", reasoning_exclude=True)
| Parameter | Type | Description |
|---|---|---|
reasoning_effort |
"xhigh" | "high" | "medium" | "low" | "minimal" | "none" |
Effort level. "none" disables thinking entirely. Supported by OpenAI o-series, Grok, Gemini 3. |
reasoning_max_tokens |
int |
Token budget for reasoning. Supported by Anthropic, some Qwen and Gemini 2.5. Cannot combine with reasoning_effort. |
reasoning_exclude |
bool |
When True, model thinks internally but reasoning tokens are not returned. Works with both other params. |
Cost and Cache Metadata¶
Every response carries cost and cache metadata in AIMessage.metadata:
response = await llm.ainvoke("Hello")
meta = response.metadata
print(meta.cost) # USD cost, e.g. 0.000123
print(meta.native_finish_reason) # raw finish reason from the provider
print(meta.usage.cached_tokens) # prompt tokens served from cache
print(meta.usage.cache_write_tokens) # tokens written to prompt cache
print(meta.usage.reasoning_tokens) # thinking/reasoning tokens used
model_kwargs Shorthand¶
When using the string or factory path, pass OpenRouter-specific options via model_kwargs:
agent = Agent(
name="Researcher",
model="openrouter:openai/gpt-4o",
model_kwargs={
"plugins": [{"id": "web", "max_results": 3}],
"provider": {"order": ["OpenAI"]},
"fallback_models": ["anthropic/claude-sonnet-4"],
"seed": 42,
},
)
Previous Provider Sections Continue Below¶
if response.tool_calls: for call in response.tool_calls: print(f"Tool: {call['name']}, Args: {call['args']}")
**Responses API (Beta):**
OpenAI's Responses API is optimized for tool use and structured outputs. Use the `use_responses_api=True` parameter:
```python
from cogent.models.openai import OpenAIChat
# Standard Chat Completions API (default)
model = OpenAIChat(model="gpt-5.4")
# Responses API (optimized for tool use)
model = OpenAIChat(model="gpt-5.4", use_responses_api=True)
# Works seamlessly with tools
bound = model.bind_tools([search_tool, calc_tool])
response = await bound.ainvoke(messages)
The Responses API provides better performance for multi-turn tool conversations while maintaining the same interface.
Azure OpenAI¶
Enterprise Azure deployments with Azure AD support:
from cogent.models.azure import AzureEntraAuth, AzureOpenAIChat, AzureOpenAIEmbedding
# With API key
model = AzureOpenAIChat(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="gpt-5.4",
api_key="your-api-key",
api_version="2024-02-01",
)
# With Entra ID (DefaultAzureCredential)
model = AzureOpenAIChat(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="gpt-5.4",
entra=AzureEntraAuth(method="default"), # Uses DefaultAzureCredential
)
# With Entra ID (Managed Identity)
# - System-assigned MI: omit client_id
# - User-assigned MI: set client_id (recommended when multiple identities exist)
model = AzureOpenAIChat(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="gpt-5.4",
entra=AzureEntraAuth(
method="managed_identity",
client_id="<USER_ASSIGNED_MANAGED_IDENTITY_CLIENT_ID>",
),
)
# Embeddings
embeddings = AzureOpenAIEmbedding(
azure_endpoint="https://your-resource.openai.azure.com",
deployment="text-embedding-ada-002",
entra=AzureEntraAuth(method="default"),
)
result = await embeddings.embed(["Document text"])
Environment variables:
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_VERSION=2024-02-01
AZURE_OPENAI_DEPLOYMENT=gpt-5.4
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002
# Auth selection
AZURE_OPENAI_AUTH_TYPE=managed_identity # api_key | default | managed_identity | client_secret
# API key auth
# AZURE_OPENAI_API_KEY=your-api-key
# Managed identity auth (user-assigned MI)
# AZURE_OPENAI_CLIENT_ID=...
# Service principal auth (client secret)
# AZURE_OPENAI_TENANT_ID=...
# AZURE_OPENAI_CLIENT_ID=...
# AZURE_OPENAI_CLIENT_SECRET=...
Anthropic¶
Claude models with native SDK:
from cogent.models.anthropic import AnthropicChat
model = AnthropicChat(
model="claude-sonnet-4-6",
max_tokens=4096,
api_key="sk-ant-...", # Or ANTHROPIC_API_KEY env var
)
response = await model.ainvoke([
{"role": "user", "content": "Explain quantum computing"}
])
Claude-specific features:
# System message
response = await model.ainvoke(
messages=[{"role": "user", "content": "Hello"}],
system="You are a helpful coding assistant.",
)
# With tools
model = AnthropicChat(model="claude-sonnet-4-6")
bound = model.bind_tools([search_tool])
Groq¶
Ultra-fast inference for supported models:
from cogent.models.groq import GroqChat
model = GroqChat(
model="llama-3.3-70b-versatile",
api_key="gsk_...", # Or GROQ_API_KEY env var
)
response = await model.ainvoke([
{"role": "user", "content": "Write a haiku about coding"}
])
Available models:
| Model | Description |
|---|---|
llama-3.3-70b-versatile |
Llama 3.3 70B |
llama-3.1-8b-instant |
Fast Llama 3.1 8B |
mixtral-8x7b-32768 |
Mixtral 8x7B |
gemma2-9b-it |
Gemma 2 9B |
Responses API (Beta):
Groq also supports OpenAI's Responses API for optimized tool use:
from cogent.models.groq import GroqChat
# Standard Chat Completions API (default)
model = GroqChat(model="llama-3.3-70b-versatile")
# Responses API (optimized for tool use)
model = GroqChat(model="llama-3.3-70b-versatile", use_responses_api=True)
# Works seamlessly with tools
bound = model.bind_tools([search_tool])
response = await bound.ainvoke(messages)
Google Gemini¶
Google's Gemini models:
from cogent.models.gemini import GeminiChat, GeminiEmbedding
model = GeminiChat(
model="gemini-2.5-flash", # Default (upgraded from 2.0)
api_key="...", # Or GOOGLE_API_KEY env var
)
response = await model.ainvoke([
{"role": "user", "content": "What is the capital of France?"}
])
# Gemini 3 Preview (Not Production Ready)
model = GeminiChat(model="gemini-3-flash-preview")
# ⚠️ WARNING: Preview models may have breaking changes or be removed
# Native Thinking (opt-in for cost efficiency)
model = GeminiChat(
model="gemini-2.5-flash",
thinking_budget=16384, # Enable thinking (default: 0 = disabled)
)
# Embeddings
embeddings = GeminiEmbedding(model="text-embedding-004")
Available Models:
- gemini-2.5-pro, gemini-2.5-flash (Stable, 1M context, thinking support)
- gemini-2.0-flash (Stable)
- gemini-3-pro-preview, gemini-3-flash-preview ⚠️ (Preview only, thinking support)
Native Thinking:
- Default: thinking_budget=0 (disabled) - cost-efficient for most tasks
- Enable: Set thinking_budget > 0 (recommended: 8192-16384 tokens)
- Cost: Thinking tokens are billed separately - only enable when needed
- Use Cases: Complex reasoning, multi-step problems, strategic planning
Pass via Agent:
from cogent import Agent
# Enable thinking for this agent
agent = Agent(
name="Thinker",
model="gemini-2.5-flash",
model_kwargs={"thinking_budget": 16384},
)
Ollama¶
Local models via Ollama:
from cogent.models.ollama import OllamaChat, OllamaEmbedding
# Chat (requires `ollama run llama3.2`)
model = OllamaChat(
model="llama3.2",
base_url="http://localhost:11434",
)
response = await model.ainvoke([
{"role": "user", "content": "Hello!"}
])
# Embeddings
embeddings = OllamaEmbedding(model="nomic-embed-text")
xAI (Grok)¶
Grok models with reasoning capabilities:
from cogent.models.xai import XAIChat
# Latest flagship (2M context, reasoning)
model = XAIChat(
model="grok-4.20",
api_key="...", # Or XAI_API_KEY env var
)
# Non-reasoning variant (same price, no internal reasoning)
model = XAIChat(model="grok-4.20-0309-non-reasoning")
# Fast agentic model with reasoning (2M context)
model = XAIChat(model="grok-4-1-fast-reasoning")
# Fast without reasoning (cheaper for high-volume)
model = XAIChat(model="grok-4-1-fast-non-reasoning")
# With reasoning effort control (grok-3-mini only)
model = XAIChat(model="grok-3-mini", reasoning_effort="high")
# or use with_reasoning()
model = XAIChat(model="grok-3-mini").with_reasoning("high")
response = await model.ainvoke([
{"role": "user", "content": "What is 101 * 3?"}
])
# Reasoning tokens tracked in metadata
if response.metadata.tokens:
print(f"Reasoning tokens: {response.metadata.tokens.reasoning_tokens}")
Available models:
| Model | Alias | Context | Reasoning | Description |
|---|---|---|---|---|
grok-4.20-0309-reasoning |
grok, grok-4.20, grok-4.20-reasoning |
2M | ✅ | Latest flagship — fast + reasoning |
grok-4.20-0309-non-reasoning |
grok-4.20-non-reasoning |
2M | ❌ | Latest flagship — non-reasoning variant |
grok-4.20-multi-agent-0309 |
— | 2M | ✅ | Multi-agent optimised variant |
grok-4-0709 |
grok-4 |
256K | ✅ | Grok 4 stable snapshot |
grok-4-1-fast-reasoning |
grok-fast-reasoning |
2M | ✅ | Fast agentic with explicit reasoning |
grok-4-1-fast-non-reasoning |
grok-fast, grok-fast-non-reasoning |
2M | ❌ | Fast agentic without reasoning |
grok-3-mini |
— | — | configurable | Supports reasoning_effort (low/high) |
grok-2-vision-1212 |
grok-vision |
— | ❌ | Image understanding |
grok-code-fast-1 |
grok-code |
— | ❌ | Code-optimized |
Features:
- Function/tool calling (all models)
- Structured outputs (JSON mode)
- Reasoning (all grok-4.20 and grok-4 models; grok-3-mini via reasoning_effort)
- Vision (grok-2-vision-1212)
- 2M context window (grok-4.20 and grok-4-1-fast models)
DeepSeek¶
DeepSeek models with Chain of Thought reasoning:
from cogent.models.deepseek import DeepSeekChat
# Standard chat model
model = DeepSeekChat(
model="deepseek-chat",
api_key="...", # Or DEEPSEEK_API_KEY env var
)
# Reasoning model (exposes Chain of Thought)
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke("9.11 and 9.8, which is greater?")
# Access reasoning content (Chain of Thought)
if hasattr(response, 'reasoning'):
print("Reasoning:", response.reasoning)
print("Answer:", response.content)
Available models:
| Model | Tools | Description |
|---|---|---|
deepseek-chat |
✅ | General chat model with tool support |
deepseek-reasoner |
❌ | Reasoning model with CoT (no tools) |
Note: deepseek-reasoner does NOT support:
- Function calling/tools
- temperature, top_p, presence_penalty, frequency_penalty
Custom Endpoints¶
Any OpenAI-compatible endpoint (vLLM, Together AI, etc.):
from cogent.models.custom import CustomChat, CustomEmbedding
# vLLM
model = CustomChat(
base_url="http://localhost:8000/v1",
model="meta-llama/Llama-3.2-3B-Instruct",
)
# Together AI
model = CustomChat(
base_url="https://api.together.xyz/v1",
model="meta-llama/Llama-3-70b-chat-hf",
api_key="...",
)
# Custom embeddings
embeddings = CustomEmbedding(
base_url="http://localhost:8000/v1",
model="BAAI/bge-small-en-v1.5",
)
Factory Function¶
Create models dynamically by provider:
from cogent.models import create_chat, create_embedding
# OpenAI
model = create_chat("openai", model="gpt-5.4")
# Azure
model = create_chat(
"azure",
deployment="gpt-5.4",
azure_endpoint="https://your-resource.openai.azure.com",
entra=AzureEntraAuth(method="default"),
)
# Anthropic
model = create_chat("anthropic", model="claude-sonnet-4-20250514")
# Groq
model = create_chat("groq", model="llama-3.3-70b-versatile")
# Gemini
model = create_chat("gemini", model="gemini-2.0-flash")
# Ollama
model = create_chat("ollama", model="llama3.2")
# xAI (Grok)
model = create_chat("xai", model="grok-4-1-fast")
# DeepSeek
model = create_chat("deepseek", model="deepseek-chat")
model = create_chat("deepseek", model="deepseek-reasoner") # Reasoning model
# Custom
model = create_chat(
"custom",
base_url="http://localhost:8000/v1",
model="my-model",
)
Mock Models¶
For testing without API calls:
from cogent.models import MockChatModel, MockEmbedding
# Predictable responses
model = MockChatModel(responses=["Hello!", "How can I help?"])
response = await model.ainvoke([{"role": "user", "content": "Hi"}])
print(response.content) # "Hello!"
response = await model.ainvoke([{"role": "user", "content": "Help"}])
print(response.content) # "How can I help?"
# Mock embeddings
embeddings = MockEmbedding(dimension=384)
vectors = await embeddings.embed_documents(["test"])
print(len(vectors[0])) # 384
Streaming¶
All models support streaming with complete metadata:
from cogent.models import ChatModel
model = ChatModel(model="gpt-5.4")
async for chunk in model.astream([
{"role": "user", "content": "Write a story"}
]):
print(chunk.content, end="", flush=True)
# Access metadata in all chunks
if chunk.metadata:
print(f"\nModel: {chunk.metadata.model}")
print(f"Response ID: {chunk.metadata.response_id}")
# Token usage available in final chunk
if chunk.metadata.tokens:
print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
print(f"Finish: {chunk.metadata.finish_reason}")
Streaming Metadata¶
All 10 chat providers return complete metadata during streaming:
| Provider | Model | Finish Reason | Token Usage | Notes |
|---|---|---|---|---|
| OpenAI | ✅ | ✅ | ✅ | Uses stream_options={"include_usage": True} |
| Gemini | ✅ | ✅ | ✅ | Extracts from usage_metadata |
| Groq | ✅ | ✅ | ✅ | Compatible with OpenAI pattern |
| Mistral | ✅ | ✅ | ✅ | Metadata accumulation |
| Cohere | ✅ | ✅ | ✅ | Event-based streaming (message-end) |
| Anthropic | ✅ | ✅ | ✅ | Snapshot-based metadata |
| Cloudflare | ✅ | ✅ | ✅ | Stream options support |
| Ollama | ✅ | ✅ | ✅ | Local model metadata |
| Azure OpenAI | ✅ | ✅ | ✅ | Stream options support |
| Azure AI Foundry / GitHub | ✅ | ✅ | ✅ | Stream options via model_extras |
Metadata Structure:
@dataclass
class MessageMetadata:
id: str | None # Response ID
timestamp: str | None # ISO 8601 timestamp
model: str | None # Model name/version
tokens: TokenUsage | None # Token counts
finish_reason: str | None # stop, length, error
response_id: str | None # Provider response ID
duration: float | None # Request duration (ms)
correlation_id: str | None # For tracing
@dataclass
class TokenUsage:
prompt_tokens: int
completion_tokens: int
total_tokens: int
reasoning_tokens: int | None # Reasoning tokens (if available)
Note: reasoning_tokens is populated by models that support reasoning/thinking (o1/o3, deepseek-reasoner, Claude extended thinking, Gemini thinking, Grok).
Streaming Pattern:
- Content chunks — Include partial metadata (model, response_id, timestamp)
- Final chunk — Empty content with complete metadata (finish_reason, tokens)
# Example streaming flow
async for chunk in model.astream(messages):
# Chunks 1-N: Content with partial metadata
if chunk.content:
print(chunk.content, end="")
# Final chunk: Complete metadata
if chunk.metadata and chunk.metadata.finish_reason:
print(f"\n\nCompleted with {chunk.metadata.tokens.total_tokens} tokens")
Embeddings¶
All 9 embedding providers support a standardized API with rich metadata and flexible usage patterns:
from cogent.models import OpenAIEmbedding, GeminiEmbedding, OllamaEmbedding
embedder = OpenAIEmbedding(model="text-embedding-3-small")
# Primary API: embed() / aembed() - Returns EmbeddingResult with full metadata
result = await embedder.aembed(["Hello world", "Cogent"])
print(result.embeddings) # list[list[float]] - the actual vectors
print(result.metadata.model) # "text-embedding-3-small"
print(result.metadata.tokens) # TokenUsage(prompt=4, completion=0, total=4)
print(result.metadata.dimensions) # 1536
print(result.metadata.duration) # 0.181 seconds
print(result.metadata.num_texts) # 2
# Convenience: embed_one() / aembed_one() - Single text, returns vector only
vector = await embedder.aembed_one("Single text")
print(len(vector)) # 1536
# Sync versions
result = embedder.embed(["Text 1", "Text 2"])
vector = embedder.embed_one("Single text")
# VectorStore protocol: embed_texts() / embed_query() - Async, no metadata
vectors = await embedder.embed_texts(["Doc1", "Doc2"]) # list[list[float]]
query_vec = await embedder.embed_query("Search query") # list[float]
Standardized API Summary:
| Method | Input | Returns | Async | Metadata |
|---|---|---|---|---|
embed(texts) |
list[str] |
EmbeddingResult |
❌ | ✅ |
aembed(texts) |
list[str] |
EmbeddingResult |
✅ | ✅ |
embed_one(text) |
str |
list[float] |
❌ | ❌ |
aembed_one(text) |
str |
list[float] |
✅ | ❌ |
embed_texts(texts) |
list[str] |
list[list[float]] |
✅ | ❌ |
embed_query(text) |
str |
list[float] |
✅ | ❌ |
dimension |
property | int |
- | - |
Embedding Metadata¶
All 9 embedding providers return complete metadata:
| Provider | Token Usage | Notes |
|---|---|---|
| OpenAI | ✅ | Extracts from response.usage.prompt_tokens |
| Cohere | ✅ | Extracts from response.meta.billed_units.input_tokens |
| Mistral | ✅ | Uses OpenAI SDK, provides token counts |
| Azure OpenAI | ✅ | Extracts from response.usage like OpenAI |
| Gemini | ❌ | API doesn't provide token counts for embeddings |
| Ollama | ❌ | Local embeddings, no token tracking |
| Cloudflare | ❌ | API doesn't track tokens |
| Mock | ❌ | Test embedding, no real tokens |
| Custom | ⚡ | Conditional - depends on underlying API |
Metadata Structure:
@dataclass
class EmbeddingMetadata:
id: str # Unique request ID
timestamp: str # ISO 8601 timestamp
model: str | None # Model name/version
tokens: TokenUsage | None # Token usage (if available)
duration: float # Request duration (seconds)
dimensions: int | None # Vector dimensions
num_texts: int # Number of texts embedded
@dataclass
class EmbeddingResult:
embeddings: list[list[float]] # The actual embedding vectors
metadata: EmbeddingMetadata # Complete metadata
Usage Examples:
# Use case 1: Need metadata for cost tracking
result = await embedder.aembed(["Text 1", "Text 2"])
vectors = result.embeddings
tokens = result.metadata.tokens # Track token usage for billing
duration = result.metadata.duration # Monitor performance
# Use case 2: Simple embedding without metadata
vector = await embedder.aembed_one("Query text") # Just returns the vector
# Use case 3: VectorStore integration (protocol compliance)
# These methods are used internally by VectorStore
vectors = await embedder.embed_texts(["Document 1", "Document 2"])
query_vec = await embedder.embed_query("Search query")
# Use case 4: Sync batch embedding
result = embedder.embed(large_batch) # Sync version for compatibility
Observability Benefits:
- Cost tracking — Monitor token usage across providers
- Performance — Track request duration and batch sizes
- Debugging — Trace requests with unique IDs and timestamps
- Model versioning — Know which embedding model version was used
- Capacity planning — Understand dimensions and text counts
Streaming¶
All models support streaming with complete metadata:
from cogent.models import ChatModel
model = ChatModel(model="gpt-5.4")
async for chunk in model.astream([
{"role": "user", "content": "Write a story"}
]):
print(chunk.content, end="", flush=True)
# Access metadata in all chunks
if chunk.metadata:
print(f"\nModel: {chunk.metadata.model}")
print(f"Response ID: {chunk.metadata.response_id}")
# Token usage available in final chunk
if chunk.metadata.tokens:
print(f"Tokens: {chunk.metadata.tokens.total_tokens}")
print(f"Finish: {chunk.metadata.finish_reason}")
Streaming Metadata¶
All 10 chat providers return complete metadata during streaming:
| Provider | Model | Finish Reason | Token Usage | Notes |
|---|---|---|---|---|
| OpenAI | ✅ | ✅ | ✅ | Uses stream_options={"include_usage": True} |
| Gemini | ✅ | ✅ | ✅ | Extracts from usage_metadata |
| Groq | ✅ | ✅ | ✅ | Compatible with OpenAI pattern |
Thinking & Reasoning¶
Several providers offer "reasoning" or "thinking" models that expose their chain-of-thought process. Cogent provides unified access to these capabilities.
Feature Comparison¶
| Provider | Models | Control Parameter | Access Reasoning | Structured Output |
|---|---|---|---|---|
| Anthropic | claude-sonnet-4, claude-opus-4 |
thinking_budget |
msg.thinking |
✅ via thinking |
| OpenAI | o1, o3, o4-mini |
reasoning_effort |
Hidden | ✅ |
| Gemini | gemini-2.5-* |
thinking_budget |
msg.thinking |
✅ |
| xAI | grok-3-mini |
reasoning_effort |
Hidden | ✅ |
| DeepSeek | deepseek-reasoner |
Always on | msg.reasoning |
❌ |
Anthropic Extended Thinking¶
Claude models support extended thinking with configurable token budgets:
from cogent.models.anthropic import AnthropicChat
# Enable extended thinking with budget
model = AnthropicChat(
model="claude-sonnet-4-20250514",
thinking={"type": "enabled", "budget_tokens": 10000},
)
response = await model.ainvoke([
{"role": "user", "content": "Solve this step by step: 15! / (12! * 3!)"}
])
# Access thinking content
if response.thinking:
print("Thinking:", response.thinking)
print("Answer:", response.content)
Using ReasoningConfig:
from cogent.models.anthropic import AnthropicChat
from cogent.reasoning import ReasoningConfig
# Create config
config = ReasoningConfig(budget_tokens=10000)
# Apply to model
model = AnthropicChat(model="claude-sonnet-4-20250514")
thinking_model = model.with_reasoning(config)
response = await thinking_model.ainvoke(messages)
Features:
- Thinking exposed in msg.thinking attribute
- Works with streaming (thinking streamed first)
- Compatible with with_structured_output() via thinking
OpenAI Reasoning Models¶
OpenAI's o-series models (o1, o3, o4-mini) have built-in reasoning:
from cogent.models.openai import OpenAIChat
# Reasoning effort: "low", "medium", "high"
model = OpenAIChat(
model="o4-mini",
reasoning_effort="high", # More thorough reasoning
)
response = await model.ainvoke([
{"role": "user", "content": "Prove that sqrt(2) is irrational"}
])
Using ReasoningConfig:
from cogent.models.openai import OpenAIChat
from cogent.reasoning import ReasoningConfig
model = OpenAIChat(model="o4-mini")
reasoning_model = model.with_reasoning(ReasoningConfig(effort="high"))
Notes:
- Reasoning is internal (not exposed in response)
- No thinking budget - use reasoning_effort instead
- Supports structured output with json_schema response format
Gemini Thinking¶
Gemini 2.5 and 3.0 models support thinking with budget control:
from cogent.models.gemini import GeminiChat
model = GeminiChat(
model="gemini-2.5-flash-preview-05-20", # or gemini-3-flash-preview
thinking_budget=8000, # Token budget for thinking
)
response = await model.ainvoke([
{"role": "user", "content": "What's the optimal strategy in this game?"}
])
# Access thinking
if response.thinking:
print("Thought process:", response.thinking)
Using ReasoningConfig:
from cogent.models.gemini import GeminiChat
from cogent.reasoning import ReasoningConfig
model = GeminiChat(model="gemini-2.5-flash-preview-05-20")
thinking_model = model.with_reasoning(ReasoningConfig(budget_tokens=8000))
xAI Reasoning¶
Grok 4.20 and grok-4 are always-on reasoning models. grok-3-mini supports configurable reasoning effort:
from cogent.models.xai import XAIChat
# grok-4.20 is a reasoning model — no parameters needed
model = XAIChat(model="grok-4.20")
response = await model.ainvoke([
{"role": "user", "content": "Explain the halting problem"}
])
# Use non-reasoning variant to skip reasoning (faster/cheaper)
model = XAIChat(model="grok-4.20-0309-non-reasoning")
# grok-3-mini: configurable reasoning effort
model = XAIChat(
model="grok-3-mini",
reasoning_effort="high", # "low" or "high"
)
Using with_reasoning():
from cogent.models.xai import XAIChat
model = XAIChat(model="grok-3-mini")
reasoning_model = model.with_reasoning(effort="high")
Notes:
- grok-4.20, grok-4, grok-4-1-fast-reasoning: reasoning always on, no reasoning_effort parameter
- grok-4.20-0309-non-reasoning, grok-4-1-fast-non-reasoning: reasoning disabled
- grok-3-mini supports reasoning_effort ("low" or "high")
- presencePenalty, frequencyPenalty, and stop are not supported by grok-4 reasoning models
- Reasoning is internal (not exposed in response)
DeepSeek Reasoner¶
DeepSeek's reasoner model exposes its chain-of-thought:
from cogent.models.deepseek import DeepSeekChat
model = DeepSeekChat(model="deepseek-reasoner")
response = await model.ainvoke([
{"role": "user", "content": "Prove the Pythagorean theorem"}
])
# Access reasoning content
if response.reasoning:
print("Chain of thought:", response.reasoning)
print("Final answer:", response.content)
Streaming reasoning:
async for chunk in model.astream(messages):
if chunk.reasoning:
print(f"[Reasoning] {chunk.reasoning}", end="", flush=True)
if chunk.content:
print(chunk.content, end="", flush=True)
Notes:
- Reasoning always enabled for deepseek-reasoner
- Does NOT support tools or structured output
- Use deepseek-chat for non-reasoning use cases
ReasoningConfig¶
Unified configuration for reasoning across providers:
from cogent.reasoning import ReasoningConfig
# Token budget (Anthropic, Gemini)
config = ReasoningConfig(budget_tokens=10000)
# Effort level (OpenAI, xAI)
config = ReasoningConfig(effort="high")
# Both (uses appropriate one per provider)
config = ReasoningConfig(budget_tokens=10000, effort="high")
Provider mapping:
| Provider | budget_tokens |
effort |
|---|---|---|
| Anthropic | ✅ thinking.budget_tokens |
❌ |
| OpenAI | ❌ | ✅ reasoning_effort |
| Gemini | ✅ thinking_budget |
❌ |
| xAI | ❌ | ✅ reasoning_effort |
| DeepSeek | ❌ (always on) | ❌ |
Structured Output¶
Chat models support structured output via with_structured_output() for type-safe JSON responses.
Provider Support¶
| Provider | Method | Strict Mode |
|---|---|---|
| OpenAI | json_schema |
✅ |
| Anthropic | Tool-based | ✅ |
| Gemini | response_schema |
✅ |
| Groq | json_mode |
❌ |
| xAI | json_schema |
✅ |
| DeepSeek | deepseek-chat only |
❌ |
| Ollama | json_mode |
❌ |
Basic Usage¶
from pydantic import BaseModel, Field
from cogent.models.openai import OpenAIChat
class Person(BaseModel):
name: str = Field(description="Full name")
age: int = Field(description="Age in years")
# Configure model for structured output
llm = OpenAIChat(model="gpt-5.4").with_structured_output(Person)
response = await llm.ainvoke([
{"role": "user", "content": "Extract: John Doe is 30 years old"}
])
# Response content is JSON matching schema
import json
data = json.loads(response.content)
print(data) # {"name": "John Doe", "age": 30}
Schema Types¶
from dataclasses import dataclass
from typing import TypedDict
# Pydantic (recommended)
class PersonPydantic(BaseModel):
name: str
age: int
# Dataclass
@dataclass
class PersonDataclass:
name: str
age: int
# TypedDict
class PersonTypedDict(TypedDict):
name: str
age: int
# JSON Schema dict
person_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
# All work with with_structured_output()
llm.with_structured_output(PersonPydantic)
llm.with_structured_output(PersonDataclass)
llm.with_structured_output(PersonTypedDict)
llm.with_structured_output(person_schema)
Methods¶
# json_schema (default, strict typing)
llm.with_structured_output(Person, method="json_schema")
# json_mode (less strict, more compatible)
llm.with_structured_output(Person, method="json_mode")
With Tools¶
Structured output and tools can be combined (the model decides when to use each):
@tool
def get_weather(location: str) -> str:
"""Get weather for a location."""
return f"Sunny in {location}"
llm = OpenAIChat(model="gpt-5.4")
llm = llm.bind_tools([get_weather])
llm = llm.with_structured_output(Person)
Agent-Level Structured Output¶
For most use cases, use the Agent's output parameter instead:
from cogent import Agent
agent = Agent(
name="Extractor",
model="gpt4",
output=Person, # Automatic validation and retry
)
result = await agent.run("Extract: John Doe, 30 years old")
print(result.data) # Person(name="John Doe", age=30)
See Agent Documentation for more details.
Base Classes¶
BaseChatModel¶
Protocol for all chat models:
from cogent.models.base import BaseChatModel
class BaseChatModel(Protocol):
async def ainvoke(
self,
messages: list[dict],
**kwargs,
) -> AIMessage: ...
async def astream(
self,
messages: list[dict],
**kwargs,
) -> AsyncIterator[AIMessage]: ...
def bind_tools(
self,
tools: list[BaseTool],
) -> BaseChatModel: ...
AIMessage¶
Response type from chat models:
from cogent.models.base import AIMessage
@dataclass
class AIMessage:
content: str
tool_calls: list[dict] | None = None
usage: dict | None = None # {"input_tokens": ..., "output_tokens": ...}
raw: Any = None # Original provider response
BaseEmbedding¶
Standardized protocol for all embedding models:
from cogent.models.base import BaseEmbedding
from cogent.core.messages import EmbeddingResult
class BaseEmbedding(ABC):
# Primary methods - return full metadata
@abstractmethod
def embed(self, texts: list[str]) -> EmbeddingResult:
"""Embed texts synchronously with metadata."""
...
@abstractmethod
async def aembed(self, texts: list[str]) -> EmbeddingResult:
"""Embed texts asynchronously with metadata."""
...
# Convenience methods - single text, no metadata
def embed_one(self, text: str) -> list[float]:
"""Embed single text synchronously, returns vector only."""
...
async def aembed_one(self, text: str) -> list[float]:
"""Embed single text asynchronously, returns vector only."""
...
# VectorStore protocol - async, no metadata
async def embed_texts(self, texts: list[str]) -> list[list[float]]:
"""Embed texts for VectorStore (async, returns vectors only)."""
...
async def embed_query(self, text: str) -> list[float]:
"""Embed query for VectorStore (async, returns vector only)."""
...
@property
def dimension(self) -> int:
"""Return embedding dimension."""
...
All 9 providers implement this API: - OpenAIEmbedding - AzureOpenAIEmbedding - OllamaEmbedding - CohereEmbedding - GeminiEmbedding - CloudflareEmbedding - MistralEmbedding - CustomEmbedding - MockEmbedding
API Reference¶
ChatModel Aliases¶
| Alias | Actual Class |
|---|---|
ChatModel |
OpenAIChat |
EmbeddingModel |
OpenAIEmbedding |
Provider Classes¶
| Provider | Chat Class | Embedding Class |
|---|---|---|
| OpenAI | OpenAIChat |
OpenAIEmbedding |
| Azure | AzureOpenAIChat |
AzureOpenAIEmbedding |
| Anthropic | AnthropicChat |
- |
| Groq | GroqChat |
- |
| Gemini | GeminiChat |
GeminiEmbedding |
| xAI | XAIChat |
- |
| DeepSeek | DeepSeekChat |
- |
| Ollama | OllamaChat |
OllamaEmbedding |
| OpenRouter | OpenRouterChat |
- |
| Custom | CustomChat |
CustomEmbedding |
Factory Functions¶
| Function | Description |
|---|---|
create_chat(provider, **kwargs) |
Create chat model for any provider |
create_embedding(provider, **kwargs) |
Create embedding model for any provider |