Model Providers
Configure and select LLM providers per agent, session, or call.
Fabric Harness uses an explicit provider/model-id reference everywhere a model is selected. There is no implicit "default OpenAI" — you opt into a provider by configuring credentials and naming the model.
Reference format
provider/model-idExamples:
openai/gpt-4o
anthropic/claude-sonnet-4-6
gemini/gemini-2.5-pro
bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0Setting credentials
Put provider keys once in a repo/workspace .env.local; Fabric Harness auto-loads .env and .env.local files, while shell env still wins:
cp .env.example .env.local
# OPENAI_API_KEY=...
# ANTHROPIC_API_KEY=...
# AZURE_OPENAI_ENDPOINT=https://....openai.azure.com
# AZURE_OPENAI_API_KEY=...Use explicit --env <file> only for overrides. Never paste API keys into source files or session artifacts.
Selecting the model
The first non-empty wins, in this order:
- CLI flag:
fh run ask --model openai/gpt-5.5 - Environment:
FABRIC_MODEL=openai/gpt-5.5 .fabricharness/config.ts→run.modeloragent.model- Agent-declared default:
agent({ model: 'openai/gpt-5.5' })
Per-call override:
await session.prompt('Summarize', { model: 'openai/gpt-5.5' });Mock provider
For local development and tests, openai/gpt-5.5 returns deterministic stub responses. It honors the typed-result schema where possible.
export default agent({
// ...
model: process.env.FABRIC_MODEL ?? 'openai/gpt-5.5',
});Provider env names
Fabric Harness knows the standard env names for common providers:
OPENAI_API_KEYANTHROPIC_API_KEYOPENROUTER_API_KEYGEMINI_API_KEYGOOGLE_API_KEYAZURE_OPENAI_API_KEY+AZURE_OPENAI_ENDPOINTGROQ_API_KEYMISTRAL_API_KEYCOHERE_API_KEY
Cloudflare Workers AI binding
When deploying to Cloudflare Workers with fh build --target cloudflare, you can route inference through the platform binding (env.AI.run()) instead of HTTP — no API tokens, no egress, runs at the edge.
import { CloudflareWorkersAIModelProvider } from '@fabric-harness/cloudflare/workers-ai';
export default {
async fetch(request: Request, env: Env) {
const fabric = await init({
modelProvider: new CloudflareWorkersAIModelProvider({
binding: env.AI,
defaultModel: '@cf/meta/llama-3.1-8b-instruct',
}),
});
// ...
},
};wrangler.toml/jsonc:
[ai]
binding = "AI"Handles modern { choices: [...] } and legacy { response: '...' } Workers AI shapes. Optional Cloudflare AI Gateway routing supports enterprise logging/routing knobs:
new CloudflareWorkersAIModelProvider({
binding: env.AI,
defaultModel: '@cf/meta/llama-3.1-8b-instruct',
gateway: {
id: 'prod-gateway',
skipCache: false,
cacheTtl: 3600,
collectLog: true,
eventId: request.headers.get('x-request-id') ?? undefined,
metadata: { tenant: 'acme', environment: 'prod' },
},
models: {
'@cf/meta/llama-3.1-8b-instruct': {
contextWindowTokens: 8192,
maxOutputTokens: 2048,
supportsTools: true,
},
},
});Model metadata feeds context-budgeting/auto-compaction and admin UIs. The provider includes built-in metadata for common Workers AI chat models and accepts models / defaultModelInfo overrides for private or newly released models.
OpenAI-compatible gateways
Many AI gateway products speak the OpenAI Chat Completions request/response shape: Vercel AI Gateway, Helicone, Portkey, LiteLLM (self-hosted), internal corp proxies. Wire any of them with the generic aiGateway() helper:
import { aiGateway, init } from '@fabric-harness/sdk';
// Helicone
const fabric = await init({
modelProvider: aiGateway({
baseUrl: 'https://oai.helicone.ai/v1',
apiKey: process.env.OPENAI_API_KEY!,
headers: { 'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}` },
defaultModel: 'gpt-4o',
name: 'helicone',
}),
});
// Self-hosted LiteLLM
const fabric = await init({
modelProvider: aiGateway({
baseUrl: 'http://litellm.internal:4000/v1',
apiKey: process.env.LITELLM_KEY!,
defaultModel: 'azure/gpt-4o',
}),
});Vercel AI Gateway preset
Vercel AI Gateway gets a thin preset with the gateway URL pre-baked:
import { vercelAIGateway, init } from '@fabric-harness/sdk';
const fabric = await init({
modelProvider: vercelAIGateway({
apiKey: process.env.AI_GATEWAY_API_KEY!,
defaultModel: 'anthropic/claude-3-5-sonnet-20241022',
}),
});baseUrl defaults to https://ai-gateway.vercel.sh/v1; override for staging or self-hosted.
Foundry runtime (Azure)
When deploying an agent into the Azure AI Foundry Hosted Agent runtime — or any Azure compute (ACA Job, AKS pod, VM) with a managed identity — FoundryRuntimeModelProvider calls the Foundry-managed Azure OpenAI surface using a Bearer token instead of an API key:
import { FoundryRuntimeModelProvider } from '@fabric-harness/azure/foundry-runtime';
import { init } from '@fabric-harness/sdk';
const fabric = await init({
modelProvider: new FoundryRuntimeModelProvider({
defaultModel: 'gpt-4o',
}),
});The Hosted Agent runtime injects AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT, and FOUNDRY_AGENT_TOKEN automatically. Outside the Foundry runtime, install @azure/identity (optional peer dep) and the provider falls back to DefaultAzureCredential for the workload's managed identity.
The runtime adapter that lets runAgent execute inside Foundry's container is still preview-blocked, but the model provider works today on any Azure compute with a managed identity.
Spend caps
Per-call and per-session USD ceilings prevent runaway spend. Wired through init({ costLimit }):
const fabric = await init({
costLimit: {
perCall: 0.10, // throw if a single model call exceeds $0.10
perSession: 1.00, // throw if cumulative session spend exceeds $1.00
onExceed: 'throw', // 'throw' (default) | 'approve'
},
});When onExceed: 'approve' the loop pauses on a violation and emits approval_requested with kind: 'cost-limit'. Approve via your existing approval UI (or fh approve <id>) to release the loop; deny to throw CostLimitExceededError.
Limits evaluate after each call's cost lands on usage.costUsd. Forks and replays start with a fresh budget — replay is a debug action, not production work.
Cross-process aggregation
For "tenant X spends ≤ $50/day" or "company-wide ≤ $100/hour" caps, pair perScope + scopeKey + store with a cross-process CostBudgetStore:
import { init, inMemoryCostBudgetStore } from '@fabric-harness/sdk';
import { postgresCostBudgetStore } from '@fabric-harness/node';
const fabric = await init({
costLimit: {
perScope: 50.00,
scopeKey: `tenant:${tenantId}:day:${todayIso}`,
store: postgresCostBudgetStore({ client: pgClient }), // or inMemoryCostBudgetStore() for single-process
onExceed: 'throw',
},
});The store is the source of truth — multiple agents / multiple processes share the running total. fabric-harness never interprets scopeKey; you pick the convention (per-tenant, per-day, per-org). Reset semantics (daily rollover, billing period close) are also yours — call store.reset(scopeKey) from a scheduled task.
Anthropic prompt caching
When an Anthropic response includes cache_read_input_tokens / cache_creation_input_tokens, fabric-harness records them on usage.cachedInputTokens and usage.cacheWriteTokens, and the cost calculator discounts billed input tokens by the cached read amount (and adds the cache-write surcharge when present). fh metrics shows a new Cache: read=N write=N line so you can see how much you're saving.
$ fh metrics ask-1f4f...
Tokens: input=120000 output=2400 total=122400
Cache: read=96000 write=0
Cost: $0.046800Cache-read tokens are billed at ~10% of the standard input rate on Claude models. Long, stable system prompts → big savings.
Per-call cost telemetry
Every model call is enriched with a USD estimate from a static price table (mainline OpenAI, Anthropic, Gemini, Bedrock, Cohere). Cost shows up in fh metrics and on OpenTelemetry spans as gen_ai.usage.cost_usd — see CLI → metrics.
Override or extend the catalog at runtime when you have custom-rate contracts:
import { registerModelPrices } from '@fabric-harness/sdk';
registerModelPrices([
{ provider: 'openai', model: 'gpt-4o', inputPerMTok: 1.5, outputPerMTok: 6, effectiveAt: '2026-05-08', notes: 'Enterprise contract' },
]);Reasoning effort
Reasoning-capable models accept a thinkingLevel controlling how much the model thinks before answering:
type ThinkingLevel = 'off' | 'minimal' | 'low' | 'medium' | 'high' | 'xhigh';It is configurable at three scopes, most-specific wins:
const agent = await init({ model: 'cloudflare/@cf/openai/gpt-oss-120b', thinkingLevel: 'medium' });
const session = await agent.session('s1', { thinkingLevel: 'high' }); // per-session override
await session.prompt('think hard about this', { thinkingLevel: 'xhigh' }); // per-call overrideThe level is capability-gated: reasoning-capable providers map it to their native control, others ignore it (no error). 'off' (or unset) requests no reasoning.
- Default loop (pi-agent-core): works for every provider; pi-ai handles capability detection + per-provider mapping and clamps the level to what each model supports.
- Native Fabric providers: Cloudflare Workers AI binding & OpenAI-compatible map to
reasoning_effort; Anthropic tothinking.budget_tokens(withmax_tokensraised above the budget); Gemini/Vertex tothinkingConfig.thinkingBudget. Gated to known reasoning families (o-series / gpt-5 / gpt-oss, Claude 3.7/4.x, Gemini 2.5).