FabricFabricHarness
Building Agents

Model Providers

Configure and select LLM providers per agent, session, or call.

Fabric Harness uses an explicit provider/model-id reference everywhere a model is selected. There is no implicit "default OpenAI" — you opt into a provider by configuring credentials and naming the model.

Reference format

provider/model-id

Examples:

openai/gpt-4o
anthropic/claude-sonnet-4-6
gemini/gemini-2.5-pro
bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0

Setting credentials

Put provider keys once in a repo/workspace .env.local; Fabric Harness auto-loads .env and .env.local files, while shell env still wins:

cp .env.example .env.local
# OPENAI_API_KEY=...
# ANTHROPIC_API_KEY=...
# AZURE_OPENAI_ENDPOINT=https://....openai.azure.com
# AZURE_OPENAI_API_KEY=...

Use explicit --env <file> only for overrides. Never paste API keys into source files or session artifacts.

Selecting the model

The first non-empty wins, in this order:

  1. CLI flag: fh run ask --model openai/gpt-5.5
  2. Environment: FABRIC_MODEL=openai/gpt-5.5
  3. .fabricharness/config.tsrun.model or agent.model
  4. Agent-declared default: agent({ model: 'openai/gpt-5.5' })

Per-call override:

await session.prompt('Summarize', { model: 'openai/gpt-5.5' });

Mock provider

For local development and tests, openai/gpt-5.5 returns deterministic stub responses. It honors the typed-result schema where possible.

export default agent({
  // ...
  model: process.env.FABRIC_MODEL ?? 'openai/gpt-5.5',
});

Provider env names

Fabric Harness knows the standard env names for common providers:

  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • OPENROUTER_API_KEY
  • GEMINI_API_KEY
  • GOOGLE_API_KEY
  • AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT
  • GROQ_API_KEY
  • MISTRAL_API_KEY
  • COHERE_API_KEY

Cloudflare Workers AI binding

When deploying to Cloudflare Workers with fh build --target cloudflare, you can route inference through the platform binding (env.AI.run()) instead of HTTP — no API tokens, no egress, runs at the edge.

import { CloudflareWorkersAIModelProvider } from '@fabric-harness/cloudflare/workers-ai';

export default {
  async fetch(request: Request, env: Env) {
    const fabric = await init({
      modelProvider: new CloudflareWorkersAIModelProvider({
        binding: env.AI,
        defaultModel: '@cf/meta/llama-3.1-8b-instruct',
      }),
    });
    // ...
  },
};

wrangler.toml/jsonc:

[ai]
binding = "AI"

Handles modern { choices: [...] } and legacy { response: '...' } Workers AI shapes. Optional Cloudflare AI Gateway routing supports enterprise logging/routing knobs:

new CloudflareWorkersAIModelProvider({
  binding: env.AI,
  defaultModel: '@cf/meta/llama-3.1-8b-instruct',
  gateway: {
    id: 'prod-gateway',
    skipCache: false,
    cacheTtl: 3600,
    collectLog: true,
    eventId: request.headers.get('x-request-id') ?? undefined,
    metadata: { tenant: 'acme', environment: 'prod' },
  },
  models: {
    '@cf/meta/llama-3.1-8b-instruct': {
      contextWindowTokens: 8192,
      maxOutputTokens: 2048,
      supportsTools: true,
    },
  },
});

Model metadata feeds context-budgeting/auto-compaction and admin UIs. The provider includes built-in metadata for common Workers AI chat models and accepts models / defaultModelInfo overrides for private or newly released models.

OpenAI-compatible gateways

Many AI gateway products speak the OpenAI Chat Completions request/response shape: Vercel AI Gateway, Helicone, Portkey, LiteLLM (self-hosted), internal corp proxies. Wire any of them with the generic aiGateway() helper:

import { aiGateway, init } from '@fabric-harness/sdk';

// Helicone
const fabric = await init({
  modelProvider: aiGateway({
    baseUrl: 'https://oai.helicone.ai/v1',
    apiKey: process.env.OPENAI_API_KEY!,
    headers: { 'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}` },
    defaultModel: 'gpt-4o',
    name: 'helicone',
  }),
});

// Self-hosted LiteLLM
const fabric = await init({
  modelProvider: aiGateway({
    baseUrl: 'http://litellm.internal:4000/v1',
    apiKey: process.env.LITELLM_KEY!,
    defaultModel: 'azure/gpt-4o',
  }),
});

Vercel AI Gateway preset

Vercel AI Gateway gets a thin preset with the gateway URL pre-baked:

import { vercelAIGateway, init } from '@fabric-harness/sdk';

const fabric = await init({
  modelProvider: vercelAIGateway({
    apiKey: process.env.AI_GATEWAY_API_KEY!,
    defaultModel: 'anthropic/claude-3-5-sonnet-20241022',
  }),
});

baseUrl defaults to https://ai-gateway.vercel.sh/v1; override for staging or self-hosted.

Foundry runtime (Azure)

When deploying an agent into the Azure AI Foundry Hosted Agent runtime — or any Azure compute (ACA Job, AKS pod, VM) with a managed identity — FoundryRuntimeModelProvider calls the Foundry-managed Azure OpenAI surface using a Bearer token instead of an API key:

import { FoundryRuntimeModelProvider } from '@fabric-harness/azure/foundry-runtime';
import { init } from '@fabric-harness/sdk';

const fabric = await init({
  modelProvider: new FoundryRuntimeModelProvider({
    defaultModel: 'gpt-4o',
  }),
});

The Hosted Agent runtime injects AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT, and FOUNDRY_AGENT_TOKEN automatically. Outside the Foundry runtime, install @azure/identity (optional peer dep) and the provider falls back to DefaultAzureCredential for the workload's managed identity.

The runtime adapter that lets runAgent execute inside Foundry's container is still preview-blocked, but the model provider works today on any Azure compute with a managed identity.

Spend caps

Per-call and per-session USD ceilings prevent runaway spend. Wired through init({ costLimit }):

const fabric = await init({
  costLimit: {
    perCall: 0.10,       // throw if a single model call exceeds $0.10
    perSession: 1.00,    // throw if cumulative session spend exceeds $1.00
    onExceed: 'throw',   // 'throw' (default) | 'approve'
  },
});

When onExceed: 'approve' the loop pauses on a violation and emits approval_requested with kind: 'cost-limit'. Approve via your existing approval UI (or fh approve <id>) to release the loop; deny to throw CostLimitExceededError.

Limits evaluate after each call's cost lands on usage.costUsd. Forks and replays start with a fresh budget — replay is a debug action, not production work.

Cross-process aggregation

For "tenant X spends ≤ $50/day" or "company-wide ≤ $100/hour" caps, pair perScope + scopeKey + store with a cross-process CostBudgetStore:

import { init, inMemoryCostBudgetStore } from '@fabric-harness/sdk';
import { postgresCostBudgetStore } from '@fabric-harness/node';

const fabric = await init({
  costLimit: {
    perScope: 50.00,
    scopeKey: `tenant:${tenantId}:day:${todayIso}`,
    store: postgresCostBudgetStore({ client: pgClient }),  // or inMemoryCostBudgetStore() for single-process
    onExceed: 'throw',
  },
});

The store is the source of truth — multiple agents / multiple processes share the running total. fabric-harness never interprets scopeKey; you pick the convention (per-tenant, per-day, per-org). Reset semantics (daily rollover, billing period close) are also yours — call store.reset(scopeKey) from a scheduled task.

Anthropic prompt caching

When an Anthropic response includes cache_read_input_tokens / cache_creation_input_tokens, fabric-harness records them on usage.cachedInputTokens and usage.cacheWriteTokens, and the cost calculator discounts billed input tokens by the cached read amount (and adds the cache-write surcharge when present). fh metrics shows a new Cache: read=N write=N line so you can see how much you're saving.

$ fh metrics ask-1f4f...
Tokens: input=120000 output=2400 total=122400
Cache: read=96000 write=0
Cost: $0.046800

Cache-read tokens are billed at ~10% of the standard input rate on Claude models. Long, stable system prompts → big savings.

Per-call cost telemetry

Every model call is enriched with a USD estimate from a static price table (mainline OpenAI, Anthropic, Gemini, Bedrock, Cohere). Cost shows up in fh metrics and on OpenTelemetry spans as gen_ai.usage.cost_usd — see CLI → metrics.

Override or extend the catalog at runtime when you have custom-rate contracts:

import { registerModelPrices } from '@fabric-harness/sdk';

registerModelPrices([
  { provider: 'openai', model: 'gpt-4o', inputPerMTok: 1.5, outputPerMTok: 6, effectiveAt: '2026-05-08', notes: 'Enterprise contract' },
]);

Reasoning effort

Reasoning-capable models accept a thinkingLevel controlling how much the model thinks before answering:

type ThinkingLevel = 'off' | 'minimal' | 'low' | 'medium' | 'high' | 'xhigh';

It is configurable at three scopes, most-specific wins:

const agent = await init({ model: 'cloudflare/@cf/openai/gpt-oss-120b', thinkingLevel: 'medium' });
const session = await agent.session('s1', { thinkingLevel: 'high' });   // per-session override
await session.prompt('think hard about this', { thinkingLevel: 'xhigh' }); // per-call override

The level is capability-gated: reasoning-capable providers map it to their native control, others ignore it (no error). 'off' (or unset) requests no reasoning.

  • Default loop (pi-agent-core): works for every provider; pi-ai handles capability detection + per-provider mapping and clamps the level to what each model supports.
  • Native Fabric providers: Cloudflare Workers AI binding & OpenAI-compatible map to reasoning_effort; Anthropic to thinking.budget_tokens (with max_tokens raised above the budget); Gemini/Vertex to thinkingConfig.thinkingBudget. Gated to known reasoning families (o-series / gpt-5 / gpt-oss, Claude 3.7/4.x, Gemini 2.5).