FabricFabricHarness
Building Agents

Voice

Bidirectional audio streaming for phone calls, browser mic/speaker, and kiosk audio. Realtime mode (OpenAI) + pipeline mode (Deepgram + ElevenLabs / Cartesia) behind one VoiceSession contract.

VoiceSession is fabric-harness's surface for voice-capable LLMs. It streams audio bytes in both directions, surfaces text transcripts, and round-trips tool calls — same governance posture as text agents (cost telemetry, audit, approvals all apply).

fabric-harness ships two voice modes behind the same VoiceSession contract:

  • Realtime mode — one model owns audio in + out + tools. OpenAI Realtime today; Anthropic / Gemini Live when GA.
  • Pipeline modeSTT → LLM → TTS. Compose Deepgram / Cartesia STT with any LLM and ElevenLabs / Cartesia TTS. Swap providers without changing the caller code.

See Voice Providers for the provider matrix, pricing, and the picking heuristic. Telephony bridges (Twilio Media Streams, Vonage, Plivo) live in user-space; scaffold one with fh add.

The contract

interface VoiceSession {
  sendAudio(frame: Uint8Array): Promise<void>;
  sendText(text: string): Promise<void>;
  submitToolResult(input: { id: string; output: unknown }): Promise<void>;
  cancelResponse(): Promise<void>;
  events(): AsyncIterable<VoiceEvent>;
  close(): Promise<void>;
}

type VoiceEvent =
  | { type: 'session_open' }
  | { type: 'audio_delta'; audio: Uint8Array }
  | { type: 'text_delta'; delta: string; role: 'assistant' | 'user' }
  | { type: 'transcript'; text: string; role: 'assistant' | 'user' }
  | { type: 'tool_call'; id: string; name: string; input: unknown }
  | { type: 'response_done'; usage?: ModelUsage }
  | { type: 'error'; message: string };

audio_delta carries raw PCM 16-bit (or g711_ulaw for telephony). tool_call flags an LLM tool invocation — your code runs the tool and calls submitToolResult({ id, output }) to feed the result back.

OpenAI Realtime

import { OpenAIRealtimeVoiceProvider } from '@fabric-harness/sdk';

const provider = new OpenAIRealtimeVoiceProvider({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-realtime-preview',
});

const voice = await provider.connect({
  instructions: 'You are a friendly intake agent. One question at a time.',
  voice: 'alloy',
  audioFormat: 'pcm16',                   // or 'g711_ulaw' for Twilio
  tools: [submitFieldTool],
  turnDetection: 'server_vad',            // OpenAI handles barge-in detection
});

// Pump audio in (mic / phone)
mic.on('data', (frame) => voice.sendAudio(frame));

// Pump audio out (speaker / phone)
for await (const event of voice.events()) {
  if (event.type === 'audio_delta') speaker.write(event.audio);
  if (event.type === 'tool_call') {
    const output = await runTool(event.name, event.input);
    await voice.submitToolResult({ id: event.id, output });
  }
  if (event.type === 'response_done') break;
}

The provider uses the platform WebSocket (Node 22+ + browsers) — no ws dependency on the consumer side.

Pipeline mode (ElevenLabs / Cartesia / Deepgram)

When you want a non-OpenAI vendor combo — voice cloning, multilingual TTS, on-prem STT, or just lower per-minute cost — use PipelineVoiceProvider. It composes a SttProvider, a text ModelProvider, and a TtsProvider into the same VoiceSession contract. Swap any of the three without touching caller code.

import {
  PipelineVoiceProvider,
  DeepgramSttProvider,
  ElevenLabsTtsProvider,
  AnthropicModelProvider,
} from '@fabric-harness/sdk';

const provider = new PipelineVoiceProvider({
  stt: new DeepgramSttProvider({ apiKey: process.env.DEEPGRAM_API_KEY!, model: 'nova-3' }),
  tts: new ElevenLabsTtsProvider({
    apiKey: process.env.ELEVENLABS_API_KEY!,
    model: 'eleven_turbo_v2_5',
    defaultVoice: '21m00Tcm4TlvDq8ikWAM',  // 'Rachel'
  }),
  llm: new AnthropicModelProvider({ apiKey: process.env.ANTHROPIC_API_KEY! }),
  model: 'claude-haiku-4-5-20251001',
});

const voice = await provider.connect({
  instructions: 'Friendly intake agent. One question at a time.',
  tools: [submitFieldTool],
});
// Same loop as realtime mode — events(), sendAudio(), submitToolResult().

Conversation flow:

  1. Caller pumps audio into voice.sendAudio().
  2. Deepgram emits final transcripts → pipeline forwards as the next user turn.
  3. LLM generates the assistant reply (with tool calls if relevant).
  4. Assistant text streams to ElevenLabs → audio bytes arrive as audio_delta events.
  5. response_done fires with rolled-up usage (inputTokens, outputTokens, sttSeconds, ttsCharacters, costUsd).

Barge-in: the STT VAD emits speech_started when the user interrupts; the pipeline aborts the in-flight LLM + TTS streams and returns to listening. Your UI should drain its audio buffer when audio_delta events stop arriving.

Single-vendor variant (no ElevenLabs key needed):

import { CartesiaSttProvider, CartesiaTtsProvider } from '@fabric-harness/sdk';

new PipelineVoiceProvider({
  stt: new CartesiaSttProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
  tts: new CartesiaTtsProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
  llm: someLlmProvider,
  model: 'gpt-4.1-mini',
});

For the full matrix (when to pick which mode, latency benchmarks, cost comparisons), see Voice Providers.

Tool calls

Tool execution is the caller's responsibility, not the voice session's. When tool_call lands:

  1. Look up the tool by name in your registry.
  2. Run it through whatever governance gates apply — approval policies, cost caps, rate limits — using the v1-era primitives unchanged.
  3. Call voice.submitToolResult({ id, output }).

This keeps the audio path lean and avoids reimplementing the loop machinery on the voice side.

Cost

Realtime audio is billed per audio token at materially higher rates than text. fabric-harness records audioInputTokens / audioOutputTokens on usage and rolls them into costUsd via the static price table:

$ fh metrics call-3a8b...
Tokens: input=1200 output=400 total=1600
Audio: input=12345 output=6789
Cost: $0.234567

Override the rates with registerModelPrices for negotiated contracts:

import { registerModelPrices } from '@fabric-harness/sdk';

registerModelPrices([{
  provider: 'openai',
  model: 'gpt-4o-realtime-preview',
  inputPerMTok: 4,
  outputPerMTok: 16,
  audioInputPerMTok: 80,
  audioOutputPerMTok: 160,
  effectiveAt: '2026-05-08',
  notes: 'Enterprise contract',
}]);

Telephony bridges

Phone calls? Don't add a Twilio dep to fabric-harness. Use fh add to scaffold a project-local bridge:

fh add https://www.twilio.com/docs/voice/twiml/stream --category voice-telephony | claude
fh add https://developer.vonage.com/voice/voice-api/code-snippets --category voice-telephony | cursor-agent

This emits the canonical telephony spec (packages/sdk/connector-spec/voice-telephony.md) with a header pointing at the provider's docs. The coding agent reads the docs, follows the spec, and produces a single file at ./connectors/<provider>-bridge.ts. Twilio uses μ-law 8kHz natively — set audioFormat: 'g711_ulaw' and skip the resample for the lowest-latency path.

Browser-direct voice via fh server

fh server ships a WS /sessions/:id/voice upgrade handler (v1.11+). The server creates the OpenAI Realtime connection on behalf of the browser — provider API keys stay server-side. Tool calls relay through the agent's existing tool registry, so approval policies, cost caps, and rate limiters apply to voice tools just like text tools.

import { connectFabricVoice } from '@fabric-harness/sdk';

const handle = connectFabricVoice({
  url: `wss://app.example.com/sessions/${sessionId}/voice`,
  authToken,
  tenantId: 'acme',
  voice: 'alloy',
  instructions: 'Friendly intake agent. One question at a time.',
  onEvent: (event) => {
    if (event.type === 'audio') speaker.write(event.audio);
    if (event.type === 'tool_call') {
      // Optional: handle tool calls client-side. Default is server-side relay.
    }
    if (event.type === 'cost_limit') {
      console.warn('Hit cost ceiling', event);
      handle.close();
    }
  },
  onError: console.error,
});

mic.on('frame', (pcm16) => handle.sendAudio(pcm16));

Server requirements:

  • OPENAI_API_KEY env var (the bridge fails 1011 / error if missing).
  • ws peer dep installed (pnpm add ws).
  • Optional: FABRIC_HARNESS_API_TOKEN, extractAuthToken, X-Fabric-Tenant — same pipeline as the chat WS.

Query params customize per-connection: ?model=gpt-4o-realtime-preview&voice=alloy&audioFormat=g711_ulaw&instructions=....

Capture pipeline

  1. navigator.mediaDevices.getUserMedia({ audio: true }).
  2. Pipe through an AudioWorklet that resamples to PCM 16-bit 24kHz mono (or μ-law 8kHz for audioFormat: 'g711_ulaw').
  3. Forward each frame via handle.sendAudio(buffer).

Cost governance

VoiceConnectOptions.costBudget (and the WS bridge's costLimit option) wires voice into the v1.4 cost-budget machinery. On every response.done, the tracker observes the call cost and emits a cost_limit event when a ceiling is crossed. With onExceed: 'approve', your requestCostLimitApproval is called before the session continues. Pair with tenantCostLimit() for per-tenant per-period ceilings.

See also