Voice
Bidirectional audio streaming for phone calls, browser mic/speaker, and kiosk audio. Realtime mode (OpenAI) + pipeline mode (Deepgram + ElevenLabs / Cartesia) behind one VoiceSession contract.
VoiceSession is fabric-harness's surface for voice-capable LLMs. It streams audio bytes in both directions, surfaces text transcripts, and round-trips tool calls — same governance posture as text agents (cost telemetry, audit, approvals all apply).
fabric-harness ships two voice modes behind the same VoiceSession contract:
- Realtime mode — one model owns audio in + out + tools. OpenAI Realtime today; Anthropic / Gemini Live when GA.
- Pipeline mode —
STT → LLM → TTS. Compose Deepgram / Cartesia STT with any LLM and ElevenLabs / Cartesia TTS. Swap providers without changing the caller code.
See Voice Providers for the provider matrix, pricing, and the picking heuristic. Telephony bridges (Twilio Media Streams, Vonage, Plivo) live in user-space; scaffold one with fh add.
The contract
interface VoiceSession {
sendAudio(frame: Uint8Array): Promise<void>;
sendText(text: string): Promise<void>;
submitToolResult(input: { id: string; output: unknown }): Promise<void>;
cancelResponse(): Promise<void>;
events(): AsyncIterable<VoiceEvent>;
close(): Promise<void>;
}
type VoiceEvent =
| { type: 'session_open' }
| { type: 'audio_delta'; audio: Uint8Array }
| { type: 'text_delta'; delta: string; role: 'assistant' | 'user' }
| { type: 'transcript'; text: string; role: 'assistant' | 'user' }
| { type: 'tool_call'; id: string; name: string; input: unknown }
| { type: 'response_done'; usage?: ModelUsage }
| { type: 'error'; message: string };audio_delta carries raw PCM 16-bit (or g711_ulaw for telephony). tool_call flags an LLM tool invocation — your code runs the tool and calls submitToolResult({ id, output }) to feed the result back.
OpenAI Realtime
import { OpenAIRealtimeVoiceProvider } from '@fabric-harness/sdk';
const provider = new OpenAIRealtimeVoiceProvider({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o-realtime-preview',
});
const voice = await provider.connect({
instructions: 'You are a friendly intake agent. One question at a time.',
voice: 'alloy',
audioFormat: 'pcm16', // or 'g711_ulaw' for Twilio
tools: [submitFieldTool],
turnDetection: 'server_vad', // OpenAI handles barge-in detection
});
// Pump audio in (mic / phone)
mic.on('data', (frame) => voice.sendAudio(frame));
// Pump audio out (speaker / phone)
for await (const event of voice.events()) {
if (event.type === 'audio_delta') speaker.write(event.audio);
if (event.type === 'tool_call') {
const output = await runTool(event.name, event.input);
await voice.submitToolResult({ id: event.id, output });
}
if (event.type === 'response_done') break;
}The provider uses the platform WebSocket (Node 22+ + browsers) — no ws dependency on the consumer side.
Pipeline mode (ElevenLabs / Cartesia / Deepgram)
When you want a non-OpenAI vendor combo — voice cloning, multilingual TTS, on-prem STT, or just lower per-minute cost — use PipelineVoiceProvider. It composes a SttProvider, a text ModelProvider, and a TtsProvider into the same VoiceSession contract. Swap any of the three without touching caller code.
import {
PipelineVoiceProvider,
DeepgramSttProvider,
ElevenLabsTtsProvider,
AnthropicModelProvider,
} from '@fabric-harness/sdk';
const provider = new PipelineVoiceProvider({
stt: new DeepgramSttProvider({ apiKey: process.env.DEEPGRAM_API_KEY!, model: 'nova-3' }),
tts: new ElevenLabsTtsProvider({
apiKey: process.env.ELEVENLABS_API_KEY!,
model: 'eleven_turbo_v2_5',
defaultVoice: '21m00Tcm4TlvDq8ikWAM', // 'Rachel'
}),
llm: new AnthropicModelProvider({ apiKey: process.env.ANTHROPIC_API_KEY! }),
model: 'claude-haiku-4-5-20251001',
});
const voice = await provider.connect({
instructions: 'Friendly intake agent. One question at a time.',
tools: [submitFieldTool],
});
// Same loop as realtime mode — events(), sendAudio(), submitToolResult().Conversation flow:
- Caller pumps audio into
voice.sendAudio(). - Deepgram emits final transcripts → pipeline forwards as the next user turn.
- LLM generates the assistant reply (with tool calls if relevant).
- Assistant text streams to ElevenLabs → audio bytes arrive as
audio_deltaevents. response_donefires with rolled-up usage (inputTokens,outputTokens,sttSeconds,ttsCharacters,costUsd).
Barge-in: the STT VAD emits speech_started when the user interrupts; the pipeline aborts the in-flight LLM + TTS streams and returns to listening. Your UI should drain its audio buffer when audio_delta events stop arriving.
Single-vendor variant (no ElevenLabs key needed):
import { CartesiaSttProvider, CartesiaTtsProvider } from '@fabric-harness/sdk';
new PipelineVoiceProvider({
stt: new CartesiaSttProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
tts: new CartesiaTtsProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
llm: someLlmProvider,
model: 'gpt-4.1-mini',
});For the full matrix (when to pick which mode, latency benchmarks, cost comparisons), see Voice Providers.
Tool calls
Tool execution is the caller's responsibility, not the voice session's. When tool_call lands:
- Look up the tool by
namein your registry. - Run it through whatever governance gates apply — approval policies, cost caps, rate limits — using the v1-era primitives unchanged.
- Call
voice.submitToolResult({ id, output }).
This keeps the audio path lean and avoids reimplementing the loop machinery on the voice side.
Cost
Realtime audio is billed per audio token at materially higher rates than text. fabric-harness records audioInputTokens / audioOutputTokens on usage and rolls them into costUsd via the static price table:
$ fh metrics call-3a8b...
Tokens: input=1200 output=400 total=1600
Audio: input=12345 output=6789
Cost: $0.234567Override the rates with registerModelPrices for negotiated contracts:
import { registerModelPrices } from '@fabric-harness/sdk';
registerModelPrices([{
provider: 'openai',
model: 'gpt-4o-realtime-preview',
inputPerMTok: 4,
outputPerMTok: 16,
audioInputPerMTok: 80,
audioOutputPerMTok: 160,
effectiveAt: '2026-05-08',
notes: 'Enterprise contract',
}]);Telephony bridges
Phone calls? Don't add a Twilio dep to fabric-harness. Use fh add to scaffold a project-local bridge:
fh add https://www.twilio.com/docs/voice/twiml/stream --category voice-telephony | claude
fh add https://developer.vonage.com/voice/voice-api/code-snippets --category voice-telephony | cursor-agentThis emits the canonical telephony spec (packages/sdk/connector-spec/voice-telephony.md) with a header pointing at the provider's docs. The coding agent reads the docs, follows the spec, and produces a single file at ./connectors/<provider>-bridge.ts. Twilio uses μ-law 8kHz natively — set audioFormat: 'g711_ulaw' and skip the resample for the lowest-latency path.
Browser-direct voice via fh server
fh server ships a WS /sessions/:id/voice upgrade handler (v1.11+). The server creates the OpenAI Realtime connection on behalf of the browser — provider API keys stay server-side. Tool calls relay through the agent's existing tool registry, so approval policies, cost caps, and rate limiters apply to voice tools just like text tools.
import { connectFabricVoice } from '@fabric-harness/sdk';
const handle = connectFabricVoice({
url: `wss://app.example.com/sessions/${sessionId}/voice`,
authToken,
tenantId: 'acme',
voice: 'alloy',
instructions: 'Friendly intake agent. One question at a time.',
onEvent: (event) => {
if (event.type === 'audio') speaker.write(event.audio);
if (event.type === 'tool_call') {
// Optional: handle tool calls client-side. Default is server-side relay.
}
if (event.type === 'cost_limit') {
console.warn('Hit cost ceiling', event);
handle.close();
}
},
onError: console.error,
});
mic.on('frame', (pcm16) => handle.sendAudio(pcm16));Server requirements:
OPENAI_API_KEYenv var (the bridge fails 1011 /errorif missing).wspeer dep installed (pnpm add ws).- Optional:
FABRIC_HARNESS_API_TOKEN,extractAuthToken,X-Fabric-Tenant— same pipeline as the chat WS.
Query params customize per-connection: ?model=gpt-4o-realtime-preview&voice=alloy&audioFormat=g711_ulaw&instructions=....
Capture pipeline
navigator.mediaDevices.getUserMedia({ audio: true }).- Pipe through an
AudioWorkletthat resamples to PCM 16-bit 24kHz mono (or μ-law 8kHz foraudioFormat: 'g711_ulaw'). - Forward each frame via
handle.sendAudio(buffer).
Cost governance
VoiceConnectOptions.costBudget (and the WS bridge's costLimit option) wires voice into the v1.4 cost-budget machinery. On every response.done, the tracker observes the call cost and emits a cost_limit event when a ceiling is crossed. With onExceed: 'approve', your requestCostLimitApproval is called before the session continues. Pair with tenantCostLimit() for per-tenant per-period ceilings.
See also
- Cost telemetry
- Connector catalog
- Session memory — useful for storing collected fields across calls