Voice Providers
Provider matrix for realtime + pipeline voice agents — OpenAI Realtime, ElevenLabs, Cartesia, Deepgram. When to pick which, how to compose them, latency / pricing tradeoffs.
fabric-harness supports two voice modes — realtime (one model owns audio in + out + tools) and pipeline (separate STT, LLM, and TTS providers). This page is the cheat sheet for picking and composing providers.
TL;DR
- Want one vendor, lowest latency, English-leaning? Use OpenAI Realtime (realtime mode). One WebSocket, one bill, server-side VAD included.
- Want best-in-class voice quality (clones, emotion, multilingual) at higher latency? Use pipeline mode with Deepgram STT + ElevenLabs TTS. Mix any LLM you already trust.
- Want a single non-OpenAI vendor? Use pipeline mode with Cartesia STT + Cartesia TTS. One vendor, one bill, low latency.
- Want to keep API keys server-side for browser apps? All four providers can sit behind
WS /sessions/:id/voice. The browser never sees the key.
Mode comparison
| Aspect | Realtime mode | Pipeline mode |
|---|---|---|
| Architecture | One WS, model handles audio in + out + tools | Three streams: STT → LLM → TTS |
| Latency to first audio | ~300-600ms | ~600-1200ms |
| Voice flexibility | Provider's voices only | Any TTS vendor — voice clones, emotion, accents |
| Language support | Provider-bound | Any combo — Deepgram nova-3 covers 30+ languages |
| Tool calling | Native to the model | Through the underlying LLM (Anthropic/OpenAI/Gemini) |
| Cost (per minute, English) | $0.10-0.30 | $0.05-0.15 |
| Barge-in | Server VAD | STT VAD + caller cancels TTS |
| Best for | Phone agents, intake bots, low-latency UX | Multilingual, branded voice, vendor-flexibility, governance-heavy |
Provider matrix
| Provider | Role | Default model | Strengths | Pricing (May 2026) |
|---|---|---|---|---|
| OpenAI Realtime | Realtime end-to-end | gpt-4o-realtime-preview | One-vendor simplicity, server VAD, built-in tools | $5/M text-in, $20/M text-out, $100/M audio-in, $200/M audio-out |
| ElevenLabs | TTS only | eleven_turbo_v2_5 | Best voice quality, voice cloning, emotion control | ~$50/M chars (Turbo), ~$33/M chars (Flash) |
| Cartesia | TTS + STT | sonic-2 (TTS), ink-whisper (STT) | Single-vendor pipeline, lowest TTS latency | TTS ~$30/M chars, STT ~$0.0125/min |
| Deepgram | STT only | nova-3 | Best STT accuracy + latency, 30+ languages, VAD/endpointing | $0.0043/min streaming |
Rates are vendor-stated and captured for the static price table at the date shown. Override with registerModelPrices for negotiated contracts.
Realtime mode (OpenAI)
import { OpenAIRealtimeVoiceProvider } from '@fabric-harness/sdk';
const provider = new OpenAIRealtimeVoiceProvider({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o-realtime-preview',
});
const voice = await provider.connect({
instructions: 'You are a friendly intake agent.',
voice: 'alloy',
audioFormat: 'pcm16', // or 'g711_ulaw' for Twilio.
tools: [submitFieldTool],
turnDetection: 'server_vad',
});The model owns the audio loop end-to-end. Tool calls relay through the existing tool_call / submitToolResult contract — same as pipeline mode.
Pipeline mode (BYO STT + LLM + TTS)
import {
PipelineVoiceProvider,
DeepgramSttProvider,
ElevenLabsTtsProvider,
AnthropicModelProvider,
} from '@fabric-harness/sdk';
const provider = new PipelineVoiceProvider({
stt: new DeepgramSttProvider({
apiKey: process.env.DEEPGRAM_API_KEY!,
model: 'nova-3',
}),
tts: new ElevenLabsTtsProvider({
apiKey: process.env.ELEVENLABS_API_KEY!,
defaultVoice: '21m00Tcm4TlvDq8ikWAM', // 'Rachel'
model: 'eleven_turbo_v2_5',
}),
llm: new AnthropicModelProvider({ apiKey: process.env.ANTHROPIC_API_KEY! }),
model: 'claude-haiku-4-5-20251001',
});
const voice = await provider.connect({
instructions: 'You are a friendly intake agent. One question at a time.',
audioFormat: 'pcm16',
tools: [submitFieldTool],
});
mic.on('frame', (pcm) => voice.sendAudio(pcm));
for await (const event of voice.events()) {
if (event.type === 'audio_delta') speaker.write(event.audio);
if (event.type === 'tool_call') {
const output = await runTool(event.name, event.input);
await voice.submitToolResult({ id: event.id, output });
}
if (event.type === 'response_done') {
// event.usage contains rolled-up tokens, sttSeconds, ttsCharacters, costUsd.
}
}The contract is identical to realtime mode — VoiceSession events, tool calls, cost telemetry, costBudget, cost_limit events. Swap providers without rewriting your loop.
Single-vendor variant (Cartesia)
import {
PipelineVoiceProvider,
CartesiaSttProvider,
CartesiaTtsProvider,
} from '@fabric-harness/sdk';
const provider = new PipelineVoiceProvider({
stt: new CartesiaSttProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
tts: new CartesiaTtsProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
llm: someLlmProvider,
model: 'claude-haiku-4-5-20251001',
});Barge-in
Pipeline mode wires barge-in through the STT VAD. When the user starts speaking while the agent is mid-utterance, the STT emits speech_started; the pipeline aborts the in-flight LLM call, cancels the TTS stream, and returns to listening. Your UI is responsible for stopping playback when audio_delta events stop arriving.
Realtime mode handles this server-side via turnDetection: 'server_vad'.
Cost telemetry
Both modes feed the same telemetry surface. response_done.usage includes:
{
inputTokens: number; // LLM input
outputTokens: number; // LLM output
audioInputTokens?: number; // realtime mode only
audioOutputTokens?: number;// realtime mode only
sttSeconds?: number; // pipeline mode (Deepgram/Cartesia)
ttsCharacters?: number; // pipeline mode (ElevenLabs/Cartesia)
costUsd?: number; // rolled up via static price table
}Wire costBudget to enforce per-call / per-session / per-tenant ceilings — voice participates in the same v1.4 cost-budget machinery as text agents.
import { CostBudgetTracker } from '@fabric-harness/sdk';
const budget = new CostBudgetTracker({ perCallUsd: 0.50, perSessionUsd: 5 });
await provider.connect({ costBudget: budget });Choosing models
- Lowest latency: OpenAI Realtime > Cartesia (TTS) ≈ Deepgram (STT) > ElevenLabs Flash > ElevenLabs Turbo > ElevenLabs Multilingual.
- Best voice quality: ElevenLabs Multilingual v2 > ElevenLabs Turbo v2.5 > Cartesia Sonic-2 > OpenAI Realtime voices.
- Best STT accuracy (English): Deepgram nova-3 > Cartesia ink-whisper.
- Best multilingual STT: Deepgram nova-3 (30+ langs) > Cartesia.
- Cheapest: Pipeline with Deepgram + ElevenLabs Flash + Haiku/Gemini Flash. Roughly 1/3 the cost of OpenAI Realtime for English.
When to bring your own provider
Implement TtsProvider, SttProvider, or VoiceProvider directly and pass it into PipelineVoiceProvider. Reasons to build your own:
- On-prem TTS (Riva, Coqui) for compliance.
- Whisper-via-vLLM for cheap multilingual STT.
- Translate-on-the-wire layers (e.g. STT in Spanish → MT → English LLM → TTS in Spanish).
The interfaces (TtsProvider, SttProvider) are intentionally narrow — synthesize(text) → AsyncIterable<Uint8Array> and open() → SttSession are the only required methods.
See also
- Voice — the
VoiceSessioncontract and OpenAI Realtime usage. - Cost telemetry — pricing and rate-card overrides.
- Connector catalog — telephony bridges (Twilio, Vonage) for phone audio.
Voice
Bidirectional audio streaming for phone calls, browser mic/speaker, and kiosk audio. Realtime mode (OpenAI) + pipeline mode (Deepgram + ElevenLabs / Cartesia) behind one VoiceSession contract.
Session memory
Persistent key/value recall across sessions, scoped by tenant. Distinct from session entries (audit log).