Voice Providers

Provider matrix for realtime + pipeline voice agents — OpenAI Realtime, ElevenLabs, Cartesia, Deepgram. When to pick which, how to compose them, latency / pricing tradeoffs.

fabric-harness supports two voice modes — realtime (one model owns audio in + out + tools) and pipeline (separate STT, LLM, and TTS providers). This page is the cheat sheet for picking and composing providers.

TL;DR

Want one vendor, lowest latency, English-leaning? Use OpenAI Realtime (realtime mode). One WebSocket, one bill, server-side VAD included.
Want best-in-class voice quality (clones, emotion, multilingual) at higher latency? Use pipeline mode with Deepgram STT + ElevenLabs TTS. Mix any LLM you already trust.
Want a single non-OpenAI vendor? Use pipeline mode with Cartesia STT + Cartesia TTS. One vendor, one bill, low latency.
Want to keep API keys server-side for browser apps? All four providers can sit behind WS /sessions/:id/voice. The browser never sees the key.

Mode comparison

Aspect	Realtime mode	Pipeline mode
Architecture	One WS, model handles audio in + out + tools	Three streams: STT → LLM → TTS
Latency to first audio	~300-600ms	~600-1200ms
Voice flexibility	Provider's voices only	Any TTS vendor — voice clones, emotion, accents
Language support	Provider-bound	Any combo — Deepgram nova-3 covers 30+ languages
Tool calling	Native to the model	Through the underlying LLM (Anthropic/OpenAI/Gemini)
Cost (per minute, English)	$0.10-0.30	$0.05-0.15
Barge-in	Server VAD	STT VAD + caller cancels TTS
Best for	Phone agents, intake bots, low-latency UX	Multilingual, branded voice, vendor-flexibility, governance-heavy

Provider matrix

Provider	Role	Default model	Strengths	Pricing (May 2026)
OpenAI Realtime	Realtime end-to-end	`gpt-4o-realtime-preview`	One-vendor simplicity, server VAD, built-in tools	$5/M text-in, $20/M text-out, $100/M audio-in, $200/M audio-out
ElevenLabs	TTS only	`eleven_turbo_v2_5`	Best voice quality, voice cloning, emotion control	~$50/M chars (Turbo), ~$33/M chars (Flash)
Cartesia	TTS + STT	`sonic-2` (TTS), `ink-whisper` (STT)	Single-vendor pipeline, lowest TTS latency	TTS ~$30/M chars, STT ~$0.0125/min
Deepgram	STT only	`nova-3`	Best STT accuracy + latency, 30+ languages, VAD/endpointing	$0.0043/min streaming

Rates are vendor-stated and captured for the static price table at the date shown. Override with registerModelPrices for negotiated contracts.

Realtime mode (OpenAI)

import { OpenAIRealtimeVoiceProvider } from '@fabric-harness/sdk';

const provider = new OpenAIRealtimeVoiceProvider({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o-realtime-preview',
});

const voice = await provider.connect({
  instructions: 'You are a friendly intake agent.',
  voice: 'alloy',
  audioFormat: 'pcm16',                 // or 'g711_ulaw' for Twilio.
  tools: [submitFieldTool],
  turnDetection: 'server_vad',
});

The model owns the audio loop end-to-end. Tool calls relay through the existing tool_call / submitToolResult contract — same as pipeline mode.

Pipeline mode (BYO STT + LLM + TTS)

import {
  PipelineVoiceProvider,
  DeepgramSttProvider,
  ElevenLabsTtsProvider,
  AnthropicModelProvider,
} from '@fabric-harness/sdk';

const provider = new PipelineVoiceProvider({
  stt: new DeepgramSttProvider({
    apiKey: process.env.DEEPGRAM_API_KEY!,
    model: 'nova-3',
  }),
  tts: new ElevenLabsTtsProvider({
    apiKey: process.env.ELEVENLABS_API_KEY!,
    defaultVoice: '21m00Tcm4TlvDq8ikWAM',          // 'Rachel'
    model: 'eleven_turbo_v2_5',
  }),
  llm: new AnthropicModelProvider({ apiKey: process.env.ANTHROPIC_API_KEY! }),
  model: 'claude-haiku-4-5-20251001',
});

const voice = await provider.connect({
  instructions: 'You are a friendly intake agent. One question at a time.',
  audioFormat: 'pcm16',
  tools: [submitFieldTool],
});

mic.on('frame', (pcm) => voice.sendAudio(pcm));

for await (const event of voice.events()) {
  if (event.type === 'audio_delta') speaker.write(event.audio);
  if (event.type === 'tool_call') {
    const output = await runTool(event.name, event.input);
    await voice.submitToolResult({ id: event.id, output });
  }
  if (event.type === 'response_done') {
    // event.usage contains rolled-up tokens, sttSeconds, ttsCharacters, costUsd.
  }
}

The contract is identical to realtime mode — VoiceSession events, tool calls, cost telemetry, costBudget, cost_limit events. Swap providers without rewriting your loop.

Single-vendor variant (Cartesia)

import {
  PipelineVoiceProvider,
  CartesiaSttProvider,
  CartesiaTtsProvider,
} from '@fabric-harness/sdk';

const provider = new PipelineVoiceProvider({
  stt: new CartesiaSttProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
  tts: new CartesiaTtsProvider({ apiKey: process.env.CARTESIA_API_KEY! }),
  llm: someLlmProvider,
  model: 'claude-haiku-4-5-20251001',
});

Barge-in

Pipeline mode wires barge-in through the STT VAD. When the user starts speaking while the agent is mid-utterance, the STT emits speech_started; the pipeline aborts the in-flight LLM call, cancels the TTS stream, and returns to listening. Your UI is responsible for stopping playback when audio_delta events stop arriving.

Realtime mode handles this server-side via turnDetection: 'server_vad'.

Cost telemetry

Both modes feed the same telemetry surface. response_done.usage includes:

{
  inputTokens: number;       // LLM input
  outputTokens: number;      // LLM output
  audioInputTokens?: number; // realtime mode only
  audioOutputTokens?: number;// realtime mode only
  sttSeconds?: number;       // pipeline mode (Deepgram/Cartesia)
  ttsCharacters?: number;    // pipeline mode (ElevenLabs/Cartesia)
  costUsd?: number;          // rolled up via static price table
}

Wire costBudget to enforce per-call / per-session / per-tenant ceilings — voice participates in the same v1.4 cost-budget machinery as text agents.

import { CostBudgetTracker } from '@fabric-harness/sdk';

const budget = new CostBudgetTracker({ perCallUsd: 0.50, perSessionUsd: 5 });
await provider.connect({ costBudget: budget });

Choosing models

Lowest latency: OpenAI Realtime > Cartesia (TTS) ≈ Deepgram (STT) > ElevenLabs Flash > ElevenLabs Turbo > ElevenLabs Multilingual.
Best voice quality: ElevenLabs Multilingual v2 > ElevenLabs Turbo v2.5 > Cartesia Sonic-2 > OpenAI Realtime voices.
Best STT accuracy (English): Deepgram nova-3 > Cartesia ink-whisper.
Best multilingual STT: Deepgram nova-3 (30+ langs) > Cartesia.
Cheapest: Pipeline with Deepgram + ElevenLabs Flash + Haiku/Gemini Flash. Roughly 1/3 the cost of OpenAI Realtime for English.

When to bring your own provider

Implement TtsProvider, SttProvider, or VoiceProvider directly and pass it into PipelineVoiceProvider. Reasons to build your own:

On-prem TTS (Riva, Coqui) for compliance.
Whisper-via-vLLM for cheap multilingual STT.
Translate-on-the-wire layers (e.g. STT in Spanish → MT → English LLM → TTS in Spanish).

The interfaces (TtsProvider, SttProvider) are intentionally narrow — synthesize(text) → AsyncIterable<Uint8Array> and open() → SttSession are the only required methods.