Technology

The real-time orchestration stack behind Frontbell.

Real-time agentic orchestration over a phone call is a hard problem. The agent has to qualify the work, book the job, generate the estimate, dispatch the crew, and follow up — while sounding human at sub-second latency, and never executing an action it cannot honestly attest to. We built our own stack because the off-the-shelf parts don't compose to that bar.

Architecture

End-to-end streaming, no batch handoffs.

Audio enters one side of the pipeline and a synthesized response leaves the other. Every stage streams. The pipeline never waits for a complete utterance, a complete LLM response, or a complete audio buffer.

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Caller │───▶│ WebRTC │───▶│ VAD │───▶│ Streaming│───▶│ LLM │ │ PSTN │ │ +AEC │ │ Local │ │ ASR │ │ Stream │ └──────────┘ └────┬─────┘ └──────────┘ └──────────┘ └─────┬────┘ │ │ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Echo │ │ Tool │ │ Cancel │ │ Engine │ └──────────┘ │ (state) │ └─────┬────┘ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Caller │◀───│ WebRTC │◀───│ TTS │◀───│ Stream │◀───│ LLM │ │ hears │ │ Mixer │ │ Stream │ │ Buffer │ │ Output │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘

Each box is a streaming transform. The orchestrator holds a single conversation state across them, can interrupt any downstream stage when the caller starts speaking again, and gates every tool call against an explicit state machine before it executes.

Latency

The sub-second budget.

Human conversational turn-taking expects ≤ 300ms of silence. We can't hit that round-trip through ASR + LLM + TTS today, but we can stay under 1 second — which is enough to feel cooperative rather than transactional.

~150ms

Streaming ASR — first partial transcript

~250ms

LLM time-to-first-token (frontier model, streamed)

~180ms

TTS time-to-first-audio (streaming voice)

~720ms

Frontbell p50 end-to-end response (production target)

Every component runs concurrently. The TTS starts speaking before the LLM finishes generating; the LLM starts decoding before the ASR finalizes the utterance. That overlap is where the budget gets recovered.

Components

What we built versus what we use.

Voice transport

Custom WebRTC-based full-duplex transport with caller-side and agent-side audio mixing. PSTN bridge through carrier-grade telephony with per-tenant subaccount isolation.

Acoustic preprocessing

Local Voice Activity Detection for endpointing and barge-in detection. Custom RMS-based echo cancellation for half-duplex carrier conditions where the caller's audio bleeds back into our outbound stream.

ASR

Streaming third-party ASR for production. Evaluating self-hosted alternatives (incl. NVIDIA Riva Parakeet) for latency and per-call unit economics as we scale.

Reasoning

Frontier LLM via streaming API for natural conversation. Smaller fine-tuned models for intent classification, named-entity extraction, and routing — on a path to self-hosted inference once volume justifies it.

TTS

Streaming neural TTS with sub-200ms time-to-first-audio. Per-tenant voice selection. Voice cloning planned post-launch (Block 76).

Tool execution

Patent-pending state-machine workflow engine. Every tool call (book appointment, send SMS, look up customer, create estimate) is gated against a YAML-defined state graph. The agent cannot execute a tool the current state doesn't permit. Eliminates an entire class of LLM-hallucinated actions.

Memory

Per-tenant persistent conversation memory across voice, SMS, and chat. Vector + relational hybrid; we don't reset context between channels for the same customer.

Telephony

Twilio Subaccount-isolated tenants, 10DLC-compliant SMS, per-tenant brand and campaign registration. Carrier-grade SLA, not a softphone wrapper.

Inference roadmap

From API-call economics to per-call GPU economics.

For a launch product, frontier LLM APIs and managed ASR/TTS are the right call — they let one developer ship a voice agent that holds a real conversation. As call volume scales, the economics shift toward self-hosted inference for the latency-bound, high-frequency parts of the pipeline.

Self-hosted ASR + TTS

Streaming ASR is the largest single component of our per-call streaming spend and the largest contributor to first-token latency. We're benchmarking NVIDIA Riva (Parakeet ASR + Magpie/Fastpitch TTS variants) against managed alternatives on cost, latency, and word error rate for trade-specific vocabulary.

Self-hosted intent & routing

Intent classification, lead scoring, and "should this go to a human" decisions are well-suited to small fine-tuned models. We plan to evaluate TensorRT-LLM for high-throughput, low-latency inference of these models on commodity GPUs.

Voice cloning (post-launch)

In-app guided owner voice capture, then per-tenant cloned outbound voice. Voice synthesis is the most GPU-intensive part of the pipeline at scale; this is where dedicated inference hardware matters most.

Pattern engine (post-launch)

Self-learning prediction and lead-temperature scoring trained on per-tenant call data. We're tracking NIM microservices and NeMo for the fine-tuning + serving loop, scoped after our July 2026 public launch stabilizes.

We are vendor-flexible by design: the production pipeline runs against streaming APIs today, and components are pluggable so we can swap inference providers without touching the orchestrator.

Why we did it this way

Generic agent frameworks don't survive contact with the phone.

Latency budgets are unforgiving

A chat UI tolerates 5-second responses. A phone call doesn't — and stitching together pre-built ASR → LLM → TTS without streaming pushes you past the cliff. We had to own the streaming orchestration to hit a sub-second budget.

Tool calls have to be honest

An LLM that hallucinates "I booked your appointment" without actually booking it is worse than no agent. Our state-machine workflow engine validates every tool transition before execution — the LLM can suggest, but the engine decides.

Telephony has compliance edges

10DLC, A2P consent, per-tenant brand registration, recording disclosure. Nothing about AI changes those constraints — and most agent frameworks ignore them. We designed for them.

Trades have specific vocabulary

"Two coats Sherwin Williams Emerald," "PEX vs. copper repipe," "240V receptacle in the panel." Generic ASR models miss these. Trade-tuned models close that gap and are the cleanest case for self-hosted inference.

Want to hear a real call?

Frontbell opens its private founder-cohort beta in June 2026; public launch is mid-July 2026. Get in touch for an early demo or to talk inference partnerships.

Get in touch See the product →