The three questions that decide your voice AI stack
Most voice AI deployments stall on the wrong stack choice in the first two weeks. Here's the framework we use to keep our clients out of that hole.
By Jannis Moore
Most teams we sit with have already evaluated three voice providers, watched twelve YouTube comparisons, and posted in two Slack communities. They're stuck. Not because the information isn't there, but because the comparisons are written for the wrong buyer: usually a solo developer optimizing for hello-world demos.
Here are the three questions we ask in the first thirty minutes of every Voice AI Roadmap. They settle most of the stack debate before it starts.
1. What's your latency tolerance?
Not "what feels fast". Your tolerance.
A scheduling agent for a multi-location service business can tolerate 800ms to 1.2s of perceived turn-around time. People expect a half-second pause when a receptionist looks up the calendar. They don't expect a half-second pause in the middle of a sentence.
A high-stakes outbound call where the model has to handle interruption, retry, and steer the conversation has a much tighter ceiling. If you're shipping speech-to-speech end-to-end with no pipeline, you can hit 200-400ms perceived. If you're cascading STT → LLM → TTS, you'll be in the 700-1200ms band on most stacks. Both work for different problems. Neither works for both.
The question we want answered: "What does your buyer hang up on?" That sets your latency budget, and the latency budget eliminates half of the providers on your list.
2. Where do your numbers terminate?
Telephony is the part nobody talks about until it breaks production at 11pm.
If you're calling US numbers, you have many options and price competition is fierce. The moment you need international numbers, regulated regions, or local presence dialing, your shortlist collapses. Some providers handle telephony in-house, some white-label Twilio, some white-label something cheaper that drops calls under load.
We've watched two seven-figure deployments wobble because the telephony partner the platform sat on top of had silent regional outages the platform didn't surface. The platform looked fine in the dashboard. The calls just weren't connecting in the Midwest.
The question to answer: "What carriers and regions matter?" Then ask the provider to show you their telephony partner and incident history, not just their model cards.
3. How much does your agent need to actually do?
This is the question that filters builders from operators.
A demo agent that books an appointment is straightforward. A production agent that:
- looks up a live record in your CRM mid-call
- decides whether to escalate based on the answer
- writes a callback record back into your system
- and recovers when an API call times out three seconds in
is a different beast. The reliability of tool calls, function calls, and structured outputs varies wildly by provider. Some look great in the demo and fall over once you string four tools together. Others are slower per call but ship boringly reliable orchestration.
We test this by building a deliberately ugly version of the customer's actual flow on every shortlisted provider. Three tools. One that times out on purpose. One that returns nonsense. One that returns the right thing 80% of the time. Whatever survives that test is the provider that survives production.
A worked example
A multi-location property management client came to us with a vendor preference baked in by an internal champion who had built a great demo on it. We ran the three questions:
- Latency tolerance: their inbound use case was rent-payment-by-phone. Buyers won't wait. Their preferred vendor's perceived latency on full pipeline was 1100ms. Their tolerance was 600ms. That should have killed it on slide one.
- Telephony: they needed local-presence dialing across 14 states. The preferred vendor white-labeled a partner that didn't support it cleanly. Workable, but a known failure mode at scale.
- Tool calls: their flow had 6 tools in it. The vendor handled 2-3 well, started losing arguments at 4, fell over at 6.
The right call wasn't to switch. It was to keep that vendor for the simpler outbound notification flow they already had working, and pick a different stack for the rent-payment-by-phone deployment. They saved a six-month detour by running this 30-minute conversation on day one instead of day 90.
When the framework breaks
The three questions are necessary but not sufficient. They don't capture compliance edges (PHI, MNPI), legal jurisdiction (EU vs US data residency), or cost-curve at scale (some providers price like a SaaS, some like infrastructure, and the math changes at 100k vs 1M minutes/month). For those, we go deeper in the Voice AI Roadmap.
But if you're stuck choosing a stack and your team is going in circles, run the three questions first. They will not give you the right answer in every case. They will give you the right next move in almost every case.
If your team is in this debate right now, we run the Voice AI Roadmap for exactly this. Fixed-fee, 14 days, and you walk out with the vendor shortlist and the reasoning. Book a discovery call. Talk to us.