Weekly red-team holds dental voice agent at zero emergency misroutes

Evaluations & Observability · Dental
Three-location dental DSO — extending the voice agent we built

The problem

Emergency-triage misroutes — avulsed tooth, severe pain + fever + swelling — would be a regulatory and patient-safety failure. Aggregate accuracy metrics weren't enough; the long tail of dangerous scenarios needed dedicated coverage.

Our approach

What we evaluate, by call type

Emergency triage — 100% evaluated. Recall is the headline metric. 40 synthetic red-team scenarios every week; the test must pass before any prompt or model change goes live.
New patient bookings — sampled at 25%. Rubric: was the practice name correct? Were DOB, insurance, chief complaint captured? Was the slot actually available? Did the agent set the right appointment type in Dentrix?
Existing patient reschedules — sampled at 10%.
Insurance verification — 100% evaluated because the inaccuracy cost is high (a wrong copay quote at the front desk creates a service moment we want to avoid).

Outcome over 6 months

Insurance verification recall held at 85% — we routed the low-confidence 15% to human callback rather than chasing higher automation, because the inaccuracy cost was too high. Office managers spot-check 5 calls per week per location through the Braintrust UI; they catch about one rubric-miss per month that we then encode.

Pricing. ~$1,200/month per location of the $2,200/month voice-agent retainer is allocated to evals.

Stack. Braintrust for the trace-to-eval workflow · weekly synthetic red-team · monthly business review.

Outcome

26 of 26 weekly red-team emergency scenarios passed in 6 months. The one near-miss in month 4 (an avulsed-tooth scenario the model didn't recognize on first phrasing) was caught in red-team, never reached a real patient. Booking accuracy moved from 89% to 96%.

26/26

weekly red-team passes

96%

booking accuracy (was 89%)

100%