Emergency-triage misroutes — avulsed tooth, severe pain + fever + swelling — would be a regulatory and patient-safety failure. Aggregate accuracy metrics weren't enough; the long tail of dangerous scenarios needed dedicated coverage.
Insurance verification recall held at 85% — we routed the low-confidence 15% to human callback rather than chasing higher automation, because the inaccuracy cost was too high. Office managers spot-check 5 calls per week per location through the Braintrust UI; they catch about one rubric-miss per month that we then encode.
Pricing. ~$1,200/month per location of the $2,200/month voice-agent retainer is allocated to evals.
Stack. Braintrust for the trace-to-eval workflow · weekly synthetic red-team · monthly business review.
26 of 26 weekly red-team emergency scenarios passed in 6 months. The one near-miss in month 4 (an avulsed-tooth scenario the model didn't recognize on first phrasing) was caught in red-team, never reached a real patient. Booking accuracy moved from 89% to 96%.