Three months after launch, the underlying foundation model rolled out a new version. Aggregate quality scores looked stable. But the workers' comp appetite-match subsegment — the one that mattered to this brokerage — silently dropped 14 percentage points.
The brokerage’s CFO told us the eval module was “the thing that made the agent feel like infrastructure, not a science project.”
Pricing. ~$3,000/month of the $11K total agent retainer is allocated to evals — never sold separately when we build the agent.
Stack. Langfuse self-hosted in the client VPC · custom rubrics on 80 labelled historical submissions · weekly review queue.
Caught within 48 hours of the model rollout. Pinned the prior model on that route, re-tuned the rubric over the following week. Two production regressions caught and rolled back before the operator's CSRs noticed.