• AWS-native AI integration · ships in 6–10 weeks

Eval module catches a silent model regression in 48 hours

  • Evaluations & Observability · Insurance
  • Commercial-lines insurance brokerage — Applied Epic shop, ~140 staff

The problem

Three months after launch, the underlying foundation model rolled out a new version. Aggregate quality scores looked stable. But the workers' comp appetite-match subsegment — the one that mattered to this brokerage — silently dropped 14 percentage points.

Our approach

What was in the eval module

  • Custom rubric design: 18 scorers covering schema validity, appetite-match accuracy, citation correctness, PII handling, tone.
  • Weekly 30-minute review queue session with the senior broker: she labels the 10 most ambiguous traces of the week; her labels feed back into the rubric.
  • Monthly business review with eval-score trend chart, top failure modes, recommended fixes.
  • 24-hour response SLA on Sev-1 score drops, 5-day on Sev-2.
  • Quarterly model-upgrade regression test before any model change goes live.

Why this matters

The brokerage’s CFO told us the eval module was “the thing that made the agent feel like infrastructure, not a science project.”

Pricing. ~$3,000/month of the $11K total agent retainer is allocated to evals — never sold separately when we build the agent.

Stack. Langfuse self-hosted in the client VPC · custom rubrics on 80 labelled historical submissions · weekly review queue.

Outcome

Caught within 48 hours of the model rollout. Pinned the prior model on that route, re-tuned the rubric over the following week. Two production regressions caught and rolled back before the operator's CSRs noticed.

48h
time to detect a silent model regression
92%
appetite-match agreement at month 6 (was 87%)
0
regressions caught before CSRs noticed