The vendor's monthly accuracy report showed 98% — but the methodology was opaque and the controller didn't trust it. Senior accountants were doing post-hoc review on a non-random sample and felt accuracy was deteriorating.
Keep the vendor — the agent was net-positive at 91%, just not as good as advertised. Renegotiate the SLA based on the real numbers. Add a human-review gate on net-new vendors until the recognition issue is fixed. Run our Langfuse rubric monthly going forward.
The client subsequently engaged us for a Layer-3 ongoing monitoring retainer at $4,500/month.
Eval sprints on someone else’s product are politically delicate. The vendor was not pleased when their SLA was renegotiated. We frame these engagements as “in service of the operator’s confidence, not in opposition to the vendor” — but if the data shows what the data shows, we report it.
Real aggregate accuracy was 91%, not 98% — the vendor's denominator excluded human-reviewed transactions, which were the hard ones. Client renegotiated SLA with real numbers in hand and engaged us for ongoing monitoring.