Independent eval sprint on a third-party AI bookkeeper

Evaluations & Observability · Accounting
Outsourced bookkeeping firm — ~40 staff, agent built by another vendor 9 months earlier

The problem

The vendor's monthly accuracy report showed 98% — but the methodology was opaque and the controller didn't trust it. Senior accountants were doing post-hoc review on a non-random sample and felt accuracy was deteriorating.

Our approach

What we delivered

Wrapped the vendor’s API with Langfuse tracing — no changes to the vendor’s product.
Pulled 90 days of historical transactions, hand-labelled 250 of them with the controller as the ground-truth oracle.
Built a custom rubric: chart-of-accounts mapping accuracy, vendor recognition accuracy, false-positive flag rate, time-to-decision.
Ran the rubric against 30 days of live production traffic.
Delivered a 14-page report with the actual numbers and a recommendation.

Recommendation

Keep the vendor — the agent was net-positive at 91%, just not as good as advertised. Renegotiate the SLA based on the real numbers. Add a human-review gate on net-new vendors until the recognition issue is fixed. Run our Langfuse rubric monthly going forward.

The client subsequently engaged us for a Layer-3 ongoing monitoring retainer at $4,500/month.

Caveat

Eval sprints on someone else’s product are politically delicate. The vendor was not pleased when their SLA was renegotiated. We frame these engagements as “in service of the operator’s confidence, not in opposition to the vendor” — but if the data shows what the data shows, we report it.

Outcome

Real aggregate accuracy was 91%, not 98% — the vendor's denominator excluded human-reviewed transactions, which were the hard ones. Client renegotiated SLA with real numbers in hand and engaged us for ongoing monitoring.

91%

true aggregate accuracy (vendor claimed 98%)

97% / 64%

accuracy on known vendors vs net-new

4 wks