Case study · AI & Data

A customer-facing copilot that knew when to stop talking

Shipped a customer-facing copilot grounded in the product knowledge base, with evaluation, guardrails, and a refusal model that reduced deflection-related complaints to near zero.

Start a similar project Talk to us

AI & Data — the operating environment for this engagement — AI & Data · Enterprise software vendor

Anonymized delivery dashboard

AI & Data outcome cockpit

After launch

Ticket deflection

38%

Eval pass rate

94%

Complaint rate

< 0.4%

Before

Constraint

Build

Controlled cutover

After

Measured gain

Client: Enterprise software vendor
Industry: AI & Data
Service: Generative AI Solutions
Engagement closed: December 2024

Storyline

From constraint to measurable change.

Every engagement is framed around the business situation, the constraint that made it hard, and the decision that turned delivery into a controlled path to value.

Situation

The operating reality

The vendor’s previous chatbot pilot was withdrawn after it confidently fabricated answers about pricing and compatibility. Leadership wanted a copilot that could deflect support tickets without ever telling a customer something untrue.

Constraint

Why it was hard

Decision

The path we chose

Treated evaluation and refusal behavior as product requirements, not prompt polish, before exposing the copilot to customers.

Build

RAG with grounded citations

Documentation and KB articles were ingested with deliberate chunking and a hybrid keyword + dense retriever. Every response cited the source passages it relied on, and answers that could not cite a source defaulted to a human handoff.

A golden set of 600 graded questions ran in CI on every prompt change, with automatic graders for groundedness, refusal correctness, and tone. The launch criterion was a defined pass rate, not a leadership demo.

Outcome

Measured impact

38% of eligible support tickets deflected with verified-correct answers.

Eval pass rate steady at 94% across two model upgrades.

What changed after launch

New behavior

Support leaders could tune deflection against groundedness, with traces and source citations available for every answer.

Outcome

The numbers that mattered.

Ticket deflection: 0%
Eval pass rate: 0%
Complaint rate: < 0.0%

“The refusal model mattered as much as the answers. Customers trusted it because it knew when not to guess.”

N. BrooksChief Customer Officer · Enterprise software vendor

Before and after

The transformation the client could see.

The work was not abstract modernization. It changed day-to-day behavior, ownership, and the evidence leaders used to make decisions.

AI product launch

Before

Withdrawn chatbot pilot
Fabricated pricing and compatibility answers
No objective launch bar

After

Cited responses or human handoff
600-question eval in CI
Complaint rate below 0.4%

Architecture and delivery

A controlled path from discovery to launch.

The delivery plan made the system boundary explicit, then used rehearsals, gates, and telemetry to optimize safely before launch.

Delivery architecture

AI & Data control loop

Discover
RAG with grounded citations
Documentation and KB articles were ingested with deliberate chunking and a hybrid keyword + dense retriever. Every response cited the source passages it relied on, and answers that could not cite a source defaulted to a human handoff.
Build
Evaluation harness as the source of truth
A golden set of 600 graded questions ran in CI on every prompt change, with automatic graders for groundedness, refusal correctness, and tone. The launch criterion was a defined pass rate, not a leadership demo.
Launch
Guardrails and observability
Prompt-injection patterns, PII redaction, and topic-scope filters wrapped the model. Every response was traced end-to-end with the retrieved context preserved for post-hoc review.

Build

How we shaped the work.

RAG with grounded citations

Evaluation harness as the source of truth

Guardrails and observability

Prompt-injection patterns, PII redaction, and topic-scope filters wrapped the model. Every response was traced end-to-end with the retrieved context preserved for post-hoc review.

What changed after launch

What shipped, and what it changed.

38% of eligible support tickets deflected with verified-correct answers.
Eval pass rate steady at 94% across two model upgrades.
Customer complaint rate on copilot interactions stayed below 0.4%.

After launch

Support leaders could tune deflection against groundedness, with traces and source citations available for every answer.

Stack

What we built it with.

Python

OpenAI

pgvector

LangChain

AWS Bedrock

OpenTelemetry

Related case studies

Similar delivery patterns.

View all work

FinTech

Replatforming a Series C lending core without a single missed disbursement

Series C consumer lender

Strangled a ten-year-old PHP monolith and migrated 1.2M live loans onto an event-driven core, with zero downtime and a 60% drop in P95 latency.

P95 latency: −62%
Disbursements lost: 0
Cutover window: 9 months

Logistics

A routing engine that cut last-mile costs by 19% in a regulated market

Regional last-mile carrier

Replaced a third-party routing engine with a domain-specific solver that respected driver work-time rules and reduced cost-per-stop by 19%.

Cost per stop: −19%
On-time rate: +8 pts
Engine cost: −83%

SaaS

A multi-tenant foundation that survived first enterprise sign-off

Vertical SaaS (Series B)

Re-architected a single-tenant codebase into a multi-tenant SaaS with hybrid silo/pool tenancy, SOC 2-ready controls, and self-serve plus enterprise funnels off the same stack.

Enterprise pilots: 11 won
SOC 2: Passed
Trial-to-paid: +27%

A similar problem?

Let’s talk about your project.

A senior engineer will follow up within one business day with an opinionated take on the shape of the work.

Build something similar Contact us

Build something similar