MayaLogic
Case study · AI & Data

A customer-facing copilot that knew when to stop talking

Shipped a customer-facing copilot grounded in the product knowledge base, with evaluation, guardrails, and a refusal model that reduced deflection-related complaints to near zero.

Anonymized delivery dashboard

AI & Data outcome cockpit

After launch

Ticket deflection

38%

Eval pass rate

94%

Complaint rate

< 0.4%

Before

Constraint

Build

Controlled cutover

After

Measured gain

Client
Enterprise software vendor
Industry
AI & Data
Service
Generative AI Solutions
Engagement closed
December 2024
Storyline

From constraint to measurable change.

Every engagement is framed around the business situation, the constraint that made it hard, and the decision that turned delivery into a controlled path to value.

Situation

The operating reality

The vendor’s previous chatbot pilot was withdrawn after it confidently fabricated answers about pricing and compatibility. Leadership wanted a copilot that could deflect support tickets without ever telling a customer something untrue.

Constraint

Why it was hard

The vendor’s previous chatbot pilot was withdrawn after it confidently fabricated answers about pricing and compatibility. Leadership wanted a copilot that could deflect support tickets without ever telling a customer something untrue.

Decision

The path we chose

Treated evaluation and refusal behavior as product requirements, not prompt polish, before exposing the copilot to customers.

Build

RAG with grounded citations

Documentation and KB articles were ingested with deliberate chunking and a hybrid keyword + dense retriever. Every response cited the source passages it relied on, and answers that could not cite a source defaulted to a human handoff.

A golden set of 600 graded questions ran in CI on every prompt change, with automatic graders for groundedness, refusal correctness, and tone. The launch criterion was a defined pass rate, not a leadership demo.

Outcome

Measured impact

38% of eligible support tickets deflected with verified-correct answers.

Eval pass rate steady at 94% across two model upgrades.

What changed after launch

New behavior

Support leaders could tune deflection against groundedness, with traces and source citations available for every answer.

Outcome

The numbers that mattered.

Ticket deflection
0%
Eval pass rate
0%
Complaint rate
< 0.0%
The refusal model mattered as much as the answers. Customers trusted it because it knew when not to guess.
N. BrooksChief Customer Officer · Enterprise software vendor
Before and after

The transformation the client could see.

The work was not abstract modernization. It changed day-to-day behavior, ownership, and the evidence leaders used to make decisions.

Before

  • Withdrawn chatbot pilot
  • Fabricated pricing and compatibility answers
  • No objective launch bar

After

  • Cited responses or human handoff
  • 600-question eval in CI
  • Complaint rate below 0.4%
Architecture and delivery

A controlled path from discovery to launch.

The delivery plan made the system boundary explicit, then used rehearsals, gates, and telemetry to optimize safely before launch.

Delivery architecture

AI & Data control loop

DiscoverModelBuildLaunchTelemetry and feedback optimize the next release
  1. Discover

    RAG with grounded citations

    Documentation and KB articles were ingested with deliberate chunking and a hybrid keyword + dense retriever. Every response cited the source passages it relied on, and answers that could not cite a source defaulted to a human handoff.

  2. Build

    Evaluation harness as the source of truth

    A golden set of 600 graded questions ran in CI on every prompt change, with automatic graders for groundedness, refusal correctness, and tone. The launch criterion was a defined pass rate, not a leadership demo.

  3. Launch

    Guardrails and observability

    Prompt-injection patterns, PII redaction, and topic-scope filters wrapped the model. Every response was traced end-to-end with the retrieved context preserved for post-hoc review.

Build

How we shaped the work.

RAG with grounded citations

Documentation and KB articles were ingested with deliberate chunking and a hybrid keyword + dense retriever. Every response cited the source passages it relied on, and answers that could not cite a source defaulted to a human handoff.

Evaluation harness as the source of truth

A golden set of 600 graded questions ran in CI on every prompt change, with automatic graders for groundedness, refusal correctness, and tone. The launch criterion was a defined pass rate, not a leadership demo.

Guardrails and observability

Prompt-injection patterns, PII redaction, and topic-scope filters wrapped the model. Every response was traced end-to-end with the retrieved context preserved for post-hoc review.

What changed after launch

What shipped, and what it changed.

  • 38% of eligible support tickets deflected with verified-correct answers.
  • Eval pass rate steady at 94% across two model upgrades.
  • Customer complaint rate on copilot interactions stayed below 0.4%.

After launch

Support leaders could tune deflection against groundedness, with traces and source citations available for every answer.

Stack

What we built it with.

Python

OpenAI

pgvector

LangChain

AWS Bedrock

OpenTelemetry

A similar problem?

Let’s talk about your project.

A senior engineer will follow up within one business day with an opinionated take on the shape of the work.

A customer-facing copilot that knew when to stop talking — Case Study | MayaLogic