ZEAL SENTINELLIVE

Your AI agent's independent auditor.

In 14 days, we'll surface at least 5 named, reproducible failures in your AI customer-support agent. Then we'll keep watching — so your VP CX never gets the call that starts with "did you see what our AI said?"

Book a free audit assessment →See how it works

— THE PROBLEM

Your AI vendor is grading their own homework. So is every eval tool you run yourself.

You deployed Decagon, Sierra, Ada, Intercom Fin, or an in-house agent. The vendor's dashboard shows a high resolution rate. New eval tooling — LangSmith Engine, Braintrust — even lets your own team cluster failures and auto-write evaluators faster than ever. But faster self-grading is still self-grading. Your board is asking a different question: how do you know it's not lying? A score you produce about your own agent doesn't answer it. An independent one does.

No public company accepts management's auditor opinion. Your AI deployment shouldn't either.

Air Canada's AI chatbot was held legally liable for its outputs by a Canadian tribunal in February 2024. The deployer — not the AI vendor — bore the cost.
Moffatt v. Air Canada, 2024 BCCRT 149

— WHAT WE DO

Independent third-party eval and monitoring for AI customer-support agents.

Sentinel runs a machine-accelerated, trace-driven eval pipeline against your deployed agent — and supplies the two things a tool alone can't: the customer-support ground truth that defines 'correct,' and an independent party to vouch for the result. We're not your AI vendor, and we don't use your vendor's rubric.

Machine-accelerated failure discovery

Trace-mining clusters thousands of real conversations into named issues at scale. We turn each cluster into a severity-rated failure that matters to your policies — not a generic rubric.

Synthetic test coverage0/16 covered

Calm

Confused

Angry

Adversarial

High riskAdversarialMulti-turnEdge caseConflicting

CoveredHigh-risk ★Persona × Intent

Dimension-driven synthetic testing

We generate test cases across failure dimensions: policy violations, hallucinated facts, missed escalations, brand-tone drift, PII exposure. Coverage you can't get from real conversations alone.

Binary LLM judges, human-validated

Every evaluator is validated against human labels on a held-out set, scored against what 'correct' means in support: refund windows, escalation rules, brand voice, PII, compliance. We publish the validation scores — you see exactly how much to trust each judge.

Locked eval dataset + regression bench

We never tune against your eval set. You get a stable regression bench in Braintrust (or similar) that runs on every release and catches behavioral drift before it ships.

Independent audit packet

5 named, reproducible failures. Each with repro steps, severity rating, estimated customer impact, and recommended fix. The document your VP CX shows the CEO.

Monthly monitoring + drift detection

After the audit, we watch for behavioral drift in production. When a pattern degrades, you see it in the dashboard — before your customers do.

Machine-accelerated failure discovery

Trace-mining clusters thousands of real conversations into named issues at scale. We turn each cluster into a severity-rated failure that matters to your policies — not a generic rubric.

Synthetic test coverage0/16 covered

Calm

Confused

Angry

Adversarial

High riskAdversarialMulti-turnEdge caseConflicting

CoveredHigh-risk ★Persona × Intent

Dimension-driven synthetic testing

We generate test cases across failure dimensions: policy violations, hallucinated facts, missed escalations, brand-tone drift, PII exposure. Coverage you can't get from real conversations alone.

Binary LLM judges, human-validated

Every evaluator is validated against human labels on a held-out set, scored against what 'correct' means in support: refund windows, escalation rules, brand voice, PII, compliance. We publish the validation scores — you see exactly how much to trust each judge.

Locked eval dataset + regression bench

We never tune against your eval set. You get a stable regression bench in Braintrust (or similar) that runs on every release and catches behavioral drift before it ships.

Independent audit packet

5 named, reproducible failures. Each with repro steps, severity rating, estimated customer impact, and recommended fix. The document your VP CX shows the CEO.

Monthly monitoring + drift detection

After the audit, we watch for behavioral drift in production. When a pattern degrades, you see it in the dashboard — before your customers do.

— HOW IT WORKS

What happens in the first 14 days.

D 0
Day 0
Discovery call
45 minutes. We confirm the AI agent in scope, the integration surface, and your success criteria.
D 1
Day 1
SOW signed. Clock starts.
Read-only access to agent logs confirmed. First invoice (50% deposit) issued. Eval environment provisioned.
Ds 2–4
Days 2–4
Failure-discovery sprint
Trace-mining clusters 200+ real conversations into candidate issues; we name and severity-rate the failure taxonomy (8–12 categories) against your policies.
Ds 5–8
Days 5–8
Synthetic test generation
Dimension-driven scenarios across persona × intent × risk × ambiguity. Red-team library applied (200 adversarial patterns). All traces captured in Langfuse.
Ds 9–12
Days 9–12
Evaluators built + validated
Binary judges built and validated against human labels. Regression bench seeded in Braintrust (or similar). Eval pass-rate baseline established.
D 13
Day 13
Dashboard scaffold shipped
NextJS reliability dashboard live on Vercel. Agent Reliability Score, failure-rate by category, regression catches. Five named risks with reproducible repros documented.
D 14
Day 14
Audit packet delivered
Final invoice issued. Full audit packet delivered. Option to activate monitoring retainer — no additional setup required.

Discovery call45 MIN

AI agent in scope

Integration surface

Success criteria

DecagonSierraAdaIntercom FinIn-house

Scope confirmedSOW

D 0
Day 0
Discovery call
45 minutes. We confirm the AI agent in scope, the integration surface, and your success criteria.
Discovery call45 MIN
AI agent in scope
Integration surface
Success criteria
DecagonSierraAdaIntercom FinIn-house
Scope confirmedSOW
D 1
Day 1
SOW signed. Clock starts.
Read-only access to agent logs confirmed. First invoice (50% deposit) issued. Eval environment provisioned.
SOW signed · clock startsDAY 01 / 14
Read-only log access
50% deposit invoiced
Eval environment provisioned
Clock running
Ds 2–4
Days 2–4
Failure-discovery sprint
Trace-mining clusters 200+ real conversations into candidate issues; we name and severity-rate the failure taxonomy (8–12 categories) against your policies.
Trace-mining → named taxonomy0/12 clusters
Machine clusters
C-01
C-02
C-03
C-04
C-05
C-06
C-07
C-08
C-09
C-10
C-11
C-12
Named failure modes
Machine clusterNamed failureHigh-severity ★C-07 → Refund Authority
Ds 5–8
Days 5–8
Synthetic test generation
Dimension-driven scenarios across persona × intent × risk × ambiguity. Red-team library applied (200 adversarial patterns). All traces captured in Langfuse.
Synthetic test coverage0/16 covered

Calm
Confused
Angry
Adversarial
High riskAdversarialMulti-turnEdge caseConflicting
CoveredHigh-risk ★Persona × Intent
Ds 9–12
Days 9–12
Evaluators built + validated
Binary judges built and validated against human labels. Regression bench seeded in Braintrust (or similar). Eval pass-rate baseline established.
Sample data — illustrative product demo.
D 13
Day 13
Dashboard scaffold shipped
NextJS reliability dashboard live on Vercel. Agent Reliability Score, failure-rate by category, regression catches. Five named risks with reproducible repros documented.
Sample data — illustrative product demo.
D 14
Day 14
Audit packet delivered
Final invoice issued. Full audit packet delivered. Option to activate monitoring retainer — no additional setup required.
Sample data — illustrative product demo.

— WHAT YOU GET

The audit packet your CEO will trust and your engineering team can act on.

Named failure taxonomy (8–12 categories, derived from your real transcripts)
5+ named reproducible failures — each with repro steps, severity, customer impact, and recommended fix
Binary LLM evaluators with human-validation scores published
Locked eval dataset + regression bench in Braintrust (or similar)
NextJS reliability dashboard (Agent Reliability Score, failure-rate by category, regression catches)
Synthetic test library — 500+ AI-support-specific test cases (refund flows, subscription changes, escalation triggers, brand-tone scenarios)
Adversarial red-team library — 200 prompt-injection and policy-evasion patterns, updated quarterly
'When Your Board Asks' playbook — slide deck and talking points for your next QBR
Monthly monitoring retainer (optional) — production regression alerts via Slack or Teams

— THE GUARANTEE

Five failures or you pay nothing.

★

Five-Failure Guarantee

If we don't surface at least 5 named, reproducible AI-agent failures during the initial 14-day audit, your audit fee is fully refunded AND the first month of monitoring is free.

✦

No surprises on our watch

Under active monitoring, every Severity-1 AI failure pattern is surfaced, documented, and escalated to you in real time — with full context to hand to your vendor.

We've never triggered this guarantee. But it tells you something about our methodology that we're willing to make it.

— WHO SENTINEL IS FOR

You're the right buyer if…

Good fit

✓You've deployed an AI customer-support agent (Decagon, Sierra, Ada, Intercom Fin, Forethought, Maven AGI, or in-house) in the last 18 months
✓Your board, CEO, or CFO has asked 'how do we know it's not lying?' in the last 90 days
✓Your AI vendor's dashboard doesn't satisfy that question
✓You've had at least one AI-related customer complaint or escalation you couldn't fully explain

Not a fit

✕You haven't deployed an AI agent yet (we can refer you to the right builders)
✕You're looking for the cheapest QA option
✕You need a SOC 2 / GRC compliance product

— COMMON QUESTIONS

What people ask before signing.

We need read-only access to your conversation logs or a data export. We work within your existing data retention and privacy boundaries. Our work product contains anonymized failure patterns and synthetic repros — not raw transcripts.

Those tools grade human agents and bolt AI on. Sentinel is AI-native — built specifically for AI-to-customer interactions — and independent. We're not affiliated with your AI vendor, we don't use the vendor's rubrics, and we're not incentivized to make the numbers look good.

Decagon, Sierra, Ada, Intercom Fin, Forethought, Maven AGI, and in-house LangGraph or LangChain agents. Two delivery paths: for closed vendor platforms we audit from the outside off your conversation logs and deliver prioritized fixes you take to your vendor; for in-house agents we plug into your tracing and repo and can draft fixes as PRs. If you're running a custom stack, contact us for a scope assessment.

You can, and for the discovery step you probably should — it's good infrastructure. Two things it won't give you: independence (a score you generate about your own agent doesn't satisfy a board), and the support-domain ground truth that decides whether a cluster is actually a policy or brand failure. And if your agent is a closed vendor platform, those tools usually can't even see it. Sentinel supplies all three.

We run best-in-class eval infrastructure rather than reinventing it — and we're stack-agnostic (LangSmith, Braintrust, or log-based for closed platforms). What you pay for is the customer-support eval IP, the independent audit, and the operated outcome — not the plumbing.

We deliver the audit packet on Day 14 and present monitoring options. You can move into an ongoing monitoring retainer, or take the eval suite and regression bench and run it yourself — we provide a full Mintlify runbook. Monitoring is month-to-month with 30-day notice to cancel.

We're completing the founding-cohort audits. Sanitized case studies will be published at thezeal.ai/customers as they complete. Subscribe to be notified.

Ready to know what your AI agent is actually doing?

We offer a free 30-min assessment call to determine whether Sentinel is the right fit. No slides, no pitch deck — we'll ask about your deployment and tell you what we'd find.

Book a free assessment →