AI Agent Platform

Pre-Launch Red Team Assessment for Customer Service Agent

Context

A YC-backed startup building an AI customer service agent for e-commerce companies. The agent handled product recommendations, order management, and basic support escalation. Two weeks from public launch with several enterprise pilots scheduled.

Challenge

The team had run extensive automated testing but had concerns about edge cases and security that their tests couldn't cover.

  • Users manipulating the agent to access other customers' orders
  • Competitors extracting training data or system prompts
  • Edge cases causing the agent to make incorrect promises
  • Safety failures that could create legal or reputational risks

Approach

Week 1

Threat Modeling & Attack Surface Mapping

3 security researchers identified attack vectors specific to customer service context: order manipulation, identity spoofing, prompt injection via product descriptions

Week 2

Adversarial Testing

Systematic testing across 12 vulnerability categories with 400+ adversarial test cases

Week 3

Validation & Documentation

Severity assessment, reproduction steps, remediation guidance, and executive summary

Results

3 Critical vulnerabilities
7 High-severity issues
15 Medium-severity issues
23 Low-severity issues

Critical Findings

  • Prompt injection through product reviews bypassed safety guardrails
  • Role confusion attack allowed accessing other users' order history
  • Social engineering technique extracted partial system prompt
"SliceLabel found vulnerabilities in three days that our automated red team missed entirely. The prompt injection through product reviews was particularly creative—we never would have thought to test that vector." — Head of Engineering, AI Agent Startup
Healthcare AI

Compliance Dataset for FDA Pre-Submission

Context

A Series B digital health company developing an AI-powered clinical decision support tool for primary care physicians. Required human-validated evaluation data for FDA Pre-Submission meeting to demonstrate safety and effectiveness.

Challenge

  • FDA requires human expert validation, not just AI-generated benchmarks
  • Need documented methodology that withstands regulatory scrutiny
  • Must demonstrate inter-rater reliability across clinical experts
  • Need traceable credentials for all evaluators

Approach

Phase 1

Protocol Development

Worked with company's medical team and regulatory consultants to design evaluation schema aligned with FDA clinical decision support guidance

Phase 2

Expert Recruitment

Recruited 8 board-certified physicians (4 primary care, 2 internal medicine, 2 specialists in target conditions) with verified credentials

Phase 3

Calibration & Annotation

2-day physician calibration workshop followed by annotation with blind multi-pass review

Phase 4

Documentation

Complete methodology report, annotator credentials, and IRR analysis for FDA submission

Results

2,500 Case annotations
0.89 Cohen's Kappa score
8 Board-certified MDs
100% FDA accepted
"The FDA reviewer specifically commented on the quality of our evaluation methodology. SliceLabel's documentation was more thorough than what many larger companies submit." — VP Regulatory Affairs, Healthcare AI Company
Financial Services

Bias and Fairness Evaluation for Lending Model

Context

A fintech company deploying an AI-assisted underwriting model that generated preliminary creditworthiness assessments. Model outputs were reviewed by human underwriters but influenced final decisions. Needed fairness evaluation before regulatory audit.

Challenge

  • Automated fairness metrics don't capture reasoning quality differences
  • Need to evaluate explanations, not just decisions
  • Regulators expect human review of edge cases
  • Protected class analysis requires careful methodology

Results

1,200 Cases evaluated
3 Systematic patterns
23% Explanation quality gap
Pre-audit Issues resolved

Key Patterns Identified

  • Self-employment income was discounted more heavily for certain demographic groups
  • Non-traditional credit history treated inconsistently across similar risk profiles
  • Explanation quality varied significantly, potentially affecting underwriter decisions
"The explanation quality disparity was something our automated fairness metrics completely missed. That finding alone was worth the entire engagement." — Chief Risk Officer, Fintech Lender

Themes Across Our Evaluations

01

Automated Testing Has Blind Spots

LLM-as-judge and rule-based testing consistently miss creative attack vectors and subtle quality issues.

02

Edge Cases Are Where Safety Fails

Systems that pass automated benchmarks often fail on rare but important cases that require human judgment.

03

Methodology Matters for Compliance

Regulators increasingly expect documented human evaluation methodology, not just automated metrics.

04

Domain Expertise Catches What Others Miss

Specialized evaluators identify issues that generalist crowdworkers and automated systems overlook.

05

Speed and Quality Are Not Tradeoffs

With the right team and processes, rapid iteration and rigorous evaluation can coexist.

Want Similar Results?

Every AI system has unique evaluation needs. Let's discuss what rigorous human evaluation could look like for your specific challenges.

Contact us: hello@slicelabel.com