LLM Evaluation Frameworks — Smitha Shivakumar

Session Overview

What This Workshop Covers

From automated eval pipelines to red-teaming — the full evaluation stack for teams shipping LLMs in production.

◈

Evaluation Foundations

What makes LLM evaluation hard — the lack of ground truth, subjective quality, and why benchmarks often don't predict real-world performance.

⇄

Automated Eval Pipelines

Building scalable evaluation systems — from reference-based metrics to LLM-as-judge patterns that can run at the speed of CI/CD.

∿

Human Evaluation Design

When automated evals aren't enough — how to design human evaluation tasks that produce consistent, useful signal without annotation chaos.

⬡

Red-Teaming & Safety

Systematic adversarial testing — finding the failure modes in your model before malicious users or edge cases find them for you.

Session Structure

The Agenda

Four blocks that cover the full evaluation stack — from why benchmarks lie to how to run a red-team session that produces actionable findings.

Block 01

Why LLM Evaluation Is Hard

The ground truth problem — why traditional ML metrics don't transfer to generative models

The dimensions of quality: accuracy, fluency, faithfulness, helpfulness, safety — and how they trade off against each other

The benchmark trap — why MMLU scores don't predict whether your RAG pipeline will fail in production

Block 02

Building Automated Eval Pipelines

Reference-based metrics: ROUGE, BLEU, BERTScore — what they measure and where they break down for generative tasks

LLM-as-judge: using a model to evaluate a model — the design patterns, prompts, and calibration tricks that make it reliable

Integrating evals into CI/CD — running evaluation gates on every model change so regressions don't reach production

Exercise Design an automated eval pipeline for a specific use case — define the dimensions to measure, the scoring method for each, and the pass/fail thresholds that gate a release

Block 03

Human Evaluation That Actually Works

Task design for human raters — how to write annotation guidelines that produce consistent labels across different evaluators

Inter-rater reliability — measuring whether your human evals are trustworthy and what to do when they aren't

Calibrated judgment: training evaluators to apply consistent standards across edge cases and subjective calls

Key Principle Human eval is your ground truth. Garbage annotation guidelines produce garbage ground truth — and you'll only find out after you've trained on it.

Block 04

Red-Teaming for Failure Modes

The red-teaming mindset — thinking like an adversary to systematically probe model behavior at the edges

Categories of failure: hallucination, refusal errors, jailbreaks, bias, and instruction-following breakdowns

Structured red-teaming protocols — how to run a session that produces actionable findings, not just anecdotes

Exercise Red-team a deployed LLM feature — define the failure taxonomy, design adversarial prompts for each category, and document findings in a structured report

What You Leave With

An Evaluation System You Can Deploy

Session Frameworks

Apply immediately — not just in theory

Three ready-to-use frameworks for teams building and evaluating LLMs in production — designed to operationalize evaluation, not just discuss it.

◈

The Eval Dimension Map

A structured breakdown of LLM quality dimensions — with recommended measurement methods, automated metrics, and human eval criteria for each.

⇄

LLM-as-Judge Template

A prompt template and calibration guide for using an LLM to evaluate LLM outputs — including the design choices that determine whether the scores are trustworthy.

⬡

Red-Team Protocol

A structured red-teaming framework — failure taxonomy, adversarial prompt categories, severity scoring, and a report template for communicating findings to stakeholders.

Skills Covered

What You'll Be Able to Do

LLM Evaluation Red-Teaming Eval Pipeline Design LLM-as-Judge Human Annotation Design Inter-Rater Reliability ROUGE / BERTScore CI/CD for AI Hallucination Detection Safety Testing Benchmark Analysis Failure Mode Taxonomy

Why LLM Evaluation Is Hard

Building Automated Eval Pipelines

Human Evaluation That Actually Works

Red-Teaming for Failure Modes

Apply immediately — not just in theory

Bring this workshop to your AI team.