0% found this document useful (0 votes)

298 views

Definitive Guide To Testing LLM Applications

Uploaded by

Nikolaos Tsinganos

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

298 views

Definitive Guide To Testing LLM Applications

Uploaded by

Nikolaos Tsinganos

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

The Definitive

Guide to
Testing LLM
Applications
Table of Contents
0.0 Introduction 3

1.0 Testing techniques across the app development cycle 6

2.0 Design Phase Use Case: Self-corrective code generation

Use Case: Self-corrective RAG 11

3.0 Pre-Production Prerequisites: Build a dataset & Define evaluation criteria

Advanced: Pairwise Evaluation 18

Advanced: Few-shot feedback for LLM-as-Judge evaluators 19

Regression testing 20

Use Case: Testing agents 22

Implementation: Integrating into your CI flow 24

4.0 Post-Production Prerequisites: Set up tracing & Collect feedback in production

Use Case: Evaluating a RAG application in production 29

Bootstrapping 31

5.0 Conclusion 32

6.0 Glossary 34

The Definitive Testing guide by langchain 2

INTRODUCTION

Why testing
matters for LLM
applications
INTRODUCTION

Why testing matters for LLM applications

While Large Language Models (LLMs) promise to solve previously unthinkable
tasks, they have also introduced new challenges for successful deployment.
First, LLMs are non-deterministic, generating a distribution of possible
outcomes from a single input. This can lead to inconsistent or hallucinated
outputs. Additionally, LLMs ingest arbitrary text, forcing developers to grapple
with a broad domain of possible user inputs. And, finally, LLMs output natural
language, often requiring new metrics to assess style or accuracy.

With these challenges in mind, we find that companies must develop new
testing approaches to guard against harmful / misleading responses or even
embarrassing brand moments caused by their LLM app.

Follow along with

LangSmith
Throughout this guide, we’ll refer to
LangSmith, our platform for testing
and monitoring LLM applications.

If you’d like to follow along on your

own account, sign up for free at
smith.langchain.com

The Definitive Testing guide by langchain 4

INTRODUCTION

At LangChain, we think of testing as the way to measure your LLM

application’s performance. The goal of testing is to help you iterate faster on
your LLM app, enabling you to make quick decisions amidst a vast sea of
choices – including what models, prompts, tools, or retrieval strategies to use.

We’ve crafted this short guide to help you add rigor to your LLM app testing.
In this guide, you’ll learn various testing techniques across phases of the LLM
app development lifecycle — from application design, to pre-production, to
post-production. We also give advice on evaluating specific use cases such as
Retrieval Augmented Generation (RAG) applications and agents.

“
Before LangSmith, we didn’t have a systematic way to improve
the quality of our LLM applications. By integrating LangSmith
into our application framework, we now have a cohesive
approach to benchmark prompts and models for 200+ of our
applications. This supports our data-driven culture at Grab and
allows us to drive continuous refinement of our LLM-powered
solutions.

Padarn Wilson
Head of Engineering, ML Platform at Grab

The Definitive Testing guide by langchain 5

Testing
techniques

across the app

development

cycle
T e s t i n g t e c h n i q u e s ac r o s s t h e a p p d e v e lo p m e n t c yc l e

Tests can be applied at three different stages in your application

development cycle:

Design Pre-Production Post-Production

Tests can be added directly to Tests can be run before Tests can be run on your
your application. These are deploying your application into application in production. These
typically simple assertions fast production. These typically help monitor and catch errors or
enough to be executed at cover key scenarios your undesirable behavior affecting
runtime. Using an orchestrator application must pass, based on real users. The goal is to identify
system can help execute these data you’ve collected. The goal issues and feed them back into
tests and feed failures back to the is to catch and fix any the design or pre-production
LLM. The goal is to catch and regressions before the app is phases, creating a continuous
self-correct errors within your app released to real users. cycle of "design, test, deploy,
itself before they affect users. monitor, fix, and redesign".

Deploy app
Design Phase Pre-Production Phase Production Phase
Error handling in your app Test app before deploying Monitor app in production

App design App Testing Monitoring

Regression?
Add nodes for error handling Test app for regressions No Test app inputs / outputs

Self correction es
Y

Together, these testing phases form a flywheel for continuous improvement of

your LLM system, helping you identify and fix production issues to prevent
regressions in new versions of your app. Now, let’s dive into techniques and
tips for each phase.

The Definitive Testing guide by langchain 7

Design

Phase
DESIGN PHASE

When designing your application, you may want To learn more about
to consider adding checks within your LangGraph, check 
application logic to prevent unwanted outputs. out our documentation
Applications with built-in error handling can and tutorials.
leverage the ability for LLMs to self-correct by
executing tests within the application and
feeding errors back to the LLM.

Below we discuss a few design patterns that

can be used for self-correction. In each case
discussed below, we use LangGraph as the
framework to orchestrate error handling.

Use Case:

Self-corrective code generation

In traditional systems, incorrect or incomplete code often arises from
misunderstandings or lack of context, such as incorrect import statements or
syntax mistakes. To identify and correct these errors, you would typically
manually review code, analyze error messages, correct the issues, and re-run
tests until the code functions correctly.

The Definitive Testing guide by langchain 9

DESIGN PHASE

Encoding these checks into a system like LangGraph and having it make fixes
can greatly speed up this process of code generation. Error handling should
ideally focus on simple assertions to minimize cost and latency. In this
example focused on code generation, our question-answer system,
ChatLangChain, sometimes hallucinates import statements — which degrades
the user experience.

To address this, we can design a simple LangGraph control flow that tests the
imports and, if they fail, passes back any errors to the LLM for correction. This
simple assertion improved performance considerably on our evaluation set.

Tip
Always start with the simplest possible
tests (e.g., assertions that can be
hard-coded) because they are cheap
and fast to run!
Code Execution Check
No Answer
(Node)
Fail?

Yes
Answer
Pydantic Object
Generation
Import Check

(Node) Preamble (Node) No

Question Imports Fail?

Code
Yes

The Definitive Testing guide by langchain 10

DESIGN PHASE

Use Case:

Self-corrective RAG
Sometimes, simple assertions are not a sufficient
test. For example, RAG systems can suffer from Tip
hallucinations and low-quality retrieval (e.g., if a Use of LLM-as-judge evaluators for
user’s question falls outside the domain of the error handling should be done with
indexed documents).
care to minimize latency and cost. See
the “Pre-Production” section for more
on LLM-as-judge evaluators.
In both cases, tests that can reason over text are
needed and cannot be easily captured as simple
assertions. In these cases, we can use an LLM to
perform the testing.

As an example, here is a self-corrective RAG application with several stages of

error handling that uses an LLM to grade retrieval relevance, answer
hallucination, and answer usefulness.

The application control flow is laid out explicitly using LangGraph

Each node should be assigned a specific task.
The control flow performs assertions in logical order, starting with retrieval
If the retrieval documents are relevant, it then proceeds with answer
generation
It checks if the answer contains hallucinations related to the retrieved
documents
Finally, it checks whether the answer addresses the questions.

The Definitive Testing guide by langchain 11

DESIGN PHASE

Generate
Yes
Answer

No Hallucinations?
Retrieve
Grade

Documents Documents
[related

to index]
Any doc

irrelevant
No Yes Answer
Routing
Answers question?
Question
Yes
No
[unrelated

to index]
Web Search

Using LangGraph, this control flow To see step-by-step how to

is highly reliable. We’ve even run it implement corrective RAG,
using strictly local LLMs (e.g., 8b check out these resources:
parameter model).

Building reliable, fully-local RAG

agents with Llama3

Conceptual guide on self-corrective

generation in RAG

The Definitive Testing guide by langchain 12

Pre-Production
PRE-PRODUCTION

Pre-Production
The goal of pre-production testing is to measure the performance of your
application, ensuring continuous improvement and catching any regressions
on scenarios that you expect to pass. To begin testing, you’ll first need to
define a set of scenarios that you want to test. Below, we’ll walk through
prepping a dataset and evaluation criteria, and how to then use regression
testing to measure and improve LLM app performance over time.

Prerequisites:

Build a dataset
Datasets are collections of examples that serve as inputs and (optionally)
expected outputs used to evaluate your LLM app.

Gathering good data is often the hardest part of testing, given the large
amount of time and context needed to pull together a quality dataset to
benchmark your system. However, you don’t need many data points to get
started.

The Definitive Testing guide by langchain 14

PRE-PRODUCTION

Below, we outline common methods for

gathering data: If you’re using
LangSmith, it’s easy 
Manually curated examples: These are to save debugging and
hand-chosen / written examples. We production traces to
often see teams start small with their datasets, or to go back
volume of data for a new project, and edit your dataset
collecting 10-50 quality examples by as you gather more
hand. You can add to these manual data information.
points over time to cover edge cases that
emerge during production usage.

Application logs: These are logs of past

interactions with your application. Adding
in production logs from older app
versions ensures your dataset is realistic
and can cover commonly-recurring user
questions as you iterate on your
application.

Synthetic data: These are artificially-generated examples that simulate

various scenarios and / or edge cases. This approach can be useful for
augmenting your dataset when you may not have enough real data to test
on. You can generate new, plausible inputs by sampling from existing
inputs — or you can paraphrase existing inputs to diversify your dataset
without changing the underlying context.

The Definitive Testing guide by langchain 15

PRE-PRODUCTION

Define evaluation criteria

After creating your dataset, you’ll want to define evaluation metrics to assess
your application’s responses before shipping to production. This batch
evaluation on a predetermined test suite is often referred to as Offline
Evaluation.

For offline evaluation, you can also optionally label expected outputs (i.e.
ground truth references) for the data points you are testing on. This lets you
compare your application’s response with the ground truth references.

There are a few ways to score your 

LLM app performance:

Heuristic evaluators: These allow you to define

assertions and hard-coded rules on your outputs Tip
to score their quality. You can use reference-free Start with simple (e.g. heuristic)
heuristics (e.g. checking if output is valid JSON) evaluations as much as possible. Then,
perform human review. Finally, use
or reference-based heuristics like accuracy. LLM-as-Judge to automate your
Reference-based evaluation compares an output human review. This order of operations
lets you add depth and scale once
to a predefined ground truth, whereas reference- your criteria are well-defined.
free evaluation assesses qualitative
characteristics without a ground truth.
Custom heuristic evaluators are useful for code generation tasks like
schema checking and unit testing with hard-coded evaluation logic; in
contrast, evaluations on natural language cannot be hardcoded as rules,
requiring human or LLM-as-Judge evaluators.

The Definitive Testing guide by langchain 16

PRE-PRODUCTION

Human evaluators: Using human feedback is a good starting point if you

can’t express your testing requirements as code and an LLM evaluator is
not consistent enough. When looking at qualitative characteristics, humans
can label app responses with scores. LangSmith speeds up the process of
collecting and incorporating human feedback with annotation queues.

LLM-as-Judge evaluators: These capture human grading rules into an LLM

prompt, in which you use the LLM to judge whether the output is correct
(e.g. relative to the reference answer) or whether it adheres to specific
criteria (e.g. if it’s reference-free and you’d like to check if the output
contains offensive content). Answer correctness is a common LLM-as-
Judge evaluator for offline evaluation, in which the reference is a correct
answer supplied from the dataset output. As you iterate in pre-production,
you’ll want to audit the scores and tune the LLM-as-Judge to produce
reliable scores.

Tip For LLM-as-Judge evaluators, you can

Minimize cognitive load by using binary (yes, no) or simple (1, 0) scores versus a continuous or
more complex (e.g., Likert scale) possible range of scores
Use straightforward prompts that can easily be replicated and understood by a human. For
example, avoid asking an LLM to produce scores on a range of 0-10 with vague distinctions
between scores.

The Definitive Testing guide by langchain 17

PRE-PRODUCTION

Advanced:

Few-shot feedback for improving LLM-as-

Judge evaluators
How can you trust the results of LLM-as-Judge evaluation? Typically, this
requires another round of prompt engineering to ensure accurate
performance. But by leveraging few-shot learning, “self-improving” evaluation
is now possible in LangSmith with minimal setup.

Few-shot learning can improve accuracy and align outputs with human
preferences by providing examples of correct behavior. This is especially
useful when it’s tough to explain in instructions how the LLM should behave,
or if the output is expected to have a particular format. In LangSmith, self-
improving evaluation looks like the following
The LLM evaluator provides feedback on generated outputs, assessing
factors like correctness, relevance, or other criteria
Add in human corrections to modify or correct the LLM evaluator’s
feedback in LangSmith. This is where human preferences and judgment are
captured
These corrections are stored as few-shot examples in LangSmith, with an
option to leave explanations for corrections
The few-shot examples are incorporated into future prompts as
subsequent evaluation runs.

Over time, the LLM-as-Judge evaluator becomes increasingly aligned with

human preferences. This self-improving approach eliminates the need for
time-consuming prompt engineering, while improving the accuracy and
relevance of LLM-as-Judge evaluations. To learn more, read this blog post.

The Definitive Testing guide by langchain 18

PRE-PRODUCTION

Advanced:

Pairwise evaluation
Evaluating LLM app outputs can be challenging in isolation, especially for
tasks where human preference is hard to encode in a set of rules. Ranking
outputs preferentially (for example, which one of these two outputs is more
informative, vague, safe etc., versus grading each one individually) can be less
cognitively-demanding for human or LLM-as-Judge evaluators.

This is where pairwise evaluation comes in handy. Pairwise evaluation

compares two outputs simultaneously from different versions of an application
to determine which version better meets evaluation criteria like creativity, etc.

LangSmith natively supports

running and visualizing pairwise
LLM app generations, highlighting
preference for one generation
over another based on guidelines
set by the pairwise evaluator.

The Definitive Testing guide by langchain 19

PRE-PRODUCTION

Regression testing
In traditional software, tests pass based on functional requirements, leading to
more stable behavior once validated. In contrast, AI models show variable
performance due to model drift (i.e. degradation due to changes in data
distribution or updates to the model) and sensitivity. As such, regression testing
for LLM applications is even more critical and should be done frequently.  

Once you've defined a dataset and evaluator, you may want to: 
(1) Measure your LLM application's performance across experiments to identify
improved app versions to ship. 
(2) Monitor app performance over time to prevent regressions.  

LangSmith supports these needs with a built-in comparison view, as shown

below.

The Definitive Testing guide by langchain 20

PRE-PRODUCTION

For (1), you can select and compare experiments associated with a dataset.
Runs that regressed are auto-highlighted in red, while runs that improved are
in green — showing aggregate stats and allowing you to comb through
specific examples. This is useful for migrating models or prompts, which may
result in performance improvements or regression on specific examples.

For (2), you can set a baseline (e.g., the current deployed version of your app)
and compare it against prior versions to detect unexpected regressions. If a
regression occurs, you can isolate both the app version and the specific
examples that contain performance changes.

“
LangSmith has made it easier than ever to curate and maintain
high-signal LLM testing suites. Using LangSmith for testing, we’ve
seen

A 43% performance increase over production systems,

bolstering executive confidence to invest millions in new
opportunities
A 15% reduction in engineering time needed for regression
testing, by eliminating the “whack-a-mole” effect of prompt
changes
Walker Ward
Staff Software Engineer Architect at Podium

The Definitive Testing guide by langchain 21

Tip PRE-PRODUCTION

For building agents, you can

Customize your agent: Build custom or Use Case:

domain-specific agents with a

controllable agent orchestrator like
LangGraph. This improves agent
Testing agents
reliability over general purpose
architectures
Diversify LLMs: Test multiple tool- Agents show promise for autonomously
calling LLMs to optimize performance performing tasks and automating
and manage costs effectively,
leveraging different strengths for workflows, but testing an agent can be
specific applications.
challenging. Agents use an LLM to decide
For testing agents, you can
the control flow of the application, which
Reduce noise: Implement repetitions
(i.e. run same test multiple times and means every agent run can look quite
aggregate the results) to reduce
variability in tool selection and agent
different. For example, different tools might
response be called, agents might get stuck in a loop,
Isolate failures: Partition your dataset
into different splits to identify specific
or the number of steps from start to finish
subsets of data causing an issue. This can vary significantly.
also helps you save computational
resources.

We recommend you test agents at three

different levels of granularity To learn more about 
The agent’s final response to strictly testing agents, check out 
focus on end-to-end performanc these tutorials or explore
Any single, important step of the this notebook. 
agent to drill into specific tool calls / You can also watch our
decision workshop on building and
The trajectory of the agent to examine testing reliable agents.
the full reasoning trajectory

The Definitive Testing guide by langchain 22

Testing  Testing  Testing 
an agent’s final a single step of an agent’s
response an agent trajectory
To assess the overall performance of Testing an individual action (where the Looking back on the steps an agent
an agent on a task, treat it as a black LLM call makes a decision) lets you took (often referred to as the
box and define success as whether or zoom in to see where your application is trajectory) lets you assess whether or
not it completes the task. Keep in failing. It can also be faster to run (with not the trajectory lined up with
mind that this method is hard to just one LLM call invoked). Keep in mind expectations of the agent, e.g. the
debug when failures occur.
that testing a single step doesn’t number of steps or sequence of steps
Testing for the agent’s final response capture the scope of the full agent. taken.

involves Dataset creation may also be more Testing an agent's trajectory involves
difficult- as it’s easier to generate a
Inputs: User input and (optionally) Inputs: User input and (optionally)
dataset for earlier steps in an agent
predefined tools predefined tools
trajectory, but harder to do so for later
Output: Agent's final response Output: Expected sequence of
steps which must account for prior
Evaluator: LLM-as-Judge tool calls (in other words, the
agent actions and responses.

evaluators, which can assess task “exact” trajectory), or a list of tool

completion directly from a text Testing for a single step of an agent calls in any order
response involves: Evaluator: Function over the steps
Inputs: User input to a single step taken. To test the outputs, you can
(e.g., user prompt, set of tools). Can look at an exact match binary
also include previously completed score (note: may be multiple
steps correct paths) or metrics that
Output: LLM response from that step focus on the incorrect steps count.
(often contains tool calls indicating To test inputs to the tools, LLM-
what action the agent should take as-Judge may be more fruitful.
next You’d need to evaluate the full
Evaluator: Binary score for correct agent’s trajectory against a
tool selection and heuristic reference trajectory, then compile
assessment of the tool input's as a set of messages to pass into
accuracy. the LLM-as-Judge.

Tools
Tool #1

Tool #2

Testing agent output

Tool #3
Answer Reference answer

Input

Agent Testing agent reasoning Reference tool (single step)

Selection

of tools
Reference tool trajectory (all steps)

23
PRE-PRODUCTION

Implementation:

Integrating into your CI flow

Integrating your LLM app testing into your Continuous Integration (CI) flow
can help you automate testing each time changes are made to the codebase.
But, there are a couple challenges with this workflow
Tests on LLM apps tend to be flakey, as many tests also use an LLM to do
the evaluation
Running an experiment on every PR can be costly, as calls to LLMs are
expensive.

To adapt your CI workflow to deal with LLM application testing, we

recommend the following:

Use a cache: Instead of making a request to the LLM every time,

pull from a cache if the input to the LLM hasn’t changed from
what’s stored in the cache
Isolate datasets for CI: Instead of triggering your full experiment
on each commit push, use a subset of the dataset that tests the
most critical examples. Reserve running the experiment over the
full dataset when you have more substantial changes
Plan for human assistance: Despite a desire for full automation,
you’ll likely need a workflow that allows a human to correct failing
tests in order to avoid blocking merges on finicky LLM evaluators.

The Definitive Testing guide by langchain 24

Post-Production
P O S T- P R O D U C T I O N

Post-Production
Though essential, testing in the pre-production phase won’t catch everything.
Only after you’ve shipped to production can you get insights into how your
LLM application fares under real user scenarios. Beyond checking for spikes in
latency or errors, you’ll need to assess characteristics such as relevance,
accuracy, or toxicity. In post-production, a monitoring system can help detect
when your LLM app performance veers off course, allowing you to isolate
valuable failure cases.

Prerequisites:

Set up tracing
If you haven’t yet, you’ll need to set up tracing to gain visibility into a
meaningful portion of your production traffic. LangSmith makes this easy to
do and offers insights into not only the user input and application response,
but also all interim steps the application took to arrive at the response. This
level of detail is helpful for writing specific step assertions or debugging
issues later on.

LangSmith will additionally provide helpful metrics out-of-the-box, such as

Trace volum
Success / failure rate
Latency & time to first toke
LLM calls per trac Check out our quickstart guide
Token count & cost to set up tracing in LangSmith
within minutes.

The Definitive Testing guide by langchain 26

P O S T- P R O D U C T I O N

LLM Call Count

Success Pending Error
4,000

3200
LLM Call Count/1d

2400

1600

800

0
Jun 29 Jul 01 Jul 03 Jul 05 Jul 07 Jul 11 Jul 13 Jul 15 Jul 17 Jul 19 Jul 21 Jul 23 Jul 25 Jul 27

This can provide baseline insight into your LLM app performance, on top of
which you’ll track qualitative metrics (covered in the next section).

“
LangSmith has been instrumental in accelerating our AI
adoption and enhancing our ability to identify and resolve
issues that impact application reliability. With LangSmith, we’ve
created custom feedback loops, improving our AI application
accuracy by 40% and reducing deployment time by 50%.

Varadarajan Srinivasan
VP of Data Science, AI and ML Engineering at Acxiom

The Definitive Testing guide by langchain 27

P O S T- P R O D U C T I O N

Collect feedback in production

Unlike in the pre-production phase, evaluators for post-production testing
don’t have grounded reference responses to compare against. Instead, your
evaluators will score performance in real time as your application processes
user inputs. This reference-free, real-time evaluation is often referred to as
Online Evaluation.

There are at least two types of feedback you can collect in production to
improve app performance:
Feedback from users: You can directly collect user feedback, which can be
explicit or implicit. Adding a / button on your application is an easy
way to record user satisfaction with the application’s response. You can
also ask users to provide an additional explanation of why or why not their
expectations were met. In LangSmith, you can attach user feedback to any
trace or intermediate run (i.e. span) of a trace, including annotating traces
inline or reviewing runs together in an annotation queue.

Feedback from LLM-as-Judge

evaluators: These can be
appended to projects in
LangSmith, giving you the ability
to define LLM-as-judge evaluation
prompts that operate directly on
application inputs or outputs.
LangSmith has a number of
preexisting prompts for RAG as
well as tagging (e.g., for toxicity).
Off-the-shelf online evaluator prompts in LangSmith

The Definitive Testing guide by langchain 28

P O S T- P R O D U C T I O N

Use Case:

Evaluating a RAG application in production

Let’s apply some of these concepts by adding online evaluation to a RAG
application that handles common questions over a knowledge base. Typically,
LLM-as-Judge evaluators are used for RAG to evaluate factual accuracy and
consistency between texts.

To assess the RAG app’s performance in real-time, you’ll likely want to test:
If the application is hallucinating responses,
If the response is relevant and properly addresses the user’s questions
What types of questions the users are asking

1 - Testing for hallucinations

Below is an example of an LLM-as-Judge online evaluator that takes as input
both
facts which is a variable representing the raw source of information from
the retrieval step, an
student answer which is a variable representing the RAG application
response

The Definitive Testing guide by langchain 29

P O S T- P R O D U C T I O N

This prompt checks to see if the application’s response is grounded in the

retrieved documents and prevents the introduction of unsupported
information. We can represent the result of the test as a boolean (hallucination
or not). This also further motivates why keeping track of interim trace steps,
the retrieved documents, is important (and not just the input and final
response).

You are a teacher grading a quiz. 

You will be given FACTS and a STUDENT ANSWER. 
Here is the grade criteria to follow:

(1) Ensure the STUDENT ANSWER is grounded in the FACTS.

(2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside
the scope of the FACTS.  
Score: 
A score of 1 means that the student's answer meets all of the criteria. This is the
highest (best) score.

A score of 0 means that the student's answer does not meet all of the criteria.
This is the lowest possible score you can give.

Explain your reasoning in a step-by-step manner to ensure your reasoning and

conclusion are correct.

Avoid simply stating the correct answer at the outset.

To learn more about online evaluation, explore our educational videos

Online evaluation in our LangSmith Evaluation serie
Online evaluation with a focus on guardrails in our LangSmith Evaluation serie
Online evaluation with a focus on RAG in our LangSmith Evaluation series
P O S T- P R O D U C T I O N

Bootstrapping
After setting up tracing and online evaluators, you’ll start to catch errors in
your application in production. Ideally, you modify your application to fix these
errors. You can also fold these errors back into your test dataset used for
offline evaluation, in order to prevent the same mistakes in future releases of
your application.

Deploy app
Design Phase Pre-Production Phase Production Phase

Error handling in your app Test app before deploying Monitor app in production

App design App Testing Monitoring

Regression?
Add nodes for error handling Test app for regressions No Test app inputs / outputs

Yes
Self correction

Bootstrapping to improve future versions

You can also do a phased release of your app on a smaller audience to

gradually build up your dataset before doing a full cutover to the new version.
Seeing a big jump in any of your monitoring charts in LangSmith should alert
you to investigate further or do a rollback. This approach lets you spot
tradeoffs between cost, latency, and quality.

The Definitive Testing guide by langchain 31

Conclusion
CONCLUSION

Conclusion
As large language models evolve rapidly, robust testing is needed to improve
the system built around the model and provide lasting value. In this guide, we
discussed three layers of testing that you can consider: (1) error handling
within the application itself, (2) pre-production testing, and (3) post-
production monitoring.  

Together, these three layers of testing create a virtuous data flywheel.

Production monitoring lets you identify application errors, informing the design
process and pre-production (regression) testing. During the design phase, in-
app error handling using frameworks like LangGraph can fix some of these
errors and enable self-correction. Pre-production testing ensures each app
version you ship avoids regressions and, ideally, improves performance on
your collected examples.  

LangChain products have helped over a million developers integrate

generative AI into their software, and it's time to also integrate this flywheel of
testing. With this guide, we hope you've gained a framework for robust LLM
application testing, so you can iterate faster and systematically navigate
decisions in the ever-changing LLM space.

The Definitive Testing guide by langchain

Glossary
GLOSSARY OF TERMS

Glossary
Agents: An agent is a system that uses an LLM to decide the control flow of
an application.

Agent trajectory: The series of steps an agent took to complete a given task.

Experiment: An experiment is application code execution on all example

inputs in a dataset and evaluated for the criteria you’ve defined.

LLM-as-Judge Evaluation: LLM-as-judge evaluators use LLMs to score your

application’s performance. To use them, you typically encode the grading rules
/ criteria in the LLM prompt. They can be reference-free (e.g., check if system
output contains offensive content or adheres to specific criteria), or, they can
compare task output to a reference (e.g., check if the output is factually
accurate relative to the reference).

Offline evaluation: Offline evaluation is conducted prior to deployment of your

LLM application. Usually you have a set of examples in the form of a dataset
that you want to test your application on. Once you record the outputs of your
application over all examples, you can evaluate performance based on tests
you’ve created. We call this an experiment in LangSmith. These tests can be
reference-free or rely on a grounded true response to compare your
application’s response against.

The Definitive Testing guide by langchain 35

GLOSSARY OF TERMS

Glossary
Online evaluation: Online evaluation allows you to evaluate an application in
production. This type of evaluation scores performance in real time, as your
application handles user inputs. Notably, it does not rely on a grounded, true
response for comparison.

RAG (Retrieval Augmented Generation): A technique for AI applications that

leverages external knowledge from a knowledge base, retrieving relevant
documents or other sources of information to generate more informed and
context-aware responses.

Repetitions: Repetitions involve running the same evaluation multiple times

and aggregating the results in order to smooth out run-to-run variability in
LLM applications. This also lets you examine the reproducibility of the AI
application's performance.

The Definitive Testing guide by langchain 36

LangChain is the platform over a million developers choose 
for building AI apps from prototype to production. Created by the LangChain
team, LangSmith is a unified platform for debugging, testing, deploying, and
monitoring your LLM application. Thousands of organizations rely on
LangSmith — including Rakuten, Home Depot, Elastic, and Grab — to build,
run, and manage their LLM applications. Founded in 2022, LangChain is
headquartered in San Francisco with customers worldwide.

Request Demo

Generative AI
No ratings yet
Generative AI
2 pages
Aios LLM As Os
100% (2)
Aios LLM As Os
35 pages
Training Generative Adversarial Networks With Limited Data
No ratings yet
Training Generative Adversarial Networks With Limited Data
37 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Paper3 - LLM Agent Operating System
No ratings yet
Paper3 - LLM Agent Operating System
14 pages
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
No ratings yet
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
31 pages
Hotelbooking- Documentation (1)
No ratings yet
Hotelbooking- Documentation (1)
65 pages
Brief Introduction To GenAI
No ratings yet
Brief Introduction To GenAI
1 page
How To Deploy Machine Learning Model As Microservices
No ratings yet
How To Deploy Machine Learning Model As Microservices
7 pages
Machine Learning Megapack
No ratings yet
Machine Learning Megapack
6 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
MLOPS
No ratings yet
MLOPS
56 pages
RAG and LangChain
No ratings yet
RAG and LangChain
14 pages
Generative AI - 48 Hours TOC
No ratings yet
Generative AI - 48 Hours TOC
4 pages
Unit 4 Generative AI
No ratings yet
Unit 4 Generative AI
5 pages
Generative AI
No ratings yet
Generative AI
2 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Data For GenAI
No ratings yet
Data For GenAI
17 pages
MasterClass Agentic AI & RAG Flyer-1
No ratings yet
MasterClass Agentic AI & RAG Flyer-1
4 pages
Augmented Reality
No ratings yet
Augmented Reality
28 pages
LlamaIndex Talk (W&B Fully Connected 2024)
No ratings yet
LlamaIndex Talk (W&B Fully Connected 2024)
38 pages
Generative AI Database
No ratings yet
Generative AI Database
14 pages
Agentic AI - Comprehensive Guide
No ratings yet
Agentic AI - Comprehensive Guide
20 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
Onnx Machine Learning in Production - Blog
No ratings yet
Onnx Machine Learning in Production - Blog
4 pages
LLM Evaluation
No ratings yet
LLM Evaluation
1 page
Generative AI LLM Tutorial
No ratings yet
Generative AI LLM Tutorial
25 pages
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
No ratings yet
KAG Graph + Multimodal RAG + LLM Agents = Powerful AI Reasoning _ by Gao Dalie (高達烈) _ in Towards AI - Freedium
13 pages
Feature Engineering 1
No ratings yet
Feature Engineering 1
68 pages
Crud Rag
No ratings yet
Crud Rag
31 pages
Semantic Kernel
No ratings yet
Semantic Kernel
471 pages
Intro to LLM
No ratings yet
Intro to LLM
4 pages
Image To Image Translation Using Generative Adversarial Network
No ratings yet
Image To Image Translation Using Generative Adversarial Network
5 pages
A Comparative Study of AI Agent Orchestration Frameworks
No ratings yet
A Comparative Study of AI Agent Orchestration Frameworks
13 pages
Elevating Customer Satisfaction With LLM-Powered Chatbots
No ratings yet
Elevating Customer Satisfaction With LLM-Powered Chatbots
18 pages
LangChain_Academy_-_Introduction_to_LangGraph_-_Motivation
No ratings yet
LangChain_Academy_-_Introduction_to_LangGraph_-_Motivation
17 pages
Presentation 1
No ratings yet
Presentation 1
5 pages
2024-05-EB-A Compact GuideTo RAG
No ratings yet
2024-05-EB-A Compact GuideTo RAG
38 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Abstract On The Artificial Intelegence
No ratings yet
Abstract On The Artificial Intelegence
15 pages
Agentic Review
No ratings yet
Agentic Review
2 pages
Hands-On Guide to Agentic Corrective RAG-1
No ratings yet
Hands-On Guide to Agentic Corrective RAG-1
5 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
AI and Robotics IBA GEI April 2017
No ratings yet
AI and Robotics IBA GEI April 2017
120 pages
Guide To Evaluating LLM and RAG Systems
No ratings yet
Guide To Evaluating LLM and RAG Systems
41 pages
Guide to Fast GraphRAG
No ratings yet
Guide to Fast GraphRAG
7 pages
Gene AI
No ratings yet
Gene AI
13 pages
MLOps Syllabus and Weekly Schedule (June 2021) PDF
No ratings yet
MLOps Syllabus and Weekly Schedule (June 2021) PDF
5 pages
Roadmap GenAI Pinnacle Program
No ratings yet
Roadmap GenAI Pinnacle Program
8 pages
Deep Learning and Computer Vision For Video Analytics
No ratings yet
Deep Learning and Computer Vision For Video Analytics
37 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Intel GenAI Hackathon
No ratings yet
Intel GenAI Hackathon
10 pages
01 GenAI ExploreWorkshopGuide
No ratings yet
01 GenAI ExploreWorkshopGuide
55 pages
Mastering Chunking in RAG - Techniques and Strategies
No ratings yet
Mastering Chunking in RAG - Techniques and Strategies
12 pages
GenAI-Unit1-3
No ratings yet
GenAI-Unit1-3
31 pages
Artificial Intelligence For Business 2016
No ratings yet
Artificial Intelligence For Business 2016
1 page
Just-an-Agent-Away-An-AI-Thesis-
No ratings yet
Just-an-Agent-Away-An-AI-Thesis-
22 pages
Shreyash's Resume
No ratings yet
Shreyash's Resume
1 page
Synthetic Generation of High Dimensional Dataset
No ratings yet
Synthetic Generation of High Dimensional Dataset
8 pages
Testing Int Questions
No ratings yet
Testing Int Questions
33 pages
2401 03910
No ratings yet
2401 03910
30 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
AI+Governance+Framework+by+Trail+ +2024.2
No ratings yet
AI+Governance+Framework+by+Trail+ +2024.2
22 pages
Iccgi 2024 1 10 10002
No ratings yet
Iccgi 2024 1 10 10002
11 pages
P1 Automated Recognition Chatv2
No ratings yet
P1 Automated Recognition Chatv2
10 pages
Machine Lerning
No ratings yet
Machine Lerning
15 pages
Solutions HOML PDF
No ratings yet
Solutions HOML PDF
45 pages
Major_report
No ratings yet
Major_report
27 pages
Curved Scene Text Detection Based On Mask R-CNN
No ratings yet
Curved Scene Text Detection Based On Mask R-CNN
13 pages
APS1070 Lecture (1) Slides
No ratings yet
APS1070 Lecture (1) Slides
86 pages
AI Unit 4 NEW
No ratings yet
AI Unit 4 NEW
60 pages
Kaur2020 Article Hyper-parameterOptimizationOfD
No ratings yet
Kaur2020 Article Hyper-parameterOptimizationOfD
15 pages
A Survey On Offensive AI Within Cybersecurity: Sahil Girhepuje, Aviral Verma, Gaurav Raina
No ratings yet
A Survey On Offensive AI Within Cybersecurity: Sahil Girhepuje, Aviral Verma, Gaurav Raina
29 pages
College Management System Project Report
No ratings yet
College Management System Project Report
30 pages
SVM & CNN
No ratings yet
SVM & CNN
62 pages
3D Reconstruction of Human Body Via Machine Learning Qi He
100% (1)
3D Reconstruction of Human Body Via Machine Learning Qi He
59 pages
Approaches For Anomaly Detection in Network - A
No ratings yet
Approaches For Anomaly Detection in Network - A
6 pages
VC - Virtual Makeup
No ratings yet
VC - Virtual Makeup
12 pages
UNIT 1 All Notes
No ratings yet
UNIT 1 All Notes
24 pages
BDA Notes Unit-5
No ratings yet
BDA Notes Unit-5
62 pages
Identification of Potent Antimicrobial Peptides Via A Machine-Learning Pipeline That Mines The Entire Space of Peptide Sequences
No ratings yet
Identification of Potent Antimicrobial Peptides Via A Machine-Learning Pipeline That Mines The Entire Space of Peptide Sequences
17 pages
Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence
No ratings yet
Evaluating Models Based On Explainable Ai: Keywords: Gradcam, Explainability, Xai, Artificial Intelligence
22 pages
Final Mla File For Practical
No ratings yet
Final Mla File For Practical
30 pages
ML 4
No ratings yet
ML 4
21 pages
ML Lab Program 7
No ratings yet
ML Lab Program 7
7 pages
ML unit-2
100% (1)
ML unit-2
28 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
FAk ECURRENCYDETECTION
No ratings yet
FAk ECURRENCYDETECTION
41 pages
DATASET
No ratings yet
DATASET
5 pages
AI notes
No ratings yet
AI notes
19 pages
Deep Learning Model For Cutaneous Leishmaniasis Detection and Classification Using Yolov5
No ratings yet
Deep Learning Model For Cutaneous Leishmaniasis Detection and Classification Using Yolov5
11 pages
Garment Returns Prediction For AI-Based Processing and Waste Reduction in E-Commerce
No ratings yet
Garment Returns Prediction For AI-Based Processing and Waste Reduction in E-Commerce
9 pages
O.Hassan Feb 2022 SABiNN FPGA Implementation of Shift Accumulate Binary Neural Network Model For Real-Time Automatic Detection of Sleep Apnea
No ratings yet
O.Hassan Feb 2022 SABiNN FPGA Implementation of Shift Accumulate Binary Neural Network Model For Real-Time Automatic Detection of Sleep Apnea
6 pages
Python Implementation of Random Forest Algorithm
No ratings yet
Python Implementation of Random Forest Algorithm
10 pages
Predicting Life Insurance Risk Classes Using Machine Learning
No ratings yet
Predicting Life Insurance Risk Classes Using Machine Learning
68 pages