Definitive Guide To Testing LLM Applications
Definitive Guide To Testing LLM Applications
Guide to
Testing LLM
Applications
Table of Contents
0.0 Introduction 3
14
Regression testing 20
26
Bootstrapping 31
5.0 Conclusion 32
6.0 Glossary 34
Why testing
matters for LLM
applications
INTRODUCTION
With these challenges in mind, we find that companies must develop new
testing approaches to guard against harmful / misleading responses or even
embarrassing brand moments caused by their LLM app.
We’ve crafted this short guide to help you add rigor to your LLM app testing.
In this guide, you’ll learn various testing techniques across phases of the LLM
app development lifecycle — from application design, to pre-production, to
post-production. We also give advice on evaluating specific use cases such as
Retrieval Augmented Generation (RAG) applications and agents.
“
Before LangSmith, we didn’t have a systematic way to improve
the quality of our LLM applications. By integrating LangSmith
into our application framework, we now have a cohesive
approach to benchmark prompts and models for 200+ of our
applications. This supports our data-driven culture at Grab and
allows us to drive continuous refinement of our LLM-powered
solutions.
Padarn Wilson
Head of Engineering, ML Platform at Grab
cycle
T e s t i n g t e c h n i q u e s ac r o s s t h e a p p d e v e lo p m e n t c yc l e
Deploy app
Design Phase Pre-Production Phase Production Phase
Error handling in your app Test app before deploying Monitor app in production
Self correction es
Y
Phase
DESIGN PHASE
When designing your application, you may want To learn more about
to consider adding checks within your LangGraph, check
application logic to prevent unwanted outputs. out our documentation
Applications with built-in error handling can and tutorials.
leverage the ability for LLMs to self-correct by
executing tests within the application and
feeding errors back to the LLM.
Use Case:
Encoding these checks into a system like LangGraph and having it make fixes
can greatly speed up this process of code generation. Error handling should
ideally focus on simple assertions to minimize cost and latency. In this
example focused on code generation, our question-answer system,
ChatLangChain, sometimes hallucinates import statements — which degrades
the user experience.
To address this, we can design a simple LangGraph control flow that tests the
imports and, if they fail, passes back any errors to the LLM for correction. This
simple assertion improved performance considerably on our evaluation set.
Tip
Always start with the simplest possible
tests (e.g., assertions that can be
hard-coded) because they are cheap
and fast to run!
Code Execution Check
No Answer
(Node)
Fail?
Yes
Answer
Pydantic Object
Generation
Import Check
Use Case:
Self-corrective RAG
Sometimes, simple assertions are not a sufficient
test. For example, RAG systems can suffer from Tip
hallucinations and low-quality retrieval (e.g., if a Use of LLM-as-judge evaluators for
user’s question falls outside the domain of the error handling should be done with
indexed documents).
care to minimize latency and cost. See
the “Pre-Production” section for more
on LLM-as-judge evaluators.
In both cases, tests that can reason over text are
needed and cannot be easily captured as simple
assertions. In these cases, we can use an LLM to
perform the testing.
Generate
Yes
Answer
No Hallucinations?
Retrieve
Grade
Documents Documents
[related
to index]
Any doc
irrelevant
No Yes Answer
Routing
Answers question?
Question
Yes
No
[unrelated
to index]
Web Search
Pre-Production
The goal of pre-production testing is to measure the performance of your
application, ensuring continuous improvement and catching any regressions
on scenarios that you expect to pass. To begin testing, you’ll first need to
define a set of scenarios that you want to test. Below, we’ll walk through
prepping a dataset and evaluation criteria, and how to then use regression
testing to measure and improve LLM app performance over time.
Prerequisites:
Build a dataset
Datasets are collections of examples that serve as inputs and (optionally)
expected outputs used to evaluate your LLM app.
Gathering good data is often the hardest part of testing, given the large
amount of time and context needed to pull together a quality dataset to
benchmark your system. However, you don’t need many data points to get
started.
For offline evaluation, you can also optionally label expected outputs (i.e.
ground truth references) for the data points you are testing on. This lets you
compare your application’s response with the ground truth references.
Advanced:
Few-shot learning can improve accuracy and align outputs with human
preferences by providing examples of correct behavior. This is especially
useful when it’s tough to explain in instructions how the LLM should behave,
or if the output is expected to have a particular format. In LangSmith, self-
improving evaluation looks like the following
The LLM evaluator provides feedback on generated outputs, assessing
factors like correctness, relevance, or other criteria
Add in human corrections to modify or correct the LLM evaluator’s
feedback in LangSmith. This is where human preferences and judgment are
captured
These corrections are stored as few-shot examples in LangSmith, with an
option to leave explanations for corrections
The few-shot examples are incorporated into future prompts as
subsequent evaluation runs.
Advanced:
Pairwise evaluation
Evaluating LLM app outputs can be challenging in isolation, especially for
tasks where human preference is hard to encode in a set of rules. Ranking
outputs preferentially (for example, which one of these two outputs is more
informative, vague, safe etc., versus grading each one individually) can be less
cognitively-demanding for human or LLM-as-Judge evaluators.
Regression testing
In traditional software, tests pass based on functional requirements, leading to
more stable behavior once validated. In contrast, AI models show variable
performance due to model drift (i.e. degradation due to changes in data
distribution or updates to the model) and sensitivity. As such, regression testing
for LLM applications is even more critical and should be done frequently.
Once you've defined a dataset and evaluator, you may want to:
(1) Measure your LLM application's performance across experiments to identify
improved app versions to ship.
(2) Monitor app performance over time to prevent regressions.
For (1), you can select and compare experiments associated with a dataset.
Runs that regressed are auto-highlighted in red, while runs that improved are
in green — showing aggregate stats and allowing you to comb through
specific examples. This is useful for migrating models or prompts, which may
result in performance improvements or regression on specific examples.
For (2), you can set a baseline (e.g., the current deployed version of your app)
and compare it against prior versions to detect unexpected regressions. If a
regression occurs, you can isolate both the app version and the specific
examples that contain performance changes.
“
LangSmith has made it easier than ever to curate and maintain
high-signal LLM testing suites. Using LangSmith for testing, we’ve
seen
involves Dataset creation may also be more Testing an agent's trajectory involves
difficult- as it’s easier to generate a
Inputs: User input and (optionally) Inputs: User input and (optionally)
dataset for earlier steps in an agent
predefined tools predefined tools
trajectory, but harder to do so for later
Output: Agent's final response Output: Expected sequence of
steps which must account for prior
Evaluator: LLM-as-Judge tool calls (in other words, the
agent actions and responses.
Tools
Tool #1
Tool #2
Input
Selection
of tools
Reference tool trajectory (all steps)
23
PRE-PRODUCTION
Implementation:
Post-Production
Though essential, testing in the pre-production phase won’t catch everything.
Only after you’ve shipped to production can you get insights into how your
LLM application fares under real user scenarios. Beyond checking for spikes in
latency or errors, you’ll need to assess characteristics such as relevance,
accuracy, or toxicity. In post-production, a monitoring system can help detect
when your LLM app performance veers off course, allowing you to isolate
valuable failure cases.
Prerequisites:
Set up tracing
If you haven’t yet, you’ll need to set up tracing to gain visibility into a
meaningful portion of your production traffic. LangSmith makes this easy to
do and offers insights into not only the user input and application response,
but also all interim steps the application took to arrive at the response. This
level of detail is helpful for writing specific step assertions or debugging
issues later on.
3200
LLM Call Count/1d
2400
1600
800
0
Jun 29 Jul 01 Jul 03 Jul 05 Jul 07 Jul 11 Jul 13 Jul 15 Jul 17 Jul 19 Jul 21 Jul 23 Jul 25 Jul 27
This can provide baseline insight into your LLM app performance, on top of
which you’ll track qualitative metrics (covered in the next section).
“
LangSmith has been instrumental in accelerating our AI
adoption and enhancing our ability to identify and resolve
issues that impact application reliability. With LangSmith, we’ve
created custom feedback loops, improving our AI application
accuracy by 40% and reducing deployment time by 50%.
Varadarajan Srinivasan
VP of Data Science, AI and ML Engineering at Acxiom
There are at least two types of feedback you can collect in production to
improve app performance:
Feedback from users: You can directly collect user feedback, which can be
explicit or implicit. Adding a / button on your application is an easy
way to record user satisfaction with the application’s response. You can
also ask users to provide an additional explanation of why or why not their
expectations were met. In LangSmith, you can attach user feedback to any
trace or intermediate run (i.e. span) of a trace, including annotating traces
inline or reviewing runs together in an annotation queue.
Use Case:
To assess the RAG app’s performance in real-time, you’ll likely want to test:
If the application is hallucinating responses,
If the response is relevant and properly addresses the user’s questions
What types of questions the users are asking
(2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside
the scope of the FACTS.
Score:
A score of 1 means that the student's answer meets all of the criteria. This is the
highest (best) score.
A score of 0 means that the student's answer does not meet all of the criteria.
This is the lowest possible score you can give.
Bootstrapping
After setting up tracing and online evaluators, you’ll start to catch errors in
your application in production. Ideally, you modify your application to fix these
errors. You can also fold these errors back into your test dataset used for
offline evaluation, in order to prevent the same mistakes in future releases of
your application.
Deploy app
Design Phase Pre-Production Phase Production Phase
Error handling in your app Test app before deploying Monitor app in production
Yes
Self correction
Conclusion
As large language models evolve rapidly, robust testing is needed to improve
the system built around the model and provide lasting value. In this guide, we
discussed three layers of testing that you can consider: (1) error handling
within the application itself, (2) pre-production testing, and (3) post-
production monitoring.
Glossary
Agents: An agent is a system that uses an LLM to decide the control flow of
an application.
Agent trajectory: The series of steps an agent took to complete a given task.
Glossary
Online evaluation: Online evaluation allows you to evaluate an application in
production. This type of evaluation scores performance in real time, as your
application handles user inputs. Notably, it does not rely on a grounded, true
response for comparison.
Request Demo