UXPA 2021, Baltimore MD
How do you
know your users
feel satisfied?
Evidence-based best practices to give you confidence
Four metrics to be used to collect satisfaction and related
sentiment, all of which assess users’ perceptions. We’ll present:
• The what, when, how, pros, and cons
• Research on their reliability and validity
• Best practices to give you confidence when you use them
What we’ll cover:
Methods of measuring satisfaction
• Customer satisfaction (CSAT) / Overall satisfaction (OSAT)
• Net Promoter Score (NPS)
• System Usability Scale (SUS)
• Usability Metric for User Experience (UMUX & UMUX-LITE)
Brian Utesch
User Research Practice Lead
IBM, CHQ – Marketing
Steve Woodburn
Senior UX Research & Designer
IBM, CIO Design
Jon G. Temple
IBM Design Principal
IBM, CIO Design
Annette Tassone
UX Research & Design Lead
IBM, CIO Design
Customer Satisfaction / Overall Satisfaction
What is satisfaction?
Customer’s perception of the
degree to which the customer’s
requirements have been fulfilled.
- ISO definition
Why measure satisfaction?
Happy - Productive
Customer Satisfaction Overall Satisfaction
Question / prompt
Analysis / result
“Overall, how satisfied…?”
“How would you rate your
overall satisfaction…?”
“How satisfied were you
with our customer service?”
#Satisfied + #Very satisfied
Total #responses
x 100
Defined by
Settings for CSAT / OSAT measure
Surveys/questionnaires Testing
CSAT / OSAT scale presentation
CSAT / OSAT scale presentation
Be consistent
CSAT / OSAT Emoji scales
Happy or Not
Analyzing CSAT / OSAT results
Actual total value
Maximum possible value
x 100
Sat rate =
Best practices for CSAT / OSAT
• Be prompt
• Consider context
• Be specific
• Use 5 to 11 items
• Use labels and anchors – semantic,
numeric, emoji
• Visual treatments can help
• Be consistent – in measure and analysis
• Allow for comments
• Be careful, in application and
Satisfaction and repurchase – weak relationship
We are UX professionals
We care about usability & productivity
Satisfaction vs. Usability
• Correlation of r = .53
• “Users prefer the design with the highest usability
metrics 70% of the time.” (Nielson, 2012)
What are we measuring?
What should we be measuring?
CSAT / OSAT – the good and the bad
• Flexible in format and use
• Familiar
• Ease of administration
• Ease of analysis
Pros Cons
• Short-term indication
• Cultural bias
• Non-diagnostic - without specific prompts
and/or comments
• Subjective – satisfaction does NOT equal
performance, usability
• Non-standard
NPS has muscled its way into UX metrics
The Net Promoter Score (NPS)
Overall Satisfaction (OSAT)
There is always an easy solution
to every human problem–neat;
plausible and wrong.
- H.L. Hencken
What is the Net Promotor score (NPS)?
• Fred Reichheld (2003) - Harvard
Business Review - NPS is the
“One Number You Need to Grow”
• Measures recommendation -
engagement and loyalty
• The Net Promoter Score (NPS) ranges
from -100 to +100
Not at all
0 1 2 3 4 5 6 7 8 9 10
How likely are you to recommend Foobar to a colleague?
• Adoption by Business and Marketing teams
• Executives love it: A single score!
• “Best practice”
• Eventually adopted by UX research and
design. Or else
NPS Calculation: Drives extreme scores
Brand loyalty
Compensation OKRs Comparing apps
The many uses of NPS
Inapplicable question Diagnostically weak
Wacky mathematics
1 2 3
analysis a must
4 5 6
Inappropriate for
small samples
6+7+8 = 8
Reasons why NPS may be a poor fit for UX Research
What construct is really
being measured?
• Recommendation implies choice or at least opportunity
• NPS used where the question does not make sense (no choice)
• Can you recommend something forced to use, where there is only one app
(e.g., internal Enterprise apps)?
• UX researchers care about satisfaction not loyalty
Not at all
0 1 2 3 4 5 6 7 8 9 10
How likely are you to recommend breathing air to a colleague?
Inapplicable Question: Do you recommend X to your colleagues or friends?
Wacky math
• Jared Spool has argued against the method of calculation
• Discards data (6, 7s)
• Sacrifices diagnostic sensitivity to focus on the extremes
• Represents any distribution less well than a simple mean.
• Below: 3 distributions of scores with NPS = 0 (Fraser, 2017)
= 8
Wacky math
Spool argues that NPS hides UX success.
Not at
all likely
0 1 2 3 4 5 6 7 8 9 10
How likely are you to recommend Foobar to a colleague?
N=50, with 0 scores +6 point improvement x 50
Not at
all likely
0 1 2 3 4 5 6 7 8 9 10
How likely are you to recommend Foobar to a colleague?
Not at
all likely
0 1 2 3 4 5 6 7 8 9 10
How likely are you to recommend Foobar to a colleague?
+1 point improvement x 50:
Not at
all likely
0 1 2 3 4 5 6 7 8 9 10
How likely are you to recommend Foobar to a colleague?
N=50, with 8 scores
Nothing detected with vast improvement (NPS = -100) after 6 pt improvement:
Yet giant jumps with minor changes (NPS = 0 to 100) after 1 pt improvement:
UX professionals need to understand three things:
1. What is driving the score?
2. Did the design intervention make a difference?
3. How much of the score is attributable to ease of use?
• NPS: direction + magnitude
• “Loyal” – not why
• Reasons for loyalty (cheaper, performance, familiarity, features, etc)
• “ease of use” – what we in UX care about – one possible explanation
• What worked?
Diagnostically Weak
• Dependent on analyzing comments
• Time consuming and laborious
• Short cuts: Analyses may be superficial
(simple keywords or sentiment)
Comments analysis a must
• Subjective, not statistically generalizable
• Most users leave no comment
• Speaks for the sample? Population?
Comment analysis can be problematic:
1. Start with N=1000
2. 20% leave comments (N=200)
3. Reveals 6-10 categories of various sizes
4. Most categories will be small
5. Individually, low statistical power (say, n=6 to n=60)
Inappropriate for small samples
Recommend minimum sample sizes
between n= 150 - 200
• Often NPS used for small samples:
o Conference presentation (e.g., 20-30 audience)
o Usability test (e.g., 6-9 participants)
• Internal IBM study: sampled groups of 30, 100,
200 and 300 scores, calculated the NPS scores
• Example: NPS = 0, n=30, true NPS may be
-26.5 to +26.5
• As n -> 200, MoE reached 12
• Margin of Error rather large for small samples
(30, 100) [at 95% confidence interval]
What construct is really being measured?
• Measures loyalty – the focus on extremes
• Ries, Tassone, Hager & Elie (2017) - highly
correlated with satisfaction (range: r=.75
to r=.85)
• Good proxy for satisfaction
But: Are loyalty and satisfaction really
overlapping constructs?
Or: Do users ignore wording of the
question to make sense of it?
Literally asks about making recommendations,
not sat. In linguistics, can use substitution to
I am satisfied with being a man vs. I
recommend being a man to a friend
I am satisfied with my wife vs. I
recommend my wife to a colleague
The good news: Any measure of satisfaction is better than none
Some great things I can say about NPS:
• High visibility / commonly accepted
by executives
• Highly correlated with satisfaction
• A familiar / rapid way for the
respondent to provide feedback
• Useful for comparing applications
and to industry benchmarks
• Useful for comparing different
implementations of the same app
• Opens the conversation between
users and the product team
Best practices for NPS
• No cheating!
• Comments for all ratings (0-10)
• Intercept survey to target verified users
• Time use of intercept surveys
• Email campaign: short time window. Not for
infrequently used applications
• When reporting NPS, include:
o Survey method
o Distribution of scores
o Margin of Error with a 95% confidence interval
• Min 150 – 200 responses
I don’t always prefer
to use NPS
But when I do, I follow
some best practices
Gratuitous internet meme
What is a better fit than NPS for UX research?
The Net Promoter Score (NPS)
The System Usability Scale (SUS)
The System Usability Scale—SUS
• Created by John Brooke in 1986 to measure perceived usability of
systems and products.
• Is known as the “quick and dirty” usability scale
• Standard, researched and vetted, heavily cited, commonly used,
The System Usability Scale—SUS
John Brooks, and his SUS helped to develop ISO 9241-11 Guidance on Usability
Usability is the extent to which a product can
be used by specified users to achieve
specified goals with effectiveness, efficiency,
and satisfaction in a specified context of use.
- ISO definition
What is the SUS ?
• UX questionnaire that reliably assesses perceived usability
• Set of 10 attitudinal questions answered using a 5-point bipolar agreement scale
• Yields a single score ranging 0—100 with an associated letter grade
• The average SUS score ranges from 65-70, a C
The 10 SUS Questions - SUS Classic (Alternating tone)
The 10 SUS Questions - All Positive Statements
I think that I would need the support of a technical
person to be able to use this system.
I think that I could use this system without the
support of a technical person.
Calculating a SUS Score – for Alternating Tone
Calculators are readily available – be sure to not change question order in the data set.
Subtract 1 from response to positively worded
questions, and subtract 5 from the response to
negatively worded questions.
Add up the converted responses
Multiply the total by 2.5
How to report and compare the score
• The score of 77.8 earns a
letter grade of B+
• When you want to compare
or trend a score, calculate
and report the Margin of
Error at a 90 or 95%
confidence interval
Sauro and Lewis (2016)
What SUS research tells us
The SUS has been well vetted by many researchers and practitioners and
the scale is cited in thousands of articles.
Mean Scores
Reliable and
reliable with small
sample sizes
Problems with
alternating tone
Like NPS, predicts
revenue growth
Bangor et al (2008)
and Sauro (2011)
Bangor et al (2008,2009)
Sauro (2019)
Alignment with
Usability Ratings
1 2 3
4 5
Tullis and Stetson (2004)
Lewis, Utesch & Maher (2015)
Finstad (2006), Sauro (2011),
and Lewis et al. (2015),
• An industry standard and benchmarking
• Reliable and valid
• Not just a score– additional insights
• Expectations of future use
• Amount of training and support needed
• Easy to administer and answer
• Relatively speaking, a large sample size is not
• Versatile – for any type of any user interface
• 10 questions is a lot to answer
• Complex calculation
• Risk of user error with alternating tone
• Not diagnostic
Pros Cons
My perspective is—
• The questions have intuitive validity
• Reliable, valid, and comparable to other
tools and industry benchmarks
• Perfect as a follow-up to any evaluative
UX research
• Provides you with an overall usability
score and additional information on
expected future use and training and
support needed
• Not the best choice for longer surveys
To be confident in your score…
• Ask the level of experience with the tool – as it impacts
questionnaire ratings*
• Best to administer it electronically, especially for scoring purposes
– which also gives you the ability to randomize the order,
immediately following tool usage or the UX test
• If you use the original set of questions with alternating tone, tell the
user upfront that they alternate
• Use the name of the system or product in the question, e.g, iPhone.
• Calculate the margin of error
Can we measure usability more efficiently?
Usability Metric for User Experience – Lite
The System Usability Scale (SUS)
Everything should be
made as simple as
possible, but not simpler.
- Albert Einstein
The SUS “problem”
• “Not-so-quick and not-so-dirty”
• Real-estate
• Inefficient
• UMUX: Reduce questions
UMUX – Finstad (2010)
• Usability Metric for User Experience
• “I thought [the system] was easy to use” (r=.89)
• ISO 9241: Effectiveness + Efficiency + Satisfaction
Effectiveness Efficiency Satisfaction
• [This system] allows me to
accomplish my tasks.
• I think I would need a
system with more features
for my tasks.
• I would not need to
supplement [this system]
with an additional one.
• [This system’s]
capabilities would not
meet my requirements.
• [This system] saves me time.
• I tend to make a lot of
mistakes with [this system].
• I don’t make many errors
with [this system].
• I have to spend a lot of time
correcting things with [this
• I am satisfied with [this
• I would prefer to use
something other than [this
• Given a choice, I would
choose [this system] over
• Using [this system] was a
frustrating experience.
• [This system] is easy to use.
Strong correlation with SUS
EFFECTIVENESS: [The system’s] capabilities meet my requirements.
SATISFACTION: Using [this system] is a frustrating experience.
OVERALL: [This system] is easy to use.
EFFICIENCY: I have to spend too much time correcting things with
[this system].
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7
The UMUX “problem”
• Some criticism of the UMUX
• 10 to 4 items isn’t a huge improvement
UMUX-Lite – Lewis, Utesch, and Maher (2013, 2015)
• Pressure to increase response rate
• Reduce SUS further
• Technology Acceptance Model – TAM (Davis, 1986)
• Usefulness + Ease of use = Tech acceptance
Correspondence with TAM
[The system’s] capabilities meet my requirements.
[This system] is easy to use.
Using [this system] is a frustrating experience.
I have to spend too much time correcting things
with [this system].
• High correlation with SUS (r=.85)
• High correlation with NPS (r=.88)
USEFULNESS: [The system’s] capabilities meet my requirements.
EASE OF USE: [This system] is easy to use.
1 2 3 4 5 6 7
1 2 3 4 5 6 7
((item 1 + item 2) – 2) * 100/12
calculation A
• Short
• Simple to administer
• Consistent with the TAM
• Leverage SUS norms
• Diverse applicability
• Independent validation
• Expandible framework
• Relatively new
Pros Cons
Subjective aspects,
like satisfaction and
usability, often
collected via
• Satisfaction
• Net Promoter
• Usability (SUS,
The frequency,
intensity and
overall level of
user involvement.
User adoption and
initial uptake of a
product or features.
The continued use
of a product or
features over time.
Task success
How efficiently,
effectively and
successfully users
accomplish key
tasks using the
product or features.
• Depth of use
• Frequency of use
• Adoption rate %
• New users %
• Returning users %
• Bounce rate
• % tasks completed
Summary: When you want to understand…
…a “finger in the
air” gauge of user
When you want an
easy to understand
When your execs
want a single
measure of
…usability and
want to have
additional insights.
When you have
users with enough
patience to
complete it
…usability, but
more efficiently.
When you don’t
have the screen
real-estate for
more items

