Usability, Customer Experience & Statistics

Top-Box Scoring of Rating Scale Data

Jeff Sauro • December 14, 2010

Rating scales are used widely.  Ways of interpreting rating scale results also vary widely.

What exactly does a 4.1 on a 5 point scale mean?

In the absence of any benchmark or historical data, researchers and managers look at so-called top-box and top-two-box scores (boxes refer to the response options).

For example, on a five-point scale, counting the number of respondents that selected the most favorable response "strongly-agree" fall into the top box.

 Strongly-Disagree Disagree

Dividing this top-box count by the total number of responses generates a top-box proportion.

The idea behind this practice is that you're getting only those that are expressing a strong attitude with a statement. This applies to standard likert item options (strongly disagree to strongly agree) to other response options such as from "definitely will not purchase" to "definitely will purchase."

Top Two-Box Scores

Top-two-box scores include responses to the two most favorable response options. On five point likert-type scales this would include all agree response ("agree" and "strongly agree").

Top-two-box scoring is popular for rating scales with between 7 and 11 points. For example, the 11 point Net Promoter Question "How likely are you to recommend this product to a friend" has the top-two boxes of 9 and 10.
 Top 2 Box
      Not at
    all Likey

The top-two-box responses are called "promoters," and responses from 0 to 6 (bottom 7) are called "detractors."  The Net Promoter Score gets its name from subtracting the proportion of detractors from promoters.

The appeal of top-box scores is that they are intuitive. It doesn't matter if the ratings are about agreeing, purchasing or recommending. You're basically cutting to the chase and only considering the highly opinionated folks.  

Top-Box Scoring Loses Information

There are two major disadvantages to this scoring method: You lose information about both precision and variability.  When you go from 7 response options to 2 or from 11 to 2, a response of a 1 becomes the same thing as a 5. Information is lost.

Losing precision and variability means it's harder to track improvements, such as changes in attitudes after a new feature or service was launched.

For example, the following 42 responses to the Net Promoter question came from a usability test of the American Airlines website (

Response CategoryRaw Response#of Responses
Detractors 0, 0, 0, 2, 3, 3, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 622
7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8
Promoters (Top 2-Box) 9, 9, 9, 10, 10, 10, 10, 108

Table 1: 42 Actual responses to the question "How Likely are you to recommend to a friend?

The top-two-box score is 8/42 = 19%. There are 22 detractors (0 to 6) so the Net Promoter Score is a paltry -14/42 = - 33% (which seems consistent with some recent criticism)

Now, let's say American Airlines hires the best UX folks to improve the website and the following new scores are obtained after the improvement:

Response CategoryRaw Response#of Responses
Detractors 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 622
7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 812
Promoters (Top 2-Box) 10, 10, 10, 10, 10, 10, 10, 108

Table 2: 42 Potential responses to the question "How Likely are you to recommend to a friend?
after a hypothetical change in design.

The only difference here is the reduction in the number of respondents who completely hated the website (0's, 2's & 3's) and three 9's changed to 10's.

Yet the top-two box score is still 19% and the Net Promoter Score is still - 33%. There were no differences in the box scores because the changes all occurred within categories.

Use Means & Standard Deviations

In the behavioral sciences, marketing and certainly applied research, it is acceptable to take averages and standard deviations of rating scale data. Only measurement purists will take issue with this practice.

The average rating on the website before the changes was 6.12 with a standard deviation of 2.71 (those are actual scores from 42 users).  After the hypothetical changes, the average increased 16% to 7.12 with less variability, having a standard deviation of 1.64 (those are realistic but fictitious scores).

The improvement in scores is statistically significant (t = 2.05 p = .045) despite there being no difference in top-box and Net Promoter Scores.
Top-box and top-two-box scoring systems have the benefit of simplicity but at the cost of losing information. Even rather large changes can be masked when rating scale data with many options is reduced to two or three options. It can mean the difference between showing no improvement and a statistically significant one. 

Top-box scoring has its place for quickly assessing results and especially for stand-alone studies when there's no meaningful comparison or benchmark. If the results ever get compared though, you'll want  a more precise scoring system to have a good chance of detecting any differences in attitudes from design changes.

About Jeff Sauro

Jeff Sauro is the founding principal of MeasuringU, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 5 books on statistics and the user-experience.
More about Jeff...

Learn More

You Might Also Be Interested In:

Related Topics

Rating Scale, Survey, Net Promoter Score, Top Box Scoring

Posted Comments

There are 6 Comments

January 19, 2015 | Heather Siemens wrote:

Hi Jeff, I find it interesting that you use Net Promoter Score as an example of a metric that does not capture response variability and statistical significance in your article on Top-Box Scoring.rnrnIf you were talking exclusively overall satisfaction scores and the like, I would agree with your assessment. However, Net Promoter is a formula that does not rely on these types of changes between scores, only on the changes between boxes... that is to say, an NPS of 65% means something, an NPS of 6.50 means nothing.rnrnYou can't take a measure where the meaningful metric is derived by which box responses fall into, and argue that variability and precision is absent, when to gain this so-called variability and precision the significance of the score becomes completely lost. 

November 19, 2014 | Elizabeth Scott wrote:

Thanks for such a useful article! I have never come across the definition of top-box score and top-box proportion. 

September 23, 2014 | Matteo Vacca wrote:

Hi Jeff,
Thank you for your useful article. I read a lot articles written by you and they ever help me in what I was doing. About customer satisfaction and NET promoter score, do you think that is enough a single question (like the one in NET) to measure the CS?
If not, I have a doubt!how do you integrate the NET promoter score question in a survey with several questions that don't have a 10points likert scale but the 5 points one?
Hope was clear.
Thank you again.

July 9, 2012 | Mr. Htay wrote:

Dear Sir,rnI'd like to know about the calculation of Index based on Top 2 boxes of rating 5 point scale question. I found in a report top-two boxes summary Index is 203, even sample sizes is only 26. It's in a AC Nielsen report. rnIt may be not a sum of top 2 boxes. If we sum it the total value of top 2 boxes is 130, at most and 104 is least.rnCould you please share me any calculation about index?rnrnBest Regards,rnMr. Htay 

December 17, 2010 | Jeff Sauro wrote:

Steve, Thanks for your note.

In my experience companies use NPS as a key metric they track (and get bonuses based on), so to say the movement of the NPS needle is high-stakes is no exaggeration. I agree that revenue is the ultimate metric but it is often a lagging indicator.

Whatís more, many teams (like product dev) donít have access to sales numbers (especially in public companies) and so rely on NPS as the best proxy for future revenue growth.

So while it would be unusual to see a significant improvement in raw scores and identical top-box scores, it is very common to see an improvement that is only statistically significant for the mean difference and not for the difference in proportions of top-boxes or NPS scores.

I see this all the time when companies are tracking NPS on quarterly dashboards. While the total sample sizes from the NPS surveys are large, when sub-grouped for product, sample sizes dip below 100 and so every bit of precision is needed. If my bonus was based on showing statistical improvements in the NPS needle, Iíd put my money (literally) on the mean rather than top-box.

Of course, if the NPS went down, then Iíd want the top-box score because Iíd have a better shot at arguing it was only a change fluctuation and not a real drop. :)

December 15, 2010 | Steve Bernstein, Waypoint Group wrote:

Hi Jeff,
Thanks for the article and perhaps we can discuss this further. Iím not sure if youíre catching the background research and action around Net Promoter, which was based on a series of longitudinal studies that examined actual customer behaviors associated with customer feedback (documented in Reichheldís book and dozens of case studies over the years). The beauty of Net Promoter (and associated ďtop boxĒ scoring as you mentioned) lies more in its ability to easily communicate action. For example, I honestly donít know how to tell an executive in 30 seconds or less what a ď7.23 average satisfaction ratingĒ really means. But executives do know what it means if we told them that only 43% of their customers are Promoters, and that the differential annual value between a Promoter and a Detractor is $162 (as a real-world example of a B2C company we recently worked with). Armed with knowledge of what creates Promoters and Detractors, executives can make good decisions and gain a leading indicator of progress by watching the % of promoters grow in their business or segment.

While itís true that the hypothetical example you posed could occur (although Iíve never seen it in all my years doing this work), I think that any organization that would focus on a score, including NPS, is missing the point. Aggregating numbers into a score always enables people to focus on scores, where a focus on improvement and the resulting financial metrics instead would better serve the business.

Post a Comment


Your Name:

Your Email Address:


To prevent comment spam, please answer the following :
What is 1 + 2: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[6397 Subscribers]

Connect With Us

Our Supporters

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Use Card Sorting to improve your IA

Loop11 Online Usabilty Testing


Jeff's Books

Customer Analytics for DummiesCustomer Analytics for Dummies

A guidebook for measuring the customer experience

Buy on Amazon

Quantifying the User Experience 2nd Ed.: Practical Statistics for User ResearchQuantifying the User Experience 2nd Ed.: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download