Usability, Customer Experience & Statistics

Best Practices for Using Statistics on Small Sample Sizes

Jeff Sauro • August 13, 2013

Some people think that if you have a small sample size you can't use statistics.

Put simply, this is wrong, but it's a common misconception.
There are appropriate statistical methods to deal with small sample sizes.

Although one researcher's "small" is another's large, when I refer to small sample sizes I mean studies that have typically between 5 and 30 users total—a size very common in usability studies.

But user research isn't the only field that deals with small sample sizes. Studies involving fMRIs, which cost a lot to operate, have limited sample sizes as well[pdf] as do studies using laboratory animals.

While there are equations that allow us to properly handle small "n" studies, it's important to know that there are limitations to these smaller sample studies: you are limited to seeing big differences or big "effects."

To put it another way, statistical analysis with small samples is like making astronomical observations with binoculars. You are limited to seeing big things: planets, stars, moons and the occasional comet.  But just because you don't have access to a high-powered telescope doesn't mean you cannot conduct astronomy. Galileo, in fact, discovered Jupiter's moons with a telescope with the same power as many of today's binoculars.

Just as with statistics, just because you don't have a large sample size doesn't mean you cannot use statistics. Again, the key limitation is that you are limited to detecting large differences between designs or measures.

Fortunately, in user-experience research we are often most concerned about these big differences—differences users are likely to notice, such as changes in the navigation structure or the improvement of a search results page. 

Here are the procedures which we've tested for common, small-sample user research, and we will cover them all at the UX Boot Camp in Denver next month.


If you need to compare completion rates, task times, and rating scale data for two independent groups, there are two procedures you can use for small and large sample sizes.  The right one depends on the type of data you have: continuous or discrete-binary.

Comparing Means: If your data is generally continuous (not binary), such as task time or rating scales, use the two sample t-test. It's been shown to be accurate for small sample sizes.

Comparing Two Proportions: If your data is binary (pass/fail, yes/no), then use the N-1 Two Proportion Test. This is a variation on the better known Chi-Square test (it is algebraically equivalent to the N-1 Chi-Square test). When expected cell counts fall below one, the Fisher Exact Test tends to perform better. The online calculator handles this for you and we discuss the procedure in Chapter 5 of Quantifying the User Experience.

Confidence Intervals

When you want to know what the plausible range is for the user population from a sample of data, you'll want to generate a confidence interval. While the confidence interval width will be rather wide (usually 20 to 30 percentage points), the upper or lower boundary of the intervals can be very helpful in establishing how often something will occur in the total user population.

For example, if you wanted to know if users would read a sheet that said "Read this first" when installing a printer, and six out of eight users didn't read the sheet in an installation study, you'd know that at least 40% of all users would likely do this--a substantial proportion.

There are three approaches to computing confidence intervals based on whether your data is binary, task-time or continuous.

Confidence interval around a mean: If your data is generally continuous (not binary) such as rating scales, order amounts in dollars, or the number of page views, the confidence interval is based on the t-distribution (which takes into account sample size).

Confidence interval around task-time:  Task time data is positively skewed. There is a lower boundary of 0 seconds. It's not uncommon for some users to take 10 to 20 times longer than other users to complete the same task. To handle this skew, the time data needs to be log-transformed  and the confidence interval is computed on the log-data, then transformed back when reporting. The online calculator handles all this.

Confidence interval around a binary measure
: For an accurate confidence interval around binary measures like completion rate or yes/no questions, the Adjusted Wald interval performs well for all sample sizes.

Point Estimates (The Best Averages)

The "best" estimate for reporting an average time or average completion rate for any study may vary depending on the study goals.  Keep in mind that even the "best" single estimate will still differ from the actual average, so using confidence intervals provides a better method for estimating the unknown population average.

For the best overall average for small sample sizes, we have two recommendations for task-time and completion rates, and a more general recommendation for all sample sizes for rating scales.

Completion Rate: For small-sample completion rates, there are only a few possible values for each task. For example, with five users attempting a task, the only possible outcomes are 0%, 20%, 40%, 60%, 80% and 100% success. It's not uncommon to have 100% completion rates with five users. There's something about reporting perfect success at this sample size that doesn't resonate well. It sounds too good to be true.

We experimented[pdf] with several estimators with small sample sizes and found the LaPlace estimator and the simple proportion (referred to as the Maximum Likelihood Estimator) generally work well for the usability test data we examined. When you want the best estimate, the calculator will generate it based on our findings.

Rating Scales: Rating scales are a funny type of metric, in that most of them are bounded on both ends (e.g. 1 to 5, 1 to 7 or 1 to 10) unless you are Spinal Tap of course. For small and large sample sizes, we've found reporting the mean to be the best average over the median[pdf]. There are in fact many ways to report the scores from rating scales, including top-two boxes. The one you report depends on both the sensitivity as well as what's used in an organization.

Average Time: One long task time can skew the arithmetic mean and make it a poor measure of the middle. In such situations, the median is a better indicator of the typical or "average" time. Unfortunately, the median tends to be less accurate and more biased than the mean when sample sizes are less than about 25. In these circumstances, the geometric mean (average of the log values transformed back) tends to be a better measure of the middle. When sample sizes get above 25, the median works fine.

About Jeff Sauro

Jeff Sauro is the founding principal of MeasuringU, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 5 books on statistics and the user-experience.
More about Jeff...

Learn More

UX Bootcamp: Denver: Aug 17-21, 2016

You Might Also Be Interested In:

Related Topics

Statistics, Sample Size

Posted Comments

There are 3 Comments

November 11, 2014 | Rasmus wrote:

In comparisons, wouldn't it be a better idea to use Mann Whitney rank sum test to determine if two samples drawn from a distribution with the same mean? Student's t-test assumes that the samples are drawn from a normal distribution, which can be both hard to determine for small samples and also a hard assumption to fulfill. 

October 22, 2014 | Steve wrote:

While it is possible to perform statistical analysis at small sample sizes, I often have a hard time justifying any inferential statistics based on the demographics used in my sample versus the wider population of users. For example, in a sample of 6 users typically used in usability study, it's hard to have a representative sample of the demographics of the wider population, particularly for consumer products with a wide audience, such as e-commerce sites. How would you justify the use of confidence intervals to a skeptical audience, in this case?  

October 7, 2014 | Rebka wrote:

I want to calculate the confidence interval in a situation that seems like a combination of a binary measure and a mean. **If** a customer makes a purchase (and this is the case about 1% of the time) then the purchase amount is continuous. How does one proceed for a small sample size? Does this type of problem show up in any of your books? 

Post a Comment


Your Name:

Your Email Address:


To prevent comment spam, please answer the following :
What is 4 + 2: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[6127 Subscribers]

Connect With Us

UX Bootcamp

3 Days of Hands-On Training on UX Methods, Metrics and Analysis
Denver: Aug. 17-19 2016

Our Supporters

Use Card Sorting to improve your IA

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Loop11 Online Usabilty Testing


Jeff's Books

Customer Analytics for DummiesCustomer Analytics for Dummies

A guidebook for measuring the customer experience

Buy on Amazon

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download