Sample Size Oddities

By Steve Baty

Published: November 17, 2008

“The larger the proportion of a population that holds a given opinion, the fewer people you need to interview when doing user research. Conversely, the smaller the minority of people who share an opinion, the more people you need to interview.”

It might seem counterintuitive, but the larger the proportion of a population that holds a given opinion, the fewer people you need to interview when doing user research. Conversely, the smaller the minority of people who share an opinion, the more people you need to interview.

Mariana Da Silva has written an article about sample sizes in market research—or user research—titled “The More the Merrier.” In the article, Mariana made a comment that has caused some consternation—and for good reason.

“It all comes down to the size of the effect you intend to detect. Imagine you wanted to know whether people in London are taller than people in New York. If people in London and people in New York are actually pretty much the same height, you will need to measure a high number of citizens of both cities. If, on the other hand, people in London were particularly tall and people in New York were shorter than average, this will be obvious after measuring just a handful of people.”—Mariana Da Silva

Surely, popular thinking went, the larger the difference, the more people you’d need to ask to make sure it was real? It makes intuitive sense, but ignores the underlying principles of probability theory that govern such situations.

Now, before there’s a stampede for the exit, this article is not going to be heavy on mathematics, probability, statistics, or any other related esoterica. What we’re going to do is take a look at the underlying principles of probability theory—in general terms—and see how we can make use of them to understand issues such as the following:

  • how many people to include in a usability test
  • how to efficiently identify population norms and popular beliefs
  • how to do quick-and-easy A/B test analysis

Then we’ll move on to take a look at a case study that shows why a large sample size doesn’t always guarantee accuracy in user research, when such situations can arise, and what we can do about it.

Understanding Optimal Usability Test Size

“Conventional wisdom holds, you can do usability testing with just a handful of users. With more, you’ll see diminishing returns on each successive test session.”

Across the usability landscape, conventional wisdom holds—as characterized by the title of Jakob Nielsen’s Alertbox article from 2000, “Why You Only Need to Test with 5 Users—you can do usability testing with just a handful of users. With more, you’ll see diminishing returns on each successive test session, because it is likely that another user will already have found the bulk of the issues a user finds.

Nielsen provides the reasoning that each user—on average and in isolation—identifies approximately 31% of all usability issues in the product under evaluation. So, the first test session uncovers 31% of all issues; the second, 31% of issues, with some overlap with session 1; and so on. After five test sessions, you’ve already recorded approximately 75-80% of all issues, so the value of the sixth, seventh, and subsequent test sessions gets lower and lower.

Identifying Norms and Minority Views

“Some problems are more widespread, or are experienced by more people, than others. … The more prominent problems are the ones that are likely to show up early and repeatedly.”

There’s another way to explain this observation: Some problems are more widespread, or are experienced by more people, than others. Because we choose users at random for our usability tests, the more prominent problems are the ones that are likely to show up early and repeatedly.

In other words, as we do tests with more users, we not only learn about what issues people experience, but if we look at the overlap between the issues users find—even with only a handful of users—we gain an understanding of which problems are likely to be the most widespread among the target audience. So, even with a very small test base, we can be reasonably sure we’re identifying the problems that will affect the biggest proportion of the user base.

We can use the same principle to identify new features or changes to a product that would appeal to the most people—a principle ethnographers use to identify population norms among cultural groups. If we ask a small group of people—selected at random—what product changes they’d like to see, the most popular suggestions from the entire user population are the ones with the biggest probability of appearing in any small group of users you’ve chosen at random.

But this also highlights a danger of small sample-size tests and surveys: Minority voices don’t get heard. The issues that affect small segments of the target population are less likely to show up in a small random sample of users—and so, you’re more likely to miss them.

If your user research needs to include the voices of minority segments within your overall audience, it is important to plan for this ahead of time. There are a number of different options at your disposal:

  • When selecting your test or survey participants, ensure that you include at least some participants who represent each minority segment. We sometimes refer to this as a stratified sample.
  • Run tests or surveys using a lot more participants. This also has the advantage of reducing the overall level of error in your test data.

Let us now return to the subject of the Mariana Da Silva quotation. Why is it that we don’t have to measure as many people if heights differ greatly between the two populations? Don’t we still need to measure a decent-sized sample, calculate averages and confidence intervals, then carry out some sort of significance tests?

The short answer is: No.

If the two populations are very different—in terms of their distributions of heights—it’s likely we’ll very quickly see that difference reflected in the mean and standard deviation of our test data. For example, let’s assume we’ve measured the heights of ten men from each city and found that the average is different by 10 centimeters, or 4 inches. That’s a large observed difference. But what can we conclude from that? Our initial response might be that it’s likely just an anomaly in the sampling—we just picked taller Londoners.

However, as we measure more people from each city, and the height differential continues to appear, the likelihood that the difference is random chance becomes smaller and smaller very quickly. It just isn’t very likely that we’re randomly, but consistently choosing to measure abnormally tall people in London—or choosing abnormally short people in New York.

Now compare this to what happens when an observed difference is very small. With a small difference, it remains plausible longer—with a much, much larger number of people—that it’s due to random chance. Therefore, we need much larger sample sizes before our statistical analysis can conclude that the difference is real.

Analyzing A/B Tests

“Testing or surveying with large numbers of people can be difficult, costly, and not necessarily valuable. Although there are some forms of testing that overcome many of these issues—such as A/B testing or online, self-administered surveys.”

Testing or surveying with large numbers of people can be difficult, costly, and—as I mentioned previously—not necessarily valuable. Although there are some forms of testing that overcome many of these issues—such as A/B testing or online, self-administered surveys. In A/B testing particularly, we can also apply some of the principles I discussed previously to reduce the length and size of a test.

In the early stages of an A/B test, a large difference in performance could be an indicator of a substantial difference between two designs. For example, we might run an A/B test on a Web site and record the results of a small number users—from 100 to 200 users. If we observe a large difference, it’s time to shut down the test. If we observe a small difference, we should continue running the test until we’ve observed the behavior of 2,000 to 5,000 users and have brought formal analysis techniques to bear.

Large sample sizes are no guarantee of accuracy, however. A recent California election offers a case in point that is worth reviewing.

Proposition 8: A Case Study in Surveys

“A sample size of 2,300 for the exit polls reduced the survey error down below +/- 2%, suggesting a clear likelihood of defeat for Proposition 8.”

In an election on November 4, 2008, the people of California voted on Proposition 8, a ban on gay marriage. As people cast their votes and left the polling centers, exit polls—which are a type of field survey—recorded how 2,300 people had voted just moments before.

Exit polls for Proposition 8 showed a majority (52%) voted against the proposition. Younger people were more likely to be against the proposition than the elderly; college-educated people were also more likely to be against it; and people without a college education were more likely to be in favor.

More importantly, a sample size of 2,300 for the exit polls reduced the survey error down below +/- 2%, suggesting a clear likelihood of defeat for Proposition 8.

As the polls closed and the count started to come in, the actual data was completely different. The exact opposite, in fact. When the votes had all been counted, 53% of the population had endorsed Proposition 8—well outside the sampling error of the exit polls.

Clearly something was going on with the survey.

There are a number of different phenomena that may have contributed to this strange result:

  • The exit polls may not have been representative of the overall population of California.
  • Some people may have reported a vote that was different from the one they actually placed.
“When confronted by a survey question that touches on topics of some sensitivity or areas of social taboo, people are more likely to choose the response that represents—in their own minds—the answer the interviewer wants to hear.”

While the first case is possible, and we can’t exclude it from our consideration, the second case is also possible—and is much more interesting. When confronted by a survey question that touches on topics of some sensitivity or areas of social taboo, people are more likely to choose the response that represents—in their own minds—the answer the interviewer wants to hear. Examples of such topics might include sexual practices, drug taking, needle sharing, and criminal behavior. In surveys—particularly face-to-face surveys—the prominence of an undesirable activity or behavior tends to be under-reported, unless survey designers and interviewers take great care to ensure that doesn’t happen.

One option is to make it very clear to each respondent that accuracy is important, and there is no right answer. Another option is to let people answer a question in a manner that is more likely to remove the desire to please the interviewer—that is, either through a self-administered survey or—as in this case—using the confidential mechanism of an election.

And this, I believe, is why we witnessed such a variance between the exit poll and the actual voting. When confronted by a real person, asking how they voted, people responded by giving what was ostensibly the right answer—that they had voted No on the proposition. However, inside the polling booth, where they had privacy, they voted the way they really felt: Yes.

Conclusion

Usability testing doesn’t always have to be a compromise between certainty and sample size. By understanding the underlying principles at work, we can design our user research to make the most efficient use of our available time and energy and still achieve meaningful results.

6 Comments

You make some excellent points. As much as I’m a supporter of one-on-one usability testing, I think we in the usability community have to start accepting that it’s usually a qualitative test. Sitting down with five people to walk through a site, no matter how well structured, isn’t an experiment; it’s an exploratory method. It’s a perfectly valid and useful method and great for those Aha! moments, but you can’t, mathematically speaking, generalize those five people to a larger population.

Split (A/B) and multivariate Web site testing is experimental science, and it’s much easier to assess significance and generalize results, if those experiments are conducted properly. We tend to call both one-on-one and A/B testing usability testing, but I think grouping them this way has caused a lot of confusion and even abuse of the methods.

The problem of minorities in sampling emerged as a central issue in my research on the Special Broadcasting Service’s use of Nielsen/OzTAM ratings. SBS is—or arguably was—more focused on an ethnic and multicultural minority audience within the broader Australian community.

This is a splintered minority that is often under-represented in TV ratings, which are based on a sample size of a few thousand homes across Australia. (There are also logistical problems getting these ratings boxes into the homes of non-English-speaking families.) SBS has its own research methods to find out how ethnic groups use its programming, but it’s really no wonder to me that their reliance on commercial audience measurements has gone hand-in-hand with growth in commercially oriented behavior.

Thanks, as always, Steve for explaining this in an easily digestible manner.

There is third option that needs to be considered in this instance: Was the question that was asked inside the same as the one that was asked outside?

It is quite common for referenda questions to be worded very carefully, which often leads to some confusion. This is certainly the case here: you record that the result was yes, but I am not clear on whether that was yes to the ban or yes to the activity. If the exit question was then worded differently, it may be that what was a Yes inside was then a No on the outside. Clearly many people who had concentrated on the fact that they had to vote Yes would then automatically answer Yes outside, even though that was the opposite answer.

Sometimes, it appears that the authorities have deliberately made a question hard to comprehend. I recall a recent one here where the question ran something like: “Do you support the council’s preferred option?” Not hard to see which answer they were hoping for there, but quite hard to work out what was going on when the second question was: “Do you support plan B?” (There was, of course, no option for the opposite of plan A.)

Oh, one more thing. You owe me a coffee for not giving our blog post discussion a nod :)

Thanks everyone for the comments. I’m glad to see the topic generating discussion!

Dr Pete: I think it’s important to recognize that one-on-one testing meets a number of needs for user experience practitioners. It is exploratory; and it’s very good at meeting that need. It is also useful in identifying—with a small audience—whether some key performance metrics for a site or application are on the right track.

But most important, to my mind, is that testing of this type is not just good for having Aha! moments. It’s also incredibly useful at highlighting those “Oh, God!” moments when you realize you’ve done something stupid.

Tim: The SBS is a classic example of the importance of careful sampling in identifying the needs of minority audience segments. It’s a good lesson for us all to remember.

Pat: I’ll buy you a coffee to make up for my oversight. The usit.com.au blog is a great resource—and a place for interesting discussion. :)

Mike: That’s a really perceptive observation! The wording of referenda can be deliberately misleading or confusing. I can’t comment on that possibility in this case, as I don’t know the wording used in the exit poll question. My understanding of the referendum itself was that it was fairly unambiguous.

This is an important issue to bear in mind when designing survey questions, and worth reiterating here.

Join the Discussion

Asterisks (*) indicate required information.