Group Sizes In Usability Testing
What is the optimal participant size for a usability test?
How To Specify Participant Group Size for Usability Studies: A Practitioner’s Guide by Ritch Macefield, is a great introduction into the subject of deciding on optimal sample sizes for usability testing. Below I have tried to summarise it, since I believe it is a great resource and I do not take credit for any of the insights below.
This article helps practitioners specify the participant group size, or “sample size” for a usability study and articulate the basis, risks, and implications associated with this specification.
Logically, as one would increase the group size of a usability study, one would also increase the study’s reliability. However, as sample sizes increase, as does the cost and duration of the study, therefore, how do we limit and find the optimal user group size based on our needs, without jeapordising the reliability and accuracy of the results?
THE BROAD ISSUES
Tensions In Commercial Contexts
The main tensions within study design is finding the balance between obtaining the most reliable findings, whilst remaining within the constraints of reducing time and budget for the overall project.
Therefore, our goal should be to work with stakeholders to reach a study design that is realistic and optimal for the project as a whole, balancing cost and time with reliability.
Application Of Research Literature
Neilsen (2000) suggests that a usability study with 5 participants will discover over 80% of the problems with an interface. However, this does not mean that any one particular study will achieve this figure.
In a study conducted by Virzi (1992) and Nielsen (1993), 100 groups of 5 participants were used to discover problems with an interface. The study found that the mean percentage of problems discovered across all 100 groups was about 85%. However, when you break the statistics down, they really mean that for any one particular group of 5, there is a 95% chance that the percentage of problems found will be in the range of 66.5%-100%. In the actual study, whilst one of the groups found almost 100% of the problems, another group only found 55% of them.
Also there can be a tendency for practitioners to perceive statistics with a positive skew since results are open to bias interpretation.
STUDIES RELATED TO PROBLEM DISCOVERY
For usability studies concerned with interfaces, there are 2 important, interrelated facts:
- Problems with interfaces are highly subjective. A feature may create a problem for one type of user but not the other, and this could even be dependent on several other factors including mood, time of day etc. Also problems could arise from the interrelationships between several features so it may not always be easy to pin the issue down.
- Ranking the severity of a problem is also subjective.
What Problem Discovery Level Is Appropriate For A Study?
We could argue that higher problem detection levels are desirable in the following contexts:
- Highly secure environments (i.e. the military)
- Safety critical applications (i.e. emergency services)
- Where socio-economic or political stakes are high (i.e. governmental applications)
- Enterprise critical applications (i.e. online banking)
- When a previous study yields suspect or inconclusive results
Ultimately we need to consider the implications of undiscovered problems remaining after a study, and possiblity of fixing these later.
To summarise, the optimal group size depends on the importance that all problems to be found.
Complexity Of The Study
In Nielsen’s testing, they were testing relatively simple, closed/specific tasks with little variation. In 2001 Spool and Schroeder conducted more complex studies with more open tasks with greater variation in participants, and found that 5 participants only found 35% of the problems with an interface. Therefore, the more complex the thing being studied, the larger the number of participants will be required.
The following can increase a study’s complexity:
- complexity/openness of tasks being performed
- complexity/number of metrics used
- participant and target group diversity and degree to which participants match user group
- potential for study contamination
- nature/variation of participant training
Studies Related To Problem Discovery In Early Conceptual Prototypes
Studies designed to test early conceptual prototypes of products that may exhibit systems with whole new interface paradigms should be considered separately as well. They are typically interested in discovering severe usability problems at an early stage to avoid wasting resources. Because they are tested so early, they often contain more errors than would typically be found in more mature prototypes. They also may present significantly more usability issues due to the fact that they are employing unconventional interfaces that users may not be familiar with. The product may also be more limited in scope because it is new.
We can therefore assume that these prototypes are likely to contain more usability problems than normal, and, thus, increases the likelihood that fewer participants are required to discover these problems.
In contrast to problem discovery studies, these studies are summative. This makes these results better suited to analysis using stablished statistical methods. They are also often definitive in their outcome, their results forming the basis of important commercial decisions and are therefore somewhat of a hypothesis test to compare absolute outcomes.
One approach to understand the actual effect size of the study is to run an open-ended study whereby we increment the number of participants in a group until on the of the following 3 things happen:
- The data becomes statistically significant
- Increasing participant number will never produce statistically significant result.
- Continuing the study is no longer viable due to budget or time constraints
This, approach, however, is not always viable since studies need to time-boxed and budgeted before they are conducted. One way to get around this, however, is to specify a very large group size that is likely to produce significant findings, and then terminating it early if any of the above 3 events occur. This could also, however, turn out to be unviable. Ultimately, the study group size will still be dependant on the likelihood of finding statistically significant data and the context. However, we do have further data from the research community stating that a comparative study using 8-25 participants per group is sensible (Landauer 1988, Nielsen 1993), and that 10-12 (Spyridakis and Fisher 1992, Rubin 1994, Faulkner 2003) is a good baseline range.
Unfortunately, there is no “one size fits all” solution. The optimal group size for a usability study should depend on the study’s complexity, severity and context. Practitioners should understand that results can be subjective and should thus approach these studies from the perspective of being formative / diagnostic exercises, rather that objective scientific experiments.
Virzi (1992) suggests that the optimal group size in terms of commercial cost-benefit may be between 3-20, with the latter being more appropriate for commercial studies. Faulkner (2003) found that a group size of 10 participants will reveal a minimum of 82% of the problems, although this was also underpinned by simplistic studies. The research by Turner (2006) suggests that a group size of 7 may actually be optimal, even with a complex study.
It is easy to argue that for most studies related to problem discovery, a group size of 3-20 participants is valid, with 5-10 being a sensible baseline range. The group size should be increased along with the study’s complexity, severity and context and decrease in the case of studies related to early conceptual prototypes.