Better Experimental Design for Better User Testing

March 27th, 2012

Written by

Robert Kerr

Author

Published on

March 27th, 2012

An ordinary Orlov Trotter horse, but is it capable of mental arithmetic?

User testing is a key part of validating design decisions, but what if the design of the test itself is invalid? As a researcher, being aware of the pitfalls that affect experimental design ensures your test can lead to more elegant end user experiences.

In the late 19th century a horse known as Clever Hans astonished audiences with his ability to stamp responses to arithmetic questions. He did this with startling accuracy. His infamy grew until an investigation determined that Hans was, in fact, only able to respond to subtle cues in the body language of his trainer.

This phenomenon became known as the “Clever Hans effect” and highlights the potential for flawed experimental design to produce misleading research findings.

Never work with children or animals.

W.C Fields

The Clever Hans effect reminds us that the charming vulnerabilities and character flaws inherent in people make research design as much an art as it is a science. The following behaviors reveal more challenges that have faced researchers over the years plus some techniques for overcoming them.

Experimenter bias

In his 1979 paper, Bias in Analytic Research, David Sackett identifies a long list of bias forms that can impact the validity of research into people. Specifically, knowledge and expectations of the person conducting the test can influence the findings.

This shouldn’t surprise us. User testing is particularly susceptible as it tends to involve people that have an emotional investment in the project at hand. These project members take qualitative notes while observing a participant, after all.

To counter this bias, a user test should be conducted by at least two people: one person who maintains a dialogue with the participant; and an additional person who takes notes. In order to ensure a fair account of the test, the second person should be someone that’s not directly involved in the project but rather someone briefed to capture feedback as they receive it.

The Hawthorne Effect

Under test conditions a tardy workforce can become a hive of activity.

It’s not just the person running/recording the experiment who’s likely to exhibit bias, though. Researcher Henry A. Landsberger found that, while workers don’t become more productive in different light conditions (as was hypothesized), they do work harder when being observed by a team of scientists.

This relates to a strange habit a lot of researchers seem to adopt. We send out invitations to a “User Testing Session” and the first thing we explain to our participants is that we’re testing the application not the user. Why not just call the session something else? This is probably the smallest change that can have the greatest impact on validity.

Better alternatives might be:

Application test
Application review
User research
Research study

Benchmark first

The Hawthorne Effect tells us that people feel self-conscious when being watched. Lucky for us, it’s relatively easy to observe people from afar by performing remote research and/or web analytics. If conducting an in-person study is the only way to go, though, the Hawthorne Effect can be better neutralized by first benchmarking the task.

For example, let’s imagine we’re testing our new website selling shoes. We want to examine the findability of products but we don’t have any definitive analytics. We do, however, have stats for how long it takes to complete registration. Although we’re not particularly interested in registration in this test, we could begin our tests by asking users to register. Then we could compare the time it takes users to register in-person with the time measured via analytics. The comparison lets us calculate a variance which can be applied to the rest of the research findings.

Demand characteristics

The Hawthorne Effect is just one of a variety of “demand characteristics.” This term refers to the tendency for participants to form an interpretation of the desired outcome and try to conform to or rebel against that vision.

Weber and Cook (1972) identified 4 personas to capture the tendencies of test subjects:

The Good Subject – keen to give the experimenter the data they’re after
The Faithful Subject – honest despite anticipating the expected result
The Negativistic Subject – determined to give contrary data
The Apprehensive Subject – awkward and keen to give the socially desired response

Let’s briefly discuss how these personas can affect your research.

Screw you!

“Screw you buddy!” Some participants will intentionally sabotage test results.

Persona #3, exhibits the “Screw you effect” that can be observed when inviting members of a team to participate in the test of a competing product. The solution to this problem lies in figuring out how many people you need to include in order to push these side-effects into insignificance.

Jeff Sauro provides a Sample Size Calculator for calculating the number of subjects required to produce data that is not just significant but can be extrapolated with confidence. Although this tool is not designed specifically for overcoming demand characteristics, considering the recommendations should give you a better idea of a suitable test sample.

A “Double blind” approach (intentionally misleading proctors and participants as to the purpose of the test) can also dissuade subjects from guessing the desired outcome. In the example of the online footwear store it could be claimed to unsuspecting users that the site is being assessed for load performance. This would suggest that the tasks were truly designed to test the site, not the users themselves.

Mike B Fisher’s article ”Meet the Respondents: Understanding User Personalities (Part 1)” offers more insight into the range of personality types you may encounter, and Part 2 of the article suggests further ways to neutralise the threat.

Attitude polarization

Given the opportunity, test subjects tend to adopt herd mentality.

Solomon Asch’s conformity experiments of the 1950s reveal just how powerful the responses of other group members are during a test. Asch used a double-blind approach by placing a test subject within a group of actors playing the parts of other subjects and asking the group to respond to simple visual perception tasks. The stooges would unanimously give the incorrect answer, often perplexing the genuine subject and encouraging them to conform.

The tendency to conformity in our society is so strong that reasonably intelligent and well-meaning young people are willing to call white black.

Solomon Asch

The herd mentality is a strong human trait. Conducting user testing sessions with large groups in a shared room makes it very tempting for subjects to mimic the feedback of those around them. The HiPPO effect (Highest Paid Person’s Opinion) can further entice people to agree with their seniors’ comments.

Ideally, test sessions should be run back to back, assuming this is consistent with end user environmental factors. In cases where it is unavoidable to test a large group simultaneously, randomising task order can help alleviate this problem as it limits the extent to which people will overhear responses to the task they’re working on.

Order effect

Participants will perform better at a specific task with practice.

The order in which tasks are performed can effect the results, as users will often perform better as their familiarity increases. This is called “order effect”.

Our user test of the online shoe store is susceptible to this effect. Take a series of three tasks:

Browse shoes
Browse special offers
Search for accessories

We may observe users having difficulty with local navigation on task 1 but no issues with similar navigation in tasks 2 and 3. In this simple example, it would be naive to think there was an issue with local navigation only when browsing shoes. In some cases this is less obvious.

The solution is to randomize the order in which selected tasks are presented. In preparation for a face-to-face user test this can be easily done by printing each task on a separate piece of paper and manually shuffling. Of course, tasks are often contextual so careful consideration needs to be given to which tasks can be shuffled.

Some tools have this feature built in. For example, Survey Monkey, used for unmoderated testing includes an option to randomize questions automatically.

Funding Bias

Behind every research experiment lurks a project sponsor, and their allegiance may lie closer to financial considerations than to objective interpretation of findings.

Turner & Spilich (2006) reported that research studies funded by the tobacco industry were more likely to conclude that smoking improved cognitive performance than those undertaken by independent researchers.

No matter how a user test is conducted the findings are open for interpretation by people who are rarely objective.

UX consultants may be inclined to find more major issues in order to justify their day rate and to secure further work. Product managers may determine issues to be minor if they’re standing in the way of a much awaited deployment. It is for the recipients of user research to challenge conclusions and make researchers earn their keep.

Conclusions

While a great deal of research has gone into identifying these issues and more, you don’t need a degree in psychology to understand how they can apply to user research.

Here’s a checklist to keep you on track:

Before you start

Benchmark. Take advantage of any data you already have access to. If you can benchmark a task you can assess how much influence the various forms of bias have on your results.
Choose wisely. If you’re testing with people you know try to select a range of personalities. Remember, the insights provided by “Screw you” types are just as valuable as those given by the “Good” subjects.
Calculate. If you don’t know what you’re letting yourself in for, give some thought to what sample size will be sufficient to push the confounding variable of personality type into insignificance.
Shuffle the deck. Order effect is easy to provision for so learn to recognize it in advance.
Test the app. Invite your participants to anything but a “user test.” It’s a misleading term so avoid it.

During the session

Know yourself. What would you like the outcome to be? Designers hate finding out how ugly their babies are but the warm fuzzy feeling of concluding you were right when the results suggest otherwise only lasts a moment. Bad design lasts a lot longer.
Know your users. Even if you don’t actually know them, expect them to be awkward, too eager to please, confused, ecstatic, nervous and so on. Prepare a few rehearsed phrases to keep things moving and be ready to improvise.

When you’re done

Share. It’s important for the entire team to understand how users experience their product.
Learn. Highlight the findings that validate your design decisions but make sure to acknowledge areas for improvement.
Reiterate. You’re ready to improve your design. Remember, the data is just a tool. You don’t have to follow users’ suggestions blindly. You’re still the designer.

By taking the time to implement sound experimental design we can improve the quality of data returned by research, show the wider community that UX is a science as much as an art, and keep designing killer experiences. Here’s to better user tests…or whatever you choose to call them!