What Happens To Task-Ratings When You Interrupt Users?

Jeff Sauro, PhD

November 30, 2010

In usability testing we ask users to complete tasks and often ask them to rate how difficult or easy the task was. Does it matter when you ask this question?

What happens if we interrupt users during the task instead of asking it after the task experience is over?

Almost ten years ago researchers at Intel asked 28 users to complete various tasks across websites such as Yahoo.com, WeatherChannel.com, Delta.com and Hertz.com.

Half the users were interrupted after 30 seconds and 120 seconds and asked to rate the task ease on a 7 point scale (concurrent ratings). The other half of users attempted the same tasks and answered the same question except only after the task was completed (post-task ratings).

They found users rated tasks 26% higher after the task versus during the task. The concurrent group’s post-task ratings were also lower than the post-task only group–suggesting that the act of concurrently rating itself lowers post-task ratings.

5 Seconds Tests using SUS

In earlier research, I found that when users are interrupted after only five seconds, their System Usability Scale (SUS) scores are the same as users who had no time limit. However, users who were interrupted after 60 seconds had higher SUS ratings than both 5 second and no-time limit groups.

These results suggested interruption either had no effect or actually increased ratings on post-test questionnaires.

Task-Level Ratings are Different than Overall Impressions

One possibility for the differences is simply that users respond to task-level questions differently than they do overall website usability questions. To investigate this phenomenon further, I recruited 66 users to interrupt during website task attempts.

Tasks Interrupted at 5 and 60 seconds

I randomly assigned users to one of two retail websites (Michaels.com and ContainerStore.com) and one of three interruption conditions: 5 seconds, 60 seconds or no interruption. Users were all asked to answer the 7-point Single Ease Question(SEQ) after attempting to locate an item from the stores.

The graph below shows the results concurred with the earlier research. Users who were interrupted at 5 and 60 seconds rated the task as more difficult than users who had no time limit [F (2, 65)=3.45, p <.04 ].

Figure 1: Task ease ratings were lower when rated after only having 5 or 60 seconds to complete the task

The full time group’s average rating was 50% higher than the 5 second group and 26% higher than the 60 second group. In the earlier research conducted at Intel, the 30 and 120 second groups were combined so the 60 second group is likely the best comparison–both were 26% higher.

Users also completed the SUS and differences between groups were not statistically different (p > .10).

With this sample size there is a definite difference between the 5 second and full groups, however, the 60 second group is only marginally different than each adjacent group. There also appears to be a linear pattern: ratings get better with more time. A larger sample size would be needed to confirm this pattern.

When should task ratings be taken?

Are users artificially rating the tasks as easier at the end of the task, or are they artificially rating them too hard during the task? It’s hard to say and I can think of compelling explanations for both theories.

Give me more time!: Interrupted users might rate tasks as more difficult because they didn’t feel they had enough time to complete the task and so think it is harder.

Oh, that task was easy because I completed it: Post task ratings might be inflated because users give less emphasis to the negative aspects of the task and are affected by a sense of accomplishment.

Conclusions

A few things to consider about interrupting users to obtain concurrent ratings:

Interrupted users will rate tasks as more difficult but they don’t necessarily consider the overall website as more difficult
Concurrent ratings will add time to the test session and task-time (around 15% more)
Don’t mix ratings: compare post-task ratings to other post-task ratings and concurrent ratings to other concurrent ratings.
Concurrent ratings might be better for diagnosing interaction problems (much like eye-tracking) than post-task ratings, since they allow you to identify more precise problem points than at the task level