EVALUATION PARADIGMS AND TECHNIQUES

<< BEHAVIOR & FORM: ELIMINATING EXCISE, NAVIGATION AND INFLECTION

DECIDE: A FRAMEWORK TO GUIDE EVALUATION >>

Human Computer Interaction (CS408)

Lecture

Lecture 29. Evaluation Part I

Learning Goals

The aim of this lecture is to introduce you the study of Human Computer Interaction,

so that after studying this you will be able to:

· Understand what evaluation is in the development process

· Understand different evaluation paradigms and techniques

What to evaluate?

There is a huge variety of interactive products with a vast array of features that need

to be evaluated. Some features, such as the sequence of links to be followed to find an

item on a website, are often best evaluated in a laboratory, since such a setting allows

the evaluators to control what they want to investigate. Other aspects, such as whether

a collaborative toy is robust and whether children enjoy interacting with it, are better

evaluated in natural settings, so that evaluators can see what children do when left to

their own devices.

John Gould and his colleagues (Gould et aL 1990; Gould and Lewis, 1985)

recommended three principles for developing the 1984 Olympic Message System:

· Focus on users and their tasks

· Observe, measure, and analyze their performance with the system

· Design lucratively

Since the OMS study, a number of new evaluation techniques have been developed.

There has also been a growing trend towards observing how people interact with the

system in their work, home, and other settings, the goal being to obtain a better

understanding of how the product is (or will be) used in its intended setting. For

example, at work people are frequently being interrupted by phone calls, others

knocking at their door, email arriving, and so on--to the extent that many tasks are

interrupt-driven. Only rarely does someone carry a task out from beginning to end

without stopping to do something else. Hence the way people carry out an activity

(e.g., preparing a report) in the real world is very different from how it may be

observed in a laboratory. Furthermore, this observation has implications for the way

products should be designed.

Why you need to evaluate?

Just as designers shouldn't assume that everyone is like them, they also shouldn't

presume that following design guidelines guarantees good usability. Evaluation is

needed to check that users can use the product and like it. Furthermore, nowadays

users look for much more than just a usable system, as the Nielsen Norman Group, a

usability consultancy company, point out (www.nngroup.com):

271

Human Computer Interaction (CS408)

"User experience" encompasses all aspects of the end-user's interaction ...

the first requirement for an exemplary user experience is to meet the exact

needs of the customer, without fuss or bother. Next comes simplicity and

elegance that produce products that are a joy to own, a joy to use. "

Bruce

Tognazzini

another

successful

usability

consultant, comments

(www.asktog.com) that:

"Iterative design, with its repeating cycle of design and testing, is the

only validated methodology in existence that will consistently produce

successful results. If you don't have user-testing as an integral part of

your design process you are going to throw buckets of money down the

drain."

Tognazzini points out that there are five good reasons for investing in user

testing:

1. Problems are fixed before the product is shipped, not after.

2. The team can concentrate on real problems, not imaginary ones.

3. Engineers code instead of debating.

4. Time to market is sharply reduced.

5. Finally, upon first release, your sales department has a rock-solid design it can sell

without having to pepper their pitches with how it will all actually work in release 1.1

or 2.0.

Now that there is a diversity of interactive products, it is not surprising that the range

of features to be evaluated is very broad. For example, developers of a new web

browser may want to know if users find items faster with their product. Government

authorities may ask if a computerized system for controlling traffic lights results in

fewer accidents. Makers of a toy may ask if six-year-olds can manipulate the controls

and whether they are engaged by its furry case and pixie face. A company that

develops the casing for cell phones may ask if the shape, size, and color of the case is

appealing to teenagers. A new dotcom company may want to assess market reaction

to its new home page design.

This diversity of interactive products, coupled with new user expectations, poses

interesting challenges for evaluators, who, armed with many well tried and tested

techniques, must now adapt them and develop new ones. As well as usability, user

experience goals can be extremely important for a product's success.

When to evaluate?

The product being developed may be a brand-new product or an upgrade of an exist-

ing product. If the product is new, then considerable time is usually invested in market

research. Designers often support this process by developing mockups of the potential

product that are used to elicit reactions from potential users. As well as helping to

assess market need, this activity contributes to understanding users' needs and early

requirements. As we said in earlier lecture, sketches, screen mockups, and other low-

fidelity prototyping techniques are used to represent design ideas. Many of these same

techniques are used to elicit users" opinions in evaluation (e.g., questionnaires and

interviews), but the purpose and focus of evaluation are different. The goal of eval-

uation is to assess how well a design fulfills users' needs and whether users like it.

272

Human Computer Interaction (CS408)

In the case of an upgrade, there is limited scope for change and attention is focused on

improving the overall product. This type of design is well suited to usability

engineering in which evaluations compare user performance and attitudes with those

for previous versions. Some products, such as office systems, go through many

versions, and successful products may reach double-digit version numbers. In

contrast, new products do not have previous versions and there may be nothing

comparable on the market, so more radical changes are possible if evaluation results

indicate a problem.

Evaluations done during design to check that the product continues to meet users'

needs are known as formative evaluations. Evaluations that are done to assess the

success of a finished product, such as those to satisfy a sponsoring agency or to check

that a standard is being upheld, are known as summative evaluation. Agencies such as

National Institute of Standards and Technology (NIST) in the USA, the International

Standards Organization (ISO) and the British Standards Institute (BSI) set standards

by which products produced by others are evaluated.

Evaluation paradigms and techniques

29.1

Before we describe the techniques used in evaluation studies, we shall start by

proposing some key terms. Terminology in this field tends to be loose and often

confusing so it is a good idea to be clear from the start what you mean. We start with

the much-used term user studies, defined by Abigail Sellen in her interview as

follows: "user studies essentially involve looking at how people behave either in their

natural [environments], or in the laboratory, both with old technologies and with new

ones." Any kind of evaluation, whether it is a user study or not, is guided either

explicitly or implicitly by a set of beliefs that may also he underpinned by theory.

These beliefs and the practices (i.e., the methods or techniques) associated with them

are known as an evaluation paradigm, which you should not confuse with the

"interaction paradigms. Often evaluation paradigms are related to a particular

discipline in that they strongly influence how people from the discipline think about

evaluation. Each paradigm has particular methods and techniques associated with it.

So that you are not confused, we want to state explicitly that we will not be

distinguishing between methods and techniques. We tend to talk about techniques, but

you may find that other some call them methods. An example of the relationship

between a paradigm and the techniques used by evaluators following that paradigm

can be seen for usability testing, which is an applied science and engineering

paradigm. The techniques associated wild usability testing are: user testing in a

controlled environment; observation of user activity in the controlled environment and

the field; and questionnaires and interviews.

Evaluation paradigms

In this lecture we identify four core evaluation paradigms: (1) "quick and dirty" eval-

uations; (2) usability testing; (3) field studies; and (4) predictive evaluation. Other

people may use slightly different terms to refer to similar paradigms.

"Quick and dirty" evaluation

A "quick and dirty" evaluation is a common practice in which designers informally

get feedback from users or consultants to confirm that their ideas are in line with

users" needs and are liked. "Quick and dirty" evaluations can be done at any stage and

273

Human Computer Interaction (CS408)

the emphasis is on fast input rather than carefully documented findings. For example,

early in design developers may meet informally with users to get feedback on ideas

for a new product (Hughes el al., 1994). At later stages similar meetings may occur to

try out an idea for an icon, check whether a graphic is liked, or confirm that

information has been appropriately categorized on a webpage. This approach is often

called "quick and dirty" because it is meant to be done in a short space of time.

Getting this kind of feedback is an essential ingredient of successful design.

As discussed in earlier lectures, any involvement with users will be highly informa-

tive and you can learn a lot early in design by observing what people do and talking to

them informally. The data collected is usually descriptive and informal and it is fed

back into the design process as verbal or written notes, sketches and anecdotes, etc.

Another source comes from consultants, who use their knowledge of user behavior,

the market place and technical know-how, to review software quickly and provide

suggestions for improvement. It is an approach that has become particularly popular

in web design where the emphasis is usually on short timescales.

Usability testing

Usability testing was the dominant approach in the 1980s (Whiteside et al., 1998), and

remains important, although, as you will see, field studies and heuristic evaluations

have grown in prominence. Usability testing involves measuring typical users'

performance on carefully prepared tasks that are typical of those for which the system

was designed. Users' performance is generally measured in terms of number of errors

and time to complete the task. As the users perform these tasks, they are watched and

recorded on video and by logging their interactions with software. This observational

data is used to calculate performance times, identify errors, and help explain why the

users did what they did. User satisfaction questionnaires and interviews are also used

to elicit users' opinions.

The defining characteristic of usability testing is that it is strongly controlled by the

evaluator (Mayhew. 1999). There is no mistaking that the evaluator is in charge!

Typically tests take place in laboratory-like conditions that are controlled. Casual

visitors are not allowed and telephone calls are stopped, and there is no possibility of

talking to colleagues, checking email, or doing any of the other tasks that most of us

rapidly switch among in our normal lives. Everything that the participant does is

recorded--every key press, comment, pause, expression, etc., so that it can be used as

data.

Quantifying users' performance is a dominant theme in usability testing. However,

unlike research experiments, variables are not manipulated and the typical number of

participants is too small for much statistical analysis. User satisfaction data from

questionnaires tends to be categorized and average ratings are presented. Sometimes

video or anecdotal evidence is also included to illustrate problems that users

encounter. Some evaluators then summarize this data in a usability specification so

that developers can use it to test future prototypes or versions of the product against it.

Optimal performance levels and minimal levels of acceptance are often specified and

current levels noted. Changes in the design can then be agreed and engineered--hence

the term "usability engineering.

Field studies

The distinguishing feature of field studies is that they are done in natural settings with

the aim of increasing understanding about what users do naturally and how

274

Human Computer Interaction (CS408)

technology impacts them. In product design, field studies can be used to (1) help

identify opportunities for new technology; (2) determine requirements for design: (3)

facilitate the introduction of technology: and (4) evaluate technology (Bly. 1997).

We introduced qualitative techniques such as interviews, observation, participant

observation, and ethnography that are used in field studies. The exact choice of

techniques is often influenced by the theory used to analyze the data. The data takes

the form of events and conversations that are recorded as notes, or by audio or video

recording, and later analyzed using a variety of analysis techniques such as content,

discourse, and conversational analysis. These techniques vary considerably. In content

analysis, for example, the data is analyzed into content categories, whereas in

discourse analysis the use of words and phrases is examined. Artifacts are also

collected. In fact, anything that helps to show what people do in their natural contexts

can be regarded as data.

In this lecture we distinguish between two overall approaches to field studies. The

first involves observing explicitly and recording what is happening, as an outsider

looking on. Qualitative techniques are used to collect the data, which may then he

analyzed qualitatively or quantitatively. For example, the number of times a particular

event is observed may be presented in a bar graph with means and standard

deviations.

In some field studies the evaluator may be an insider or even a participant.

Ethnography is a particular type of insider evaluation in which the aim is to explore

the details of what happens in a particular social setting. "In the context of human

computer interaction, ethnography is a means of studying work (or other activities) in

order to inform the design of information systems and understand aspects of their use"

(Shapiro, 1995, p. 8).

Predictive evaluation

In predictive evaluations experts apply their knowledge of typical users, often guided

by heuristics, to predict usability problems. Another approach involves theoretically

based models. The key feature of predictive evaluation is that users need not be pres-

ent, which makes the process quick, relatively inexpensive, and thus attractive to

companies; but it has limitations.

In recent years heuristic evaluation in which experts review the software product

guided by tried and tested heuristics has become popular (Nielsen and Mack, 1994).

Usability guidelines (e.g., always provide clearly marked exits) were designed

primarily for evaluating screen-based products (e.g. form fill-ins, library catalogs,

etc.). With the advent of a range of new interactive products (e.g., the web, mobiles,

collaborative technologies), this original set of heuristics has been found insufficient.

While some are still applicable (e.g., speak the users' language), others are

inappropriate. New sets of heuristics are also needed that are aimed at evaluating

different classes of interactive products. In particular, specific heuristics are needed

that are tailored to evaluating web-based products, mobile devices, collaborative

technologies, computerized toys, etc. These should be based on a combination of

usability and user experience goals, new research findings and market research. Care

is needed in using sets of heuristics. Designers are sometimes led astray by findings

from heuristic evaluations that turn out not to be as accurate as they at first seemed.

275

Human Computer Interaction (CS408)

Table bellow summarizes the key aspects of each evaluation paradigm for the fol-

lowing issues:

the role of users

who controls the process and the relationship between evaluators and users

during the evaluation

the location of the evaluation

when the evaluation is most useful

the type of data collected and how it is analyzed

how the evaluation findings are fed back into the design process

the philosophy and theory that underlies the evaluation paradigms.

Evaluation

and Usability testing Field studies

"Quick

Predictive

paradigms

dirty"

of Natural

To carry out set Natural behavior.

Users generally

Role

behavior.

tasks.

not involved.

users

E valuators take Evaluators

Evaluators try to Expert evaluators.

Who

minimum

strongly

in develop

controls

control.

relationships with

users.

Natural

Laboratory.

Natural environment. Laboratory-

environment

oriented but often

Location

laboratory

happens

customer's

premises.

276

Human Computer Interaction (CS408)

Any time you With a prototype

Most often used

Expert reviews

When used want to get or product.

early in design to

(often done by

feedback about a

check that users'

consultants) with

design quickly.

needs are being met

a prototype, but

Techniques from

assess

can occur at any

other evaluation

problems or design

time.

paradigms can

opportunities.

Models are used to

be Used e.g.

assess

specific

experts review

aspects of a

soft ware.

potential design.

Usually

Quantitative.

Qualitative

List of problems

Type of data

qualitative,

Sometimes

descriptions often

from

expert

informal

statistically

accompanied with

reviews.

descriptions

validated. Users'

sketches. Scenarios

Quantitative

opinions

quotes,

other

figures

from

collected

artifacts.

model, e.g., how

questionnaire or

long it takes to

interview.

perform a task

using

two

designs.

Sketches,

Report

Descriptions

that

Reviewers

Fed back into

design by..

quotes,

performance

include

quotes,

provide a list of

descriptive

measures, errors

Sketches, anecdotes,

problems, often

report.

etc.

Findings

and sometimes time

with suggested

provide

logs.

solutions. Times

benchmark

for

calculated from

future versions.

models are given

to designers.

Philosophy

User-centered, Applied approach May be objective

Practical

highly practical based

on observation

heuristics

and

approach

experimentation. ethnographic.

practitioner

i.e.,

usability

expertise

engineering.

underpin expert

reviews. Theory

underpins

models

Techniques

There are many evaluation techniques and they can be categorized in various ways,

but in this lecture we will examine techniques for:

· observing users

· asking users their opinions

· asking experts their opinions

· testing users" performance

· modeling users' task performance to predict the efficacy of a user interface

277

Human Computer Interaction (CS408)

The brief descriptions below offer an overview of each category. Be aware that some

techniques are used in different ways in different evaluation paradigms.

Observing users

Observation techniques help to identify needs leading to new types of products and

help to evaluate prototypes. Notes, audio, video, and interaction logs are well-known

ways of recording observations and each has benefits and drawbacks. Obvious

challenges for evaluators are how to observe without disturbing the people being

observed and how to analyze the data, particularly when large quantities of video data

are collected or when several different types must be integrated to tell the story (e.g.,

notes, pictures, sketches from observers).

Asking users

Asking users what they think of a product--whether it does what they want; whether

they like it; whether the aesthetic design appeals; whether they had problems using it;

whether they want to use it again--is an obvious way of getting feedback. Inter views

and questionnaires are the main techniques for doing this. The questions asked can be

unstructured or tightly structured. They can be asked of a few people or of hundreds.

Interview and questionnaire techniques are also being developed for use with email

and the web.

Asking experts

Software inspections and reviews are long established techniques for evaluating

software code and structure. During the 1980s versions of similar techniques were

developed for evaluating usability. Guided by heuristics, experts step through tasks

role-playing typical users and identify problems. Developers like this approach he-

cause it is usually relatively inexpensive and quick to perform compared with labo-

ratory and field evaluations that involve users. In addition, experts frequently suggest

solutions to problems

User testing

Measuring user performance to compare two or more designs has been the bedrock of

usability testing. As we said earlier when discussing usability testing, these tests are

usually conducted in controlled settings and involve typical users performing typical.

well-defined tasks. Data is collected so that performance can be analyzed. Generally

the time taken to complete a task, the number of errors made, and the navigation path

through the product are recorded. Descriptive statistical measures such as means and

standard deviations are commonly used to report the results.

Modeling users' task performance

There have been various attempts to model human-computer interaction so as to

predict the efficiency and problems associated with different designs at an early stage

without building elaborate prototypes. These techniques are successful for systems

with limited functionality such as telephone systems. GOMS and the keystroke model

are the best known techniques.

278

Table of Contents: