ZeePedia

EVALUATION PARADIGMS AND TECHNIQUES

<< BEHAVIOR & FORM: ELIMINATING EXCISE, NAVIGATION AND INFLECTION
DECIDE: A FRAMEWORK TO GUIDE EVALUATION >>
img
Human Computer Interaction (CS408)
VU
Lecture
29
Lecture 29. Evaluation ­ Part I
Learning Goals
The aim of this lecture is to introduce you the study of Human Computer Interaction,
so that after studying this you will be able to:
·  Understand what evaluation is in the development process
·  Understand different evaluation paradigms and techniques
What to evaluate?
There is a huge variety of interactive products with a vast array of features that need
to be evaluated. Some features, such as the sequence of links to be followed to find an
item on a website, are often best evaluated in a laboratory, since such a setting allows
the evaluators to control what they want to investigate. Other aspects, such as whether
a collaborative toy is robust and whether children enjoy interacting with it, are better
evaluated in natural settings, so that evaluators can see what children do when left to
their own devices.
John Gould and his colleagues (Gould et aL 1990; Gould and Lewis, 1985)
recommended three principles for developing the 1984 Olympic Message System:
·  Focus on users and their tasks
·  Observe, measure, and analyze their performance with the system
·  Design lucratively
Since the OMS study, a number of new evaluation techniques have been developed.
There has also been a growing trend towards observing how people interact with the
system in their work, home, and other settings, the goal being to obtain a better
understanding of how the product is (or will be) used in its intended setting. For
example, at work people are frequently being interrupted by phone calls, others
knocking at their door, email arriving, and so on--to the extent that many tasks are
interrupt-driven. Only rarely does someone carry a task out from beginning to end
without stopping to do something else. Hence the way people carry out an activity
(e.g., preparing a report) in the real world is very different from how it may be
observed in a laboratory. Furthermore, this observation has implications for the way
products should be designed.
Why you need to evaluate?
Just as designers shouldn't assume that everyone is like them, they also shouldn't
presume that following design guidelines guarantees good usability. Evaluation is
needed to check that users can use the product and like it. Furthermore, nowadays
users look for much more than just a usable system, as the Nielsen Norman Group, a
usability consultancy company, point out (www.nngroup.com):
271
img
Human Computer Interaction (CS408)
VU
"User experience" encompasses all aspects of the end-user's interaction ...
the first requirement for an exemplary user experience is to meet the exact
needs of the customer, without fuss or bother. Next comes simplicity and
elegance that produce products that are a joy to own, a joy to use. "
Bruce
Tognazzini
another
successful
usability
consultant, comments
(www.asktog.com) that:
"Iterative design, with its repeating cycle of design and testing, is the
only validated methodology in existence that will consistently produce
successful results. If you don't have user-testing as an integral part of
your design process you are going to throw buckets of money down the
drain."
Tognazzini points out that there are five good reasons for investing in user
testing:
1. Problems are fixed before the product is shipped, not after.
2. The team can concentrate on real problems, not imaginary ones.
3. Engineers code instead of debating.
4. Time to market is sharply reduced.
5. Finally, upon first release, your sales department has a rock-solid design it can sell
without having to pepper their pitches with how it will all actually work in release 1.1
or 2.0.
Now that there is a diversity of interactive products, it is not surprising that the range
of features to be evaluated is very broad. For example, developers of a new web
browser may want to know if users find items faster with their product. Government
authorities may ask if a computerized system for controlling traffic lights results in
fewer accidents. Makers of a toy may ask if six-year-olds can manipulate the controls
and whether they are engaged by its furry case and pixie face. A company that
develops the casing for cell phones may ask if the shape, size, and color of the case is
appealing to teenagers. A new dotcom company may want to assess market reaction
to its new home page design.
This diversity of interactive products, coupled with new user expectations, poses
interesting challenges for evaluators, who, armed with many well tried and tested
techniques, must now adapt them and develop new ones. As well as usability, user
experience goals can be extremely important for a product's success.
When to evaluate?
The product being developed may be a brand-new product or an upgrade of an exist-
ing product. If the product is new, then considerable time is usually invested in market
research. Designers often support this process by developing mockups of the potential
product that are used to elicit reactions from potential users. As well as helping to
assess market need, this activity contributes to understanding users' needs and early
requirements. As we said in earlier lecture, sketches, screen mockups, and other low-
fidelity prototyping techniques are used to represent design ideas. Many of these same
techniques are used to elicit users" opinions in evaluation (e.g., questionnaires and
interviews), but the purpose and focus of evaluation are different. The goal of eval-
uation is to assess how well a design fulfills users' needs and whether users like it.
272
img
Human Computer Interaction (CS408)
VU
In the case of an upgrade, there is limited scope for change and attention is focused on
improving the overall product. This type of design is well suited to usability
engineering in which evaluations compare user performance and attitudes with those
for previous versions. Some products, such as office systems, go through many
versions, and successful products may reach double-digit version numbers. In
contrast, new products do not have previous versions and there may be nothing
comparable on the market, so more radical changes are possible if evaluation results
indicate a problem.
Evaluations done during design to check that the product continues to meet users'
needs are known as formative evaluations. Evaluations that are done to assess the
success of a finished product, such as those to satisfy a sponsoring agency or to check
that a standard is being upheld, are known as summative evaluation. Agencies such as
National Institute of Standards and Technology (NIST) in the USA, the International
Standards Organization (ISO) and the British Standards Institute (BSI) set standards
by which products produced by others are evaluated.
Evaluation paradigms and techniques
29.1
Before we describe the techniques used in evaluation studies, we shall start by
proposing some key terms. Terminology in this field tends to be loose and often
confusing so it is a good idea to be clear from the start what you mean. We start with
the much-used term user studies, defined by Abigail Sellen in her interview as
follows: "user studies essentially involve looking at how people behave either in their
natural [environments], or in the laboratory, both with old technologies and with new
ones." Any kind of evaluation, whether it is a user study or not, is guided either
explicitly or implicitly by a set of beliefs that may also he underpinned by theory.
These beliefs and the practices (i.e., the methods or techniques) associated with them
are known as an evaluation paradigm, which you should not confuse with the
"interaction paradigms. Often evaluation paradigms are related to a particular
discipline in that they strongly influence how people from the discipline think about
evaluation. Each paradigm has particular methods and techniques associated with it.
So that you are not confused, we want to state explicitly that we will not be
distinguishing between methods and techniques. We tend to talk about techniques, but
you may find that other some call them methods. An example of the relationship
between a paradigm and the techniques used by evaluators following that paradigm
can be seen for usability testing, which is an applied science and engineering
paradigm. The techniques associated wild usability testing are: user testing in a
controlled environment; observation of user activity in the controlled environment and
the field; and questionnaires and interviews.
Evaluation paradigms
In this lecture we identify four core evaluation paradigms: (1) "quick and dirty" eval-
uations; (2) usability testing; (3) field studies; and (4) predictive evaluation. Other
people may use slightly different terms to refer to similar paradigms.
"Quick and dirty" evaluation
A "quick and dirty" evaluation is a common practice in which designers informally
get feedback from users or consultants to confirm that their ideas are in line with
users" needs and are liked. "Quick and dirty" evaluations can be done at any stage and
273
img
Human Computer Interaction (CS408)
VU
the emphasis is on fast input rather than carefully documented findings. For example,
early in design developers may meet informally with users to get feedback on ideas
for a new product (Hughes el al., 1994). At later stages similar meetings may occur to
try out an idea for an icon, check whether a graphic is liked, or confirm that
information has been appropriately categorized on a webpage. This approach is often
called "quick and dirty" because it is meant to be done in a short space of time.
Getting this kind of feedback is an essential ingredient of successful design.
As discussed in earlier lectures, any involvement with users will be highly informa-
tive and you can learn a lot early in design by observing what people do and talking to
them informally. The data collected is usually descriptive and informal and it is fed
back into the design process as verbal or written notes, sketches and anecdotes, etc.
Another source comes from consultants, who use their knowledge of user behavior,
the market place and technical know-how, to review software quickly and provide
suggestions for improvement. It is an approach that has become particularly popular
in web design where the emphasis is usually on short timescales.
Usability testing
Usability testing was the dominant approach in the 1980s (Whiteside et al., 1998), and
remains important, although, as you will see, field studies and heuristic evaluations
have grown in prominence. Usability testing involves measuring typical users'
performance on carefully prepared tasks that are typical of those for which the system
was designed. Users' performance is generally measured in terms of number of errors
and time to complete the task. As the users perform these tasks, they are watched and
recorded on video and by logging their interactions with software. This observational
data is used to calculate performance times, identify errors, and help explain why the
users did what they did. User satisfaction questionnaires and interviews are also used
to elicit users' opinions.
The defining characteristic of usability testing is that it is strongly controlled by the
evaluator (Mayhew. 1999). There is no mistaking that the evaluator is in charge!
Typically tests take place in laboratory-like conditions that are controlled. Casual
visitors are not allowed and telephone calls are stopped, and there is no possibility of
talking to colleagues, checking email, or doing any of the other tasks that most of us
rapidly switch among in our normal lives. Everything that the participant does is
recorded--every key press, comment, pause, expression, etc., so that it can be used as
data.
Quantifying users' performance is a dominant theme in usability testing. However,
unlike research experiments, variables are not manipulated and the typical number of
participants is too small for much statistical analysis. User satisfaction data from
questionnaires tends to be categorized and average ratings are presented. Sometimes
video or anecdotal evidence is also included to illustrate problems that users
encounter. Some evaluators then summarize this data in a usability specification so
that developers can use it to test future prototypes or versions of the product against it.
Optimal performance levels and minimal levels of acceptance are often specified and
current levels noted. Changes in the design can then be agreed and engineered--hence
the term "usability engineering.
Field studies
The distinguishing feature of field studies is that they are done in natural settings with
the aim of increasing understanding about what users do naturally and how
274
img
Human Computer Interaction (CS408)
VU
technology impacts them. In product design, field studies can be used to (1) help
identify opportunities for new technology; (2) determine requirements for design: (3)
facilitate the introduction of technology: and (4) evaluate technology (Bly. 1997).
We introduced qualitative techniques such as interviews, observation, participant
observation, and ethnography that are used in field studies. The exact choice of
techniques is often influenced by the theory used to analyze the data. The data takes
the form of events and conversations that are recorded as notes, or by audio or video
recording, and later analyzed using a variety of analysis techniques such as content,
discourse, and conversational analysis. These techniques vary considerably. In content
analysis, for example, the data is analyzed into content categories, whereas in
discourse analysis the use of words and phrases is examined. Artifacts are also
collected. In fact, anything that helps to show what people do in their natural contexts
can be regarded as data.
In this lecture we distinguish between two overall approaches to field studies. The
first involves observing explicitly and recording what is happening, as an outsider
looking on. Qualitative techniques are used to collect the data, which may then he
analyzed qualitatively or quantitatively. For example, the number of times a particular
event is observed may be presented in a bar graph with means and standard
deviations.
In some field studies the evaluator may be an insider or even a participant.
Ethnography is a particular type of insider evaluation in which the aim is to explore
the details of what happens in a particular social setting. "In the context of human
computer interaction, ethnography is a means of studying work (or other activities) in
order to inform the design of information systems and understand aspects of their use"
(Shapiro, 1995, p. 8).
Predictive evaluation
In predictive evaluations experts apply their knowledge of typical users, often guided
by heuristics, to predict usability problems. Another approach involves theoretically
based models. The key feature of predictive evaluation is that users need not be pres-
ent, which makes the process quick, relatively inexpensive, and thus attractive to
companies; but it has limitations.
In recent years heuristic evaluation in which experts review the software product
guided by tried and tested heuristics has become popular (Nielsen and Mack, 1994).
Usability guidelines (e.g., always provide clearly marked exits) were designed
primarily for evaluating screen-based products (e.g. form fill-ins, library catalogs,
etc.). With the advent of a range of new interactive products (e.g., the web, mobiles,
collaborative technologies), this original set of heuristics has been found insufficient.
While some are still applicable (e.g., speak the users' language), others are
inappropriate. New sets of heuristics are also needed that are aimed at evaluating
different classes of interactive products. In particular, specific heuristics are needed
that are tailored to evaluating web-based products, mobile devices, collaborative
technologies, computerized toys, etc. These should be based on a combination of
usability and user experience goals, new research findings and market research. Care
is needed in using sets of heuristics. Designers are sometimes led astray by findings
from heuristic evaluations that turn out not to be as accurate as they at first seemed.
275
img
Human Computer Interaction (CS408)
VU
Table bellow summarizes the key aspects of each evaluation paradigm for the fol-
lowing issues:
the role of users
·
who controls the process and the relationship between evaluators and users
·
during the evaluation
the location of the evaluation
·
when the evaluation is most useful
·
the type of data collected and how it is analyzed
·
how the evaluation findings are fed back into the design process
·
the philosophy and theory that underlies the evaluation paradigms.
·
Evaluation
and Usability testing Field studies
"Quick
Predictive
paradigms
dirty"
of Natural
To carry out set Natural behavior.
Users  generally
Role
behavior.
tasks.
not involved.
users
E valuators take Evaluators
Evaluators  try  to Expert evaluators.
Who
minimum
strongly
in develop
controls
control.
control.
relationships  with
users.
Natural
Laboratory.
Natural environment. Laboratory-
environment
or
oriented but often
Location
laboratory
happens
on
customer's
premises.
276
img
Human Computer Interaction (CS408)
VU
Any time you With a prototype
Most  often  used
Expert  reviews
When used want  to  get or product.
early in design to
(often done by
feedback about a
check  that  users'
consultants) with
design  quickly.
needs are being met
a prototype, but
Techniques from
or
to
assess
can occur at any
other evaluation
problems or design
time.
paradigms  can
opportunities.
Models are used to
be  Used  e.g.
assess
specific
experts  review
aspects  of  a
soft ware.
potential design.
Usually
Quantitative.
Qualitative
List of problems
Type of data
qualitative,
Sometimes
descriptions  often
from
expert
informal
statistically
accompanied  with
reviews.
descriptions
validated. Users'
sketches. Scenarios
Quantitative
opinions
quotes,
other
figures
from
collected
by
artifacts.
model, e.g., how
questionnaire  or
long it takes to
interview.
perform a task
using
two
designs.
Sketches,
Report
of
Descriptions
that
Reviewers
Fed back into
design by..
quotes,
performance
include
quotes,
provide a list of
descriptive
measures, errors
Sketches, anecdotes,
problems, often
report.
etc.
Findings
and sometimes time
with  suggested
provide
a
logs.
solutions. Times
benchmark
for
calculated from
future versions.
models are given
to designers.
Philosophy
User-centered,  Applied approach May be objective
Practical
highly practical based
on observation
or
heuristics
and
approach
experimentation. ethnographic.
practitioner
i.e.,
usability
expertise
engineering.
underpin expert
reviews. Theory
underpins
models
Techniques
There are many evaluation techniques and they can be categorized in various ways,
but in this lecture we will examine techniques for:
·  observing users
·  asking users their opinions
·  asking experts their opinions
·  testing users" performance
·  modeling users' task performance to predict the efficacy of a user interface
277
img
Human Computer Interaction (CS408)
VU
The brief descriptions below offer an overview of each category. Be aware that some
techniques are used in different ways in different evaluation paradigms.
Observing users
Observation techniques help to identify needs leading to new types of products and
help to evaluate prototypes. Notes, audio, video, and interaction logs are well-known
ways of recording observations and each has benefits and drawbacks. Obvious
challenges for evaluators are how to observe without disturbing the people being
observed and how to analyze the data, particularly when large quantities of video data
are collected or when several different types must be integrated to tell the story (e.g.,
notes, pictures, sketches from observers).
Asking users
Asking users what they think of a product--whether it does what they want; whether
they like it; whether the aesthetic design appeals; whether they had problems using it;
whether they want to use it again--is an obvious way of getting feedback. Inter views
and questionnaires are the main techniques for doing this. The questions asked can be
unstructured or tightly structured. They can be asked of a few people or of hundreds.
Interview and questionnaire techniques are also being developed for use with email
and the web.
Asking experts
Software inspections and reviews are long established techniques for evaluating
software code and structure. During the 1980s versions of similar techniques were
developed for evaluating usability. Guided by heuristics, experts step through tasks
role-playing typical users and identify problems. Developers like this approach he-
cause it is usually relatively inexpensive and quick to perform compared with labo-
ratory and field evaluations that involve users. In addition, experts frequently suggest
solutions to problems
User testing
Measuring user performance to compare two or more designs has been the bedrock of
usability testing. As we said earlier when discussing usability testing, these tests are
usually conducted in controlled settings and involve typical users performing typical.
well-defined tasks. Data is collected so that performance can be analyzed. Generally
the time taken to complete a task, the number of errors made, and the navigation path
through the product are recorded. Descriptive statistical measures such as means and
standard deviations are commonly used to report the results.
Modeling users' task performance
There have been various attempts to model human-computer interaction so as to
predict the efficiency and problems associated with different designs at an early stage
without building elaborate prototypes. These techniques are successful for systems
with limited functionality such as telephone systems. GOMS and the keystroke model
are the best known techniques.
278
Table of Contents:
  1. RIDDLES FOR THE INFORMATION AGE, ROLE OF HCI
  2. DEFINITION OF HCI, REASONS OF NON-BRIGHT ASPECTS, SOFTWARE APARTHEID
  3. AN INDUSTRY IN DENIAL, SUCCESS CRITERIA IN THE NEW ECONOMY
  4. GOALS & EVOLUTION OF HUMAN COMPUTER INTERACTION
  5. DISCIPLINE OF HUMAN COMPUTER INTERACTION
  6. COGNITIVE FRAMEWORKS: MODES OF COGNITION, HUMAN PROCESSOR MODEL, GOMS
  7. HUMAN INPUT-OUTPUT CHANNELS, VISUAL PERCEPTION
  8. COLOR THEORY, STEREOPSIS, READING, HEARING, TOUCH, MOVEMENT
  9. COGNITIVE PROCESS: ATTENTION, MEMORY, REVISED MEMORY MODEL
  10. COGNITIVE PROCESSES: LEARNING, READING, SPEAKING, LISTENING, PROBLEM SOLVING, PLANNING, REASONING, DECISION-MAKING
  11. THE PSYCHOLOGY OF ACTIONS: MENTAL MODEL, ERRORS
  12. DESIGN PRINCIPLES:
  13. THE COMPUTER: INPUT DEVICES, TEXT ENTRY DEVICES, POSITIONING, POINTING AND DRAWING
  14. INTERACTION: THE TERMS OF INTERACTION, DONALD NORMAN’S MODEL
  15. INTERACTION PARADIGMS: THE WIMP INTERFACES, INTERACTION PARADIGMS
  16. HCI PROCESS AND MODELS
  17. HCI PROCESS AND METHODOLOGIES: LIFECYCLE MODELS IN HCI
  18. GOAL-DIRECTED DESIGN METHODOLOGIES: A PROCESS OVERVIEW, TYPES OF USERS
  19. USER RESEARCH: TYPES OF QUALITATIVE RESEARCH, ETHNOGRAPHIC INTERVIEWS
  20. USER-CENTERED APPROACH, ETHNOGRAPHY FRAMEWORK
  21. USER RESEARCH IN DEPTH
  22. USER MODELING: PERSONAS, GOALS, CONSTRUCTING PERSONAS
  23. REQUIREMENTS: NARRATIVE AS A DESIGN TOOL, ENVISIONING SOLUTIONS WITH PERSONA-BASED DESIGN
  24. FRAMEWORK AND REFINEMENTS: DEFINING THE INTERACTION FRAMEWORK, PROTOTYPING
  25. DESIGN SYNTHESIS: INTERACTION DESIGN PRINCIPLES, PATTERNS, IMPERATIVES
  26. BEHAVIOR & FORM: SOFTWARE POSTURE, POSTURES FOR THE DESKTOP
  27. POSTURES FOR THE WEB, WEB PORTALS, POSTURES FOR OTHER PLATFORMS, FLOW AND TRANSPARENCY, ORCHESTRATION
  28. BEHAVIOR & FORM: ELIMINATING EXCISE, NAVIGATION AND INFLECTION
  29. EVALUATION PARADIGMS AND TECHNIQUES
  30. DECIDE: A FRAMEWORK TO GUIDE EVALUATION
  31. EVALUATION
  32. EVALUATION: SCENE FROM A MALL, WEB NAVIGATION
  33. EVALUATION: TRY THE TRUNK TEST
  34. EVALUATION – PART VI
  35. THE RELATIONSHIP BETWEEN EVALUATION AND USABILITY
  36. BEHAVIOR & FORM: UNDERSTANDING UNDO, TYPES AND VARIANTS, INCREMENTAL AND PROCEDURAL ACTIONS
  37. UNIFIED DOCUMENT MANAGEMENT, CREATING A MILESTONE COPY OF THE DOCUMENT
  38. DESIGNING LOOK AND FEEL, PRINCIPLES OF VISUAL INTERFACE DESIGN
  39. PRINCIPLES OF VISUAL INFORMATION DESIGN, USE OF TEXT AND COLOR IN VISUAL INTERFACES
  40. OBSERVING USER: WHAT AND WHEN HOW TO OBSERVE, DATA COLLECTION
  41. ASKING USERS: INTERVIEWS, QUESTIONNAIRES, WALKTHROUGHS
  42. COMMUNICATING USERS: ELIMINATING ERRORS, POSITIVE FEEDBACK, NOTIFYING AND CONFIRMING
  43. INFORMATION RETRIEVAL: AUDIBLE FEEDBACK, OTHER COMMUNICATION WITH USERS, IMPROVING DATA RETRIEVAL
  44. EMERGING PARADIGMS, ACCESSIBILITY
  45. WEARABLE COMPUTING, TANGIBLE BITS, ATTENTIVE ENVIRONMENTS