|
|||||
Human
Computer Interaction
(CS408)
VU
Lecture
29
Lecture
29. Evaluation
Part I
Learning
Goals
The
aim of this lecture is to
introduce you the study of
Human Computer
Interaction,
so that
after studying this you will
be able to:
· Understand
what evaluation is in the
development process
· Understand
different evaluation paradigms
and techniques
What to
evaluate?
There is
a huge variety of interactive
products with a vast array of
features that need
to be
evaluated. Some features,
such as the sequence of
links to be followed to find
an
item on a
website, are often best
evaluated in a laboratory, since such
a setting
allows
the
evaluators to control what
they want to investigate.
Other aspects, such as
whether
a
collaborative toy is robust
and whether children enjoy
interacting with it, are
better
evaluated
in natural settings, so that evaluators
can see what children do
when left to
their
own devices.
John
Gould and his colleagues
(Gould et aL 1990; Gould and
Lewis, 1985)
recommended
three principles for
developing the 1984 Olympic
Message System:
· Focus
on users and their
tasks
· Observe,
measure, and analyze their
performance with the
system
· Design
lucratively
Since
the OMS study, a number of
new evaluation techniques
have been developed.
There
has also been a growing
trend towards observing how
people interact with
the
system in
their work, home, and
other settings, the goal
being to obtain a
better
understanding
of how the product is (or
will be) used in its
intended setting. For
example,
at work people are frequently
being interrupted by phone
calls, others
knocking
at their door, email
arriving, and so on--to the
extent that many tasks
are
interrupt-driven.
Only rarely does someone
carry a task out from
beginning to end
without
stopping to do something else. Hence
the way people carry
out an activity
(e.g.,
preparing a report) in the
real world is very different
from how it may be
observed
in a laboratory. Furthermore, this
observation has implications
for the way
products
should be designed.
Why you
need to evaluate?
Just as
designers shouldn't assume that
everyone is like them, they
also shouldn't
presume
that following design
guidelines guarantees good usability.
Evaluation is
needed to
check that users can
use the product and
like it. Furthermore,
nowadays
users
look for much more
than just a usable system, as
the Nielsen Norman Group,
a
usability
consultancy company, point
out (www.nngroup.com):
271
Human
Computer Interaction
(CS408)
VU
"User
experience" encompasses all
aspects of the end-user's interaction
...
the
first requirement for an
exemplary user experience is to meet
the exact
needs of
the customer, without fuss or
bother. Next comes
simplicity and
elegance
that produce products that
are a joy to own, a joy to
use. "
Bruce
Tognazzini
another
successful
usability
consultant,
comments
(www.asktog.com)
that:
"Iterative
design, with its repeating
cycle of design and testing,
is the
only
validated methodology in existence
that will consistently
produce
successful
results. If you don't have
user-testing as an integral part
of
your
design process you are
going to throw buckets of
money down the
drain."
Tognazzini
points out that there
are five good reasons
for investing in user
testing:
1.
Problems are fixed before
the product is shipped, not
after.
2. The
team can concentrate on real
problems, not imaginary
ones.
3.
Engineers code instead of
debating.
4. Time
to market is sharply
reduced.
5.
Finally, upon first release,
your sales department has a
rock-solid design it can
sell
without
having to pepper their
pitches with how it will all
actually work in release
1.1
or
2.0.
Now
that there is a diversity of
interactive products, it is not
surprising that the
range
of
features to be evaluated is very
broad. For example,
developers of a new
web
browser
may want to know if users
find items faster with
their product.
Government
authorities
may ask if a computerized
system for controlling
traffic lights results in
fewer
accidents. Makers of a toy may
ask if six-year-olds can
manipulate the
controls
and
whether they are engaged by
its furry case and
pixie face. A company
that
develops
the casing for cell phones
may ask if the shape, size,
and color of the case
is
appealing
to teenagers. A new dotcom company
may want to assess market
reaction
to its
new home page
design.
This
diversity of interactive products,
coupled with new user
expectations, poses
interesting
challenges for evaluators,
who, armed with many
well tried and
tested
techniques,
must now adapt them and
develop new ones. As well as
usability, user
experience
goals can be extremely
important for a product's
success.
When to
evaluate?
The
product being developed may
be a brand-new product or an upgrade of
an exist-
ing
product. If the product is
new, then considerable time
is usually invested in
market
research.
Designers often support this
process by developing mockups of
the potential
product
that are used to elicit
reactions from potential
users. As well as helping
to
assess
market need, this activity
contributes to understanding users'
needs and early
requirements.
As we said in earlier lecture, sketches,
screen mockups, and other
low-
fidelity
prototyping techniques are
used to represent design ideas. Many of
these same
techniques
are used to elicit users"
opinions in evaluation (e.g.,
questionnaires and
interviews),
but the purpose and focus of
evaluation are different.
The goal of eval-
uation is
to assess how well a design
fulfills users' needs and
whether users like
it.
272
Human
Computer Interaction
(CS408)
VU
In the
case of an upgrade, there is
limited scope for change and
attention is focused on
improving
the overall product. This
type of design is well
suited to usability
engineering
in which evaluations compare user
performance and attitudes
with those
for
previous versions. Some
products, such as office
systems, go through
many
versions,
and successful products may
reach double-digit version numbers.
In
contrast,
new products do not have
previous versions and there
may be nothing
comparable
on the market, so more
radical changes are possible
if evaluation results
indicate
a problem.
Evaluations
done during design to check
that the product continues
to meet users'
needs
are known as
formative evaluations. Evaluations
that are done to assess
the
success
of a finished product, such as
those to satisfy a sponsoring
agency or to check
that a
standard is being upheld, are
known as summative
evaluation. Agencies
such as
National
Institute of Standards and Technology
(NIST) in the USA, the
International
Standards
Organization (ISO) and the
British Standards Institute (BSI)
set standards
by which
products produced by others
are evaluated.
Evaluation
paradigms and techniques
29.1
Before we
describe the techniques used in
evaluation studies, we shall start
by
proposing
some key terms. Terminology in
this field tends to be loose
and often
confusing
so it is a good idea to be clear
from the start what you
mean. We start with
the
much-used term user studies,
defined
by Abigail Sellen in her
interview as
follows:
"user studies essentially involve
looking at how people behave
either in their
natural
[environments], or in the laboratory,
both with old technologies
and with new
ones."
Any kind of evaluation, whether it is a
user study or not, is guided
either
explicitly
or implicitly by a set of beliefs
that may also he underpinned
by theory.
These
beliefs and the practices
(i.e., the methods or techniques)
associated with them
are known
as an evaluation paradigm, which
you should not confuse
with the
"interaction
paradigms. Often evaluation
paradigms are related to a
particular
discipline
in that they strongly
influence how people from
the discipline think
about
evaluation.
Each paradigm has particular methods
and techniques associated
with it.
So that
you are not confused, we
want to state explicitly
that we will not be
distinguishing
between methods and techniques. We
tend to talk about
techniques, but
you
may find that other
some call them methods. An
example of the
relationship
between a
paradigm and the techniques
used by evaluators following
that paradigm
can be
seen for usability testing,
which is an applied science
and engineering
paradigm.
The techniques associated
wild usability testing are:
user testing in a
controlled
environment; observation of user
activity in the controlled
environment and
the
field; and questionnaires
and interviews.
Evaluation
paradigms
In this
lecture we identify four core
evaluation paradigms: (1)
"quick and dirty"
eval-
uations;
(2) usability testing; (3)
field studies; and (4)
predictive evaluation.
Other
people
may use slightly different
terms to refer to similar
paradigms.
"Quick
and dirty" evaluation
A "quick
and dirty" evaluation is a
common practice in which designers
informally
get
feedback from users or
consultants to confirm that
their ideas are in line
with
users"
needs and are liked.
"Quick and dirty"
evaluations can be done at
any stage and
273
Human
Computer Interaction
(CS408)
VU
the
emphasis is on fast input
rather than carefully
documented findings. For
example,
early in
design developers may meet
informally with users to get
feedback on ideas
for a
new product (Hughes el al.,
1994). At later stages
similar meetings may occur
to
try
out an idea for an icon,
check whether a graphic is
liked, or confirm
that
information
has been appropriately categorized on a
webpage. This approach is
often
called
"quick and dirty" because it
is meant to be done in a short
space of time.
Getting
this kind of feedback is an essential
ingredient of successful
design.
As
discussed in earlier lectures,
any involvement with users
will be highly informa-
tive
and you can learn a
lot early in design by
observing what people do and
talking to
them
informally. The data collected is
usually descriptive and
informal and it is
fed
back
into the design process as
verbal or written notes, sketches
and anecdotes, etc.
Another
source comes from
consultants, who use their
knowledge of user
behavior,
the
market place and technical
know-how, to review software
quickly and provide
suggestions
for improvement. It is an approach
that has become particularly
popular
in web
design where the emphasis is
usually on short timescales.
Usability
testing
Usability
testing was the dominant
approach in the 1980s
(Whiteside et al., 1998),
and
remains
important, although, as you will
see, field studies and
heuristic evaluations
have
grown in prominence. Usability
testing involves measuring
typical users'
performance
on carefully prepared tasks
that are typical of those
for which the
system
was
designed. Users' performance is
generally measured in terms of
number of errors
and
time to complete the task.
As the users perform these
tasks, they are watched
and
recorded
on video and by logging
their interactions with
software. This
observational
data is
used to calculate performance
times, identify errors, and
help explain why
the
users
did what they did.
User satisfaction questionnaires
and interviews are also
used
to elicit
users' opinions.
The
defining characteristic of usability
testing is that it is strongly
controlled by the
evaluator
(Mayhew. 1999). There is no
mistaking that the evaluator
is in charge!
Typically
tests take place in
laboratory-like conditions that
are controlled. Casual
visitors
are not allowed and
telephone calls are stopped,
and there is no possibility
of
talking
to colleagues, checking email, or
doing any of the other
tasks that most of us
rapidly
switch among in our normal
lives. Everything that the
participant does is
recorded--every
key press, comment, pause,
expression, etc., so that it
can be used as
data.
Quantifying
users' performance is a dominant theme in
usability testing.
However,
unlike
research experiments, variables
are not manipulated and
the typical number of
participants
is too small for much
statistical analysis. User
satisfaction data from
questionnaires
tends to be categorized and average
ratings are presented. Sometimes
video or
anecdotal evidence is also
included to illustrate problems
that users
encounter.
Some evaluators then summarize
this data in a usability specification
so
that
developers can use it to
test future prototypes or
versions of the product
against it.
Optimal
performance levels and
minimal levels of acceptance are
often specified and
current
levels noted. Changes in the
design can then be agreed
and engineered--hence
the
term "usability
engineering.
Field
studies
The
distinguishing feature of field studies
is that they are done in
natural settings with
the
aim of increasing understanding
about what users do
naturally and how
274
Human
Computer Interaction
(CS408)
VU
technology
impacts them. In product
design, field studies can be
used to (1) help
identify
opportunities for new
technology; (2) determine
requirements for design:
(3)
facilitate
the introduction of technology:
and (4) evaluate technology
(Bly. 1997).
We
introduced qualitative techniques
such as interviews, observation,
participant
observation,
and ethnography that are
used in field studies. The
exact choice of
techniques
is often influenced by the
theory used to analyze the
data. The data takes
the
form of events and conversations
that are recorded as notes, or by audio
or video
recording,
and later analyzed using a
variety of analysis techniques
such as content,
discourse,
and conversational analysis. These
techniques vary considerably. In
content
analysis,
for example, the data is
analyzed into content categories, whereas
in
discourse
analysis the use of words
and phrases is examined.
Artifacts are also
collected.
In fact, anything that helps
to show what people do in
their natural
contexts
can be
regarded as data.
In this
lecture we distinguish between
two overall approaches to field studies.
The
first
involves observing explicitly
and recording what is
happening, as an outsider
looking
on. Qualitative techniques
are used to collect the
data, which may then
he
analyzed
qualitatively or quantitatively. For
example, the number of times
a particular
event is
observed may be presented in a bar
graph with means and
standard
deviations.
In some
field studies the evaluator
may be an insider or even a
participant.
Ethnography
is a particular type of insider
evaluation in which the aim
is to explore
the
details of what happens in a particular
social setting. "In the
context of human
computer
interaction, ethnography is a means of
studying work (or other
activities) in
order to
inform the design of
information systems and
understand aspects of their
use"
(Shapiro,
1995, p. 8).
Predictive
evaluation
In
predictive evaluations experts
apply their knowledge of
typical users, often
guided
by
heuristics, to predict usability
problems. Another approach
involves theoretically
based
models. The key feature of
predictive evaluation is that
users need not
be
pres-
ent,
which makes the process
quick, relatively inexpensive,
and thus attractive to
companies;
but it has
limitations.
In recent
years heuristic evaluation in which
experts review the software
product
guided by
tried and tested heuristics
has become popular (Nielsen
and Mack, 1994).
Usability
guidelines (e.g., always
provide clearly marked
exits) were designed
primarily
for evaluating screen-based
products (e.g. form
fill-ins, library
catalogs,
etc.).
With the advent of a range of
new interactive products
(e.g., the web,
mobiles,
collaborative
technologies), this original
set of heuristics has been
found insufficient.
While
some are still applicable
(e.g., speak the users'
language), others are
inappropriate.
New sets of heuristics are
also needed that are aimed
at evaluating
different
classes of interactive products. In
particular, specific heuristics are
needed
that are
tailored to evaluating web-based
products, mobile devices,
collaborative
technologies,
computerized toys, etc. These
should be based on a combination
of
usability
and user experience goals,
new research findings and
market research. Care
is needed in
using sets of heuristics. Designers
are sometimes led astray by
findings
from
heuristic evaluations that
turn out not to be as accurate as
they at first seemed.
275
Human
Computer Interaction
(CS408)
VU
Table
bellow summarizes the key
aspects of each evaluation
paradigm for the
fol-
lowing
issues:
the
role of users
·
who
controls the process and
the relationship between
evaluators and users
·
during
the evaluation
the
location of the
evaluation
·
when
the evaluation is most
useful
·
the
type of data collected and
how it is analyzed
·
how
the evaluation findings are
fed back into the
design process
·
the
philosophy and theory that
underlies the evaluation
paradigms.
·
Evaluation
and
Usability testing Field
studies
"Quick
Predictive
paradigms
dirty"
of Natural
To carry
out set Natural
behavior.
Users
generally
Role
behavior.
tasks.
not
involved.
users
E
valuators take
Evaluators
Evaluators
try to Expert
evaluators.
Who
minimum
strongly
in
develop
controls
control.
control.
relationships
with
users.
Natural
Laboratory.
Natural
environment. Laboratory-
environment
or
oriented
but often
Location
laboratory
happens
on
customer's
premises.
276
Human
Computer Interaction
(CS408)
VU
Any time
you With a prototype
Most
often used
Expert
reviews
When
used want
to get or product.
early in
design to
(often
done by
feedback
about a
check
that users'
consultants)
with
design
quickly.
needs
are being met
a
prototype, but
Techniques
from
or
to
assess
can
occur at any
other
evaluation
problems
or design
time.
paradigms
can
opportunities.
Models
are used to
be
Used e.g.
assess
specific
experts
review
aspects
of a
soft
ware.
potential
design.
Usually
Quantitative.
Qualitative
List of
problems
Type of
data
qualitative,
Sometimes
descriptions
often
from
expert
informal
statistically
accompanied
with
reviews.
descriptions
validated.
Users'
sketches.
Scenarios
Quantitative
opinions
quotes,
other
figures
from
collected
by
artifacts.
model,
e.g., how
questionnaire
or
long it
takes to
interview.
perform a
task
using
two
designs.
Sketches,
Report
of
Descriptions
that
Reviewers
Fed
back into
design
by..
quotes,
performance
include
quotes,
provide a
list of
descriptive
measures,
errors
Sketches,
anecdotes,
problems,
often
report.
etc.
Findings
and
sometimes time
with
suggested
provide
a
logs.
solutions.
Times
benchmark
for
calculated
from
future
versions.
models
are given
to
designers.
Philosophy
User-centered,
Applied approach May be
objective
Practical
highly
practical based
on
observation
or
heuristics
and
approach
experimentation.
ethnographic.
practitioner
i.e.,
usability
expertise
engineering.
underpin
expert
reviews.
Theory
underpins
models
Techniques
There
are many evaluation
techniques and they can be
categorized in various
ways,
but in
this lecture we will examine
techniques for:
· observing
users
· asking
users their opinions
· asking
experts their
opinions
· testing
users" performance
· modeling
users' task performance to predict
the efficacy of a user
interface
277
Human
Computer Interaction
(CS408)
VU
The
brief descriptions below
offer an overview
of
each category. Be aware that
some
techniques
are used in different ways
in different evaluation
paradigms.
Observing
users
Observation
techniques help to identify
needs leading to new types
of products and
help to
evaluate prototypes. Notes,
audio, video, and
interaction logs are
well-known
ways of
recording observations and
each has benefits and
drawbacks. Obvious
challenges
for evaluators are how to
observe without disturbing the
people being
observed
and how to analyze the
data, particularly when
large quantities of video
data
are
collected or when several
different types must be integrated to
tell the story
(e.g.,
notes,
pictures, sketches from
observers).
Asking
users
Asking
users what they think of a
product--whether it does what
they want; whether
they
like it; whether the
aesthetic design appeals; whether they
had problems using
it;
whether
they want to use it
again--is an obvious way of
getting feedback. Inter
views
and
questionnaires are the main
techniques for doing this.
The questions asked can
be
unstructured
or tightly structured. They
can be asked of a few people or of
hundreds.
Interview
and questionnaire techniques
are also being developed
for use with
email
and
the web.
Asking
experts
Software
inspections and reviews are
long established techniques
for evaluating
software
code and structure. During
the 1980s versions of
similar techniques
were
developed
for evaluating usability.
Guided by heuristics, experts
step through tasks
role-playing
typical users and identify
problems. Developers like
this approach he-
cause it
is usually relatively inexpensive
and quick to perform
compared with labo-
ratory
and field evaluations that
involve users. In addition,
experts frequently
suggest
solutions
to problems
User
testing
Measuring
user performance to compare two or
more designs has been the
bedrock of
usability
testing. As we said earlier
when discussing usability testing,
these tests are
usually
conducted in controlled settings
and involve typical users
performing typical.
well-defined
tasks. Data is collected so
that performance can be
analyzed. Generally
the
time taken to complete a
task, the number of errors made,
and the navigation
path
through
the product are recorded.
Descriptive statistical measures
such as means and
standard
deviations are commonly used
to report the results.
Modeling
users' task
performance
There
have been various attempts to
model human-computer interaction so as
to
predict
the efficiency and problems
associated with different designs at an
early stage
without
building elaborate prototypes.
These techniques are
successful for
systems
with
limited functionality such as
telephone systems. GOMS and
the keystroke model
are the
best known
techniques.
278
Table of Contents:
|
|||||