|
|||||
![]() Research
Methods STA630
VU
Lesson
18
CRITERIA
FOR GOOD
MEASUREMENT
Now
that we have seen how to
operationally define variables, it is
important to make sure that
the
instrument
that we develop to measure a
particular concept is indeed
accurately measuring the variable,
and
in fact, we are actually measuring the
concept that we set out to
measure. This ensures that
in
operationally
defining perceptual and attitudinal
variables, we have not overlooked
some important
dimensions
and elements or included
some irrelevant ones. The
scales developed could often
be
imperfect
and errors are prone to
occur in the measurement of attitudinal
variables. The use of
better
instruments
will ensure more accuracy in
results, which in turn, will
enhance the scientific quality of
the
research.
Hence, in some way, we need to
assess the "goodness" of the measure
developed.
What
should be the characteristics of a good
measurement? An intuitive answer to
this question is that
the
tool should be an accurate
indicator of what we are interested in
measuring. In addition, it should
be
easy
and efficient to use.
There are three major
criteria for evaluating a
measurement tool:
validity,
reliability,
and sensitivity.
Validity
Validity
is the ability of an instrument (for
example measuring an attitude) to measure
what it is
supposed
to measure. That is, when we
ask a set of questions (i.e.
develop a measuring instrument)
with
the
hope that we are tapping the concept,
how can we be reasonably certain
that we are indeed
measuring
the concept we set out to do and
not something else? There is no
quick answer.
Researchers
have attempted to assess validity in
different ways, including asking
questions such as "Is
there
consensus among my colleagues that my
attitude scale measures what
it is supposed to measure?"
and
"Does my measure correlate with
others' measures of the `same' concept?"
and "Does the behavior
expected
from my measure predict the actual
observed behavior?" Researchers expect
the answers to
provide
some evidence of a measure's
validity.
What
is relevant depends on the nature of the
research problem and the
researcher's judgment. One
way
to
approach this question is to organize the
answer according to measure-relevant types of
validity. One
widely
accepted classification consists of three
major types of validity: (1) content
validity, (2)
criterion-related
validity, and (3) construct
validity.
(1)
Content Validity
The
content validity of a measuring instrument
(the composite of measurement scales) is
the extent to
which
it provides adequate coverage of the
investigative questions guiding the
study. If the instrument
contains
a representative sample of the universe of subject
matter of interest, then the content validity
is
good.
To evaluate the content validity of an
instrument, one must first
agree on what dimensions and
elements
constitute adequate coverage. To put it
differently, content validity is a
function of how well
the
dimensions and elements of a concept have
been delineated. Look at the
concept of feminism
which
implies
a person's commitment to a set of beliefs
creating full equality between
men and women in
areas
of the arts, intellectual pursuits,
family, work, politics, and
authority relations. Does
this definition
provide
adequate coverage of the different dimensions of the
concept? Then we have the following
two
questions
to measure feminism:
1.
Should
men and women get equal pay
for equal work?
2.
Should
men and women share household
tasks?
These
two questions do not provide coverage to
all the dimensions delineated
earlier. It definitely
falls
short
of adequate content validity for
measuring feminism.
A
panel of persons to judge
how well the instrument
meets the standard can
attest to the content
validity
of
the instrument. A panel independently
assesses the test items for
a performance test. It judges each
55
![]() Research
Methods STA630
VU
item
to be essential, useful but
not essential, or not
necessary in assessing performance of a
relevant
behavior.
Face
validity is considered as a
basic and very minimum
index of content validity. Face
validity
indicates
that the items that are
intended to measure a concept, do on the
face of it look like
they
measure
the concept. For example a few
people would accept a
measure of college student math
ability
using
a question that asked
students: 2 + 2 = ? This is not a
valid measure of college-level math
ability
on
the face of it. Nevertheless, it is a
subjective agreement among professionals
that a scale
logically
appears
to reflect accurately what it is supposed
to measure. When it appears
evident to experts that
the
measure
provides adequate coverage of the concept, a
measure has face
validity.
(2)
Criterion-Related Validity
Criterion
validity uses some standard
or criterion to indicate a construct
accurately. The validity of
an
indicator
is verified by comparing it with another
measure of the same construct in which
research has
confidence.
There are two subtypes of
this kind of
validity.
Concurrent
validity: To have concurrent
validity, an indicator must be
associated with a
preexisting
indicator
that is judged to be valid.
For example we create a new
test to measure intelligence.
For it to
be
concurrently valid, it should be
highly associated with
existing IQ tests (assuming the
same definition
of
intelligence is used). It means
that most people who
score high on the old
measure should also
score
high
on the new one, and vice
versa. The two measures
may not be perfectly
associated, but if
they
measure
the same or a similar construct, it is
logical for them to yield
similar results.
Predictive
validity:
Criterion
validity whereby an indicator predicts
future events that are
logically related to a construct
is
called
a predictive validity. It cannot be used
for all measures. The
measure and the action
predicted
must
be distinct from but
indicate the same construct. Predictive
measurement validity should
not be
confused
with prediction in hypothesis testing,
where one variable predicts a different
variable in future.
Look
at the scholastic assessment tests being
given to candidates seeking admission in
different
subjects.
These are supposed to
measure the scholastic aptitude of the
candidates the ability to
perform
in institution as well as in the subject. If
this test has high
predictive validity, then
candidates
who
get high test score will
subsequently do well in their subjects.
If students with high scores
perform
the
same as students with
average or low score, then
the test has low predictive
validity.
(3)
Construct Validity
Construct
validity is for measures
with multiple indicators. It
addresses the question: If the measure
is
valid,
do the various indicators operate in
consistent manner? It requires a
definition with
clearly
specified
conceptual boundaries. In order to evaluate construct
validity, we consider both theory
and
the
measuring instrument being used.
This is assessed through
convergent validity and
discriminant
validity.
Convergent
Validity: This
kind of validity applies
when multiple indicators converge or
are associated
with
one another. Convergent validity
means that multiple measures
of the same construct hang
together
or operate in similar ways. For example,
we construct "education" by asking people
how much
education
they have completed, looking at their
institutional records, and asking
people to complete a
test
of school level knowledge. If the
measures do not converge (i.e.
people who claim to have
college
degree
but have no record of attending college,
or those with college degree
perform no better than
high
school
dropouts on the test), then our test
has weak convergent validity
and we should not combine
all
three
indicators into one
measure.
Discriminant
Validity: Also
called divergent validity,
discriminant validity is the opposite
of
convergent
validity. It means that the
indicators of one construct hang together or converge,
but also
56
![]() Research
Methods STA630
VU
diverge
or are negatively associated
with opposing constructs. It
says that if two constructs
A and B are
very
different, then measures of A and B
should not be associated.
For example, we have 10 items
that
measure
political conservatism. People answer all
10 in similar ways. But we have also
put 5 questions
in
the same questionnaire that
measure political liberalism.
Our measure of conservatism
has
discriminant
validity if the 10 conservatism items hang together
and are negatively associated
with 5
liberalism
ones.
Reliability
The
reliability of a measure indicates the
extent to which it is without
bias (error free) and hence
ensures
consistent
measurement across time and
across the various items in the
instrument. In other words, the
reliability
of a measure is an indication of the
stability
and
consistency
with
which the instrument
measures
the concept and helps to assess the
`goodness" of measure.
Stability
of Measures
The
ability of the measure to remain the
same over time despite
uncontrollable testing conditions
or
the
state of the respondents themselves
is indicative of its stability and
low vulnerability to changes
in
the
situation. This attests to
its "goodness" because the
concept is stably measured, no matter
when it is
done.
Two tests of stability are
test-retest reliability and
parallel-form reliability.
(1)
Test-retest Reliability: Test-retest method
of determining reliability involves
administering the
same
scale to the same respondents at
two separate times to test
for stability. If the measure is
stable
over
time, the test, administered
under the same conditions
each time, should obtain
similar results. For
example,
suppose a researcher measures
job satisfaction and finds
that 64 percent of the population
is
satisfied
with their jobs. If the
study is repeated a few
weeks later under similar
conditions, and the
researcher
again finds that 64 percent of the
population is satisfied with
their jobs, it appears that
the
measure
has repeatability. The high
stability correlation or consistency
between the two measures at
time
1 and at time 2 indicates high degree of
reliability. This was at the
aggregate level; the
same
exercise
can be applied at the individual
level. When the measuring instrument
produces unpredictable
results
from one testing to the
next, the results are said
to be unreliable because of error in
measurement.
There
are two problems with
measures of test-retest reliability
that are common to all
longitudinal
studies.
Firstly, the first measure
may sensitize the respondents to
their participation in a
research
project
and subsequently influence the results of the
second measure. Further if the
time between the
measures
is long, there may be attitude
change or other maturation of the
subjects. Thus it is possible
for
a reliable measure to indicate
low or moderate correlation between the
first and the second
administration,
but this low correlation
may be due an attitude
change over time rather than
to lack of
reliability.
(2)
Parallel-Form Reliability: When
responses on two comparable sets of
measures tapping the
same
construct
are highly correlated, we have
parallel-form reliability. It is also
called equivalent-form
reliability.
Both forms have similar items and
same response format, the
only changes being
the
wording
and the order or sequence of the questions.
What we try to establish here is the
error variability
resulting
from wording and ordering of the
questions. If two such comparable forms
are highly
correlated,
we may be fairly certain
that the measures are reasonably
reliable, with minimal
error
variance
caused by wording, ordering, or
other factors.
Internal
Consistency of Measures
Internal
consistency of measures is indicative of
the homogeneity of the items in the measure
that tap
the
construct. In other words, the items should `hang
together as a set,' and be capable of
independently
measuring
the same concept so that the
respondents attach the same
overall meaning to each of
the
items.
This can be seen by
examining if the items and the subsets of items in the
measuring instrument
57
![]() Research
Methods STA630
VU
are
highly correlated. Consistency can be
examined through the inter-item
consistency reliability and
split-half
reliability.
(1)
Inter-item Consistency reliability:
This
is a test of consistency of respondents'
answers to all the
items
in a measure. To the degree that items
are independent measures of the
same concept, they
will
be
correlated with one
another.
(2)
Split-Half reliability: Split
half reliability reflects the
correlations between two halves of
an
instrument.
The estimates could vary
depending on how the items in the measure
are split into
two
halves.
The technique of splitting halves is the
most basic method for
checking internal
consistency
when
measures contain a large number of items.
In the split-half method the researcher
may take the
results
obtained from one half of the
scale items (e.g.
odd-numbered items) and check them against
the
results
from the other half of the
items (e.g. even numbered items). The high
correlation tells us there is
similarity
(or homogeneity) among its
items.
It
is important to note that
reliability is a necessary but
not sufficient condition of the
test of goodness of
a
measure. For example, one
could reliably measure a
concept establishing high
stability and
consistency,
but it may not be the
concept that one had set out
to measure. Validity ensures the
ability
of
a scale to measure the intended
concept.
Sensitivity
The
sensitivity of a scale is an important
measurement concept, particularly when
changes in attitudes or
other
hypothetical constructs are
under investigation. Sensitivity refers
to an instrument's ability to
accurately
measure variability in stimuli or
responses. A dichotomous response
category, such as
"agree
or disagree," does not allow
the recording of subtle attitude changes.
A more sensitive measure,
with
numerous items on the scale, may be
needed. For example adding
"strongly agree,"
"mildly
agree,"
"neither agree nor
disagree," "mildly disagree,"
and "strongly disagree" as
categories increases a
scale's
sensitivity.
The
sensitivity of a scale based on a
single question or single
item can also be increased
by adding
additional
questions or items. In other words, because
index measures allow for
greater range of
possible
scores, they are more sensitive
than single item.
Practicality:
The
scientific requirements of a project call
for the measurement process to be
reliable
and
valid, while the operational requirements
call for it to be practical.
Practicality has been
defined as
economy,
convenience,
and interpretability.
58
Table of Contents:
|
|||||