ZeePedia

CRITERIA FOR GOOD MEASUREMENT:Convergent Validity

<< MEASUREMENT OF CONCEPTS (CONTINUED):Scales and Indexes
RESEARCH DESIGN:Purpose of the Study, Steps in Conducting a Survey >>
img
Research Methods ­STA630
VU
Lesson 18
CRITERIA FOR GOOD MEASUREMENT
Now that we have seen how to operationally define variables, it is important to make sure that the
instrument that we develop to measure a particular concept is indeed accurately measuring the variable,
and in fact, we are actually measuring the concept that we set out to measure. This ensures that in
operationally defining perceptual and attitudinal variables, we have not overlooked some important
dimensions and elements or included some irrelevant ones. The scales developed could often be
imperfect and errors are prone to occur in the measurement of attitudinal variables. The use of better
instruments will ensure more accuracy in results, which in turn, will enhance the scientific quality of the
research. Hence, in some way, we need to assess the "goodness" of the measure developed.
What should be the characteristics of a good measurement? An intuitive answer to this question is that
the tool should be an accurate indicator of what we are interested in measuring. In addition, it should be
easy and efficient to use.  There are three major criteria for evaluating a measurement tool: validity,
reliability, and sensitivity.
Validity
Validity is the ability of an instrument (for example measuring an attitude) to measure what it is
supposed to measure. That is, when we ask a set of questions (i.e. develop a measuring instrument) with
the hope that we are tapping the concept, how can we be reasonably certain that we are indeed
measuring the concept we set out to do and not something else? There is no quick answer.
Researchers have attempted to assess validity in different ways, including asking questions such as "Is
there consensus among my colleagues that my attitude scale measures what it is supposed to measure?"
and "Does my measure correlate with others' measures of the `same' concept?" and "Does the behavior
expected from my measure predict the actual observed behavior?" Researchers expect the answers to
provide some evidence of a measure's validity.
What is relevant depends on the nature of the research problem and the researcher's judgment. One way
to approach this question is to organize the answer according to measure-relevant types of validity. One
widely accepted classification consists of three major types of validity: (1) content validity, (2)
criterion-related validity, and (3) construct validity.
(1) Content Validity
The content validity of a measuring instrument (the composite of measurement scales) is the extent to
which it provides adequate coverage of the investigative questions guiding the study. If the instrument
contains a representative sample of the universe of subject matter of interest, then the content validity is
good. To evaluate the content validity of an instrument, one must first agree on what dimensions and
elements constitute adequate coverage. To put it differently, content validity is a function of how well
the dimensions and elements of a concept have been delineated. Look at the concept of feminism which
implies a person's commitment to a set of beliefs creating full equality between men and women in
areas of the arts, intellectual pursuits, family, work, politics, and authority relations. Does this definition
provide adequate coverage of the different dimensions of the concept? Then we have the following two
questions to measure feminism:
1. Should men and women get equal pay for equal work?
2. Should men and women share household tasks?
These two questions do not provide coverage to all the dimensions delineated earlier. It definitely falls
short of adequate content validity for measuring feminism.
A panel of persons to judge how well the instrument meets the standard can attest to the content validity
of the instrument. A panel independently assesses the test items for a performance test. It judges each
55
img
Research Methods ­STA630
VU
item to be essential, useful but not essential, or not necessary in assessing performance of a relevant
behavior.
Face validity is considered as a basic and very minimum index of content validity. Face validity
indicates that the items that are intended to measure a concept, do on the face of it look like they
measure the concept. For example a few people would accept a measure of college student math ability
using a question that asked students: 2 + 2 = ? This is not a valid measure of college-level math ability
on the face of it. Nevertheless, it is a subjective agreement among professionals that a scale logically
appears to reflect accurately what it is supposed to measure. When it appears evident to experts that the
measure provides adequate coverage of the concept, a measure has face validity.
(2) Criterion-Related Validity
Criterion validity uses some standard or criterion to indicate a construct accurately. The validity of an
indicator is verified by comparing it with another measure of the same construct in which research has
confidence. There are two subtypes of this kind of validity.
Concurrent validity: To have concurrent validity, an indicator must be associated with a preexisting
indicator that is judged to be valid. For example we create a new test to measure intelligence. For it to
be concurrently valid, it should be highly associated with existing IQ tests (assuming the same definition
of intelligence is used). It means that most people who score high on the old measure should also score
high on the new one, and vice versa. The two measures may not be perfectly associated, but if they
measure the same or a similar construct, it is logical for them to yield similar results.
Predictive validity:
Criterion validity whereby an indicator predicts future events that are logically related to a construct is
called a predictive validity. It cannot be used for all measures. The measure and the action predicted
must be distinct from but indicate the same construct. Predictive measurement validity should not be
confused with prediction in hypothesis testing, where one variable predicts a different variable in future.
Look at the scholastic assessment tests being given to candidates seeking admission in different
subjects. These are supposed to measure the scholastic aptitude of the candidates ­ the ability to
perform in institution as well as in the subject. If this test has high predictive validity, then candidates
who get high test score will subsequently do well in their subjects. If students with high scores perform
the same as students with average or low score, then the test has low predictive validity.
(3) Construct Validity
Construct validity is for measures with multiple indicators. It addresses the question: If the measure is
valid, do the various indicators operate in consistent manner? It requires a definition with clearly
specified conceptual boundaries. In order to evaluate construct validity, we consider both theory and
the measuring instrument being used. This is assessed through convergent validity and discriminant
validity.
Convergent Validity: This kind of validity applies when multiple indicators converge or are associated
with one another. Convergent validity means that multiple measures of the same construct hang
together or operate in similar ways. For example, we construct "education" by asking people how much
education they have completed, looking at their institutional records, and asking people to complete a
test of school level knowledge. If the measures do not converge (i.e. people who claim to have college
degree but have no record of attending college, or those with college degree perform no better than high
school dropouts on the test), then our test has weak convergent validity and we should not combine all
three indicators into one measure.
Discriminant Validity:  Also called divergent validity, discriminant validity is the opposite of
convergent validity. It means that the indicators of one construct hang together or converge, but also
56
img
Research Methods ­STA630
VU
diverge or are negatively associated with opposing constructs. It says that if two constructs A and B are
very different, then measures of A and B should not be associated. For example, we have 10 items that
measure political conservatism. People answer all 10 in similar ways. But we have also put 5 questions
in the same questionnaire that measure political liberalism.  Our measure of conservatism has
discriminant validity if the 10 conservatism items hang together and are negatively associated with 5
liberalism ones.
Reliability
The reliability of a measure indicates the extent to which it is without bias (error free) and hence ensures
consistent measurement across time and across the various items in the instrument. In other words, the
reliability of a measure is an indication of the stability and consistency with which the instrument
measures the concept and helps to assess the `goodness" of measure.
Stability of Measures
The ability of the measure to remain the same over time ­ despite uncontrollable testing conditions or
the state of the respondents themselves ­ is indicative of its stability and low vulnerability to changes in
the situation. This attests to its "goodness" because the concept is stably measured, no matter when it is
done. Two tests of stability are test-retest reliability and parallel-form reliability.
(1) Test-retest Reliability: Test-retest method of determining reliability involves administering the
same scale to the same respondents at two separate times to test for stability. If the measure is stable
over time, the test, administered under the same conditions each time, should obtain similar results. For
example, suppose a researcher measures job satisfaction and finds that 64 percent of the population is
satisfied with their jobs. If the study is repeated a few weeks later under similar conditions, and the
researcher again finds that 64 percent of the population is satisfied with their jobs, it appears that the
measure has repeatability. The high stability correlation or consistency between the two measures at
time 1 and at time 2 indicates high degree of reliability. This was at the aggregate level; the same
exercise can be applied at the individual level. When the measuring instrument produces unpredictable
results from one testing to the next, the results are said to be unreliable because of error in measurement.
There are two problems with measures of test-retest reliability that are common to all longitudinal
studies. Firstly, the first measure may sensitize the respondents to their participation in a research
project and subsequently influence the results of the second measure. Further if the time between the
measures is long, there may be attitude change or other maturation of the subjects. Thus it is possible
for a reliable measure to indicate low or moderate correlation between the first and the second
administration, but this low correlation may be due an attitude change over time rather than to lack of
reliability.
(2) Parallel-Form Reliability: When responses on two comparable sets of measures tapping the same
construct are highly correlated, we have parallel-form reliability. It is also called equivalent-form
reliability. Both forms have similar items and same response format, the only changes being the
wording and the order or sequence of the questions. What we try to establish here is the error variability
resulting from wording and ordering of the questions. If two such comparable forms are highly
correlated, we may be fairly certain that the measures are reasonably reliable, with minimal error
variance caused by wording, ordering, or other factors.
Internal Consistency of Measures
Internal consistency of measures is indicative of the homogeneity of the items in the measure that tap
the construct. In other words, the items should `hang together as a set,' and be capable of independently
measuring the same concept so that the respondents attach the same overall meaning to each of the
items. This can be seen by examining if the items and the subsets of items in the measuring instrument
57
img
Research Methods ­STA630
VU
are highly correlated. Consistency can be examined through the inter-item consistency reliability and
split-half reliability.
(1) Inter-item Consistency reliability: This is a test of consistency of respondents' answers to all the
items in a measure. To the degree that items are independent measures of the same concept, they will
be correlated with one another.
(2) Split-Half reliability: Split half reliability reflects the correlations between two halves of an
instrument. The estimates could vary depending on how the items in the measure are split into two
halves. The technique of splitting halves is the most basic method for checking internal consistency
when measures contain a large number of items. In the split-half method the researcher may take the
results obtained from one half of the scale items (e.g. odd-numbered items) and check them against the
results from the other half of the items (e.g. even numbered items). The high correlation tells us there is
similarity (or homogeneity) among its items.
It is important to note that reliability is a necessary but not sufficient condition of the test of goodness of
a measure.  For example, one could reliably measure a concept establishing high stability and
consistency, but it may not be the concept that one had set out to measure. Validity ensures the ability
of a scale to measure the intended concept.
Sensitivity
The sensitivity of a scale is an important measurement concept, particularly when changes in attitudes or
other hypothetical constructs are under investigation. Sensitivity refers to an instrument's ability to
accurately measure variability in stimuli or responses. A dichotomous response category, such as
"agree or disagree," does not allow the recording of subtle attitude changes. A more sensitive measure,
with numerous items on the scale, may be needed. For example adding "strongly agree," "mildly
agree," "neither agree nor disagree," "mildly disagree," and "strongly disagree" as categories increases a
scale's sensitivity.
The sensitivity of a scale based on a single question or single item can also be increased by adding
additional questions or items. In other words, because index measures allow for greater range of
possible scores, they are more sensitive than single item.
Practicality: The scientific requirements of a project call for the measurement process to be reliable
and valid, while the operational requirements call for it to be practical. Practicality has been defined as
economy, convenience, and interpretability.
58
Table of Contents:
  1. INTRODUCTION, DEFINITION & VALUE OF RESEARCH
  2. SCIENTIFIC METHOD OF RESEARCH & ITS SPECIAL FEATURES
  3. CLASSIFICATION OF RESEARCH:Goals of Exploratory Research
  4. THEORY AND RESEARCH:Concepts, Propositions, Role of Theory
  5. CONCEPTS:Concepts are an Abstraction of Reality, Sources of Concepts
  6. VARIABLES AND TYPES OF VARIABLES:Moderating Variables
  7. HYPOTHESIS TESTING & CHARACTERISTICS:Correlational hypotheses
  8. REVIEW OF LITERATURE:Where to find the Research Literature
  9. CONDUCTING A SYSTEMATIC LITERATURE REVIEW:Write the Review
  10. THEORETICAL FRAMEWORK:Make an inventory of variables
  11. PROBLEM DEFINITION AND RESEARCH PROPOSAL:Problem Definition
  12. THE RESEARCH PROCESS:Broad Problem Area, Theoretical Framework
  13. ETHICAL ISSUES IN RESEARCH:Ethical Treatment of Participants
  14. ETHICAL ISSUES IN RESEARCH (Cont):Debriefing, Rights to Privacy
  15. MEASUREMENT OF CONCEPTS:Conceptualization
  16. MEASUREMENT OF CONCEPTS (CONTINUED):Operationalization
  17. MEASUREMENT OF CONCEPTS (CONTINUED):Scales and Indexes
  18. CRITERIA FOR GOOD MEASUREMENT:Convergent Validity
  19. RESEARCH DESIGN:Purpose of the Study, Steps in Conducting a Survey
  20. SURVEY RESEARCH:CHOOSING A COMMUNICATION MEDIA
  21. INTERCEPT INTERVIEWS IN MALLS AND OTHER HIGH-TRAFFIC AREAS
  22. SELF ADMINISTERED QUESTIONNAIRES (CONTINUED):Interesting Questions
  23. TOOLS FOR DATA COLLECTION:Guidelines for Questionnaire Design
  24. PILOT TESTING OF THE QUESTIONNAIRE:Discovering errors in the instrument
  25. INTERVIEWING:The Role of the Interviewer, Terminating the Interview
  26. SAMPLE AND SAMPLING TERMINOLOGY:Saves Cost, Labor, and Time
  27. PROBABILITY AND NON-PROBABILITY SAMPLING:Convenience Sampling
  28. TYPES OF PROBABILITY SAMPLING:Systematic Random Sample
  29. DATA ANALYSIS:Information, Editing, Editing for Consistency
  30. DATA TRANSFROMATION:Indexes and Scales, Scoring and Score Index
  31. DATA PRESENTATION:Bivariate Tables, Constructing Percentage Tables
  32. THE PARTS OF THE TABLE:Reading a percentage Table
  33. EXPERIMENTAL RESEARCH:The Language of Experiments
  34. EXPERIMENTAL RESEARCH (Cont.):True Experimental Designs
  35. EXPERIMENTAL RESEARCH (Cont.):Validity in Experiments
  36. NON-REACTIVE RESEARCH:Recording and Documentation
  37. USE OF SECONDARY DATA:Advantages, Disadvantages, Secondary Survey Data
  38. OBSERVATION STUDIES/FIELD RESEARCH:Logic of Field Research
  39. OBSERVATION STUDIES (Contd.):Ethical Dilemmas of Field research
  40. HISTORICAL COMPARATIVE RESEARCH:Similarities to Field Research
  41. HISTORICAL-COMPARATIVE RESEARCH (Contd.):Locating Evidence
  42. FOCUS GROUP DISCUSSION:The Purpose of FGD, Formal Focus Groups
  43. FOCUS GROUP DISCUSSION (Contd.):Uses of Focus Group Discussions
  44. REPORT WRITING:Conclusions and recommendations, Appended Parts
  45. REFERENCING:Book by a single author, Edited book, Doctoral Dissertation