Reliability and Validity: Concepts in Tension

Kathleen Yancey, in her oft-cited essay “Looking Back as We Look Forward: Historicizing Writing Assessment,” argues that “writing assessment is commonly understood as an exercise in balancing the twin concepts of validity and reliability” (135). And it certainly seems to be true that these concepts are, if not exactly opposed to each other, certainly in tension with each other. Validity, as Yancey understands it, is “measuring what you intend to measure,” while reliability means that “you can measure it consistently” (135). Not surprisingly, rhetoricians tend to favor the former, which relies more on arguments, while psychometricians  tend to favor the latter, which relies more on numbers.

But the picture is perhaps more complex than Professor Yancey’s essay would suggest. Roberta Camp, in “Changing the Model for the Direct Assessment of Writing,” agrees that validity and reliability perform a balancing act in assessment, pointing to the popularity of multiple choice tests (high validity) with psychometricians, and direct writing samples (high validity) with writing teachers, and the common compromise of merging the two in assessments (103-106). But she also argues that validity, as a construct, can only be fully realized when based on a rich theory of writing.

Lorrie Shepherd further complicates the traditional view of validity in her essay “The Centrality of Test Use and Consequences for Test Validity,” arguing that validity cannot and should not be considered apart from the consequences of the use of an assessment, assuming the assessment is used as intended (5 ). Michael Neal agrees, seeing validity as a combination of accuracy (which the Yancey definition addresses, to some degree) and appropriateness (is this the right assessment for this situation?).

Reliability, too, is a multifaceted thing, as is made clear in Cherry and Meyer’s “Reliability Issues in Holistic Assessment.” The authors remind us that reliability involves the concepts of measurement error, analysis of variance, and context. It is certainly not simply a matter of inter-rater reliability, as is often understood in the rhet/comp community. Ultimately, the authors argue that “strictly speaking, ‘reliability’ refers not to a characteristic (or set of characteristics) of a particular test but to the confidence that test users can place in scores yielded by the test as a basis for making certain kinds of decisions” (53). So here, interestingly, reliability seems to bleed a little into Shepherd’s concept of of consequential delivery, and everything points back to my previous post: assessment is never neutral. There is always something at stake besides measuring writing, so we must measure with care.