Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA, April, 2006.

Rasch scoring complications arise when calibrating a single-prompt writing assessment scored on six polytomous traits in which sparse data are present. This paper compares empirical calibration results from Andrich Rating Scale and Masters Partial Credit models on field test data from one state in three grades. The impact of calibration model and various data treatments on item difficulty parameters and student ability estimates are examined. Results indicate that if data are missing in score categories for some items, the Rating Scale model is a viable option in place of the Partial Credit model because of few empirical differences in person estimation results.

Abstract

The federal No Child Left Behind Act of 2001 (2002), requires states to assess student performance in English language arts and mathematics in grades 3 through 8 and at least once in high school on a yearly basis. As mandatory state-wide testing increases, concern about the amount of time students spend taking tests in lieu of receiving instruction also increases. Assessment of writing is particularly challenging in that the time required for a student to respond to one essay prompt is much longer than the time required for a student to respond to one multiple choice reading item. However, many teachers prefer to have states assess writing through evaluation of student products from authentic writing activities, instead of through multiple choice items. Therefore, some states have chosen to assess student writing on a single essay prompt. In many of these cases, several traits of the student's response are scored and treated as separate items. A common framework for scoring is the six-trait writing analytic rubric, developed by Ruth Culham of Northwest Regional Educational Laboratory (2002), in which the six traits are idea/content, organization, voice, word choice, sentence fluency, and conventions. Using the six-trait rubric, scores on each trait are assigned from one to six.

Scoring a single prompt writing test with the six trait analytic rubric introduces issues related to sparse data, potential instability of item difficulty estimates over time, and possible violations of local independence of items (traits). In addition, the estimate instability may lead to difficulties in form-to-form equating. This paper addresses Rasch calibration and scoring complications that arise with sparse data. Sparse data is of particular concern because it can prevent Rasch models from maintaining the structure of categories in each trait. Several factors can impact sparse data. Treatment of condition codes, assigned to responses that are off-topic or in a different language, may influence the unidimensionality of the measure. In situations where more than one rater is used to score each trait, decisions about how adjacent and discrepant score points are resolved more than one rater increases the number of score levels and, ultimately, the chance that any given score level has fewer than the necessary number of observations for calibration. In addition, the problem of sparse data is often aggravated by the necessity of calibrating items using samples of students rather than entire populations in order to meet scoring and reporting schedules.

Perhaps the fundamental challenge of calibrating single prompt writing assessments with sparse data is maintaining theoretical assumptions while ensuring accurate estimation of step difficulties across the score range. Unfortunately, sparse data may lead to a situation where practitioners find themselves with only two options: a) use model and data treatment that can account for categories with very few observations while sacrificing theoretical assumptions or (b) maintain theoretical assumptions by using a model or data treatment that will not be able to provide estimates for all score points because sparse data cannot be accommodated. Some Rasch models for calibration of polytomous items may be able to control for sparse data better than others. Of particular interest are the Andrich Rating Scale model (Andrich, 1978) and Masters Partial Credit model (Masters, 1982).

Open .PDF of the complete paper to view the Andrich Rating Scale model »

In the case of the single prompt writing test, the Rating Scale model uses information from all traits to determine thresholds between categories that are then held constant across all traits. Thus, it may better be able to provide estimates of thresholds for all score categories, even categories which have limited or no data. There may be little theoretical basis, however, for assuming that the category structure of score points is the same across all scored traits in a single-prompt writing test.

This study empirically examines and compares different methods for dealing with sparse data when calibrating a single-prompt writing test. Of particular interest were treatments of condition codes, resolution and treatment of multiple ratings, and choice of calibration model. Results of each calibration were evaluated for Rasch diagnostic and model fit criteria as compiled by Jackson and Popovich (2003) based on Waugh and Addison (1998), Linacre (1994) and Wright and Linacre (1994). Results were then examined for the impact of calibration model and data treatments on item difficulty parameters and student ability estimates.

Continue Reading

View .PDF of the complete research paper »