Chapter 12 Partial Credit Models -Part II

In this session, we will learn about
- Considerations when scoring partial credit items - Partial credit models and two-parameter models

12.1 The maximum (highest) score of an item

We frequently see that items in a test are given different maximum marks. For example, an item may be marked out of 5, while another item is marked out of 2. What are the factors that determine what maximum score an item should get? When we ask test writers about the criteria that guide them about assigning a maximum score for an item, we frequently get the following responses: “It depends on the item difficulty,” or, “It depends on how much time students need to spend answering this question.” Sometimes, the test writers look at the number of easily discernable answer categories and decide how many score points an item should be assigned.

We will say upfront that the maximum score assigned to an item should not depend on the item difficulty. This is particularly the case for multiple-choice items. When an item is difficult, there is often a lot of guessing, so that obtaining the correct answer is less likely due to high ability, but is due to some luck.

Below, we discuss about factors relating to item scores.

12.2 Item Weight, Item Information and Item Discrimination

Suppose Item 1 is scored out of a maximum of 5, while Item 2 is scored out of a maximum of 1. then Item 1 has five times the weight of Item 2 in the overall test score. That is, if you obtain the correct answer for Item 1, you get 5 points, while getting Item 1 correct only adds 1 score point to your test score. From the test taker’s point of view, Item 1 is more “important” than Item 2, as there is an opportunity of scoring five points. From the test writer’s point of view, Item 1 should also be “important” to distinguish ability levels among students, since a higher test score should reflect a high ability. This notion of “importance” can be translated into “item information.” When an item has more “item information,” it means that it provides more power in separating low and high ability students. That is, an item with more information is a more discriminating item, since discrimination is a measure of the extent of separation of ability levels.

Consequently, we can turn the discussion around and say that items that are more discriminating should have higher maximum scores, or higher item weight in the test.

Conceptually, item discrimination and item difficulty are unrelated. That is, maximum item score should not be dependent on item difficulty, but it should be dependent on item discrimination.

A highly discriminating item is also an “important” item since it relates more closely to the construct being measured. So, in considering maximum score for an item, we can think of the “importance” of the item in terms of the construct. As an example, when we measure the propensity for developing a disease such as skin cancer, we may ask questions about skin colour, hair colour, eye colour and family history, etc. The factors that relate highly to skin cancer should get higher scores, while factors that are moderately related to skin cancer should get lower scores. We do not look at the the number of categories we can separate hair colour to determine the number of score points. Instead, we need to think of how well hair colour predicts skin cancer.

12.3 Codes and Scores

In discussing response categories and scoring, we need to mention about the differences between “codes” and “scores.” We use the term “codes” to denote labelling of different response categories. While “codes” may look like 0, 1, 2, 3 etc., these codes may or may not necessarily be the scores for each response category. In fact, it will be best to have a separate set of “scores” matching the set of “codes.” For “example, we may use 0, 1, 2, 3 to code eye colour for”black”, “brown,” “green” and “blue.” But the scores for these four codes might be “0,” “0,” “1,” “2,” or, “0,” “1,” “1,” “2,” or other combinations, depending on how likely each eye colour is related to skin cancer. Nevertheless, when we are collecting data, we can use many response category “codes” to capture the data, and decide on the scores based on what the data tell us. That is, we do not assign scores to response categories a priori.

12.4 Two-parameter Model (2PL)

In fact, the two-parameter model does that precisely, by estimating the weight of each item. Some two-parameter models will estimate item weights at the item level, some will estimate item weights at the response category level. For 2PL, there is a clear separation between “codes” and “scores.” In contrast, the partial credit model assigns scores a priori, and the scores are not estimated. A further difference between 2PL models and PCM is that the scores for PCM are integers, while the scores for 2PL item categories can be any real number.

This chapter is currently incomplete. More will be added in due course.