Chapter 8 Equating Tests. Part 1: Theory

In this session, we will learn about
- why we need to equate tests
- frequently used methods to equate tests

8.1 Why equating is needed to align tests

When a student scores high on a test, we do not know if it is because the student is of high ability, or if the test is easy, unless there is some external information that helps us to make that judgement. This is also reflected in the mathematical expression for the probability of success on an item where the probability is a function of \(\theta-\delta\). So if we add any constant to both \(\delta\) and \(\theta\), the difference between \(\delta\) and \(\theta\) remains the same. That is, there is no unique solution to the value of \(\delta\) and \(\theta\). Consequently, we set an arbitrary zero on the \(\theta\) scale to fix the values of \(\delta\) and \(\theta\). Typically, we set ‘constraint=“items”’ (mean of item difficulties=0), or ‘constraint=“cases”’ (mean of student abilities=0). Since the zero on the \(\theta\) scale is arbitrarily set, the estimated \(\theta\)s and \(\delta\)s are not directly comparable from one test to another.

8.2 Methods for aligning two different tests on the same \(\theta\) scale

First, if two tests have completely different items administered to two completely different groups of people, we will not be able to equate the tests. If the test scores on one test are higher than test scores on another test, we are not able to tell whether one group of students are of higher abilities, or one test has easier items. So the first condition for equating tests is that there must be something in common between the two tests, whether there are some common items in the two tests, or there are some students who took both tests. If two tests have common items, we can make sure the common items have the same item parameters across the two tests and scale the other items relative to the common items. If two tests have the same candidates, we can assume that the candidates have the same abilities while taking the two tests, and thus make a judgement on the relative difficulties of the two tests.

The following shows two tests with common items, but different students taking each of the tests:
Two tests with common items

Figure 8.1: Two tests with common items

There are at least three frequently used methods to equate tests with common items: Concurrent Scaling Method, Shift (and scale) Method and Anchoring Method.

8.3 Concurrent Scaling Method

As the name suggests, item response data from both tests are combined into one single file, and scaled together. That is, the combined data shown in Figure 8.1 are analysed as one single data set. So for Test 1 students, there will be responses for Test 1 items and common items, but missing responses for Test 2 items. For Test 2 students, there will be missing responses for Test 1 items that are not common items. As IRT can handle missing responses with ease, we can carry out a single IRT calibration. There will be one set of item parameters for the common items for both tests. Test 1 items will be calibrated relative to the common items, and Test 2 items will be calibrated relative to the common items. Consequently, items from both tests will be on the same scale.

This method is typically useful when we have two tests that are “concurrent,” such as two different forms of a test administered at the same time. Both tests are equivalent in that neither is more important and neither is a predecessor of the other, such as a test administered a year earlier. That is, we do not have a set of item parameters that have already been calibrated some time ago, and we want to retain those item parameters. When we have an old test and a new test, and we want to place the new test on the same scale as the old test, the shift (and scale) method, or the anchoring method will be a better choice.

8.4 Shift (and scale) Method

In the case of placing the results of a new test onto the same scale as an old test, we carry out separate calibrations of the tests. The old test will have a set of calibrated item parameters already. The new test will have another set of item parameters. So for the common items, there will be two sets of item parameters. Calculate the mean of the common items for both the old test and the new test, say \(\mu_{old}\) and \(\mu_{new}\). To make the mean of the new set equal to the mean of the old set, we need to add \(d=\mu_{old}-\mu_{new}\) to all the new set of parameters. All estimated parameters in the new set need to have \(d\) added to them. This includes all item parameters and all person ability parameters. This is called the Shift Method.

For the Shift Method, we only adjust a constant shift to the parameters. We do not adjust for the scale factor of the parameters. If the scale factors for the two sets of item parameters are very different, an adjustment can be applied to the scale as well, in addition to the shift. To adjust for the scale factor, calculate the standard deviation of the item parameters for the common items, for each test, say, \(\sigma_{old}\) and \(\sigma_{new}\). If \(\sigma_{old}\) is quite different from \(\sigma_{new}\), it means that the two tests are not spreading the students out in the same way. An adjustment using both the shift and scale factors can be done in the following way. If \(\delta_{new}\) is an item parameter for the new test, then compute

\[\begin{equation} \delta_{adjusted\_new} = \frac{\delta_{new}-\mu_{new}}{\sigma_{new}}\times\sigma_{old}+\mu_{old}\tag{8.1} \end{equation}\]

Similarly, to adjust for the ability parameters, compute

\[\begin{equation} \theta_{adjusted\_new} = \frac{\theta_{new}-\mu_{new}}{\sigma_{new}}\times\sigma_{old}+\mu_{old}\tag{8.2} \end{equation}\]

The parameters \(\delta_{adjusted\_new}\) and \(\theta_{adjusted\_new}\) are now placed on the same scale as the parameters on the old test.

8.5 Anchoring Method

A variation to the Shift Method is the Anchoring Method. In calibrating the item parameters of the new test, we fix the item parameters of the common items at parameter values for the old test. In this way, the parameters of the common items for the new test are not estimated, but fixed at the old parameter values. Comparing to the Shift Method, the Anchoring Method allows fewer degrees of freedom by restricting each common item parameter to a fixed value. In contrast, the Shift Method allows the common item parameters to have new values, provided the overall mean of the common item parameters for the new set is the same as the mean of common items for the old set.

8.6 Exercise 1

The data files for this exercise come from two numeracy tests with overlapping items. The N1.csv file contains 50 items, where the first 21 items are unique and the remaining 29 items are the common items with Test 2: N2.csv. Test 2 also has 50 items where the first 29 items are the common items with Test 1. The keys for these two tests are in a specification file that you can download.

Download the R package plyr so we can use the rbind.fill function to merge two data files easily.

Set your working directory, and use the following code to read in the data files and to merge the two files.

N1 <- read.csv("N1.csv",stringsAsFactors = FALSE, colClasses = rep("character", 50))
N2 <- read.csv("N2.csv",stringsAsFactors = FALSE, colClasses = rep("character", 50))
library(plyr)
resp_raw <- rbind.fill(N1,N2)

key <- "32213211431123114141111143411111211312111111323121414324231413131111242"
key <- unlist(strsplit(key,""))

The rbind.fill function merges the two data frames using column names. The columns with the same name in the two data files will be merged under one column. The merged file has a structure as depicted in Figure 8.1, where there are NAs (not available) for items where students did not attempt, such as Test 1 unique items for Test 2 students, and Test 2 unique items for Test 1 students.

Score the merged data file using CTT score function.

Carry out a CTT analysis first on the scored data. You will find that there are issues now because, by default, the CTT itemAnalysis function deletes any record that contains NA (missing values). Since all records contain some NA’s, all records are deleted.

Use the option NA.Delete when calling itemAnalysis, such as:
IA <- itemAnalysis(resp,NA.Delete = FALSE)

With the NA.Delete=FALSE option, the CTT package turns all NA’s to zero. This is not appropriate since students never attempted the NA questions so they should not get zero for the questions. Nevertheless, some results can be produced by the CTT analysis. We note here that when students take different sets of items, CTT is no longer appropriate. IRT should be used since IRT can easily estimate item and person parameters with incomplete data sets.

Run an IRT analysis for the merged data set. In particular, run tam.ctt and produce a table of distractor analysis. Use the R Markdown code for Appendix A in Section 7.2 for reference.

8.7 Homework

Download files test1.csv and test2.csv. These two files contain common items, as shown by common item names in the files. The files contain scored item responses. Merge the two files and run a concurrent IRT analysis. Use R Markdown to produce a report of item analysis results, including Wright Map and ICC plots.