Chapter 13 Estimating Population Characteristics - Part I: Issues and Solutions

In this session, we will learn about
- The differences between estimating individuals and groups.
- The concept behind Joint Maximum Likelihood (JML) and Marginal Maximum Likelihood estimations (MML).
- Issues arising from JML estimation method.

13.1 Purposes of assessments

The purposes of assessments differ. Some assessments aim to provide an estimate of the attainment level of each student. Other assessments aim to provide the characteristics of the distribution of student attainments for a country, or for a subgroup. Depending on the purpose of the assessment, the analysis method differs. In this session, we will focus on the differences between assessing individuals and assessing groups.

13.2 Assessment design - sample-based or census

If we are interested in the characteristics of the abilities of students for a group (e.g, at the country level, or at the region/state level), we can sample students in the country/region, and use the sample to make inferences about the population. If we are interested in providing students of their attainment levels individually, such as providing certification for entrance into schools/universities, we clearly need to test every student. It is costly to test everyone, so if we want assessments to inform policy, for example, we can just use a sample of students to give us some idea of the group characteristics. This is the first difference between assessing a population and assessing individuals. For most large-scale international assessments, the focus is on the population, and not on individuals, so sampling is done for the assessments.

A second difference between assessing individuals and assessing groups is that the item response models used are different, which is the focus of this chapter.

13.3 Sampling error and Measurement error

Sample-based assessments will incur some inaccuracies due to variations in the possible sample chosen. This inaccuracy is referred to as sampling error. The computation of sampling errors adds additional complexity to the computation of standard errors associated with estimates. We will discuss this in latter chapters. In addition to sampling errors, we also have measurement errors, which refer to the inaccuracy in the estimate of individual student ability measures. The final standard error computed for a population estimate will include both the sampling error and the measurement error. In this chapter, we will focus on the effect of measurement error on the estimation of population characteristics.

13.4 How to estimate population statistics

If we want to estimate population statistics such as the mean score and the variance of ability measures in a population, we can take a sample of students and compute sample mean and sample variance. We then make inferences from the sample statistics to the population statistics. This is a common approach in making statistical inferences in general. However, in the case of measuring students, the measure at each individual student level is not precise and is subject to measurement error. In contrast, when we measure height or weight, we have very precise instruments that have very small error margins. In testing students, we take a (rather small) sample of the students’ capabilities, as testing time is always limited, so the measurement error is somewhat larger than measurement errors associated with physical measurements we commonly encounter.

13.5 Effect of Measurement Error

When individuals’ ability scores have measurement errors, the aggregated scores will still have an unbiased mean estimate for the population mean score. However, the measurement errors will inflate the estimate of the population variance. For the mean score, inaccuracies will occur both ways: over-estimate and under-estimate of the true score, so the measurement errors will ‘cancel’ out, leaving the mean score still as an unbiased estimate. However, the additional variation in student scores due to measurement errors will make the variance larger: the larger the measurement errors, the higher the inflated variance.

13.6 Exercise 1 - Checking the effect of measurement error

To see the effect of measurement error on population estimates, we will do a simulation exercise in R. We simulate item response data that fit the Rasch model, estimate each student’s ability, and calculate the mean ability and the variance of the abilities.

In Section 10.2, we introduced R codes for simulation. A function is created to generate Rasch item responses given the number of items and the number of students. After generating the data, we fit the Rasch model using the tam.jml (joint maximum likelihood, JML) estimation method and compute student ability estimates. Finally, we compute the mean and variance of the student ability estimates. The R code is as follows:

library(TAM)
generateRasch <- function(N,I){
  theta <- rnorm( N ) # student abilities from normal distribution with mean 0 and var 1
  p1 <- plogis( outer( theta , seq( -2 , 2 , len=I ) , "-" ) )  #item diff from -2 to 2
  resp <- 1 * ( p1 > matrix( runif( N*I ) , nrow=N , ncol=I ) )  # item responses
  colnames(resp) <- paste("I" , 1:I, sep="")
  return(list(resp=resp,theta=theta))
}
#Generate Rasch item responses for 10000 students and 30 items
generateData <- generateRasch(10000,30)
resp <- generateData$resp

#Fit item response data to the Rasch model and estimate student abilities
mod1 <- tam.jml(resp)
mean(mod1$xsi)
wle <- mod1$WLE  # student ability estimates are stored in variable wle
mean(wle)
var(wle)

rel <- mod1$WLEreliability  #Test reliability
corrected_var <- rel*var(wle)  #Corrected variance estimate

You will notice that both the mean item difficulties, “mean(mod1$xsi),” and the mean ability, “mean(wle),” are quite close to zero, while ‘var(wle)’ is larger than 1. Since student abilities were drawn from a normal distribution with mean 0 and variance 1, and the item difficulties centred around 0, we find that the estimated mean ability and mean item difficulties are close to the generating values, but the estimated variance is higher than the generating variance of 1.

As an exercise, re-run the simulation with different numbers of students (N) and different numbers of items (I). How do the values of N and I affect the estimation of mean and variance of student abilities?

In summary, the joint maximum likelihood (JML) estimation method results in an over-estimation of the population variance, through a two-step process of estimating individual abilities and then aggregating the abilities.

13.7 Overcoming the problem of inflated variance estimates

A correction factor can be applied to adjust the over-estimated variance. The test reliability can be used as the correction factor. Multiply the variance of the ability estimates by the test reliability, the variance will get closer to the generating (true) variance.

While the variance can be adjusted through a correction factor, it is not always easy to adjust other statistics such as percentile points for the over-estimation. In addition, when a test is too difficult or too easy for the students, the JML ability estimates are biased because of ceiling or floor effects. The test reliability does not work so well as a correction factor in that case.

13.8 A better method for estimating population mean and variance

In the procedures described above for estimating population mean and variance, there are two steps. First, students’ individual abilities are estimated. Second, the mean and variance of these estimates are computed. This two-step process leads to biased results in the variance estimates because of measurement errors. A better way to estimate the population mean and variance is to have direct estimation of the population mean and variance, and not through a two-step process. To directly estimate the population mean and variance, the model we use for the item responses needs to include the population mean and variance. One such estimation method is called Marginal Maximum Likelihood (MML) estimation, or the Bayesian IRT model. In contrast, the estimation method we used in our two-step process is the joint maximum likelihood (JML) method. You will notice that in the R code, we used ‘tam.jml’ command to carry out the JML method. To carry out the MML method, we use the ‘tam.mml’ function instead of the ‘tam.jml’ function.

A more detailed description of the MML estimation method is given below. An understanding of the concepts of MML will help us understand terminologies such as ‘plausible values,’ ‘posterior distribution’ and ‘conditioning variables,’ which are frequently referred to in large-scale assessments. However, before going into the description of the MML method, we will do some practical exercises to see how MML estimation recovers population mean and variance.

13.9 Exercise 2 - Directly estimate mean and variance using MML estimation method

library(TAM)
generateRasch <- function(N,I){
  theta <- rnorm( N ) # student abilities from normal distribution with mean 0 and var 1
  p1 <- plogis( outer( theta , seq( -2 , 2 , len=I ) , "-" ) )  #item diff from -2 to 2
  resp <- 1 * ( p1 > matrix( runif( N*I ) , nrow=N , ncol=I ) )  # item responses
  colnames(resp) <- paste("I" , 1:I, sep="")
  return(list(resp=resp))
}
#Generate Rasch item responses for 10000 students and 30 items
resp <- generateRasch(10000,30)$resp

#Fit item response data using the MML estimation method
mod2 <- tam.mml(resp)
mean(mod2$xsi$xsi)  #Mean of item difficulties
mod2$beta #Mean student abilities
mod2$variance  #Variance of student abilities

As an exercise, re-run the simulation in Exercise 2 with different numbers of students (N) and different numbers of items (I). How do the values of N and I affect the estimation of mean and variance of student abilities? Are the variance estimates still inflated? Compare the results with the JML estimation method.

13.10 What is the difference between JML and MML estimation methods?

We will use an example to explain about the differences between the JML and the MML estimation methods. Let’s assume that we have an item pool of 1000 items. Some students know 10% (p=0.1) of the items, some know 30% (p=0.3), some know 50%, 70% and 90% of the items respectively. We will call these proportions (p=0.1, 0.3, 0.5, 0.7, 0.9) the “true abilities” of the five groups of students if all 1000 items were administered to the students. However, in a test, only 10 items are randomly selected. For each of the five groups of students, what is the probability of getting 0 correct, 1 correct,…., 10 correct? The following table shows the probabilities, calculated from a binomial distribution. The rows of Table 13.1 are the five ability groups. The columns of Table 13.1 are the test scores out of a maximum of 10. The table shows probabilities of getting a particular test score given the ability group.

Table 13.1: Probabilities of number of successes
p 0 1 2 3 4 5 6 7 8 9 10
0.1 0.349 0.387 0.194 0.057 0.011 0.001 0.000 0.000 0.000 0.000 0.000
0.3 0.028 0.121 0.233 0.267 0.200 0.103 0.037 0.009 0.001 0.000 0.000
0.5 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001
0.7 0.000 0.000 0.001 0.009 0.037 0.103 0.200 0.267 0.233 0.121 0.028
0.9 0.000 0.000 0.000 0.000 0.000 0.001 0.011 0.057 0.194 0.387 0.349

When the students take a 10-item test, we don’t know which group the students come from; we only know their test scores out of 10. We need to find their “true ability” (p in the table) based on their test scores. Suppose a student obtained a score of 2 out of 10. We read down the column of “2” in Table 13.1. We see that the probabilities are 0.194, 0.233, 0.044, 0.001, 0.000 respectively for p = 0.1, 0.3, 0.5, 0.7, 0.9. Since 0.233 is the largest probability out of the five, we make the decision that students who obtain a score of 2 belong to the second group (p=0.3 group). That is, we estimate that the students with a score of 2 in this test to have an ability of “0.3.”

This process of estimating student abilities is the underlying idea of JML estimation method. Of course, in the actual JML estimation, we do not use percentages for students’ abilities. We use “logits” and the metric is a little difference. Nevertheless, the idea is to see which ability is the most likely one to produce a particular test score. Once we estimate the student abilities, we then aggregate the student abilities to form population estimates such as the mean and variance.

In MML estimation, there is an additional assumption about the population distribution. Table 13.2 has an added first column of the number of students in each ability group. So the first two columns of Table 13.2 show the population distribution of student abilities. This distribution is called the prior distribution. In real-life, we do not know this distribution, and it is estimated. For this example, let us assume we know this distribution, and we will explain the concepts of how this prior distribution is used in the estimation process.

Table 13.2: Number of Students by Test Scores
N p 0 1 2 3 4 5 6 7 8 9 10
350 0.1 122 136 68 20 4 1 0 0 0 0 0
400 0.3 11 48 93 107 80 41 15 4 1 0 0
125 0.5 0 1 5 15 26 31 26 15 5 1 0
75 0.7 0 0 0 1 3 8 15 20 18 9 2
50 0.9 0 0 0 0 0 0 1 3 10 19 17
1000 133 185 166 143 113 81 57 42 34 29 19

The figures in Table 13.2 are computed by multiplying the first column N by the probabilities in Table 13.1. So these are the expected number of students in an ability group to obtain a particular test score. The last row in Table 13.2 shows the sum of the number of students in each column. That is, the bottom row is a Table margin showing how many students are expected to obtain a particular score over the whole population of students.

We can convert Table 13.2 to probabilities, by dividing the number of students by the row margin sums, showing the probabilities of being in an ability group given a test score. For example, for ability group 0.3, and test score of 5, divide 41 by 81. Table 13.3 shows this.

Table 13.3: Proportion of Students in each ability group by Test Scores
N p 0 1 2 3 4 5 6 7 8 9 10
350 0.1 0.92 0.74 0.41 0.14 0.04 0.01 0.00 0.00 0.00 0.00 0.00
400 0.3 0.08 0.26 0.56 0.75 0.71 0.51 0.26 0.10 0.03 0.00 0.00
125 0.5 0.00 0.01 0.03 0.10 0.23 0.38 0.46 0.36 0.15 0.03 0.00
75 0.7 0.00 0.00 0.00 0.01 0.03 0.10 0.26 0.48 0.53 0.31 0.11
50 0.9 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.07 0.29 0.66 0.89
1000 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

For a student with a test score of 5, Table 13.3 shows that the probabilities that the student belongs to ability groups 1 to 5 are 0.01, 0.51, 0.38, 0.10, 0.00, respectively. Since the maximum probability is 0.51, corresponding to ability group 2 (p=0.3), we would assign a student with a test score of 5 to ability group of 0.3. This differs from Table 13.1, by which we would assign a student with a test score of 5 to ability group 0.5, which has the highest probability (0.246). The reason for the difference between using Table 13.1 and Table 13.3 is that in Table 13.1 no population distribution (prior distribution) assumption is made. In contrast, Table 13.3 takes into account of the population distribution of the number of students in each ability group. Because there are more students in the lower ability groups, the estimated individual student ability becomes lower.

In real-life, we do not know the prior distribution (population ability distribution), so we won’t know the probabilities in Table 13.3. The data we collect tell us how many students obtained each test score, which are the numbers in the last row (table row margin) of Table 13.2. Given these “marginal” figures from our data, we estimate what ability distribution (prior distribution) should be. In this way, student ability distribution is estimated directly from student item responses, and not from estimated individual ability estimates.

The approach with an assumption of a prior (ability) distribution forms the concept of Marginal Maximum Likelihood estimation, although the process uses logistic models and not simple counts as we have described. The key message to remember is that MML estimation makes direct estimations of the student ability distribution (i.e., not a two-step process), while JML estimations first estimates individual abilities and then form aggregates (two-step process).

13.11 Further Terminology - Posterior distribution

In Table 13.3, each column of probabilities is called the posterior distribution for a student with a particular test score. For example, for students with a test score of 5,

Table 13.4: Posterior distribution of students with a test score of 5
Ability group probability of being in group given a test score of 5
0.1 0.01
0.3 0.51
0.5 0.38
0.7 0.10
0.9 0.00

The reason Table 13.4 is called the posterior distribution (in contrast to the prior distribution) is that this distribution is derived post testing, with the information from the prior distribution and the test results combined. To make inferences about individual students’ ability levels, the posterior distribution is used, leading to terms like “plausible values,” which will be discussed in the next chapter.

13.12 Homework

For Exercise 1 (13.6) and Exercise 2 (13.9) in this Chapter, answer the questions in these sections and present your results.