Differential Item Functioning

Getting ready

If the packages are not already installed on your computer use the function install.packages(). Then we include the libraries:

library(eRm)
library(iarm)
library(PP)
library(rms)
library(r2r)  # library(2R2)
library(splines)
library(DT)

We use the complete cases from the dataset AMTS (include in the R-package iarm) as an example. We include the data set and check for missing values

data(amts)
apply(is.na(amts), 2, sum)

      id   agegrp      sex      age     time  address     name     year 
       0        0        0        0        1        0        0        0 
     dob    month  firstww  monarch countbac 
       0        0        0        0        0

next, we select only the complete cases

cc.amts <- complete.cases(amts)
amts <- amts[cc.amts, ]

and we store the exogenous variables in one object and the scale items in another item

dat.exo.amts <- amts[, 2:3]
dat.items.amts <- amts[, 4:13]

In this dataset, the exogenous variables are agegrp and sex and we want to (i) test invariance of the item difficulties across age groups, (ii) test invariance of the item difficulties across gender groups. We start out with descriptive statistics for the two variables

PF_table(dat.exo.amts$sex, dat.items.amts)
PF_table(dat.exo.amts$agegrp, dat.items.amts)

Differential Item Functioning

Differential item functioning (DIF) is the lack of invariance of an item difficulty to an exogenous variable. In other words, DIF occurs when sample subgroups respond differently to an individual item despite equal level of ability. These systematic differences in the responses can be problematic as they indicate that the questionnaire is not fair across the complete sample and comparisons across subgroups can be problematic. For example, in an educational test with DIF items may disadvantage one subgroup against the other. Items with DIF make a scale unsuitable for measurement and comparisons across a population group. If an item shows DIF, the item can be either removed, split or the model has to be extended to take into account the DIF.

There are two types of DIF: uniform and non-uniform DIF. Uniform DIF means that the difference between some subgroups is consistent over the range of ability levels, non-uniform DIF means inconsistent differences.

Typically, items are checked for DIF for exogenous variables, i.e. person or sample characteristics, such as ‘age group’, ‘gender’, ‘educational level’, ‘health condition’, ‘country’, ‘language’, etc. The decision on which sample characteristics to test for item DIF is left to the researcher. Note that a sufficient number of cases are necessary to perform a reliable DIF analysis.

There are several methods to detect DIF. We illustrate five possible approaches that can all be applied to dichotomous data.

Andersens likelihood ratio (LR) test of DIF
Two-way ANOVA of the residuals
Partial Gamma Coefficient
Wald test

1. Andersens likelihood ratio (LR) test of DIF

Andersens Likelihood Ratio test is a test for model fit which can also be used as a global test for DIF. This test does not directly compare the estimated item parameters found in sample subgroups. Instead, it compares the maximum of the conditional likelihood when the item parameters are allowed to vary across subgroups to the maximum of the conditional likelihood under the Rasch model, where the difficulty parameters (\(\beta\)) have to be invariant. An index can be derived which accounts for the model fit under the two conditions, with free and invariant item difficulty parameters. In absence of DIF, subgroups will have the same parameter estimate and, consequently, the maximum conditional likelihoods will be approximately equal. Graphically, if the item difficulties in sample subgroups are invariant, i.e. roughly equal, the sub group specific item difficulty parameter estimates fall on a diagonal reference line. The R-package eRm, which is directly loaded with iarm, includes Andersen’s LR-test LRtest() together with a plotting function plotGOF() which allows to graphically detect potential DIF-items. This plotting function is only available for exogenous factor variables with two levels.

We fit the Rasch model to the AMTS items, compute Andersens LR test for age group using

RM.amts <- RM(dat.items.amts)
LRtest(RM.amts, splitcr = dat.exo.amts$agegrp)


Andersen LR-test: 
LR-value: 18.558 
Chi-square df: 18 
p-value:  0.419

Next, using the same object RM.amts, we compute Andersens LR test for the Rasch model using the variable sex and include a plot (xlab and ylab change the axis labels, xlimandylimadjusts the range of the axes, andtlab` chooses item labels instead of numbers)

LRtest(RM.amts, splitcr = dat.exo.amts$sex)


Andersen LR-test: 
LR-value: 18.753 
Chi-square df: 9 
p-value:  0.027

plotGOF(LRtest(RM.amts, splitcr = dat.exo.amts$sex), xlab = "Difficulty for Female",
    ylab = "Difficulty for Male", tlab = "item", xlim = c(-2, 2), ylim = c(-2, 2))

The statistic and the figure support indicates some evidence of DIF for sex. In fact, in absence of DIF, Andersen’s LR test is supposed to be non-significant (p-value > 0.05) also the difficulty of the items are supposed to tally and to be on the figure’s diagonal. The graphical analysis with the goodness-of-fit plot allows to determine which items difficulty estimates depart most from the expected concordance. In order to know how far-off a point an be so that it is still a variation by chance, plotGOF() allows to draw 95% confidence ellipses. When an item’s confidence ellipse does not cross the reference line, the probability that the item parameter are equal across subgroups is < 0.05, i.e. unlikely. Confidence intervals can be added to the figures using

plotGOF(LRtest(RM.amts, splitcr = dat.exo.amts$sex), xlab = "Beta for Female", ylab = "Beta for Male",
    tlab = "number", xlim = c(-2, 2), ylim = c(-2, 2), conf = list(gamma = 0.95,
        col = 1))

Note that we add CI-ellipses and that in tlab the option ‘number’ instead of ‘labels’ makes the plot easier to look at (try using tlab="identify" for interactively labelling through clicking).

The DIF analysis with Andersen’s LR showed significant results with p-value < 0.05 only for the variable sex. The goodness-of-fit plot, with confidence ellipses, indicates departures for two items, item 6 and item 8.

colnames(dat.items.amts)[c(6, 8)]

[1] "dob"     "firstww"

It appears that it is easier for females to remember their birthday given the same level of cognitive functioning than their male counterparts. On the other hand, male participants more easily recollect facts about WWI.

An inherent problem with model tests is that small deviations will be statistically signficant for large sample sizes. The larger the sample, the smaller the confidence ellipse and more easily the null hypothesis of fit is rejected. Let’s check if another approach will confirm these findings.

2. Two-way ANOVA of the residuals

Group invariance is tested through a Differential Item Functioning (DIF) procedure. DIF, which used to be called item bias (Groenvold M et al. 1996), is where, given the level of the trait under consideration, the response to an item should be the same, irrespective of group membership (e.g. gender). Thus, for example, when males and females have the same level of well-being, their response to an item indicative of well-being should be the same. There are many different ways to assess DIF, including logistic regression, structural equation modelling (group factorial invariance), and ANOVA tests with (e.g.) gender as a main effect and groups of people at different levels of the trait as another main effect. For gender, if the main effect is significant, then Uniform DIF is observed, and if there is an interaction effect between gender and the grouping of people along the trait, then non-uniform DIF is observed (Terrasi J 2000). These ANOVA tests are based upon the residuals.

With this approach, item DIF is determined by observing the standardized residuals of the Rasch analysis to detect non-random patterns which can be explained by an exogenous variable. To do so, the standardized residuals of the Rasch analysis are submitted to an analysis of variance (2-way ANOVA) of the scale items for score class intervals and an exogenous variable. Uniform DIF is present if the exogenous variable has a significant effect, for non-uniform DIF the interaction term is relevant.

We choose to have around 50 people per score class interval, so with a sample size of N=196, the number of class intervals is set to 4 with the option nci=.

options(scipen=10) #output number format option see ?option 'scipen:'
twoway.anova.amts <- anova_DIF(dat.items.amts,
                    dat.exo.amts, #all exogenous variables can be tested at once
                    nci=4,  #number of score class interval 
                    p.adj="BH") #correction for multiple testing 
twoway.anova.amts

With the Benjamini-Hochberg correction for the false discovery rate (p.adj="BH": recommended approach), the lowest p-value is found for item firstww (p-value = 0.0732) supporting no evidence of DIF for sex and age group for the AMTS scale. The correction for multiple testing results in more conservative levels of significance.

3. Partial gamma coefficient

Items should function in the same way for all subgroups of persons. An item shows DIF if there is a significant association between the item score and an exogenous variable, controlling for the scale score. In the following approach, Partial Gamma coefficients are used as test statistics to determine the presence of sample subgroups scoring systematically higher on some items. The gamma coefficient ranges from -1 to 1. The function partgam_DIF() from the R-package iarm provides Gamma coefficients, standard errors, p-values and confidence limits for every pair of an item and an exogenous variable.

DIF, as well as LID, in a set of items \({x}_{1}\) to \({x}_{k}\) can be tested using the partial gamma. A general assumption is that, in presence of item independence the probability of each item score \(x_{1}\) to \(x_{k}\) given a latent trait \(\theta\) equals the product of the single item scores given the latent trait

\[\mathrm{P}(x_{1}, \dots, x_{k} | \theta) = \prod_{i=1}^{k} \mathrm{P}(x_{i}|\theta)\] A rest score \(\mathrm{R}\), which is the raw sum score minus the contribution of item \(x_{j}\) is defined as:

\[\mathrm{R^{(j)}}=\sum_{i \neq j} x_{i}\]

The probability of having a local depedency between item \(x_{j}\) and some other item \(x_{k}\) given the rest score is defined as the product of the single probabilities of \(x_{j}\) and \(x_{k}\) given the rest score \(\mathrm{R^{(j)}}\).

\[\mathrm{P}(x_{j}, x_{k} | \mathrm{R^{(j)})=\mathrm{P}(x_{j}| \mathrm{R^{(j)}}) \mathrm{P}(x_{k}| \mathrm{R^{(j)}})} \] \[\mathrm{P}(.|\mathrm{R}^{(j)}) : corr(x_{j},x_{k})=0\] \[\mathrm{P}(.| \mathrm{R}^{(k)}) : corr(x_{j},x_{k})=0\]

partgam_DIF(dat.items.amts, dat.exo.amts)

       Item    Var   gamma     se pvalue padj.BH sig   lower   upper
1       age agegrp  0.0556 0.2397 0.8167  1.0000     -0.4143  0.5254
2       age    sex  0.3913 0.3253 0.2290  1.0000     -0.2462  1.0000
3      time agegrp  0.1508 0.2399 0.5297  1.0000     -0.3194  0.6210
4      time    sex -0.3035 0.2469 0.2190  1.0000     -0.7874  0.1804
5   address agegrp  0.2386 0.1972 0.2265  1.0000     -0.1480  0.6252
6   address    sex -0.0096 0.2379 0.9677  1.0000     -0.4760  0.4567
7      name agegrp -0.1549 0.2640 0.5572  1.0000     -0.6723  0.3624
8      name    sex  0.0769 0.2996 0.7974  1.0000     -0.5103  0.6641
9      year agegrp -0.1742 0.2223 0.4332  1.0000     -0.6099  0.2615
10     year    sex  0.2782 0.2638 0.2917  1.0000     -0.2389  0.7953
11      dob agegrp -0.0380 0.2869 0.8947  1.0000     -0.6004  0.5244
12      dob    sex -0.7000 0.1954 0.0003  0.0068  ** -1.0000 -0.3170
13    month agegrp  0.3784 0.2003 0.0589  1.0000     -0.0142  0.7710
14    month    sex -0.3333 0.3075 0.2784  1.0000     -0.9361  0.2694
15  firstww agegrp -0.1901 0.2376 0.4236  1.0000     -0.6558  0.2756
16  firstww    sex  0.5525 0.2019 0.0062  0.1243      0.1567  0.9483
17  monarch agegrp -0.2453 0.2006 0.2214  1.0000     -0.6385  0.1479
18  monarch    sex -0.0476 0.2765 0.8633  1.0000     -0.5895  0.4943
19 countbac agegrp -0.1277 0.1832 0.4857  1.0000     -0.4868  0.2313
20 countbac    sex -0.1969 0.2250 0.3815  1.0000     -0.6379  0.2441

The evidence for DIF is weaker than what found with Anderson’s conditioning LR-test, this may be due to the correction for multiple testing. However, direction of the Gamma coefficient for dob and firstww support what found ealier, than given the same level of cognitive functioning females are more likely to remember their date of birth and males the dates of WWI.

4. Wald test

A Wald test can be used to test for invariance of items by specifying either a split criterion for the variable (mean or median) or by using a dichotomous grouping variable. In that sense, the Wald test can be used for the testing DIF for exogenous variables with two-levels. The function Waldtest() is found in the R-package eRm.

RM.amts <- RM(dat.items.amts)
Waldtest(RM.amts, amts$sex)


Wald test on item level (z-values):

              z-statistic p-value
beta age           -1.704   0.088
beta time           1.326   0.185
beta address       -0.166   0.868
beta name          -0.185   0.853
beta year          -0.780   0.435
beta dob            1.966   0.049
beta month          1.450   0.147
beta firstww       -2.549   0.011
beta monarch       -0.472   0.637
beta countbac       1.450   0.147

Here, as found in some of the previous approaches, the variables firstww and dob can be flagged for uniform DIF for sex as their z-statistics are significant (p-value < 0.05)

Dealing with DIF

As mentionned earlier, options to solve DIF in scales is either to remove items with DIF, to use a more extended Rasch model which can take into account the DIF or to do item split. In what follows, we will sketch the item split approach (beyond the scope of this course to go through this in detail)

Splitting

The earlier analyses to test for DIF showed somewhat different degrees of evidence for DIF. Globally, two items were not systematically but more frequently flagged for DIF for sex, namely dob followed by firstww. In what follows, the first steps of an item split approach are shown. In general, item DIF is solved stepwise starting with the item showing the highest DIF, here for example dob. The function split_dif() found in the R-package R2R facilitates transforming the DIF variable into separate variables, with one variable with values for each level of an exogenous variable. This variable is added to the dataset and the non-split version of the item has to be removed from the original data.

Briefly sketching how splitting can be done

split the item with DIF ‘dob’ for the exogenous variable which causes DIF and remove the original version of the item

DIF.split.amts <- cbind(amts, split_dif(dob ~ sex, amts))
DIF.split.amts <- DIF.split.amts[, -which(colnames(DIF.split.amts) %in% "dob")]
head(DIF.split.amts)

Run a Rasch analysis with the new variable.

model.split.amts <- RM(DIF.split.amts[, 4:14])
summary(model.split.amts)

Create a conversion tables with split items

# create a dataframe with each individual row score.
score.ability.amts <- as.data.frame( 
                cbind(
                DIF.split.amts[,4:14], # calibration data
                ability=person.parameter(model.split.amts)$theta.table[,1], # ability estimate
                sex=amts$sex)) # exogenous variable for which an item was split
scores.female.amts <- subset(score.ability.amts, sex=="female", -dob_male)
scores.male.amts <- subset(score.ability.amts, sex=="male", -dob_female)

# remove the exogeneous variable

scores.female.amts <- scores.female.amts[,-which(colnames(scores.female.amts)=="sex")]
scores.male.amts <- scores.male.amts[,-which(colnames(scores.male.amts)=="sex")]

# pass the data matrix to a function (requires the libraries splines and scales)
convert <- function(x){  #x=items and last column the abilities for a sample subgroup
  x=x[complete.cases(x[,-ncol(x)]),]
  Sc=unique(cbind(rowSums(x[,-ncol(x)]), x[,ncol(x)]))
  min_max=range(Sc[,1])
  interp=predict(interpSpline(Sc[,1], Sc[,2]), min_max[1]:min_max[2]) 
  return(cbind(row_score=interp$x, 
               logit=interp$y, 
               converted=round(scales::rescale(interp$y, to=c(min_max[1], min_max[2])),1) 
               ))
}

#conversion table for female subgroup

convert(scores.female.amts)

#conversion table for male subgroup

convert(scores.female.amts)

Differential Item Functioning

2022-06-09

Getting ready

Differential Item Functioning

1. Andersens likelihood ratio (LR) test of DIF

2. Two-way ANOVA of the residuals

3. Partial gamma coefficient

4. Wald test

Dealing with DIF

Splitting

Briefly sketching how splitting can be done