If the packages are not already installed on your computer use the
function install.packages()
. Then we include the
libraries:
library(eRm)
library(iarm)
library(PP)
library(rms)
library(r2r) # library(2R2)
library(splines)
library(DT)
We use the complete cases from the dataset AMTS
(include
in the R-package iarm
) as an example. We include the data
set and check for missing values
data(amts)
apply(is.na(amts), 2, sum)
id agegrp sex age time address name year
0 0 0 0 1 0 0 0
dob month firstww monarch countbac
0 0 0 0 0
next, we select only the complete cases
cc.amts <- complete.cases(amts)
amts <- amts[cc.amts, ]
and we store the exogenous variables in one object and the scale items in another item
dat.exo.amts <- amts[, 2:3]
dat.items.amts <- amts[, 4:13]
In this dataset, the exogenous variables are agegrp
and
sex
and we want to (i) test invariance of the item
difficulties across age groups, (ii) test invariance of the item
difficulties across gender groups. We start out with descriptive
statistics for the two variables
PF_table(dat.exo.amts$sex, dat.items.amts)
PF_table(dat.exo.amts$agegrp, dat.items.amts)
Differential item functioning (DIF) is the lack of invariance of an item difficulty to an exogenous variable. In other words, DIF occurs when sample subgroups respond differently to an individual item despite equal level of ability. These systematic differences in the responses can be problematic as they indicate that the questionnaire is not fair across the complete sample and comparisons across subgroups can be problematic. For example, in an educational test with DIF items may disadvantage one subgroup against the other. Items with DIF make a scale unsuitable for measurement and comparisons across a population group. If an item shows DIF, the item can be either removed, split or the model has to be extended to take into account the DIF.
There are two types of DIF: uniform and non-uniform DIF. Uniform DIF means that the difference between some subgroups is consistent over the range of ability levels, non-uniform DIF means inconsistent differences.
Typically, items are checked for DIF for exogenous variables, i.e. person or sample characteristics, such as ‘age group’, ‘gender’, ‘educational level’, ‘health condition’, ‘country’, ‘language’, etc. The decision on which sample characteristics to test for item DIF is left to the researcher. Note that a sufficient number of cases are necessary to perform a reliable DIF analysis.
There are several methods to detect DIF. We illustrate five possible approaches that can all be applied to dichotomous data.
Andersens Likelihood Ratio test is a test for model fit which can
also be used as a global test for DIF. This test does not directly
compare the estimated item parameters found in sample subgroups.
Instead, it compares the maximum of the conditional likelihood when the
item parameters are allowed to vary across subgroups to the maximum of
the conditional likelihood under the Rasch model, where the difficulty
parameters (\(\beta\)) have to be
invariant. An index can be derived which accounts for the model fit
under the two conditions, with free and invariant item difficulty
parameters. In absence of DIF, subgroups will have the same parameter
estimate and, consequently, the maximum conditional likelihoods will be
approximately equal. Graphically, if the item difficulties in sample
subgroups are invariant, i.e. roughly equal, the sub group specific item
difficulty parameter estimates fall on a diagonal reference line. The
R-package eRm
, which is directly loaded with
iarm
, includes Andersen’s LR-test LRtest()
together with a plotting function plotGOF()
which allows to
graphically detect potential DIF-items. This plotting function is only
available for exogenous factor variables with two levels.
We fit the Rasch model to the AMTS items, compute Andersens LR test for age group using
RM.amts <- RM(dat.items.amts)
LRtest(RM.amts, splitcr = dat.exo.amts$agegrp)
Andersen LR-test:
LR-value: 18.558
Chi-square df: 18
p-value: 0.419
Next, using the same object RM.amts
, we compute
Andersens LR test for the Rasch model using the variable
sex
and include a plot (xlab
and
ylab
change the axis labels,
xlimand
ylimadjusts the range of the axes, and
tlab`
chooses item labels instead of numbers)
LRtest(RM.amts, splitcr = dat.exo.amts$sex)
Andersen LR-test:
LR-value: 18.753
Chi-square df: 9
p-value: 0.027
plotGOF(LRtest(RM.amts, splitcr = dat.exo.amts$sex), xlab = "Difficulty for Female",
ylab = "Difficulty for Male", tlab = "item", xlim = c(-2, 2), ylim = c(-2, 2))
The statistic and the figure support indicates some evidence of DIF
for sex. In fact, in absence of DIF, Andersen’s LR test is supposed to
be non-significant (p-value > 0.05) also the difficulty of the items
are supposed to tally and to be on the figure’s diagonal. The graphical
analysis with the goodness-of-fit plot allows to determine which items
difficulty estimates depart most from the expected concordance. In order
to know how far-off a point an be so that it is still a variation by
chance, plotGOF()
allows to draw 95% confidence ellipses.
When an item’s confidence ellipse does not cross the reference line, the
probability that the item parameter are equal across subgroups is <
0.05, i.e. unlikely. Confidence intervals can be added to the figures
using
plotGOF(LRtest(RM.amts, splitcr = dat.exo.amts$sex), xlab = "Beta for Female", ylab = "Beta for Male",
tlab = "number", xlim = c(-2, 2), ylim = c(-2, 2), conf = list(gamma = 0.95,
col = 1))
Note that we add CI-ellipses and that in tlab
the option
‘number’ instead of ‘labels’ makes the plot easier to look at (try using
tlab="identify"
for interactively labelling through
clicking).
The DIF analysis with Andersen’s LR showed significant results with p-value < 0.05 only for the variable sex. The goodness-of-fit plot, with confidence ellipses, indicates departures for two items, item 6 and item 8.
colnames(dat.items.amts)[c(6, 8)]
[1] "dob" "firstww"
It appears that it is easier for females to remember their birthday given the same level of cognitive functioning than their male counterparts. On the other hand, male participants more easily recollect facts about WWI.
An inherent problem with model tests is that small deviations will be statistically signficant for large sample sizes. The larger the sample, the smaller the confidence ellipse and more easily the null hypothesis of fit is rejected. Let’s check if another approach will confirm these findings.
Group invariance is tested through a Differential Item Functioning (DIF) procedure. DIF, which used to be called item bias (Groenvold M et al. 1996), is where, given the level of the trait under consideration, the response to an item should be the same, irrespective of group membership (e.g. gender). Thus, for example, when males and females have the same level of well-being, their response to an item indicative of well-being should be the same. There are many different ways to assess DIF, including logistic regression, structural equation modelling (group factorial invariance), and ANOVA tests with (e.g.) gender as a main effect and groups of people at different levels of the trait as another main effect. For gender, if the main effect is significant, then Uniform DIF is observed, and if there is an interaction effect between gender and the grouping of people along the trait, then non-uniform DIF is observed (Terrasi J 2000). These ANOVA tests are based upon the residuals.
With this approach, item DIF is determined by observing the standardized residuals of the Rasch analysis to detect non-random patterns which can be explained by an exogenous variable. To do so, the standardized residuals of the Rasch analysis are submitted to an analysis of variance (2-way ANOVA) of the scale items for score class intervals and an exogenous variable. Uniform DIF is present if the exogenous variable has a significant effect, for non-uniform DIF the interaction term is relevant.
We choose to have around 50 people per score class interval, so with
a sample size of N=196, the number of class intervals is set to 4 with
the option nci=
.
options(scipen=10) #output number format option see ?option 'scipen:'
twoway.anova.amts <- anova_DIF(dat.items.amts,
dat.exo.amts, #all exogenous variables can be tested at once
nci=4, #number of score class interval
p.adj="BH") #correction for multiple testing
twoway.anova.amts
With the Benjamini-Hochberg correction for the false discovery rate
(p.adj="BH"
: recommended approach), the lowest
p-value is found for item firstww
(p-value = 0.0732) supporting no evidence of DIF for sex and
age group for the AMTS scale. The correction for multiple
testing results in more conservative levels of significance.
Items should function in the same way for all subgroups of persons.
An item shows DIF if there is a significant association between the item
score and an exogenous variable, controlling for the scale score. In the
following approach, Partial Gamma coefficients are used as test
statistics to determine the presence of sample subgroups scoring
systematically higher on some items. The gamma coefficient ranges from
-1 to 1. The function partgam_DIF()
from the R-package
iarm
provides Gamma coefficients, standard errors,
p-values and confidence limits for every pair of an item and an
exogenous variable.
DIF, as well as LID, in a set of items \({x}_{1}\) to \({x}_{k}\) can be tested using the partial gamma. A general assumption is that, in presence of item independence the probability of each item score \(x_{1}\) to \(x_{k}\) given a latent trait \(\theta\) equals the product of the single item scores given the latent trait
\[\mathrm{P}(x_{1}, \dots, x_{k} | \theta) = \prod_{i=1}^{k} \mathrm{P}(x_{i}|\theta)\] A rest score \(\mathrm{R}\), which is the raw sum score minus the contribution of item \(x_{j}\) is defined as:
\[\mathrm{R^{(j)}}=\sum_{i \neq j} x_{i}\]
The probability of having a local depedency between item \(x_{j}\) and some other item \(x_{k}\) given the rest score is defined as the product of the single probabilities of \(x_{j}\) and \(x_{k}\) given the rest score \(\mathrm{R^{(j)}}\).
\[\mathrm{P}(x_{j}, x_{k} | \mathrm{R^{(j)})=\mathrm{P}(x_{j}| \mathrm{R^{(j)}}) \mathrm{P}(x_{k}| \mathrm{R^{(j)}})} \] \[\mathrm{P}(.|\mathrm{R}^{(j)}) : corr(x_{j},x_{k})=0\] \[\mathrm{P}(.| \mathrm{R}^{(k)}) : corr(x_{j},x_{k})=0\]
partgam_DIF(dat.items.amts, dat.exo.amts)
Item Var gamma se pvalue padj.BH sig lower upper
1 age agegrp 0.0556 0.2397 0.8167 1.0000 -0.4143 0.5254
2 age sex 0.3913 0.3253 0.2290 1.0000 -0.2462 1.0000
3 time agegrp 0.1508 0.2399 0.5297 1.0000 -0.3194 0.6210
4 time sex -0.3035 0.2469 0.2190 1.0000 -0.7874 0.1804
5 address agegrp 0.2386 0.1972 0.2265 1.0000 -0.1480 0.6252
6 address sex -0.0096 0.2379 0.9677 1.0000 -0.4760 0.4567
7 name agegrp -0.1549 0.2640 0.5572 1.0000 -0.6723 0.3624
8 name sex 0.0769 0.2996 0.7974 1.0000 -0.5103 0.6641
9 year agegrp -0.1742 0.2223 0.4332 1.0000 -0.6099 0.2615
10 year sex 0.2782 0.2638 0.2917 1.0000 -0.2389 0.7953
11 dob agegrp -0.0380 0.2869 0.8947 1.0000 -0.6004 0.5244
12 dob sex -0.7000 0.1954 0.0003 0.0068 ** -1.0000 -0.3170
13 month agegrp 0.3784 0.2003 0.0589 1.0000 -0.0142 0.7710
14 month sex -0.3333 0.3075 0.2784 1.0000 -0.9361 0.2694
15 firstww agegrp -0.1901 0.2376 0.4236 1.0000 -0.6558 0.2756
16 firstww sex 0.5525 0.2019 0.0062 0.1243 0.1567 0.9483
17 monarch agegrp -0.2453 0.2006 0.2214 1.0000 -0.6385 0.1479
18 monarch sex -0.0476 0.2765 0.8633 1.0000 -0.5895 0.4943
19 countbac agegrp -0.1277 0.1832 0.4857 1.0000 -0.4868 0.2313
20 countbac sex -0.1969 0.2250 0.3815 1.0000 -0.6379 0.2441
The evidence for DIF is weaker than what found with Anderson’s
conditioning LR-test, this may be due to the correction for multiple
testing. However, direction of the Gamma coefficient for
dob
and firstww
support what found ealier,
than given the same level of cognitive functioning females are more
likely to remember their date of birth and males the dates of WWI.
A Wald test can be used to test for invariance of items by specifying
either a split criterion for the variable (mean or median) or by using a
dichotomous grouping variable. In that sense, the Wald test can be used
for the testing DIF for exogenous variables with two-levels. The
function Waldtest()
is found in the R-package
eRm
.
RM.amts <- RM(dat.items.amts)
Waldtest(RM.amts, amts$sex)
Wald test on item level (z-values):
z-statistic p-value
beta age -1.704 0.088
beta time 1.326 0.185
beta address -0.166 0.868
beta name -0.185 0.853
beta year -0.780 0.435
beta dob 1.966 0.049
beta month 1.450 0.147
beta firstww -2.549 0.011
beta monarch -0.472 0.637
beta countbac 1.450 0.147
Here, as found in some of the previous approaches, the variables
firstww
and dob
can be flagged for uniform DIF
for sex as their z-statistics are significant (p-value <
0.05)
As mentionned earlier, options to solve DIF in scales is either to remove items with DIF, to use a more extended Rasch model which can take into account the DIF or to do item split. In what follows, we will sketch the item split approach (beyond the scope of this course to go through this in detail)
The earlier analyses to test for DIF showed somewhat different
degrees of evidence for DIF. Globally, two items were not systematically
but more frequently flagged for DIF for sex, namely dob
followed by firstww
. In what follows, the first steps of an
item split approach are shown. In general, item DIF is solved stepwise
starting with the item showing the highest DIF, here for example
dob
. The function split_dif()
found in the
R-package R2R
facilitates transforming the DIF variable
into separate variables, with one variable with values for each level of
an exogenous variable. This variable is added to the dataset and the
non-split version of the item has to be removed from the original
data.
split the item with DIF ‘dob’ for the exogenous variable which causes DIF and remove the original version of the item
DIF.split.amts <- cbind(amts, split_dif(dob ~ sex, amts))
DIF.split.amts <- DIF.split.amts[, -which(colnames(DIF.split.amts) %in% "dob")]
head(DIF.split.amts)
Run a Rasch analysis with the new variable.
model.split.amts <- RM(DIF.split.amts[, 4:14])
summary(model.split.amts)
Create a conversion tables with split items
# create a dataframe with each individual row score.
score.ability.amts <- as.data.frame(
cbind(
DIF.split.amts[,4:14], # calibration data
ability=person.parameter(model.split.amts)$theta.table[,1], # ability estimate
sex=amts$sex)) # exogenous variable for which an item was split
scores.female.amts <- subset(score.ability.amts, sex=="female", -dob_male)
scores.male.amts <- subset(score.ability.amts, sex=="male", -dob_female)
# remove the exogeneous variable
scores.female.amts <- scores.female.amts[,-which(colnames(scores.female.amts)=="sex")]
scores.male.amts <- scores.male.amts[,-which(colnames(scores.male.amts)=="sex")]
# pass the data matrix to a function (requires the libraries splines and scales)
convert <- function(x){ #x=items and last column the abilities for a sample subgroup
x=x[complete.cases(x[,-ncol(x)]),]
Sc=unique(cbind(rowSums(x[,-ncol(x)]), x[,ncol(x)]))
min_max=range(Sc[,1])
interp=predict(interpSpline(Sc[,1], Sc[,2]), min_max[1]:min_max[2])
return(cbind(row_score=interp$x,
logit=interp$y,
converted=round(scales::rescale(interp$y, to=c(min_max[1], min_max[2])),1)
))
}
#conversion table for female subgroup
convert(scores.female.amts)
#conversion table for male subgroup
convert(scores.female.amts)