5 Data management
The estimated amount of time to complete this chapter is 1-2 hours.
Often we do not receive the data files in a useful form but we will have to do some rearranging and transformation of the data in order to prepare for the statistical analyses. This includes subsetting datasets, merging datasets as well as defining new variables and deleting variables.
5.1 Merge two data sets
Information on our patients can occur in two or more data sets. E.g. consider a situation where one data set contains the information from sleepData
and another data set contains individual specific information such as the following (the following data is for illustrative purposes and has absolutely
nothing to do with the orginal data in the Cushny and Peebles paper):
<- data.frame( ID = 1:10, BMI = 18:27, smoking = rep( c("Yes", "No"), 5) )
ptInfo ptInfo
ID BMI smoking
1 1 18 Yes
2 2 19 No
3 3 20 Yes
4 4 21 No
5 5 22 Yes
6 6 23 No
7 7 24 Yes
8 8 25 No
9 9 26 Yes
10 10 27 No
The ptInfo
thus contains information on BMI and whether the patient is smoking or not (we used the rep()
(REPeat) function to repeat the vector c('Yes','No')
5 times to obtain a length of 10).
If we wish to add this new information onto the sleepData
data frame, we can use the function merge()
.
To merge two data sets it is important to identify which variable(s) that connects the two data sets. In the sleepData
and ptInfo
example it is the id-variable ID
. ID
is then called a key.
Using the merge()
function where ID
is the key between the two data sets, we create a new data set sleepData2
that contains information from the sleepData and ptInformation data set:
<- sleep
sleepData <- merge( x = sleepData, y = ptInfo, by = "ID" )
sleepData2 head( sleepData2 )
ID extra group BMI smoking
1 1 0.7 1 18 Yes
2 1 1.9 2 18 Yes
3 10 2.0 1 27 No
4 10 3.4 2 27 No
5 2 -1.6 1 19 No
6 2 0.8 2 19 No
From the Environment pane we see that the sleepData2
data contains 20 observations and 5 variables. I.e. it is a copy of the original sleepData
with the two variables BMI
and smoking
from the ptInfo
added.
Sometimes there may be more than one variable needed to merge two data sets properly (see e.g. statmethods.net to learn the syntax).
5.2 Subsetting
A subset of a data frame is a data frame containing only some of the rows or some of the columns (variables). In Chapter 3.3 we saw how we could define such smaller data sets using indexing. A useful built-in function is the subset()
command that creates a subset of an existing data set based on one or more logical conditions on the variables in the data set. The function has the following structure:
subset(x, subset)
Here x
is a data set and subset
is the condition (logical expression).
How to use the function is shown below:
<- sleep
sleepData <- subset(sleepData, group == 1)
drug1 <- subset(sleepData, group == 2) drug2
Here the first subset only contains the data lines where the group variable equals the value 1. Note that we use double equal signs for comparisons (asking whether group
“is equal to” 1). The second subset similarly contains the data lines where the group variable equals 2. The sleep
data has 20 observations and from the Environment pane we see that we now have two datasets drug1
and drug2
each with 10 observations and 3 variables. Have a look at each of the datasets to see what they contain (using e.g. View()
).
5.3 Long vs. wide format
A data set may contain several measurements on the same individuals like the sleepData
that has two measuements per patient. Such a data set often occur in either of the two formats: wide or long. We will in this section explain the differences between the two formats and show how to change from one to another.
We can represent the sleepData
data set both in long and wide format. In the long format the data contains one row for each measurement and each patient contributes with several lines. In the wide format there is only one row for each patient and each measurement is in a separate column. The original format of the sleepData
data set is the long format. Here long format refers to the fact that we have repeated measurements (multiple measurements on the same individual represented by ID
) and the measurements are listed on top of each other instead of side-by-side. Often the long format is preferred when doing a statistical analysis, but this is not always the case.
5.3.1 From long to wide format
A wide format of the sleepData
data set can be achieved by placing values of the extra
column of group 2 next to the group 1 extra
column and thereby reducing the data frame from 20 lines to 10 lines.
We may use the above two subsets drug1
and drug2
containing the group 1 respectively group 2 measurements (Chapter 5.2). In wide format we will need 10 lines (one for each individual) and three variables (the ID
, the extra hours slept on drug 1 and the extra hours slept on drug 2).
From drug1
we need the ID
and the extra hours slept on drug1. We rename the extra
variable to extra1
as we need a new name to make it clear that these measurements were taken while the patients were on drug 1.
# Show the names of the variables in the data set:
names( drug1 )
[1] "extra" "group" "ID"
# Replace variable name no 1 with "extra1":
names( drug1 )[1] <- "extra1"
# Delete the group-variable:
$group <- NULL
drug1head( drug1 )
extra1 ID
1 0.7 1
2 -1.6 2
3 -0.2 3
4 -1.2 4
5 -0.1 5
6 3.4 6
Note that we deleted the group
variable by assigning it the value NULL
which is basically just an empty vector.
We do the same for the drug2
-data set but instead of deleting the group variable using the NULL
value we refer specifically to the column we wish to delete using indexing:
names( drug2 )
[1] "extra" "group" "ID"
names( drug2 )[1] <- "extra2"
# Delete the group-variable (column no 2)
<- drug2[ ,-2]
drug2 head( drug2 )
extra2 ID
11 1.9 1
12 0.8 2
13 1.1 3
14 0.1 4
15 -0.1 5
16 4.4 6
We could also have specified that we only wanted column no 1 and 3 as drug2 <- drug1[ , c(1,3) ]
.
Note that we carefully defined the two datasets, drug1
and drug2
, to only have the ID
variable in common in order to be able to ‘set the data sets side by side.’ We may do that using the merge()
function described above (Chapter 5.1):
<- merge( drug1, drug2, by='ID')
sleepWide sleepWide
ID extra1 extra2
1 1 0.7 1.9
2 10 2.0 3.4
3 2 -1.6 0.8
4 3 -0.2 1.1
5 4 -1.2 0.1
6 5 -0.1 -0.1
7 6 3.4 4.4
8 7 3.7 5.5
9 8 0.8 1.6
10 9 0.0 4.6
Voila! However note that merge()
has changed the order of the ID variable. If we want our data sorted by the ID
variable we order the data by ID
using the function order()
:
<- sleepWide[ order(sleepWide$ID), ]
sleepWide sleepWide
ID extra1 extra2
1 1 0.7 1.9
3 2 -1.6 0.8
4 3 -0.2 1.1
5 4 -1.2 0.1
6 5 -0.1 -0.1
7 6 3.4 4.4
8 7 3.7 5.5
9 8 0.8 1.6
10 9 0.0 4.6
2 10 2.0 3.4
To see a few more examples of sorting data sets, see e.g. statmethods.net.
Exercise 1: Order the data by ID
using indexing instead of order()
. You first need to run sleepWide <- merge( drug1, drug2, by='ID')
to redefine the unsorted data frame. Now use indexing (Chapter 3.3) to have the appropriate order of the data.
Click here for a solution.
In the unsorted data set we first need the first row, then the third, fourth etc until the 10’th row. The second row should be the last. I.e. we need the following order of the data lines:
<- merge( drug1, drug2, by="ID" )
sleepWide <- sleepWide[ c(1,3:10,2), ]
sleepWide sleepWide
ID extra1 extra2
1 1 0.7 1.9
3 2 -1.6 0.8
4 3 -0.2 1.1
5 4 -1.2 0.1
6 5 -0.1 -0.1
7 6 3.4 4.4
8 7 3.7 5.5
9 8 0.8 1.6
10 9 0.0 4.6
2 10 2.0 3.4
If you need to organize your own data from long to wide it might be worth having a closer look at the dcast()
function in the reshape2
package.
5.3.2 From wide to long format
To reproduce the long format of the sleepData
data set from the sleepWide
data set, the two columns extra1
and extra2
must be placed on top of each other and we need a variable indicating which group the measurement belongs to (like the group
variable in sleepData
). This can be achieved by 1) separating extra1
and extra2
into two separate data frames, 2) adding a group
variable to each data frame, 3) rename the variables of the data frames to contain the same variable names and 4) pasting the two data frames on top of each other.
# 1) make a copy of the wideData, delete extra2
<- sleepWide
wide1 $extra2 <- NULL
wide1# 2) adding the group variable
$group <- 1
wide1# 3) rename the extra1 variable
names( wide1 )[2] <- "extra"
We do the same for the extra2
variable
<- sleepWide
wide2 $extra1 <- NULL
wide2$group <- 2
wide2names( wide2 )[2] <- "extra"
The two data sets now contain identical variable names and we may place them on top of each other using rbind()
(row bind):
# 4)
<- rbind( wide1, wide2 )
sleepLong sleepLong
ID extra group
1 1 0.7 1
2 2 -1.6 1
3 3 -0.2 1
4 4 -1.2 1
5 5 -0.1 1
6 6 3.4 1
7 7 3.7 1
8 8 0.8 1
9 9 0.0 1
10 10 2.0 1
11 1 1.9 2
12 2 0.8 2
13 3 1.1 2
14 4 0.1 2
15 5 -0.1 2
16 6 4.4 2
17 7 5.5 2
18 8 1.6 2
19 9 4.6 2
20 10 3.4 2
Voila! We now have a data set identical to the sleep
data except that the original extra
variable now has the name extra
.
sleepLong
data set such that the variable in the first column is the ID
variable, the second is the group
and the third is extra
.
Click here for a solution.
# we need the order corresponding 1st, 3rd and 2nd variable
<- sleepLong[ , c(1,3,2)]
sleepLong # or referring to the variable names
<- sleepLong[ , c("ID","group","extra")] sleepLong
Exercise 3: Reorder the
sleepLong
data such that the first two rows correspond to individual with ID=1
, the third and fourth row to individual with ID=2
etc.
Click here for a solution.
We can sort the data according to the ID
variable using order()
:
<- sleepLong[ order(sleepLong$ID) , ]
sleepLong head(sleepLong)
ID group extra
1 1 1 0.7
11 1 2 1.9
2 2 1 -1.6
12 2 2 0.8
3 3 1 -0.2
13 3 2 1.1
Exercise 4: Reorder the sleepLong
data such that the first row correspond to ID=1
and group=2
, the second ID=1
and group=1
, the third ID=2
and group=1
etc. I.e. a data frame sorted according to increasing ID
and decreasing group
within each individual. Consult statmethods.net to find out how to do that.
Click here for a solution.
We can sort the data according toincreasing ID
and decreasing group
within ID
:
<- sleepLong[ order(sleepLong$ID,-sleepLong$group) , ]
sleepLong head(sleepLong)
ID group extra
11 1 2 1.9
1 1 1 0.7
12 2 2 0.8
2 2 1 -1.6
13 3 2 1.1
3 3 1 -0.2
If you need to organize your own data from long to wide it might be worth having a closer look at the melt()
function in the reshape2()
package.
5.4 Defining new variables
In this chapter we show how to define various kinds of new variables based on the variables in the SundBy data set.
5.4.1 Calculation of new variables
The data SundBy is loaded into a data frame named d
(as described in Chapter 4):
<- read.csv('https://biostat.ku.dk/R/data/sundby.csv') d
We can calculate new variables based on the variables in the data set. For example we often need to log-transform our variables when performing a regression analysis. We will then need a new variable in the data set being the log of the variable of interest, for example the log of weight:
$logwgt <- log( d$wgt ) d
Note that we may use $-notation to add a new variable to a data set as well as we use $-notation to refer to an existing variable.
In the video below (5:28 min)we show how to add a BMI-variable to the SundBy data set. The BMI is determined from weight and height using the formula \[\begin{eqnarray*} {\rm BMI} &=& \frac{\rm weight\ in\ kilo}{\rm (height\ in\ meters)^2} \end{eqnarray*}\]
In the video we defined the bmi
variable in two different ways:
# Defining height in meters before calculating BMI
$htms <- d$ht / 100
d$bmi <- d$wgt / d$htms^2
d
# Calculating height in meters inside the calculation - be careful with ()!
$bmi <- d$wgt / (d$ht/100)^2 d
An important point, when defining new variables, is that we should always
check the values of the new variables to make sure that we did the calculations right. In the video we carefully used the head()
command every time we modified the code.
5.4.2 Grouped variables from numeric variables
Sometimes we wish to break a quantitative (numeric or integer) variable into
groups. A group variable is also termed a categorical variable or a factor variable. For this purpose we may use the cut()
function.
In the video below (5:23 min) we demonstrate how to make a grouped version of the BMI-variable with values “Underweight” / “Normal” / “Overweight” / “Obese” corresponding to the BMI groups <= 20, 20-25, 25-30, > 30:
Click here to find the code produced in the video
<- read.csv("https://biostat.ku.dk/R/data/sundby0_English.csv")
d $bmi <- d$wgt / (d$ht/100)^2
dhead(d)
summary( d$bmi )
$bmiGrp <- cut( d$bmi, breaks=c(16,20,25,30,36),
dlabels=c('Underweight','Normal','Overweight','Obese'))
summary( d$bmiGrp )
Contents of the video:
The cut()
function is used to split a numerical variable into a grouped variable based on its values. The function takes the following arguments:
cut(x, breaks, labels, right)
Here x
is the numerical vector to be divided into groups. The argument breaks
is a vector specifying the values defining the groups. Arguments labels
and right
are optional.
Having specified breaks=c(16,20,25,30,36)
as in the video, the groups will be defined as (16,20], (20-25] etc. The open paranthesis ( is to be understood such that the value is not included in the interval, the bracket paranthesis such that the value is included in the interval. E.g. the second group (20-25] will contain values from > 20 (but not =20!) to 25 (25 included). To define the groups the other way around, i.e. including the lower value in each interval (e.g. [20,25)), use additional argument right=FALSE
.
Warning: Always make sure that the minimum (maximum) value specified in the breaks are smaller (larger) than the smallest (largest) observed value - otherwise missing values will occur. Use e.g.
a summary()
of your variable to determine these values.
The label
argument can be used to add informative names to the groups, e.g. labels=c('Underweight','Normal','Overweight','Obese')
. There have to be as many names as groups and the names have to be specified using quotation around each name.
Quiz
How many people are normal weight in the SundBy data set?
Test your result and find multiple solutions here.
5.4.3 Binary variables
The ifelse()
function can be used to create new variables with 2 levels. It has the following form:
ifelse(test, yes, no)
Here test
is a logical statement that produces a TRUE/FALSE value, if the statement is TRUE, the custom value for yes
is returned and if the statement is FALSE, the custom value for no
is returned. The yes
and no
arguments can be any type of element (numeric, integer, character), but they must be of the same type.
Say we wish to make an indicator that returns 1 if the patient is normal weight (BMI above 20 and below 25) and 0 otherwise. This can be done using the ifelse
function:
$normalBMI <- ifelse( 20 < d$bmi & d$bmi <= 25, 1, 0 ) d
Using the head()
function, we find that the indicator seems to return the correct result:
head( d, 6 )
id gender age wgt ht bmi bmiGrp normalBMI
1 1434 2 31 70.0 172 23.66144 Normal 1
2 978 1 28 68.0 170 23.52941 Normal 1
3 701 2 55 NA 165 NA <NA> NA
4 134 1 26 76.0 180 23.45679 Normal 1
5 292 2 21 60.0 175 19.59184 Underweight 0
6 1481 2 27 61.5 178 19.41043 Underweight 0
5.4.4 Combining groups
Sometimes we wish to group character vectors with a large number of categories into fewer categories. To do that we use a manual approach.
5.4.4.1 Combining two groups
There are only very few underweight people in the SundBy data and we may therefore wish to combine the underweight and normal weight people. We do that by making a copy of the bmiGrp
variable named bmi3Grp
and next assigning the underweight people the value Normal
:
$bmi3Grp <- d$bmiGrp
d$bmi3Grp[ d$bmiGrp == 'Underweight' ] <- 'Normal' d
We should always perform a check of our new variable. We can do that using summary()
:
> summary( d$bmiGrp )
's
Underweight Normal Overweight Obese NA 18 128 35 12 7
> summary( d$bmi3Grp )
Underweight Normal Overweight Obese NA's
0 146 35 12 7
Note that the bmi3Grp
variable now as expected has 18+128=146 people registered as normal weight.
Also note that the empty group "underweight"
is not deleted from the levels of the variable. To delete this level we may use the droplevels()
function:
$bmi3Grp <- droplevels( d$bmi3Grp)
dsummary( d$bmi3Grp )
Normal Overweight Obese NA's
146 35 12 7
5.4.4.2 Combine several groups
Suppose we want a variable indicating whether a person is overweight or not. Here we may use the ifelse()
function described above or we may use a manual approach:
$obese[ d$bmiGrp == 'Obese' ] <- 'Obese'
d$obese[ d$bmiGrp %in% c('Underweight','Normal','Overweight') ] <- 'Not Obese' d
We use %in%
and not ==
when we wish to match with more than one element.
Note that we cannot follow the procedure used in Section 5.4.4.1 as we need to add a new level Not obese
for our factor variable
> d$obese <- d$bmiGrp
> d$obese[ d$bmiGrp %in% c('Underweight','Normal weight','Overweight') ] <- 'Not obese'
in `[<-.factor`(`*tmp*`, d$bmiGrp %in%
Warning c("Underweight", "Normal weight", : ugyldigt faktorniveau,
NA oprettet
We might also define the new variable in several steps instead of using %in%
:
$obese[ d$bmiGrp == 'Obese' ] <- 'Obese'
d$obese[ d$bmiGrp == 'Underweight' ] <- 'Not Obese'
d$obese[ d$bmiGrp == 'Normal weight' ] <- 'Not Obese'
d$obese[ d$bmiGrp == 'Overweight' ] <- 'Not Obese' d
5.5 Exercises
Before you start the exercises it is important that you understood how to use the functions in this chapter. We also expect that you have knowledge about handling missing values and some of the functions from other chapters. If you do not - read the content summaries and maybe the previous chapter + the chapther about vectors and data frames.
All the exercises in this chapter are based on the sleepData2
defined above (Ch. 5.1).
Exercise 1A:
By help of the subset()
-function, create a new data frame including only smokers.
Click here for a solution.
<- subset(sleepData2, smoking=="Yes")
sleepQ1A sleepQ1A
ID extra group BMI smoking
1 1 0.7 1 18 Yes
2 1 1.9 2 18 Yes
7 3 1.1 2 20 Yes
8 3 -0.2 1 20 Yes
11 5 -0.1 1 22 Yes
12 5 -0.1 2 22 Yes
15 7 5.5 2 24 Yes
16 7 3.7 1 24 Yes
19 9 0.0 1 26 Yes
20 9 4.6 2 26 Yes
Exercise 1B:
By help of the subset()
-function, create a new data frame including only smokers with a BMI at 20 or above.
Click here for a solution.
<- subset(sleepData2, smoking=="Yes" & BMI >= 20)
sleepQ1C sleepQ1C
ID extra group BMI smoking
7 3 1.1 2 20 Yes
8 3 -0.2 1 20 Yes
11 5 -0.1 1 22 Yes
12 5 -0.1 2 22 Yes
15 7 5.5 2 24 Yes
16 7 3.7 1 24 Yes
19 9 0.0 1 26 Yes
20 9 4.6 2 26 Yes
Exercise 1C:
By help of the subset()
-function, create a new data frame including only smokers with BMI below 20 or above 25.
Click here for a solution.
<- subset(sleepData2, smoking=="Yes" & (BMI < 20 | BMI > 25 ) )
sleepQ1C sleepQ1C
ID extra group BMI smoking
1 1 0.7 1 18 Yes
2 1 1.9 2 18 Yes
19 9 0.0 1 26 Yes
20 9 4.6 2 26 Yes
NB: Be aware of the parantheses as AND (&
) precedes OR (|
) (like multiplication precedes addition). If you do not add the paranthesis around the BMI-expression (e.g. using smoking=="Yes" & BMI < 20 | BMI > 25
) you will have a different result corresponding to (smoking=="Yes" & BMI < 20) | BMI > 25
, namely underweight smokers and all overweight patients.
Exercise 2: Use the cut()
function to assign a grouped variable to the sleepData2
data set. Name the new variable change
. The groups of change
is defined as follows: “Reduction”: extra
is less than -0.5, ” No change”: extra
is between -0.5 and 0.5, “Increase”: extra
is greater than 0.5. For how many of the measurements were sleep increased resp. decreased?
Click here for a solution.
$change <- cut( sleepData2$extra, breaks=c(-2,-0.5,0.5,6),
sleepData2labels=c("Reduction", "No change", "Increase") )
summary( sleepData2$change )
Reduction No change Increase
2 5 13
# 13 measurements had increased sleep, 2 reduced
Exercise 3: Create a new 0/1-variable using the ifelse()
function that indicates whether or not sleep increased (1: extra
>0 vs 0: extra
<= 0).
For how many measurements were sleep increased using this definition?
Click here for a solution.
$increase01 <- ifelse( sleepData2$extra > 0, 1, 0)
sleepData2summary( sleepData2$increase01 )
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 1.0 0.7 1.0 1.0
Thus, with a mean of 0.7 of the 0/1-variable, 70% of the measurements corresponding to 0.7*20=14 measurements. However, this is a rather strange way of finding the number of measurements - an easier way that you will learn later is to make a table of the variable:
table( sleepData2$increase01)
0 1
6 14