5 Data management

The estimated amount of time to complete this chapter is 1-2 hours.

Often we do not receive the data files in a useful form but we will have to do some rearranging and transformation of the data in order to prepare for the statistical analyses. This includes subsetting datasets, merging datasets as well as defining new variables and deleting variables.

5.1 Merge two data sets

Information on our patients can occur in two or more data sets. E.g. consider a situation where one data set contains the information from sleepData and another data set contains individual specific information such as the following (the following data is for illustrative purposes and has absolutely nothing to do with the orginal data in the Cushny and Peebles paper):

ptInfo <- data.frame( ID = 1:10, BMI = 18:27, smoking = rep( c("Yes", "No"), 5) )
ptInfo

   ID BMI smoking
1   1  18     Yes
2   2  19      No
3   3  20     Yes
4   4  21      No
5   5  22     Yes
6   6  23      No
7   7  24     Yes
8   8  25      No
9   9  26     Yes
10 10  27      No

The ptInfo thus contains information on BMI and whether the patient is smoking or not (we used the rep() (REPeat) function to repeat the vector c('Yes','No') 5 times to obtain a length of 10).

If we wish to add this new information onto the sleepData data frame, we can use the function merge(). To merge two data sets it is important to identify which variable(s) that connects the two data sets. In the sleepData and ptInfo example it is the id-variable ID. ID is then called a key. Using the merge() function where ID is the key between the two data sets, we create a new data set sleepData2 that contains information from the sleepData and ptInformation data set:

sleepData <- sleep
sleepData2 <- merge( x = sleepData, y = ptInfo, by = "ID" )
head( sleepData2 )

  ID extra group BMI smoking
1  1   0.7     1  18     Yes
2  1   1.9     2  18     Yes
3 10   2.0     1  27      No
4 10   3.4     2  27      No
5  2  -1.6     1  19      No
6  2   0.8     2  19      No

From the Environment pane we see that the sleepData2 data contains 20 observations and 5 variables. I.e. it is a copy of the original sleepData with the two variables BMI and smoking from the ptInfo added.

Sometimes there may be more than one variable needed to merge two data sets properly (see e.g. statmethods.net to learn the syntax).

5.2 Subsetting

A subset of a data frame is a data frame containing only some of the rows or some of the columns (variables). In Chapter 3.3 we saw how we could define such smaller data sets using indexing. A useful built-in function is the subset() command that creates a subset of an existing data set based on one or more logical conditions on the variables in the data set. The function has the following structure:

subset(x, subset)

Here x is a data set and subset is the condition (logical expression). How to use the function is shown below:

sleepData <- sleep 
drug1 <- subset(sleepData, group == 1)
drug2 <- subset(sleepData, group == 2)

Here the first subset only contains the data lines where the group variable equals the value 1. Note that we use double equal signs for comparisons (asking whether group “is equal to” 1). The second subset similarly contains the data lines where the group variable equals 2. The sleep data has 20 observations and from the Environment pane we see that we now have two datasets drug1 and drug2 each with 10 observations and 3 variables. Have a look at each of the datasets to see what they contain (using e.g. View()).

5.3 Long vs. wide format

A data set may contain several measurements on the same individuals like the sleepData that has two measuements per patient. Such a data set often occur in either of the two formats: wide or long. We will in this section explain the differences between the two formats and show how to change from one to another.

We can represent the sleepData data set both in long and wide format. In the long format the data contains one row for each measurement and each patient contributes with several lines. In the wide format there is only one row for each patient and each measurement is in a separate column. The original format of the sleepData data set is the long format. Here long format refers to the fact that we have repeated measurements (multiple measurements on the same individual represented by ID) and the measurements are listed on top of each other instead of side-by-side. Often the long format is preferred when doing a statistical analysis, but this is not always the case.

5.3.1 From long to wide format

A wide format of the sleepData data set can be achieved by placing values of the extra column of group 2 next to the group 1 extra column and thereby reducing the data frame from 20 lines to 10 lines.

We may use the above two subsets drug1 and drug2 containing the group 1 respectively group 2 measurements (Chapter 5.2). In wide format we will need 10 lines (one for each individual) and three variables (the ID, the extra hours slept on drug 1 and the extra hours slept on drug 2).

From drug1 we need the ID and the extra hours slept on drug1. We rename the extra variable to extra1 as we need a new name to make it clear that these measurements were taken while the patients were on drug 1.

# Show the names of the variables in the data set:
names( drug1 )

[1] "extra" "group" "ID"

# Replace variable name no 1 with "extra1":
names( drug1 )[1] <- "extra1"
# Delete the group-variable:
drug1$group <- NULL
head( drug1 )

  extra1 ID
1    0.7  1
2   -1.6  2
3   -0.2  3
4   -1.2  4
5   -0.1  5
6    3.4  6

Note that we deleted the group variable by assigning it the value NULL which is basically just an empty vector. We do the same for the drug2-data set but instead of deleting the group variable using the NULL value we refer specifically to the column we wish to delete using indexing:

names( drug2 )

[1] "extra" "group" "ID"

names( drug2 )[1] <- "extra2"
# Delete the group-variable (column no 2)
drug2 <- drug2[ ,-2]
head( drug2 )

   extra2 ID
11    1.9  1
12    0.8  2
13    1.1  3
14    0.1  4
15   -0.1  5
16    4.4  6

We could also have specified that we only wanted column no 1 and 3 as drug2 <- drug1[ , c(1,3) ].

Note that we carefully defined the two datasets, drug1 and drug2, to only have the ID variable in common in order to be able to ‘set the data sets side by side.’ We may do that using the merge() function described above (Chapter 5.1):

sleepWide <- merge( drug1, drug2, by='ID')
sleepWide

   ID extra1 extra2
1   1    0.7    1.9
2  10    2.0    3.4
3   2   -1.6    0.8
4   3   -0.2    1.1
5   4   -1.2    0.1
6   5   -0.1   -0.1
7   6    3.4    4.4
8   7    3.7    5.5
9   8    0.8    1.6
10  9    0.0    4.6

Voila! However note that merge() has changed the order of the ID variable. If we want our data sorted by the ID variable we order the data by ID using the function order():

sleepWide <- sleepWide[ order(sleepWide$ID), ]
sleepWide

   ID extra1 extra2
1   1    0.7    1.9
3   2   -1.6    0.8
4   3   -0.2    1.1
5   4   -1.2    0.1
6   5   -0.1   -0.1
7   6    3.4    4.4
8   7    3.7    5.5
9   8    0.8    1.6
10  9    0.0    4.6
2  10    2.0    3.4

To see a few more examples of sorting data sets, see e.g. statmethods.net.

Exercise 1: Order the data by ID using indexing instead of order(). You first need to run sleepWide <- merge( drug1, drug2, by='ID') to redefine the unsorted data frame. Now use indexing (Chapter 3.3) to have the appropriate order of the data.

Click here for a solution.

In the unsorted data set we first need the first row, then the third, fourth etc until the 10’th row. The second row should be the last. I.e. we need the following order of the data lines:

sleepWide <- merge( drug1, drug2, by="ID" )
sleepWide <- sleepWide[ c(1,3:10,2), ]
sleepWide

   ID extra1 extra2
1   1    0.7    1.9
3   2   -1.6    0.8
4   3   -0.2    1.1
5   4   -1.2    0.1
6   5   -0.1   -0.1
7   6    3.4    4.4
8   7    3.7    5.5
9   8    0.8    1.6
10  9    0.0    4.6
2  10    2.0    3.4

If you need to organize your own data from long to wide it might be worth having a closer look at the dcast() function in the reshape2 package.

5.3.2 From wide to long format

To reproduce the long format of the sleepData data set from the sleepWide data set, the two columns extra1 and extra2 must be placed on top of each other and we need a variable indicating which group the measurement belongs to (like the group variable in sleepData). This can be achieved by 1) separating extra1 and extra2 into two separate data frames, 2) adding a group variable to each data frame, 3) rename the variables of the data frames to contain the same variable names and 4) pasting the two data frames on top of each other.

# 1) make a copy of the wideData, delete extra2
wide1 <- sleepWide
wide1$extra2 <- NULL
# 2) adding the group variable
wide1$group <- 1
# 3) rename the extra1 variable
names( wide1 )[2] <- "extra"

We do the same for the extra2 variable

wide2 <- sleepWide
wide2$extra1 <- NULL
wide2$group <- 2
names( wide2 )[2] <- "extra"

The two data sets now contain identical variable names and we may place them on top of each other using rbind() (row bind):

# 4) 
sleepLong <- rbind( wide1, wide2 )
sleepLong

   ID extra group
1   1   0.7     1
2   2  -1.6     1
3   3  -0.2     1
4   4  -1.2     1
5   5  -0.1     1
6   6   3.4     1
7   7   3.7     1
8   8   0.8     1
9   9   0.0     1
10 10   2.0     1
11  1   1.9     2
12  2   0.8     2
13  3   1.1     2
14  4   0.1     2
15  5  -0.1     2
16  6   4.4     2
17  7   5.5     2
18  8   1.6     2
19  9   4.6     2
20 10   3.4     2

Voila! We now have a data set identical to the sleep data except that the original extra variable now has the name extra.

Exercise 2: Reorder the sleepLong data set such that the variable in the first column is the ID variable, the second is the group and the third is extra.

Click here for a solution.

# we need the order corresponding 1st, 3rd and 2nd variable
sleepLong <- sleepLong[ , c(1,3,2)]
# or referring to the variable names
sleepLong <- sleepLong[ , c("ID","group","extra")]

Exercise 3: Reorder the sleepLong data such that the first two rows correspond to individual with ID=1, the third and fourth row to individual with ID=2 etc.

Click here for a solution.

We can sort the data according to the ID variable using order():

sleepLong <- sleepLong[ order(sleepLong$ID) , ]
head(sleepLong)

   ID group extra
1   1     1   0.7
11  1     2   1.9
2   2     1  -1.6
12  2     2   0.8
3   3     1  -0.2
13  3     2   1.1

Exercise 4: Reorder the sleepLong data such that the first row correspond to ID=1 and group=2, the second ID=1 and group=1, the third ID=2 and group=1 etc. I.e. a data frame sorted according to increasing ID and decreasing group within each individual. Consult statmethods.net to find out how to do that.

Click here for a solution.

We can sort the data according toincreasing ID and decreasing group within ID:

sleepLong <- sleepLong[ order(sleepLong$ID,-sleepLong$group) , ]
head(sleepLong)

   ID group extra
11  1     2   1.9
1   1     1   0.7
12  2     2   0.8
2   2     1  -1.6
13  3     2   1.1
3   3     1  -0.2

If you need to organize your own data from long to wide it might be worth having a closer look at the melt() function in the reshape2() package.

5.4 Defining new variables

In this chapter we show how to define various kinds of new variables based on the variables in the SundBy data set.

5.4.1 Calculation of new variables

The data SundBy is loaded into a data frame named d (as described in Chapter 4):

d <- read.csv('https://biostat.ku.dk/R/data/sundby.csv')

We can calculate new variables based on the variables in the data set. For example we often need to log-transform our variables when performing a regression analysis. We will then need a new variable in the data set being the log of the variable of interest, for example the log of weight:

d$logwgt <- log( d$wgt )

Note that we may use $-notation to add a new variable to a data set as well as we use $-notation to refer to an existing variable.

In the video below (5:28 min)we show how to add a BMI-variable to the SundBy data set. The BMI is determined from weight and height using the formula \[\begin{eqnarray*} {\rm BMI} &=& \frac{\rm weight\ in\ kilo}{\rm (height\ in\ meters)^2} \end{eqnarray*}\]

In the video we defined the bmi variable in two different ways:

# Defining height in meters before calculating BMI
d$htms <- d$ht / 100
d$bmi <- d$wgt / d$htms^2

# Calculating height in meters inside the calculation - be careful with ()!
d$bmi <- d$wgt / (d$ht/100)^2

An important point, when defining new variables, is that we should always check the values of the new variables to make sure that we did the calculations right. In the video we carefully used the head() command every time we modified the code.

5.4.2 Grouped variables from numeric variables

Sometimes we wish to break a quantitative (numeric or integer) variable into groups. A group variable is also termed a categorical variable or a factor variable. For this purpose we may use the cut() function.

In the video below (5:23 min) we demonstrate how to make a grouped version of the BMI-variable with values “Underweight” / “Normal” / “Overweight” / “Obese” corresponding to the BMI groups <= 20, 20-25, 25-30, > 30:

Click here to find the code produced in the video

d <- read.csv("https://biostat.ku.dk/R/data/sundby0_English.csv")
d$bmi <- d$wgt / (d$ht/100)^2
head(d)

summary( d$bmi )

d$bmiGrp <- cut( d$bmi, breaks=c(16,20,25,30,36),
                 labels=c('Underweight','Normal','Overweight','Obese'))
summary( d$bmiGrp )

Contents of the video:

The cut() function is used to split a numerical variable into a grouped variable based on its values. The function takes the following arguments:

cut(x, breaks, labels, right)

Here x is the numerical vector to be divided into groups. The argument breaks is a vector specifying the values defining the groups. Arguments labels and right are optional.

Having specified breaks=c(16,20,25,30,36) as in the video, the groups will be defined as (16,20], (20-25] etc. The open paranthesis ( is to be understood such that the value is not included in the interval, the bracket paranthesis such that the value is included in the interval. E.g. the second group (20-25] will contain values from > 20 (but not =20!) to 25 (25 included). To define the groups the other way around, i.e. including the lower value in each interval (e.g. [20,25)), use additional argument right=FALSE.

Warning: Always make sure that the minimum (maximum) value specified in the breaks are smaller (larger) than the smallest (largest) observed value - otherwise missing values will occur. Use e.g. a summary() of your variable to determine these values.

The label argument can be used to add informative names to the groups, e.g. labels=c('Underweight','Normal','Overweight','Obese'). There have to be as many names as groups and the names have to be specified using quotation around each name.

Quiz

How many people are normal weight in the SundBy data set?

Test your result and find multiple solutions here.

5.4.3 Binary variables

The ifelse() function can be used to create new variables with 2 levels. It has the following form:

ifelse(test, yes, no)

Here test is a logical statement that produces a TRUE/FALSE value, if the statement is TRUE, the custom value for yes is returned and if the statement is FALSE, the custom value for no is returned. The yes and no arguments can be any type of element (numeric, integer, character), but they must be of the same type.

Say we wish to make an indicator that returns 1 if the patient is normal weight (BMI above 20 and below 25) and 0 otherwise. This can be done using the ifelse function:

d$normalBMI <- ifelse( 20 < d$bmi & d$bmi <= 25, 1, 0 )

Using the head() function, we find that the indicator seems to return the correct result:

head( d, 6 )

    id gender age  wgt  ht      bmi      bmiGrp normalBMI
1 1434      2  31 70.0 172 23.66144      Normal         1
2  978      1  28 68.0 170 23.52941      Normal         1
3  701      2  55   NA 165       NA        <NA>        NA
4  134      1  26 76.0 180 23.45679      Normal         1
5  292      2  21 60.0 175 19.59184 Underweight         0
6 1481      2  27 61.5 178 19.41043 Underweight         0

5.4.4 Combining groups

Sometimes we wish to group character vectors with a large number of categories into fewer categories. To do that we use a manual approach.

5.4.4.1 Combining two groups

There are only very few underweight people in the SundBy data and we may therefore wish to combine the underweight and normal weight people. We do that by making a copy of the bmiGrp variable named bmi3Grp and next assigning the underweight people the value Normal:

d$bmi3Grp <- d$bmiGrp
d$bmi3Grp[ d$bmiGrp == 'Underweight' ] <- 'Normal'

We should always perform a check of our new variable. We can do that using summary():

> summary( d$bmiGrp )
Underweight      Normal  Overweight       Obese        NA's 
         18         128          35          12           7 
> summary( d$bmi3Grp )
Underweight      Normal  Overweight       Obese        NA's 
          0         146          35          12           7

Note that the bmi3Grp variable now as expected has 18+128=146 people registered as normal weight.

Also note that the empty group "underweight" is not deleted from the levels of the variable. To delete this level we may use the droplevels() function:

d$bmi3Grp <- droplevels( d$bmi3Grp)
summary( d$bmi3Grp )

    Normal Overweight      Obese       NA's 
       146         35         12          7

5.4.4.2 Combine several groups

Suppose we want a variable indicating whether a person is overweight or not. Here we may use the ifelse() function described above or we may use a manual approach:

d$obese[ d$bmiGrp == 'Obese' ] <- 'Obese'
d$obese[ d$bmiGrp %in% c('Underweight','Normal','Overweight') ] <- 'Not Obese'

We use %in% and not == when we wish to match with more than one element.

Note that we cannot follow the procedure used in Section 5.4.4.1 as we need to add a new level Not obese for our factor variable

> d$obese <- d$bmiGrp
> d$obese[ d$bmiGrp %in% c('Underweight','Normal weight','Overweight') ] <- 'Not obese'
Warning in `[<-.factor`(`*tmp*`, d$bmiGrp %in%
c("Underweight", "Normal weight", : ugyldigt faktorniveau,
NA oprettet

We might also define the new variable in several steps instead of using %in%:

d$obese[ d$bmiGrp == 'Obese' ] <- 'Obese'
d$obese[ d$bmiGrp == 'Underweight' ] <- 'Not Obese'
d$obese[ d$bmiGrp == 'Normal weight' ] <- 'Not Obese'
d$obese[ d$bmiGrp == 'Overweight' ] <- 'Not Obese'

5.5 Exercises

Before you start the exercises it is important that you understood how to use the functions in this chapter. We also expect that you have knowledge about handling missing values and some of the functions from other chapters. If you do not - read the content summaries and maybe the previous chapter + the chapther about vectors and data frames. All the exercises in this chapter are based on the sleepData2 defined above (Ch. 5.1).

Exercise 1A: By help of the subset()-function, create a new data frame including only smokers.

Click here for a solution.

sleepQ1A <- subset(sleepData2, smoking=="Yes")
sleepQ1A

   ID extra group BMI smoking
1   1   0.7     1  18     Yes
2   1   1.9     2  18     Yes
7   3   1.1     2  20     Yes
8   3  -0.2     1  20     Yes
11  5  -0.1     1  22     Yes
12  5  -0.1     2  22     Yes
15  7   5.5     2  24     Yes
16  7   3.7     1  24     Yes
19  9   0.0     1  26     Yes
20  9   4.6     2  26     Yes

Exercise 1B: By help of the subset()-function, create a new data frame including only smokers with a BMI at 20 or above.

Click here for a solution.

sleepQ1C <- subset(sleepData2, smoking=="Yes" & BMI >= 20)
sleepQ1C

   ID extra group BMI smoking
7   3   1.1     2  20     Yes
8   3  -0.2     1  20     Yes
11  5  -0.1     1  22     Yes
12  5  -0.1     2  22     Yes
15  7   5.5     2  24     Yes
16  7   3.7     1  24     Yes
19  9   0.0     1  26     Yes
20  9   4.6     2  26     Yes

Exercise 1C: By help of the subset()-function, create a new data frame including only smokers with BMI below 20 or above 25.

Click here for a solution.

sleepQ1C <- subset(sleepData2, smoking=="Yes" & (BMI < 20 | BMI > 25 ) )
sleepQ1C

   ID extra group BMI smoking
1   1   0.7     1  18     Yes
2   1   1.9     2  18     Yes
19  9   0.0     1  26     Yes
20  9   4.6     2  26     Yes

NB: Be aware of the parantheses as AND (&) precedes OR (|) (like multiplication precedes addition). If you do not add the paranthesis around the BMI-expression (e.g. using smoking=="Yes" & BMI < 20 | BMI > 25) you will have a different result corresponding to (smoking=="Yes" & BMI < 20) | BMI > 25, namely underweight smokers and all overweight patients.

Always look at your data to make sure you got it right! ___

Exercise 2: Use the cut() function to assign a grouped variable to the sleepData2 data set. Name the new variable change. The groups of change is defined as follows: “Reduction”: extra is less than -0.5, ” No change”: extra is between -0.5 and 0.5, “Increase”: extra is greater than 0.5. For how many of the measurements were sleep increased resp. decreased?

Click here for a solution.

sleepData2$change <- cut( sleepData2$extra, breaks=c(-2,-0.5,0.5,6),
                          labels=c("Reduction", "No change", "Increase") )
summary( sleepData2$change )

Reduction No change  Increase 
        2         5        13

# 13 measurements had increased sleep, 2 reduced

Exercise 3: Create a new 0/1-variable using the ifelse() function that indicates whether or not sleep increased (1: extra>0 vs 0: extra<= 0).

For how many measurements were sleep increased using this definition?

Click here for a solution.

sleepData2$increase01 <- ifelse( sleepData2$extra > 0, 1, 0)
summary( sleepData2$increase01 )

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     1.0     0.7     1.0     1.0

Thus, with a mean of 0.7 of the 0/1-variable, 70% of the measurements corresponding to 0.7*20=14 measurements. However, this is a rather strange way of finding the number of measurements - an easier way that you will learn later is to make a table of the variable:

table( sleepData2$increase01)


 0  1 
 6 14