3 Data structures in R
The estimated amount of time to complete this chapter is 1-2.5 hours.
In this chapter you will be given an introduction to data formats and structures in R, in particular how data sets are organized and how we access the values in a data set. You probably have been working with data sets in Excel. Data sets are organized in a similar way in R, namely in rows and columns. You will read about data sets and watch two videos. Finally there is a quiz where you will be guided through some of the steps illustrated in the videos using a specific R data set.
Depending on your familiarity with R, we expect you to use between 45 min and 3 hours on this chapter.
3.1 Data types
Every object in R is of a certain type. This type is found using the typeof()
command which returns the type of the object in the parenthesis. We here mention some of the most common data types
Numeric
All calculations we have done so far have been with numbers of the numeric
class. These include integers, decimal numbers, constants like \(\pi\) or \(e\) and many more. To check if an object, for instance \(\pi\), is numeric, write
> is.numeric(pi)
1] TRUE [
Note that numeric
is not a type, but several types of objects are in the numeric class. The type of \(\pi\) is the same as the type of 2
> typeof(pi)
1] "double"
[> typeof(2)
1] "double" [
The double
type is the most common data type since practically all numbers are of this type.
Text values
Often a variable will not be a number but a word or a combination of letters with a certain meaning. Objects like these are of the character
type. We create a character type object and then check to see, that it actually is of the desired type.
> name <- "health_variable"
> typeof(name)
1] "character" [
Character names can include symbols such as _
, -
and &
and it can also include numbers.
> name <- "health_variable_1"
> typeof(name)
1] "character" [
Note that this might sometimes lead to confusion whenever a number, for instance 2.5
, is saved as a character string instead of as a number
> variable <- "2.5"
> typeof(variable)
1] "character" [
This is a common issue when working with real life data. If we would like the variable to be treated as a numeric object of the double
type, we can change the type by writing
> variable <- "2.5"
> variable <- as.numeric(variable)
> typeof(variable)
1] "double" [
Logical values
A logical value is a value indicating whether something is TRUE
or FALSE
. To check if two things are equal in R, we have to use two equates signs, ==
. For instance
> 7 + 11 == 18
1] TRUE
[> 7 + 11 == exp(5)
1] FALSE [
Luckily the output tells us that the first statement is true while the second is false. Two check if two things are not equal, write !=
> "variable" != "cariable"
1] TRUE [
Likewise we can use operators such as <=
(smaller than or equal to), >=
(greater than or equal to), along with <
and >
.
> 2 >= 3
1] FALSE [
It is worth noting that R can perform calculations with logical values as it stores TRUE values as 1 and FALSE values as 0.
> 3 + (5 == 5)
1] 4 [
3.2 Vectors
If we have several objects of the same type, we can combine them in a vector by using the combine function, c()
. The numbers 2, 5 and - 3.5 are stored in a vector called y
by
> y <- c(2, 5, -3.5)
How to do calculations using vectors is illustrated in the video below (9:45 min).
Click here to find the code produced in the video
# Author: Anne
# Description: Basic data structures in R
# vectors
<- c(1,0,1,0,1,1,1,1,0,0)
x
# first introduction to a function - how to get help
?c help(c)
+2)^2 # impose a function on all entries in a vector
(x
<- c(1.65, 1.79,1.62,1.87) # store more meaning full vectors
height <- c(55.2, 89.7, 49.8, 92.0)
weight
<- weight/height^2 # use the vectors to calculate new information
bmi <- weight/(height^2)
bmi2
<- c("Anne", "Anna", "Anders","Andreas") # store characters
firstName <- c(TRUE, FALSE, TRUE, FALSE) # store a logical function
mathMajor
typeof(firstName) # find the type
firstName
mathMajor
# indexes for vectors
FirstName3]
FirstName[c(1,2,3)]
FirstName[
3] <- "Andre" # change an entry
FirstName[ FirstName
Contents of the video:
A vector is a collection of elements all of the same type. A coarse overview of the types includes:
-
Numeric (decimal numbers and integers)
- E.g. 1, -1, 0, 3.98, 3.14 etc.
-
Logical (true or false indicators)
- TRUE/FALSE and T/F
-
Character (names, levels etc.)
- E.g. “Apple,” “Pear,” “A,” “B” etc.
Besides having a type, a vector also has a length (i.e. the number of elements). The length can be found by:
length(vector)
To extract elements of a vector, one can use indexing. An example of indexing is shown below, where the elements nr. 1 and nr. 2 to 5 is extracted respectively:
> x <- c(1,-1,0,4,5,9,57)
> x
1] 1 -1 0 4 5 9 57
[> x[1]
1] 1
[> x[2:5]
1] -1 0 4 5 [
You can also extract based on conditions instead of position. Say that you want all of the values larger than 1:
> x[x>1]
1] 4 5 9 57 [
x > 1 is a condition that gives a logical vector as output:
> x>1
1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE [
This logical output is what selects the correct elements in the vector x:
> x[c(FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE)]
1] 4 5 9 57 [
3.3 Data frames
Data sets in R are saved in data frames. A data frame consists of equally lengthened vectors, organized in columns, thereby making it two-dimensional. A data frame thus consists of rows (one for each subject in our study) and columns (one for each variable/vector). In this video (12 min) you are introduced to data frames - how they can be defined and how to locate specific values within a data frame.
Click here to find the code produced in the video
# Author: Anne
# Description: Basic data structures in R
# vectors
<- c(1,0,1,0,1,1,1,1,0,0)
x
# first introduction to a function - how to get help
?c help(c)
+2)^2 # impose a function on all entries in a vector
(x
<- c(1.65, 1.79,1.62,1.87) # store more meaning full vectors
height <- c(55.2, 89.7, 49.8, 92.0)
weight
<- weight/height^2 # use the vectors to calculate new information
bmi <- weight/(height^2)
bmi2
<- c("Anne", "Anna", "Anders","Andreas") # store characters
firstName <- c(TRUE, FALSE, TRUE, FALSE) # store a logical function
mathMajor
typeof(firstName) # find the type
firstName
mathMajor
# indexes for vectors
FirstName3]
FirstName[c(1,2,3)]
FirstName[
3] <- "Andre" # change an entry
FirstName[
FirstName
# collecting data (data.frame)
<- data.frame(FirstName,height,weight,mathMajor)
roomies $age <- c(54,25,46,76)
roomies
age $age # collect the age variable from the roomies dataset
roomies
# indexes on a data frame
roomies2]
roomies[2] # second column
roomies[,2,] # first row
roomies[2,"age"]
roomies[2,5]
roomies[$age[2] roomies
Contents of the video:
A data frame may consist of different types of data but every column must be of the same type. An example of a data frame named kids
consisting of different types of data types is the following:
> kids <- data.frame(subject = c(1,2,3,4), gender = c("F","M","F","M"), age = c(7, 5, 9, 2))
> kids
subject gender age1 1 F 7
2 2 M 5
3 3 F 9
4 4 M 2
For each of the 4 kids, the data set contains the subject id, gender and age. To find the number of observations (subjects, corresponding to the number of rows) and the number of columns (variables, corresponding to the number of variables) we may use the command dim:
> dim(kids)
1] 4 3 [
The data consists of four observations (kids) and there are three columns (variables) in total.
Extracting elements of a data frame can be done in multiple ways. To extract one particular element, the two following methods can be applied:
kids[i,j]"gender"] kids[i,
i
can be any number between 1 and the number of rows in the data frame (4 in our example).
j
can be any number between 1 and the number of columns in the data frame (3 in our example). j
can also be a column/variable name in the data set. Referring specifically to the column names (variables) requires single (’ ’) or double (” “) quotation around the variable name. The code above will not work as i
and j
are not defined. An example of how to use it to obtain the value of the 2nd row in the 2nd column is:
> kids[2,2]
1] M
[: F M
Levels> kids[2,"gender"]
1] M
[: F M Levels
Extracting entire rows and columns is also possible:
#extract entire 2nd row
2,] kids[
#extract entire 2nd column
2]
kids[,"gender"]
kids[,$gender kids
The $
notation, used for extracting the values of the gender column, uses that gender
is a name of one of the columns in the kids
data set.
To extract e.g. the first 3 rows of the data set one can use a sequence. The sequence 1:3
generates the following:
1:3
[1] 1 2 3
Using the sequence on the kids
data set extracts the first 3 rows:
> kids[1:3,]
subject gender age1 1 F 7
2 2 M 5
3 3 F 9
The first 3 columns can be extracted in the same manner: kids[,1:3]
. The :
-sign creates a sequence and it can create sequences from any negative or positive number to any negative or positive number. Try to play around a but with positive and negative numbers!
The same way as with vectors, you can also extract observations based on a specific condition on the rows. If we wish to see the females only , we may use:
> kids[ kids$gender=="F",]
subject gender age1 1 F 7
3 3 F 9
3.4 Quiz
R has several built-in data sets. In this quiz we will consider the data set named sleep
based on the paper by Cushny and Peebles (1905) The action of optical isomers: II hyoscines, comparing the effect of two soporific drugs on the number of hours of sleep in a group of 10 patients.
In the sleep study, each patient was studied over several nights given 1) Hyoscyamine, 2) Hyoscine or 3) no treatment. The average hours of sleep with each treatment were registered. We are interested in comparing each of the two treatments to no treatment as well as comparing the two active treatments.
The sleep
data contains two measurements for each patient: The difference in average hours of sleep with Hyoscyamine compared to control and Hyoscine compared to control. The data has three variables: extra
is the difference between hours of sleep on treatment and hours of sleep without treatment, group
indicates the treatment (1=Hyoscyamine, 2=Hyoscine) and ID
is the patient id number:
sleep
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
7 3.7 1 7
8 0.8 1 8
9 0.0 1 9
10 2.0 1 10
11 1.9 2 1
12 0.8 2 2
13 1.1 2 3
14 0.1 2 4
15 -0.1 2 5
16 4.4 2 6
17 5.5 2 7
18 1.6 2 8
19 4.6 2 9
20 3.4 2 10
We note that the first 10 lines correspond to treatment group
1, the last 10 to group 2
. Each patient contribute with two measurements, one for each treatment.
Access to the data set is obtained by typing and running:
<- sleep sleepData
Note that the data set sleepData
now appears in your Environment. You can view the data in RStudio just typing sleepData
or View( sleepData )
.
In the quizzes you will be introduced to some new commands that are very useful when working with data frames. There is a total of 4 quiz questions.
Quiz question 1
How many many records (observations) does the data contain? How many variables?
Start the quiz here. You might find the answer to this quiz obvious - do the quiz anyway to learn a few more commands you may use to find the answer.
As specified in the welcome text to this introduction, the quiz will not work if you are using an old version of the browser (find a complete list of supported browsers here)
Quiz question 2
How many times were an increase in the average hours of sleep observed comparing treatment to placebo? (I.e. how many observations in the extra
column have a value > 0?)
Start the quiz here.
Quiz question 3
Assign a new vector to your data set called extraMinutes which is the extra sleep calculated in minutes instead of hours.
Which of the following commands can be used to achieve this?
-
sleep$extraMinutes <- sleep$extra*60
-
sleepData$extraMinutes <- sleepData$extra/60
-
sleepData[,"extraMinutes"] <- sleepData[,"extra"]*60
-
sleepData[,"extraMinutes"] <- sleepData[,"extra"]/60
Start the quiz here.
Quiz question 4
Missing data values are represented by the value NA (Not Available).
What happens when running the code:
3,"extra"] <- NA
sleepData[ sleepData
Start the quiz here.