R command sheet

 

The homepage www.statmethods.net contains a short and clear presentation of many of

the commands below.

 

This page is dynamic and will be updated during the course as more commands are needed.

 

Loading and viewing data

 

First set your working directory to the folder containing your data file. You can either browse

from R Studio using the menu Tools -> Set Working Directory -> Choose Directory.

Alternatively you can copy the location to the command setwd(), e.g.:

 

setwd( 'C:/Users/Susanne/Rworks/' )

 

Load data using the command read.dbf() from the foreign-package:

 

library(foreign)

d <- read.dbf( 'datafile.dbf' )

 

head( d )

head( d, n=10 )

Prints the first 6 lines of the data set named d. The additional argument n sets the number of lines to be printed (here to 10)

View( d )

Prints the data set as Excel-look-alike in a new tab.

 

 

Data manipulations

 

 

d2 <- subset(d, Grp==1)

Define a sub data set of d named d2 containing only those elements of d for which the logical condition Grp==1 is fulfilled. I.e. only observations in group 1.

d$var1

Use $-notation to access a variable named var1 in data set d.

d$new.var <- d$var1 / d$var2

Define a new variable in data set d named new.var containing var1 divided by var2.

d$grp.var <- cut( d$var1, breaks=c(0,100,200,1000),

labels=c('label1','label2','label3') )

Creating a factor (group variable) in d named grp.var by chopping d$var1 into pieces. Cut points are specified by breaks argument. Make sure that lower and upper value in breaks exceeds the range of x.

Specify labels argument to control the naming of the groups.

d$heavy <- 1*( d$weight > 100)

 

An example of defining a binary (0/1) variable based on a quantitative variable

 

Exporting data

 

The data set named d can be exported in dbf-format by specifying

library(foreign)
write.dbf( d, 'newFile.dbf')

which will generate a dBase file with name newFile.dbf in your working directory.

You can also export a csv-file (Comma Separated Values). Excel can read such files. Simply write:
write.csv2( d, 'newFile.csv')
to have R create a csv-file in your working directory.

 

Working with vectors

 

seq( from=x, to=y)

Generates a sequence of numbers from X to Y with steps of size 1. Additional arguments :

by=b specifies the steps to be of size b.

length=l specifies the total length to equal l.

See item F Day 1 for more details.

rep(x, times=t)

Repeats element x t times. See item G Day 1 for more details

x[ 1:4 ]

Use [] to pick out specific elements of a vector x. See Item H Day 1.

sort( x )

Sorts elements in x.

rank( x )

Determines the ranks of x.

length( x )

Gives the length of the vector, i.e. the number of elements.

which( x )

Tells which elements of x that are true (i.e. x has to be a TRUE/FALSE vector (e.g. a condition like y>600)).

 

Working with data

 

dim( d )

Gives dimension of the data set d.

summary(d)

Gives a summary of each of the variables in d

median( d$var1 )

Median of variable named var1 in data set d. You might need argument na.rm=T if any missing values.

quantile( d$var1 )

Quantiles of a variable named var1 in data set d. You need to add argument na.rm=T if any missing values. Extra argument probs can be used to control which quantiles to compute, e.g. probs=c(.025,.975) to determine lower and upper 2.5% quantile

max( d$var1 )

min( d$var1 )

range( d$var1 )

Max of variable var1.

Min of variable var1.

Range gives min and max of var1.

All functions require na.rm=T if missing values

mean( d$var1)

Mean of variable named var1 in data set d. You might need argument na.rm=T if any missing values

sd( d$var1 )

Standard deviation of variable named var1 in d. You might need argument na.rm=T if any missing values.

 

Generating data

 

sample( x, no, replace=T)

Draws no numbers from a vector x with replacement. Replace T with F to draw without replacement.

 

 

Calculations

 

sqrt( X )

Calculate square root of a number X

log( X )

log2( X ) or log( X, base=2)

log10( X ) or log( X, base=10)

Natural log of X

Log base 2 of X

Log base 10 of X

exp( X )

Anti-log of X (natural log), exp(1)=e~2.71

X^y

The y-th power of X

round( X, 2 )

X is a number to round with 2 decimals

 

 

Tables

 

mytable <- matrix( c(1,2,3,4), nrow=2)

Make a 2x2 table with elements 1-4

table( d$x )

One-way table of x.

Add argument useNA='ifany' to count also the missing values.

table( d$x , d$y )

Tabulate x vs y

table( d$x , d$y, d$z )

Tabulate x vs y stratified on z

mytable <-table(d$x, d$y )

prop.table(mytable)

prop.table(mytable, 1)

prop.table(mytable, 2)

 

Determine cell percentages

Determine row percentages

Determine column percentages

chisq.test( mytable, correct=F )

Pearson chi-square test of independence between x and y. If argument correct=F is omitted for 2 by 2 tables, Yates' continuity correction will be applied

prop.test(mytable)

Test and CI of the difference between two proportions

binom.test(x,n)

Exact binomial test observing x of n possible successes. Tests hypothesis p=0.5 per default (change with additional argument p=). Provides exact CI.

fisher.test( mytable )

Fisher test of independence between x and y

oddsratio( mytable, method='wald' )

Calculates OR in a 2 by 2 table. The oddsratio() command is found in the package 'epitools' (i.e. use library(epitools) before using oddsratio()). Use method='wald' to have CI based in the Wald method. Use rev='rows' to reverse the rows (i.e. flip OR).

riskratio( mytable, method='wald' )

Calculates relative risk RR in a 2 by 2 table. The command is found in the package 'epitools' (i.e. use library(epitools) before using riskratio()). CI are based on the Wald method. Use rev='rows' to reverse the rows (i.e. flip RR)

 

 

T-test

 

t.test( d$var1)

One sample t-test on variable named var1

t.test(d$var1, mu=7)

One sample t-test, test hypothesis that mean = 7

t.test( d$var1~d$group, var.equal=T)

 

 

t.test( d1$var1, d2$var2, var.equal=T)

 

Two-sample t-test comparing the means of var1 in the two groups specified by a variable named group.

Alternative way of requesting the t-test. Here we compare the mean of var1 in data set d1 with the mean of var2 in data set d2.

t.test(d1$var1, d2$var2, paired=T )

A paired t-test comparing mean of var1 to mean of var2 (supposed to be measured on same individuals / animals / items etc)

var.test( d$var1~d$group )

A formal test of whether the variances of var1 in the two groups specified by group are the same.

 

 

Non-parametric comparisons

 

wilcox.test( d$var1, mu=7 )

One-sample Wilcoxon test investigating whether median equals 7

wilcox.test(d$var1~d$group )

 

 

 

wilcox.test(d1$var1, d2$var2)

Two-sample Wilcoxon test investigating whether the medians in the two groups specified by group are equal. Only works for two groups.

Alternative use. Here we compare the median of var1 in data set d1 with the median of var2 in data set d2

kruskal.test(d$var1~d$group)

Kruskal-Wallis test. Can be used for comparing 2 or more groups.

 

 

Correlation

 

cor.test( d$x, d$y )

Determines Pearson correlation between x and y. Add option method='spearman' or method='kendall' to determine Spearmans correlation or Kendall's tau. NB: cor.test takes no data-option, $-notation is needed.

 

Linear regression models

 

lm1 <- lm(y ~ x, data=d)

lm=Linear Model. Performs linear regression analysis of y on x.
Always save the result in an object (here named lm1).

summary(lm1)

Gives a summary of the results from the fitted regression model lm1.

coef(lm1)

Estimated coefficients from the model (intercept and slope)

confint(lm1)

Confidence intervals of estimated parameters (intercept and slope).

plot(lm1)

Gives various plots used for model assessment. Add option which=1 for residuals vs predicted values, which=2 for quantile-quantile plot, which=4 for Cooks distance.

predict(lm1, newdata=newD)

Prediction of mean values for individuals with values of x specified in a data fram newD (e.g. newD= data.frame(x=1:5), newD MUST contain same variable names as used in the model). Add option interval='confidence' or interval='prediction' to determine confidence intervals for the means resp prediction intervals.

cooks.distance(lm1)

Determines Cook's distance for each individual. Compare with 4/n.

dfbetas(lm1)

DFBETAS for each individual (estimate of how much an observation has effected the estimated coefficients). Compare with 2/sqrt(n).

 

Logistic regression models

 

glm1 <- glm( y01 ~factor(group), data=d, family=binomial )

glm=Generalized Linear Model. Performs a logistic regression analysis. The response y01 has to have values 0 or 1.
Always save the result in an object (here named glm1).

summary( glm1 )

Gives a summary of the results from the fitted regression model glm1.

relevel( factor( group ), ref=r )

Is used to change the reference group of a factor variable used in regression analysis. The argument r to ref specifies which group should be the reference (1=1st level, 2=2nd level etc.).

coef( glm1 )

Requests the estimated coefficients (differences in log-odds (ie log OR) for a logistic regression).

confint.default( glm1 )

Calculates confidence intervals from a model fit (e.g. glm1) based on the Wald-method.

drop1( glm1, test='Chisq')

Performs overall test for each variable in the model - asking whether each term may be deleted assuming the remaining terms is kept in the model.

anova( glm1, glm2)

Compares two models, the one being a coarser version of the other.

predict( glm1, newdata=newD, predict='response')

Predict probabilities from the model glm1 using values from a data frame named newD. See also section Predictions above.

 

 

Calculating p-values by hand

 

Two-sided p-values

 

2* (1 - pnorm( x ) )

From a normal distribution with observed test statistic. x is the positive value of the test statistic. Wald-test or Z-test.

2* (1 - pt( x, df=f ) )

From a t-distribution with f degrees of freedom. x is the positive value of the test statistic.

 

Plots

 

hist( x )

Make histogram of x

boxplot( y~x )

Boxplots of y for each value of x

stripchart( y~ x)

Stripchart of y on groups defined by x. Use additional arguments:

vertical=T if vertical rather than horizontal plot,

method='jitter' to add noise in the data points (noise is not added in y, only around x).

par(las=1)

Run this command before running plot to rotate numbers on y-axis.

plot(x, y)

Plot y as a function of x

plot(x,y,tybe='b')

Plot y as a function of x, plot type 'b' (see my video https://www.youtube.com/watch?v=FQZhsEXCAUM)

Other arguments you will need :

col='blue', see other color choices from link on Podio, item I. pch=2 to control plot symbols, see again item I.

xlim=c(a,b). ylim=c(a,b). xlab='X label'.

Ylab='Y label'. main='Your title' to add title to the plot.

Axes=F to customize axes using axis()-function next.

scatter.smooth(x,y)

Plots y as a function of x and adds a smoother to the plot (a line being a kind of a moving average). The line can be controlled with parameters given in additional argument lpars, e.g.

lpars=list( col='red', lwd=2 )

axis(1)

 

 

axis(2)

Add x-axis to plot. With additional argument at=c(0,6) ticks are drawn at 0 and 6 - otherwise R will make the choice on where to put the ticks.

Add y-axis. You may also use at-argument

lines(x,y)

Add lines to a plot, plotting y as a function of x. Same additional arguments as to the plot function

points(x,y)

Adds points to a plot, plotting y as a function of x. Same additional arguments as to the plot function

arrows( 3, L, 3, U, code=3, length=0 )

Draws an arrow from (3, L) to (3,U). code=3 makes the arrow 'have arrows in both ends'. Use length=0 to avoid any arrows in both ends.

abline(a,b)

 

abline( h=0 )

abline( v=0 )

abline( lm1 )

Draws a straight line with intercept a, slope b. E.g. abline(0,1).

Draws a horizontal line at y=0.

Draws a vertical line at y=0.

Adds estimated regression line from a fitted linear model (named lm1).

legend('topright', c('x','y') )

Add legend to a plot in the top right corner. The legend will contain two lines with text 'x' and 'y' respectively.

Alternatives placements are 'bottomright', 'topleft','bottomleft'.

Add extra arguments :

pch=c(1,2) to have plot symbols of type 1 and 2 in the two lines.

lty=c(1,2) to have two line types.

Use inset=0.05 to move the legend 0.05 units towards the center of the plot.