1 Introduction to R

Welcome to the first taste of BAN400. We will start by downloading and installing the tools that we need to start coding, and then we will explore some of the most basic aspects of the R programming language. After most of the videos we have included a small problem that we encourage you to try before you move on to the next topic.

Before we start, you need to install two items on your computer. Please do the installations in the following order:

  1. The R Programming Language: Navigate to cran.uib.no and download the version of R that corresponds to your operating system. Run the installation as you would for any other program that you install on your system.

  2. RStudio: Navigate to posit.co/download/rstudio-desktop/, and download the version of “RStudio Desktop” that corresponds to your operating system. Run the installation as you would for any other program that you install on your computer.

Both R and RStudio are free to download and free to use.

1.1 Tour of RStudio

In this video we open up RStudio for the first time and take a small tour of the user interface.

1.2 Calculations and Variables

We move on to write our first R commands. It is critical that you already now start to feel the programming, and you do that best by typing in the code lines just as in the video above (no copy/paste!), making sure that you get the same results.

# We can use R as a calculator:
2+2

# Pretty simple! We must use paratheses if we have more complicated expressions:
(2+8)/2
2+8/2

# Variables are important in R. We can save just about anything inside the
# computer memory by giving them names:
a <- 5
a
a*4 

b <- 3

# R performs all operations on the right hand side before assigning the value to c:
c <- a+b
c

# No errors, warnings or questions when overwriting!
c <- 4

c <- c + 2
c

# Let's make an error!
d

# We can name things more or less what we want. Not a non-trivial problem in
# large projects!
whatever_we_want <- "hello world"
whatever_we_want

Exercise:

Pick your favorite three integers and store them in three different variables. Calculate yor magic number, which is the sum of these three integers. Store your magic number in a new variable. Give your new variable a name that clearly identifies what it is.

number1 <- 1
number2 <- 87
number3 <- 101

magic_number <- number1 + number2 + number3

1.3 Vectors

Vectors are very important in R. We remember perhaps from our math classes that vectors may represent points in space; in R it is a way to store more than one number (or string, or some other data type) under a single variable name. When doing statistics, this may for example be a set of observations.

In this video we first create a vector of numbers using the c()-function, and then we look at various ways to extract/pick out the elements: to subset. Python coders will notice two important distinctions from what they are used to:

  • In R we start counting on 1, and not 0!
  • When trying to subset using an index that out of the range of the vector, we do not get an error message, we just get back the empty value NA.

Furthermore, we use the Up-arrow to get back the last command that we have executed in the console. You can even tap the up-arrow again in order go further back in your command history (and of course use the down-arrow to navigate the other way).

# We can make a vector in the following way:
vector1 <- c(3, 5, 7.8, 10, 2, 0.16, -3)

# Print out
vector1

# Subsetting (The first item has index 1!)
vector1[1]        # Square brackets to subset
vector1[10]       # Out-of-range error
vector1[2:5]      # Subset a sequence
vector1[c(1,3)]   # Subset using another vector!

# The letter "c" stands for "combine". R makes it very easy to work with
# vectors:
vector1 - 1
vector1*3

# We can use *functions* to calculate various things:
length(vector1)
mean(vector1)
sum(vector1)
sd(vector1)

# We can make a vector of strings as well:
vector2 <- c("hello", "world")

# A vector can only contain one data type!

# Perhaps we need the standard deviation later?
sd_vector1 <- sd(vector1)
sd_vector1

Exercise:

Calculate the maximum and minimum values of vector1, as well as the median. (Hint, and this will be the most important lesson you will learn in this course: If you do not know the name of the function, Google it!)

# Relevant Google searches: "minimum value r", "maximum r", "median r"

min(vector1)
max(vector1)
median(vector1)

1.4 Packages

In this video we install our first package in R. There are two major takeaways from this:

  • We install the package on our computer using the install.packages()-function. We only have to do this once per computer.
  • If we are going to use some of the functions in a package we need to load it using the library()-command. You have to do that every time you restart R (and we will later see that we will typically load all the packages we need in the beginning of the scripts that we write).
# In order to install the package readxl, we run the following command. 
# We run this command only once.
install.packages("readxl")

# When we are going to use it, we load it using the "library()"-function, 
# and we need to repreat this every time we restart R.
library(readxl)

Exercise: Install the following packages. We will make use of them (and several others) later in the course: ggplot2, dplyr, tidyr and lubridate.

install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("lubridate")

1.5 Working directory

We introduce the concept of a working directory, which is the folder where R looks for files that we are going to read into the memory, and where R puts the files that we create, for instance image files of plots.

There are two central functions:

  • getwd() prints out the current working directory.

  • setwd("C:/path/to/folder") sets the working directory to the specified folder. We will as a general rule not use setwd() in our scripts (the reason for that will become clear later), but rather use RStudio’s menu system for changing the working directory (we will in practice not need to do that as a general rule as well, which will also become clear in a short while).

We may however have to deal with file paths in our code, and make the following technical notes:

  • On UNIX systems (such as Mac or Linux) the file paths look differently, they do not start with a drive letter such as C:\.
  • On Windows, we always use the backslash / to separate between the folders in R code, and not the usual forward slash \. This may be counter-intuitive to some, but in programming the forward slash usually has special meaning (the escape character) and must not be used for anything else. On UNIX systems we also use the backslash, but that is the system standard for writing file paths, so it does not require any special attention to users of those operating systems.

Exercise: Make sure that you have completed the following tasks before proceding to the next lesson:

  • You have created a dedicated folder on your computer where you will collected all material that we will use today.
  • You have downloaded the file testdata.xls and put in in your newly created folder.
  • You have changed your working directory to this folder.
  • You have positively confirmed that your working directory now is correctly set.

1.6 Reading data

We read our first small data file into the memory of R and apply some simple operations to it. We will spend much more time working with data in R in later lessons.

# The data is in the .xls-format, so we need the readxl-package in order to load 
# it into R.
library(readxl)

# Inside this package, there is a function called read_excel:
read_excel("testdata.xls")

# That's fine, but in order to use this data, we need to save it in a variable
testdata <- read_excel("testdata.xls")

# Print out the (top of the) data set.
testdata

# Now we see the data in the environment. We can look at it by typing the name that we
# gave it. We can also pick out individual columns using the $-sign:
testdata$X1

# Calaculate the mean for X1 and X2:
mean(testdata$X1)
mean(testdata$X2)

# How many rows/observations do we have?
nrow(testdata)

Exercise:

  1. How many columns does our data set have?
  2. Can you find a way to print out a vector that contains the sum of the X1 and X2 columns in testdata?
  3. What is the total sum of all the numbers in the X1 and X2 columns of testdata?
# 1
ncol(testdata)

# 2 
testdata$X1 + testdata$X2

# 3
sum(testdata$X1 + testdata$X2)

1.7 Plotting

We introduce the main package for the plotting engine that we will use in this course; ggplot2, and its basic syntax.

# A simple scatterplot
plot(testdata$X1, testdata$X2)

# Making adjustments to the plot
plot(testdata$X1, testdata$X2,
     pch = 20,
     bty = "l",
     xlab = "X1",
     ylab = "X2")

# Load the ggplot2-package
library(ggplot2)

# Here is the code for creating a simple scatterplot of the X1 and X2 columns in
# our data set:
ggplot(testdata, aes(x = X1, y = X2)) + geom_point()

# First make the plot, then save it to a file
ggplot(testdata, aes(x = X1, y = X2)) + geom_point()
ggsave("testplot.pdf")

# A more flexible way to do it is to save the plot in a variable, and then
# supply the name of the plot to the ggsave-finction:
p <- ggplot(testdata, aes(x = X1, y = X2)) + geom_point()
ggsave("testplot.pdf", p)

# That way, we can save the plot p at any time, we do not have to do it directly
# after the plotting commands.

Exercise: Can you figure out how to make the dots in the plot bigger and blue?

ggplot(testdata, aes(x = X1, y = X2)) + geom_point(colour = "blue", size = 5)

1.8 Writing scripts

In this video we stop writing code directly in the console, and rather write our code in a script file, which is simply a pure text file containing commands. There are two important new concepts that we have to pay attention to when writing scripts:

  • The comment character #: Evertything after this character in an R-script is ignored when executing the script. We can use the comment character to add small comments to our code, briefly explaining what is going on. This is a great help for other people trying to understand what you have done, in particular, and perhaps most importantly: the future you returning to a project. The comment character # is the same as in Python.
  • The keybinding Ctrl - Enter (Cmd - Enter on a Mac) executes the line where your cursor is located in the script. You can also select several lines and execute all of them using this shortcut.
# Introduction to R
# -------------------

# Load packages 
library(readxl)
library(ggplot2)

# Read our data set
testdata <- read_xls("testdata.xls")

# Make a scatterplot of the X1 and X2-variables
plot <- ggplot(testdata, aes(x = X1, y = X2)) + 
    geom_point() + 
    ggtitle("Scatterplot of testdata") +
    theme_classic()
ggsave("testplot.pdf", plot)

Exercise: Make sure to save your script as an .R-file in the folder that we created for this session. Close RStudio. Then navigate to the folder and double click on the script file. Hopefully RStudio opens (if not, right click, select “Open in” and then Rstudio, confirm if prompted to set RStudio as default program for opening .R-files).

Find out what your working directory is now. What just happened? How is this useful?

Opening RStudio by double clicking on the script file automatically sets the working directory to the location of the script file. Very useful when returning to a project.

1.9 The pipe operators %>% and |>

Since version 4.1.0, R has included the pipe operator |>. A pipe operator allows you to write code that is closer to how we read English. To see how it works, consider the code below, that applies three functions to a vector:

# Create some data:
x <- (-500):500

# Want to calculate mean of sqrt of abs values of x...
# one way: via temporary variables: 
abs.x <- abs(x)  
sqrt.abs.x <- sqrt(abs.x)
alt.1 <- mean(sqrt.abs.x)

print(alt.1)
[1] 14.91415

Assuming we only care about the final result alt.1, the script above creates two unnecessary variables in memory (abs.x and sqrt.abs.x), and clutters up the environment.

We could instead nest the three function calls:

alt.2 <- mean(sqrt(abs(x)))
print(alt.2)
[1] 14.91415

Nesting the function calls doesn’t store the intermediary calculations, so we don’t clutter up the environment. However, reading what is happening on this line is more challenging than it needs to be: you have to read from right to left. It the functions calls has more than one argument, you would also need to keep track of which of the right-parentheses belongs to which function.

This is when the pipe operator comes to the rescue. The pipe “passes” whatever is on the left hand side to the right hand side of the expression. As a simple example, the sqrt(2) can be written in a pipe as:

2 |> sqrt()
[1] 1.414214

The script above can be read as “The number 2, then the square root”. Pipes can also be used with multiple arguments. The two statements below gives the same answer:

mean(c(1,2,NA), na.rm=T)
[1] 1.5
c(1,2,NA) |> mean(na.rm=T)
[1] 1.5

By default, the pipe operator inserts the value of the left hand side as the first argument in the function call on the right hand side, so e.g. x |> mean(na.rm=TRUE) is equivalent to mean(x, na.rm=TRUE). If you want the value to be inserted as another argument, an underscore _ may be used as a placeholder for the left hand side value. However, then the argument names must be named:

atan2(x=1, y=2)
[1] 1.107149
2 |> atan2(x=1, y=_)
[1] 1.107149

The Tidyverse-package magrittr also contains a pipe operator %>%. The Tidyverse-pipe behaves similarly to the |> in simple cases, however, it uses a period . as placeholder. The Tidyverse-pipe generally is more mature and has more features than the base-pipe |> (see here for a more thorough comparison of the two).

In this course we’ll generally use the Tidyverse-pipe. The shortcut for the pipe in RStudio is Ctrl - Shift - M on Windows, and Cmd - Shift - M on a Mac.

We will use this technique extensively throughout the course!

library(magrittr)
2 %>% . ^ 2
[1] 4
2 %>% atan2(1, .)
[1] 0.4636476

Exercise:

Use magrittr to calculate the equivalent of alt.1 and alt.2

alt.3 <-
  x %>% 
  abs %>%
  sqrt %>%
  mean

alt.1
alt.2
alt.3

Alternatively, with the base-pipe. Note |> is more picky than %>% with parentheses after function calls.

x |>
  abs() |>
  sqrt() |>
  mean() 
[1] 14.91415