Decision Making and Loops
Overview
Class Date: 9/17/2024 -- In Class
Teaching: 90 min
Exercises: 30 minQuestions
How do I write code to make decisions about data?
How do I use the same code to treat different data sets in different ways?
How can I perform the same operations on multiple data sets or across multiple subsets of a single dataset?
Objectives
Write conditional statements with
if()
,else
, andifelse()
.Correctly evaluate expressions containing
&&
(“and”) and||
(“or”).Use a
for
loop to repeat an analysis across different data subsets or to process multiple files.Use
list.files()
to get a list of filenames that match a simple pattern.Understand the basic purpose of regular expressions and where to go to learn more.
Explain the basic process underlying what a
for
loop does.Correctly write
for
loops to repeat simple calculations.
In Class
Our previous lessons have shown us how to read and manipulate data, define our own functions, calculate basic statistics, and generate basic charts. However, the programs we have written so far always do the same things, regardless of what data they’re given. We want programs to make choices based on the values they are manipulating.
Conditionals
In order to have the output of a script depend on specific attributes of the input data, we need to write code that automatically decides between multiple options. The tool R gives us for doing this is called a conditional statement, and looks like this:
num <- 37
if (num > 100) {
print("greater")
} else {
print("not greater")
}
print("done")
[1] "not greater"
[1] "done"
The second line uses an if
statement to tell R that we want to make a choice. If the test following if
in parentheses ()
is TRUE
, the body of the if
–the line(s) of code contained in the curly braces {}
–are executed. If the test is FALSE
, the body of the else
is executed instead. Only one or the other is ever executed (never both):
In the example above, the test num > 100
returns the value FALSE
:
num > 100
[1] FALSE
which is why the code inside the if
block was skipped and the code inside the else
statement was executed instead.
Conditional statements don’t have to include an else
. If there isn’t one, R simply does nothing if the test is FALSE
:
if (num > 100) {
print("num is greater than 100")
}
We can chain several tests together when there are more than two options. Let’s use this to write a function that returns the sign of a number:
what.sign <- function(num) {
if (num > 0) {
return(1)
} else if (num == 0) {
return(0)
} else {
return(-1)
}
}
what.sign(-3)
[1] -1
what.sign(0)
[1] 0
what.sign(2/3)
[1] 1
Note that when combining else
and if
in an else if
statement, the if
portion still requires a direct input condition to specify the conditional test. This is never the case for the lone else
statement, which is only executed if all other conditions go unsatisfied.
We can combine logical tests using relational operators. Two ampersands, &&
, symbolizing “and”, or two vertical bars, ||
, symbolizing “or”, can be used to test whether two separate tests are both true (&&
):
if (1 > 0 && -1 > 0) {
print("both parts are true")
} else {
print("at least one part is not true")
}
[1] "at least one part is not true"
or if either test is true (||
):
if (1 > 0 || -1 > 0) {
print("at least one part is true")
} else {
print("neither part is true")
}
[1] "at least one part is true"
In this case, “either” means “either or both”, not “either one or the other but not both”. If you want the latter, use the exclusive or (xor()
) function.
R includes the function ifelse()
to make if/else statements more efficient in certain circumstances.
?ifelse
The first argument is the if
test, the second is the code to execute if the test is TRUE
, the third argument is the code to execute if the test is FALSE
. This format is also vectorized, which provides us a useful way to run the same if/else statement on each element of an input vector. Let’s use our carSpeeds
data set to repeat the replacement of "Blue"
with "Green"
in the car color variable, as we did in an earlier lesson, but this time using ifelse()
:
carSpeeds <- read.csv(file = 'data/car-speeds.csv')
head(carSpeeds)
Color Speed State
1 Blue 32 NewMexico
2 Red 45 Arizona
3 Blue 35 Colorado
4 White 34 Arizona
5 Red 25 Arizona
6 Blue 41 Arizona
# replace all "Green" entries in the Color column with "Blue
carSpeeds$Color <- ifelse(carSpeeds$Color == 'Blue', 'Green', carSpeeds$Color)
head(carSpeeds)
Color Speed State
1 Green 32 NewMexico
2 Red 45 Arizona
3 Green 35 Colorado
4 White 34 Arizona
5 Red 25 Arizona
6 Green 41 Arizona
Choosing Plots Based on Data
Write a function
plot_dist()
that plots a boxplot if the length of the vector is greater than a specified threshold (e.g. the vector contains more than 10 data points) and a stripchart otherwise.To do this you’ll use the R functions
boxplot()
andstripchart()
.For instance, your function should have the following behavior when presented with subsets of the data from the
inflammation-01.csv
data set:dat <- read.csv("data/inflammation-01.csv", header = FALSE) plot_dist(dat[, 10], threshold = 10) # day (column) 10
plot_dist(dat[1:5, 10], threshold = 10) # samples (rows) 1-5 on day (column) 10
Solution
plot_dist <- function(x, threshold) { if (length(x) > threshold) { boxplot(x) } else { stripchart(x, vert=T) } }
Repeating tasks with for loops
Suppose we want to print each word in a sentence. One way is to use six print
statements:
best_practice <- c("Let", "the", "computer", "do", "the", "work")
print_words <- function(sentence) {
print(sentence[1])
print(sentence[2])
print(sentence[3])
print(sentence[4])
print(sentence[5])
print(sentence[6])
}
print_words(best_practice)
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
… but that’s a bad approach for two reasons:
-
It doesn’t scale: if we want to print the elements in a vector that’s hundreds long, we’d be better off just typing them in.
-
It is not robust: if we give it a longer vector, it only prints part of the data, and if we give it a shorter vector, it returns
NA
values because we’re asking for elements that don’t exist.
print_words(c(best_practice,"please"))
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
print_words(best_practice[-6])
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] NA
A better approach is to use a for
loop:
print_words <- function(sentence) {
for(word in sentence) {
print(word)
}
}
print_words(best_practice)
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
This takes few lines of code to produce–especially if we were to create a version of print_words
that prints every character in a hundred-word vector–and more robust as well:
print_words(best_practice[-6])
[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
The improved version of print_words
uses a for loop to repeat an operation—in this case, the print()
function—once for each thing in a collection (i.e. each element in the best_practices
character vector.
The general form of a for
loop is:
for(variable in collection) {
do things with variable
}
We can name the loop variable anything we like (with the usual restrictions on variable names). The in
is a formal part of the for
syntax and required.
Note that the condition (variable in collection
) is enclosed in parentheses, and the body of the loop is enclosed in curly braces { }
, similar to if
statements. For a single-line loop body, as here, the braces aren’t actually needed,
for(i in 1:10) print(i)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
… but it is a best practice to include them as we did, because it avoids ambiguity.
Here’s another loop that repeatedly updates a variable in order to count the number of entries in a vector, carrying out the the same task as the length()
function:
# define the vector of interest
vowels <- c("a", "e", "i", "o", "u")
# initialize a variable to store the current value for length during
# each iteration of the loop
len <- 0
# execute the for loop
for (v in vowels) {
len <- len + 1
}
# Display the calculated vector length
len
[1] 5
It’s worth tracing the execution of this little program step by step:
- Since there are five elements in the vector
vowels
, the statement inside the loop will be executed five times. - The first time around,
len
is zero (the value assigned to it on line 1) andv
is"a"
. - The statement adds 1 to the old value of
len
, producing 1, and updateslen
to refer to that new value. - The next time around,
v
is"e"
andlen
is 1, solen
is updated to be 2. - After three more updates,
len
is 5; since there no remaining elements in the vectorvowels
, the loop finishes.
Note that a loop variable is just a variable that’s being used to record progress in a loop. Even though we don’t explicitly use the value of v
in the code within the braces, it is still defined and updated during each loop. The value of v
will also be retained after the loop is over, and we can re-use variables previously defined as loop variables as well:
# start with "letter" defined
letter <- "z"
# execute a for loop that uses "letter" as the loop variable
for (letter in c("a", "b", "c")) {
print(letter)
}
[1] "a"
[1] "b"
[1] "c"
# after the loop, letter is:
letter
[1] "c"
Note that length()
is much faster than any R function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t seen yet, so we should always use it when we can. Whenever possible, use available functions before creating your own.
Exercises
Printing Numbers
Write a function that prints the first N natural numbers, one per line:
print_N(3)
[1] 1 [1] 2 [1] 3
Solution
print_N <- function(N) { nseq <- seq(N) for (num in nseq) { print(num) } }
Using for
loops to process multiple files
In many real world datasets, you will want to process a series of files that contain data in the same format using the same set of analysis steps. for
loops are a useful tool for this purpose.
In the data folder we have a series of files containing information about inflammation. These files give data on patients treated with a new drug for arthritis. Each file contains a series of patients (in rows) with a series of inflammation measurements on subsequent days (in columns). Each file contains information from the same set of patients on different rounds of treatment. We looked at the first file in this series previously. Let’s read in the file to recall the data format:
inflam1 <- read.csv("data/inflammation-01.csv", header=F)
head(inflam1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 1 3 1 2 4 7 8 3 3 3 10 5 7 4 7 7 12 18 6
2 0 1 2 1 2 1 3 2 2 6 10 11 5 9 4 4 7 16 8 6 18
3 0 1 1 3 3 2 6 2 5 9 5 7 4 5 4 15 5 11 9 10 19
4 0 0 2 0 4 2 2 1 6 7 10 7 9 13 8 8 15 10 10 7 17
5 0 1 1 3 3 1 3 5 2 4 4 7 6 5 3 10 8 10 6 17 9
6 0 0 1 2 2 4 2 1 6 4 7 6 6 9 9 15 4 16 18 12 12
V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40
1 13 11 11 7 7 4 6 8 8 4 4 5 7 3 4 2 3 0 0
2 4 12 5 12 7 11 5 11 3 3 5 4 4 5 5 1 1 0 1
3 14 12 17 7 12 11 7 4 2 10 5 4 2 2 3 2 2 1 1
4 4 4 7 6 15 6 4 9 11 3 5 6 3 3 4 2 3 2 1
5 14 9 7 13 9 12 6 7 7 9 6 3 2 2 4 2 0 1 1
6 5 18 9 5 3 10 3 12 7 8 4 7 3 5 4 4 3 2 1
To begin, let’s calculate and plot the minimum, maximum, and mean inflammation across patients for each day:
# use apply to calculate min, max, and mean inflammation for each day and save them to variables
avg_day_inflam <- apply(inflam1, 2, mean)
max_day_inflam <- apply(inflam1, 2, max)
min_day_inflam <- apply(inflam1, 2, min)
# ------------------------------------------
# Plot min, max, and mean on the same chart
# (using different line colors)
# First we need to grab the overall maximum
# and minimum values in our data set to be
# sure we make the plot window big enough
y_min <- min(inflam1)
y_max <- max(inflam1)
# first initiate the plot with "min" data
plot(min_day_inflam, # plot minimum first
type = "l", col = "blue", # plot the minimum as blue lines
xlab = "Day", ylab = "Inflammation", # axis labels
ylim = c(y_min, y_max)) # define plot limits
# use the lines() function to add max and mean to the current plot
lines(avg_day_inflam,
col = "black")
lines(max_day_inflam,
col = "red")
# add a legend
legend(x = "topright", # location on chart
lty = 1, # this tells R to make lines
legend = c("Maximum", "Mean", "Minimum"), # define/order the labels
col = c("red","black","blue")) # define/order the colors
The chart looks good, but something is strange with the data. The maximum measurement across patients appears to increase exactly linearly up to day 20, then decline exactly linearly thereafter. The minimum value looks to stair-step up to day 20, then stair-step down thereafter.
Does this happen with all the files? Let’s write a script to run the above analysis on all files and save the resulting report to a PDF file. For this, we can use our for
loop capability to carry out the following steps:
- Identify a list of files with a similar naming pattern.
- Initiate a PDF file for capturing the output plots.
- Set up a
for
loop to process each file in the list. - For each file, complete the following steps: a. Calculate minimum, mean, and maximum across patients for each day. b. Generate a plot of minimum, mean, and maximum inflammation across the time course for each file.
- Finalize the PDF file with
dev.off()
.
The first point requires that we have some way to recognize patterns in file names. This can be accomplished with the list.files()
function, which we looked at briefly in an earlier lesson.
?list.files
We can use the pattern
argument to specify a search pattern using regular expressions (aka regex), which is a systematic language for searching strings of text. For, example, we can use the pattern
argument to look for all .csv files:
list.files(path = "data", pattern = ".csv")
[1] "car-speeds-cleaned.csv" "car-speeds-corrected-9999.csv"
[3] "car-speeds-corrected-na.csv" "car-speeds-corrected.csv"
[5] "car-speeds.csv" "combined-inflammation.csv"
[7] "inflammation-01.csv" "inflammation-02.csv"
[9] "inflammation-03.csv" "inflammation-04.csv"
[11] "inflammation-05.csv" "inflammation-06.csv"
[13] "inflammation-07.csv" "inflammation-08.csv"
[15] "inflammation-09.csv" "inflammation-10.csv"
[17] "inflammation-11.csv" "inflammation-12.csv"
[19] "Nadeau2_table.csv" "sample-gendercorrected.csv"
[21] "sample-noquotes.csv" "sample-processed.csv"
[23] "sample.csv" "small-01.csv"
[25] "small-02.csv" "small-03.csv"
Since we have bunch of .csv files in this folder, this is not specific enough. How about files names containing the word “inflammation”?
list.files(path = "data", pattern = "inflammation")
[1] "combined-inflammation.csv" "inflammation-01.csv"
[3] "inflammation-02.csv" "inflammation-03.csv"
[5] "inflammation-04.csv" "inflammation-05.csv"
[7] "inflammation-06.csv" "inflammation-07.csv"
[9] "inflammation-08.csv" "inflammation-09.csv"
[11] "inflammation-10.csv" "inflammation-11.csv"
[13] "inflammation-12.csv"
Almost, but not quite there. If we want to be more specific, the search string gets a bit more complicated. If we want to extract all examples from a list that start with “inflammation-“ and end in “.csv”, (but contain anything else in between), we can use the following regex terms:
^
indicates the beginning of a string..*
is a wild card, specifying that any type or length of character can occur in this place.$
indicates the end of a string.
Regular Expresssions (regex)
The use of regex can be very powerful in allowing scripts to identify specific characteristics in textual data.
Regular expression are a whole separate language unto themselves and there are many resources online for learning to use and test regex. Here are a couple that I have used:
- RegexOne – regular expression tutorials
- regexr – an online tool that let’s you test out a regular expression against a text string.
- RexEgg regex cheat sheet – a regex cheat sheet that I have found useful (and some other resources as well).
We will also add the full.names
argument to tell list.files()
that we want it to return the full path to each file name, not just the file name itself:
inflam.files <- list.files(path = "data",
pattern = "^inflammation-.*.csv$",
full.names = TRUE)
inflam.files
[1] "data/inflammation-01.csv" "data/inflammation-02.csv"
[3] "data/inflammation-03.csv" "data/inflammation-04.csv"
[5] "data/inflammation-05.csv" "data/inflammation-06.csv"
[7] "data/inflammation-07.csv" "data/inflammation-08.csv"
[9] "data/inflammation-09.csv" "data/inflammation-10.csv"
[11] "data/inflammation-11.csv" "data/inflammation-12.csv"
Now we have a way to pull just the “inflammation” file names out automatically and assign them to a variable, which we can use later. Now that we have the last piece of the puzzle, we can build our loop. The output will be saved to a PDF file with a page graphing the maximum, mean, and minimum inflammation value across patients over time for each file.
# First, grab our file list using pattern matching
inflam.files <- list.files(path = "data",
pattern = "^inflammation-.*.csv$",
full.names = TRUE)
# Initiate the PDF file to store the graphs
pdf(file = "results/inflammation-by-file.pdf",
height = 5, width = 5)
# Start the for loop to cycle through the files
for(file.c in inflam.files) {
# read in the current file
inflam.c <- read.csv(file = file.c, header = FALSE)
# calculate min, mean, and max values by day
avg_day_inflam <- apply(inflam.c, 2, mean)
max_day_inflam <- apply(inflam.c, 2, max)
min_day_inflam <- apply(inflam.c, 2, min)
# Plot min, max, and mean on the same chart for this day
# look up max and min values for complete day to set plot size
y_min <- min(inflam1)
y_max <- max(inflam1)
# first initiate the plot with "min" data
plot(min_day_inflam, # plot minimum first
type = "l", col = "blue", # plot the minimum in blue
xlab = "Day", ylab = "Inflammation", # axis labels
ylim = c(y_min, y_max)) # define plot limits
# use the lines() function to add max and mean to the current plot
lines(avg_day_inflam, # plot mean second
col = "black") # draw mean in black
lines(max_day_inflam, # plot maximum second
col = "red") # draw maximum in red
# add a legend
legend(x = "topright",
lty = 1, # use lines of style 1 (solid)
legend = c("Maximum", "Mean", "Minimum"), # order the labels
col = c("red","black","blue")) # define colors in label order
}
# finalize the PDF file by turning off the graphics device
dev.off()
Key Points
Use
if (condition) {take action}
to start a conditional statement,else if (condition) {take alternative action}
to provide additional tests, andelse {take alternative action}
to provide a default alternative.Use
==
to test for equality.
X && Y
is only true if both X and Y are true.
X || Y
is true if either X, Y, or both, are true.Use
for (variable in collection) {take repeated action}
to process the elements of a collection one at a time.Use regular expressions (regex) to perform pattern searches in textual data.