Getting started with R and RStudio -- Additional Detail
Overview
Class Date: 8/27/2024 -- On Your Own
Teaching: 90 min
Exercises: 30 minQuestions
How can I write R code that other people can understand and use?
How do I manage the variables, files, and memory usage in my workspace?
How do I write my own functions?
How do I install and load packages to access application-specific tools?
Where can I get help?
Objectives
Employ best practices for documenting code so that others can follow your work.
Identify and segregate distinct components in your code using # or #-.
Understand the importance about requirements and dependencies for your code.
Learn how to remove variables from the workspace.
Learn how to free up memory for projects using large data files.
Know the basic syntax for writing and executing your own functions.
Know how to install and load packages.
Know what resources are available if you get stuck online and at the University of Arizona.
On Your Own
Best practices when writing code in R (and in general)
What if I handed you this and asked you to explain it:
newX <- aperm(X, c(s.call, s.ans))
dim(newX) <- c(prod(d.call), d2)
ans <- vector("list", d2)
if (length(d.call) < 2L) {
if (length(dn.call))
dimnames(newX) <- c(dn.call, list(NULL))
for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, newX[, i], ...)
if (!is.null(tmp))
ans[[i]] <- tmp
}
}
else for (i in 1L:d2) {
tmp <- forceAndCall(1, FUN, array(newX[, i], d.call,
dn.call), ...)
if (!is.null(tmp))
ans[[i]] <- tmp
}
ans.list <- is.recursive(ans[[1L]])
l.ans <- length(ans[[1L]])
ans.names <- names(ans[[1L]])
Got it? Yeah, me neither. For the most part, if you give raw code to someone else, this is what it will look like to them. With time and a lot of messing around, they could probably figure it out, but there is a better way.
Over time, the coding community has developed a set of best practices to follow when writing code to ease the confusion and frustration when someone else tries to read and interpret that code. This isn’t just for other people. In a year, that other person trying to interpret your code may be future you.
Before we start, here is exercise to illustrate the point on even a simple example:
Best coding practices
Here are what I would consider bad and good practices in coding:
Bad:
# assignvalues, then average x = 21 value2 = 26 bob = (x+value2)/2 bob
Good:
# This script will average the body weights of two mice. # First enter the body weights of two mice (in grams!) bw.mouse1 = 21 bw.mouse2 = 26 # Next calculate the average body weight of the two mice bw.average = (bw.mouse1 + bw.mouse2) / 2 # Output the average body weight bw.average
Compare the “Good” and “Bad” examples. What aspects of the “Good” code help you understand the script?
Solution
Here are some things that I came up with:
- Comments are clear and describe the goals of each step of the code.
- Header comments summarizes the overall goal of the entire script.
- Comments are frequent and explain specific steps.
- Variable names are descriptive.
- Variable name formats are consistent.
- Steps are visually separated by blank lines to add structure.
- Spaces are added to equations to add structure.
Here are a few habits to develop that will improve your relationship with other coders and your future selves:
Annotation – Headers with key meta-data about the script
In the first lesson we briefly mentioned the primary tool in keeping your code organized: the #
comment operator. Annotate your code thoroughly. Use the #
to add text to your code that will be ignored by the computer, but helpful to another person trying to figure out just what you did.
The goal is to allow someone who has never looked at your script before to understand what you are trying to accomplish. Just putting something down is not enough. Include enough detail to make them understand.
In addition to adding comments to describe individual actions, adding a detailed header that outlines the purpose and usage of a script is very helpful to frame you analysis. Starting your code with an annotated description of what the code does will also help you when you revisit or change the script in the future. Just a line or two at the beginning of the file can save a lot of time and effort when trying to understand what a particular script does.
# This is code to replicate the analyses and figures from my 2014 Science
# paper. Code developed by Sarah Supp, Tracy Teal, and Jon Borelli. Last
# updated 8/22/2019.
Actually, even this is a bit vague.
- Which science paper? Can you provide a link or Pubmed ID?
- Which analysis in that paper?
- What is the purpose of the analysis?
- Which figure(s) are being reproduced?
Any detail that you can add will help you down the line.
Be explicit about the requirements and dependencies of your code
Many scripts will require tools that are not available in the the base R language and will require one or more add-on packages. There are hundreds of packages available for numerous specialized applications, and we will be using several in this course. Installing and loading these packages is outlined below.
Loading any necessary packages up front (using library
) is a nice way of indicating which packages are necessary to run your code. It can be frustrating to make it two-thirds of the way through a long-running script only to find out that a critical dependency hasn’t been installed. It is good practice to load all required packages early in your script using a section like this (this is just an example; you don’t need to run this, and in fact, it won’t work if you don’t have the packages installed):
# Load required packages
library(ggplot2)
library(reshape)
library(vegan)
Another way you can be explicit about the requirements of your code and improve its reproducibility is to limit the “hard-coding” of the input and output files for your script. If your code will read in data from a file, define a variable early in your code that stores the path to that file:
### Note: this code is not meant to be run, just provide and example
### of soft-coding a set of fictional file names.
# define input and output file names
input_file <- "data/data.csv"
output_file <- "data/results.csv"
# read input
input_data <- read.csv(input_file)
# get number of samples in data
sample_number <- nrow(input_data)
# generate results
results <- some_other_function(input_file, sample_number)
# write results
write.table(results, output_file)
This is preferable to repeatedly writing out the input file paths (and way easier to update if you move or rename one of your files):
### Note: this code is not meant to be run, just provide and example
### of hard-coding a set of fictional file names.
# read input
input_data <- read.csv("data/data.csv")
# get number of samples in data
sample_number <- nrow(input_data)
# generate results
results <- some_other_function("data/data.csv", sample_number)
# write results
write.table("data/results.csv", output_file)
Naming Conventions
Historically, R programmers have used a variety of conventions for naming variables. The .
character in R can be a valid part of a variable name; thus weight.kg <- 57.5
is a legitimate variable name.
This is often confusing to R newcomers who have programmed in languages where .
has a more significant meaning. Today, most R programmers:
- Start variable names with lower case letters
- Separate words in variable names with underscores
- Use only lowercase letters, underscores, and numbers in variable names.
The book R Packages includes a chapter on this and other style considerations.
How closely you follow these standards is up to you. As a good rule of thumb, choose a convention for naming R objects that is descriptive, that you can remember, and that you will stick to using.
Identify and segregate distinct components in your code
It’s easy to annotate and mark your code using #
or #-
to set off sections of your code and to make finding specific parts of your code easier. For example, it’s often helpful when writing code to separate the function definitions. If you create only one or a few custom functions in your script, put them toward the top of your code. If you have written many functions, put them all in their own .R file and then source
those files. source
will define all of these functions so that your code can make use of them as needed.
source("my_genius_fxns.R")
Other ideas
-
Keep your code in bite-sized chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces (e.g. write functions to carry out chunks of code all in one line; break complex analyses into multiple loops with intermediate results saved in between).
-
Don’t repeat yourself–automate! If you are repeating the same code over and over, use a loop or a function to repeat that code for you. Needless repetition doesn’t just waste time, it also increases the likelihood you’ll make a costly mistake.
-
Keep all of your source files for a project in the same directory, then use relative paths as necessary to access them. For example, to load data use:
# set working directory
setwd("C:/Users/Karthik/Documents/sannic-project")
# load datasets 1 and 2 from the files subfolder
dat1 <- read.csv(file = "files/dataset1.csv", header = TRUE)
dat2 <- read.csv(file = "files/dataset2.csv", header = TRUE)
rather than:
dat1 <- read.csv(file = "C:/Users/Karthik/Documents/sannic-project/files/dataset1.csv", header = TRUE)
dat2 <- read.csv(file = "C:/Users/Karthik/Documents/sannic-project/files/dataset2.csv", header = TRUE)
- R can run into memory issues. It is a common problem to run out of memory after running R scripts for a long time. To inspect the objects in your current R environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long lines of computationally intensive code is to remove temporary objects after they have served their purpose. However, sometimes, R will not clean up unused memory for a while after you delete objects. You can force R to tidy up its memory by using
gc()
(discussed in more detail later).
# Generate sample dataset of 1000 rows
interim_object <- data.frame(rep(1:100, 10),
rep(101:200, 10),
rep(201:300, 10))
# Report the memory size allocated to the object
object.size(interim_object)
# Removes the object itself, but not necessarily the memory allotted to it
rm("interim_object")
# Force R to release memory it is no longer using
gc()
-
Don’t save a session history (the default option in R, when it asks if you want an
RData
file). Instead, start in a clean environment so that older objects don’t remain in your environment any longer than they need to. If that happens, it can lead to unexpected results. Your script should explicitly load any data and create any variables that it needs, rather than relying on a variable that you assigned in the console panel sometime last week. -
Wherever possible, keep track of
sessionInfo()
somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and re-install the version that worked (Note: at least on CRAN, all older versions of packages are permanently archived).
# Take a look at the information provided by sessionInfo():
sessionInfo()
# Here is how to save:
current.session <- sessionInfo()
session.file <- paste0("./Project X R Session Info for ",Sys.Date(),".Rdata")
save(current.session, file=session.file)
rm(current.session)
# ... and to load old files to look at them
load(file=session.file)
current.session
-
Do not ever modify your original raw data files. Doing so risks overwriting or unintentionally modifying the values of your starting data. Instead, always save modified data under a new name. You can even set critical raw data files to Read Only in your operating system to prevent accidental overwriting.
-
Collaborate. Grab a buddy and practice “code review”. Review is used for preparing experiments and manuscripts; why not use it for code as well? Code can easily be a major scientific achievement and the product of lots of hard work!
-
Develop your code using version control and frequent updates! You can find lessons about version control on software-carpentry.org/lessons.
Cleaning up
Keeping a tidy workspace can be just as critical when coding as when conducting experiments at the bench. Learning to code takes a lot or trial and error, particularly in the beginning. A well-designed script should be self-contained, meaning that you should be able to clear out all past variables, re-run the script from scratch, and arrive at the same outcome. Periodically clearing out your workspace of old variables prevents errors that occur when you accidentally use the same variable name in two different places, or accidentally overwrite a critical variable while testing a new function. There are a variety of tools available to help keep your workspace tidy.
You can use the ls()
command to list all variables currently defined in your workspace.
# first define a couple of variables
x <- 1
y <- 100
# now look at what we have defined
ls()
[1] "args" "dest_md" "missing_pkgs" "required_pkgs"
[5] "src_rmd" "x" "y"
Try clicking on the brush icon in the Environment panel and clicking Yes in the pop-up box. What happened to your variables? What happens if you try to display the variable in the console?
x
Error in eval(expr, envir, enclos): object 'x' not found
y
Error in eval(expr, envir, enclos): object 'y' not found
R throws an error message when you ask it to return the values of x
or y
because you just removed both variables from memory–R no longer knows they exist. You can also remove variables one at a time.
# again, define a few test variables and check your workspace
x = 3
y = x + 2
z1 = x - y
z2 = x + y
ls()
[1] "args" "dest_md" "missing_pkgs" "required_pkgs"
[5] "src_rmd" "x" "y" "z1"
[9] "z2"
# remove x and y one at a time, then remove all remaining variables and
# observe what happens to your workspace
rm(x)
ls()
[1] "args" "dest_md" "missing_pkgs" "required_pkgs"
[5] "src_rmd" "y" "z1" "z2"
rm(y)
ls()
[1] "args" "dest_md" "missing_pkgs" "required_pkgs"
[5] "src_rmd" "z1" "z2"
# remove all variables currently in environment (this won't work on the website, but try it in RStudio)
rm(list = ls())
ls()
Often when you get into more complex analyses in R, you end up with a lot of data stored in memory. In some cases, clearing the variable with rm()
does not completely release all of the memory that R has allocated for that variable immediately. You can force R to release that memory using the gc()
function, which stands for “Garbage Collection”.
# Calling gc() is simple and forces R to release memory to the operating system
gc()
Don’t get bogged down in the details, but we can check memory usage before
creating, after creating, and after removing a large vector. The memory usage at
each stage is indicated by “Vcells” (the V stands for “Vector”). Removing the object and running gc() frees up some of the memory. Note that gc()
also gives you a memory usage report.
# check memory before creating vector
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 619330 33.1 1324998 70.8 1324998 70.8
Vcells 1140471 8.8 8388608 64.0 1901433 14.6
# create a big vector (this can take some time) and recheck memory with
# vector loaded; Vcells used (MB) goes up
x <- integer(100000); for(i in 1:100000) x <- c(x, i)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 619421 33.1 1324998 70.8 1324998 70.8
Vcells 1240837 9.5 8388608 64.0 8388562 64.0
# remove vector and check memory a final time; Vcells used (MB) returns to
# it's original value
rm(x)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 619392 33.1 1324998 70.8 1324998 70.8
Vcells 1140674 8.8 8388608 64.0 8388562 64.0
You may find that this speeds up processing when you have a script that is repeatedly loading and releasing large data files.
It is generally good practice to start off each session with a clean slate. When you start RStudio, the Environment will be clear (unless you load a previously saved workspace). If you are finished with one analysis and starting another, it is a good idea to clear you Environment. That way if you reuse a variable name (e.g. x
) you won’t accidentally end up using the value stored in your workspace during the previous analysis.
Writing your own functions
A good way to consolidate code that you intend to use repeatedly is to build that code into a new function. Instead of repeating the entire set of code over and over, you can instead just call the function with a few commands.
Let’s define a new function FtoC
that converts a temperature value in Fahrenheit (the argument) to Celsius (the output):
FtoC <- function(temp_F) {
temp_C <- (temp_F - 32) * (5 / 9)
return(temp_C)
}
Now let’s give is a try:
FtoC(32) # output to console
[1] 0
Temp_C <- FtoC(32) # assign the output to a new variable called Temp_C
Temp_C
[1] 0
Functions can be just a few short lines (as above), or span thousands of lines of code employing many other functions.
Installing and loading packages
R comes preloaded with a decently wide range of functions for basic data manipulation, but you will inevitably want to do something more interesting. There are many, many ‘packages’ that are available online for download. Each package is a set of related functions designed for a specific area of analysis. For example, later in the course we will use specialized packages that contain functions for power analysis (the pwr
package) and survival analysis (the survival
package).
To use a package, you have to do two things:
- Install – Download the package from an external source, and make the functions accessible to R and RStudio. This is accomplished with the
install.packages()
function and only has to be done just per computer. - Load – Installing the package does not immediately allow you to access the associated functions. You first have to load the package into your local workspace. This is done with the
library()
function.
Let’s give it a try by installing and loading the pwr
package:
# Install required packages
install.packages("pwr")
Installing package into 'C:/Users/sutph/AppData/Local/R/win-library/4.4'
(as 'lib' is unspecified)
package 'pwr' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\sutph\AppData\Local\Temp\RtmpwjFJOc\downloaded_packages
# Load required packages into memory
library("pwr")
Generally, a Google search will let you find packages for just about any type of analysis. These usually come from one of two sources (which R mostly takes care of automatically, with a few exceptions):
- CRAN for basic R packages
- Bioconductor for biological packages
In situations where installing a package is more complicated that simply calling the install.packages()
function from within RStudio, the installation instructions are generally listed near the top of the homepage or documentation associated with that package.
Resources – where to turn for help
There are a variety of resources available when you have questions. Here are a few places to look:
Getting help in R
R has a built in function called help()
that provides basic information about specific functions. You can either use this function like any other function, or use the ?
operator. Let’s give it a try to get more detailed information on how the setwd()
function works.
help(setwd)
?setwd
When you enter either command, notice that the Help panel opens (lower right pane in RStudio). This panel provides information on the purpose, inputs, and outputs of the queried function. It also provides useful examples of how to use the function at the end of the documentation. help()
is usually a good first place to look to get a feel for what a function is doing.
Online Forums
Perhaps the most valuable resource available to both basic and advanced R users is the vast online community. Many people are actively using the R language for a variety of analysis tasks. You will find that most questions have already been asked and answered on one of the various online forums. If you can’t find the answer, sign up for a forum and post your question. You will generally get an answer (or a pointer to another forum where the question has already been answered) within a day or two.
To find answers, the simplest way is to just type what you are trying to do into Google. Preceding your question with “R” will tend to find questions within the R community:
R calculate standard deviation
There are many online forums with R user discussions, but I personally find the most common and useful to be stackoverflow.
The Carpentries
The Carpentries, and Software Carpentry in particular, offer free online introductory programming lessons for R and other languages. Much of the early part of this course is based on the Software Carpentry material. They are a good resource for those just delving into coding (or even just R) for the first time.
Resources at the University of Arizona
The University of Arizona has several resources for both statistics and programming:
- Statistics resources
- Statistics consulting (pay service, free consultation)
- BIO5 StatLab (pay service, free consultation)
- Cancer Center Biostatistics & Bioinformatics
Key Points
Use
#
to add comments to programs.Annotate your code!
Name and style code consistently.
Break code into small, discrete pieces.
Factor out common operations rather than repeating them.
Keep all of the source files for a project in one directory and use relative paths to access them.
Keep track of session information in your project folder.
Start each program with a description of what it does.
Load all required packages at the beginning of a script.
Use
#----
comments to mark off sections of code.When working with large datasets, removing variables that you no longer need and releasing memory can improve performance.
Writing functions for sets of commands that you will be running repeatedly will make your data processing more efficient.
Use
install.packages()
to install new packages andlibrary()
to load and use installed packages in your current workspace.Use help() and ? to get basic information about functions.
Online forums have the answers to almost any R-related question you can come up with. Know how to Google your question!
The University of Arizona offers serveral resources to assist with both biostatistics and programming.