MCB 585: Multidisciplinary/Quantitative Approaches to Solving Biological Problems: Basic Operation

Key Points

Getting started with R and RStudio	R is the programming language; RStudio is a user friendly environment for interacting with R. Using RStudio can make programming in R much more productive. Consider what working directory you are in when sourcing a script and loading data. Functions are the basic tool used to manipulate data and other objects in R. Functions take the form: `output <- function(argument1, argument2, ...)`. Arguments can be passed to functions by matching based on name, by position, or by omitting them (in which case the default value is used).
Getting started with R and RStudio -- Additional Detail	Use `#` to add comments to programs. Annotate your code! Name and style code consistently. Break code into small, discrete pieces. Factor out common operations rather than repeating them. Keep all of the source files for a project in one directory and use relative paths to access them. Keep track of session information in your project folder. Start each program with a description of what it does. Load all required packages at the beginning of a script. Use `#----` comments to mark off sections of code. When working with large datasets, removing variables that you no longer need and releasing memory can improve performance. Writing functions for sets of commands that you will be running repeatedly will make your data processing more efficient. Use `install.packages()` to install new packages and `library()` to load and use installed packages in your current workspace. Use help() and ? to get basic information about functions. Online forums have the answers to almost any R-related question you can come up with. Know how to Google your question! The University of Arizona offers serveral resources to assist with both biostatistics and programming.
Basic Data Types and Data Structures in R	The most commonly encountered data types in R are character, numeric, and logical. R’s basic data structures are vectors and matrices. Objects may have attributes, such as name, dimension, and class. Use `object[x]` and `object[x, y]` to select a single element from a 2- and 3-dimensional data structure, respectively. Use `from:to` to specify a sequence that includes the indices from `from` to `to`.
R Data Types -- In-Depth	R’s basic data types are character, numeric, integer, complex, and logical. R’s data structures include the vector, list, matrix, data frame, and factors. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. lists, data frames). Factors are used to represent categorical data. Factors can be ordered or unordered. Some R functions have special methods for handling factors. The function `dim` gives the dimensions of a data structure.
Data Frames, Basic Indexing, Reading/Writing Data	All the indexing and subsetting that works on matrices also works on data frames. Use `object[x, y]` to select a single element from a data frame. Each column of a data frame can be directly addressed by specifying the column name using the `$` operator (e.g. `mydataframe$column1`). Data in data structures can by accessed by specifying the appropriate index, by logical vector, or by column/row name (if defined). Import data from a .csv or .txt file using the `read.table(...)`, `read.csv(...)`, and `read.delim(...)` functions. Write data to a new .csv or .txt file using the `write.table(...)` and `write.csv(...)` functions.
Lists and Advanced Indexing	Lists are a standard data structure in R in which each element can contain any other R object. Lists can contain elements of different classes, unlike vectors. Data frames are a specific type of list in which all elements are vectors of the same length. Each vector can contain data of different classes. Use `object[[x]]` to select a single element from a list. Each element of a list can be assigned a name that can be addressed using the `$` operator (e.g. `mylist$element1`). Different indexing methods can be combined to efficiently extract desired data subsets for further analysis.
Manipulating and Plotting Data	Use `mean`, `max`, `min`, and `sd` to calculate simple statistics. Use `plot` to create simple visualizations. Display simple graphs. Save a plot in a pdf file using `pdf(...)` and stop writing to the pdf file with `dev.off()`.
Advanced Data Manipulation and Plotting	Use `apply()`, `lapply()`, and `sapply()` to calculate statistics across the rows or columns of a data frame. There are more advanced tools for complex data manipulation, including the `dplyr` and `data.table` packages. Use `aggregate()` to calculate statistics based on the structure of a dataset.
Decision Making and Loops	Use `if (condition) {take action}` to start a conditional statement, `else if (condition) {take alternative action}` to provide additional tests, and `else {take alternative action}` to provide a default alternative. Use `==` to test for equality. `X && Y` is only true if both X and Y are true. `X \|\| Y` is true if either X, Y, or both, are true. Use `for (variable in collection) {take repeated action}` to process the elements of a collection one at a time. Use regular expressions (regex) to perform pattern searches in textual data.
Decision Making and Loops -- Additional Detail	There are a number of ways to use logical operators, relational operators, and various functions to ask complex and specific questions in the form of `if` statements. Two commone ways to index the loop variable in `for` loops are to use the form `for(variable in collection)` to pull the elements (“variable”) from a vector (“collection”) one at at time, or to use the form `for(i in 1:length(collection))` or `for(i in seq_along(collection))` to instead pull each index (i.e. location) in a vector one at a time. Use functions such as `apply` instead of `for` loops to conduct repeated operates on values contained within defined subsets of a data structure. Use `list.files(path = "path", pattern = "pattern", full.names = TRUE)` to create a list of files whose names match a pattern.
Distributions and Normality	Samples are sets of observations drawn from a population. The population distribution describes the characteristics of the observed phenotype in the population of interest. The sampling distribution describes the characteristics of all possible samples of a given size. Use `hist()`, `density()`, `dnorm()`, `qqnorm()`, and `qqline()` to visually assess whether a sample is normally distributed.
Distributions and Normality -- Additional Detail	Use shapiro.test() to perform at Shapiro-Wilk test of normality. However, the power of the test is such that a negative result can eitehr indicate insufficient sample size or lack of normality. Simple manipulations such as taking the logrithm or the square root of a sample can reversibly transform the observations into a more normal distribution.
Hypothesis Testing	Understand basic model development and testing. All statistical tests make assumptions about your sample and population. Understanding these assumptions is critical to running a valid test.
Multiple Test Correction	Running multiple comparisons increases you chance of making a Type I Error. Different multiple test correction strategies correct for different types of errors (Type I vs. Type II) using different strategies. The basic outcome of a multiple test correction is to lower the P-value threshold ($lpha$) below which you reject the null hypothesis.
Survival Analysis	Time-to-event data includes a hybrid of an observation (event vs. no event) and a series of observation times. Time-to-event data analysis violates several assumptions made by standard tests, including normality and independence of observations. Censoring complicates analysis and is not handled by standard statistical tests. Instead, we use the Log-Rank test for basic survival comparisons. Unlike the survival function, age-specific mortality at any given time does not depend on previous observations.
Advanced Survival Analysis	The `survfit()` function creates a complex object in R containing life table information. This data can be extracted using the `strata` variable. Once extracted, the life table data can be used to calculate and plot age-specific mortality (don’t forget to use log scale on your y-axis!).
Power Analysis	The goal of experimental design is to minimize hypothesis testing errors. The primary tool for improving statistical power is sample size. R has basic tools for power analysis in the `pwr` package.
Simulating Experiments	Experiment simulations can be used to understand the behavior of an idealized system (i.e. a system lacking noise from sources like subjective observation and measurement error). Power analyses can be conducted using pilot data and (potentially) removing assumptions about distribution. In essence, an experiment simulation is used to build a ‘real’ sampling distribution from which to draw statistical parameters like significance and power.
Final Projects	MCB 585 individual final projects.

Basic Operation

# this is a comment in R
Use x <- 3 to assign a value, 3, to a variable, x
R counts from 1, unlike many other programming languages (e.g., Python)
length(thing) returns the number of elements contained in the variable collection
c(value1, value2, value3) creates a vector
container[i] selects the i’th element from the variable container

List objects in current environment ls()

Remove objects in current environment rm(x)

Remove all objects from current environment rm(list = ls())

Control Flow

Create a conditional using if, else if, and else

if(x > 0){
	print("value is positive")
} else if (x < 0){
	print("value is negative")
} else{
	print("value is neither positive nor negative")
}

create a for loop to process elements in a collection one at a time

for (i in 1:5) {
	print(i)
}

This will print:

Use == to test for equality
- 3 == 3, will return TRUE,
- 'apple' == 'orange' will return FALSE
X & Y is TRUE is both X and Y are true
X | Y is TRUE if either X or Y, or both are true

Functions

Defining a function:

is_positive <- function(integer_value){
	if(integer_value > 0){
	   TRUE
	}
	else{
	   FALSE
	{
}

In R, the last executed line of a function is automatically returned

Specifying a default value for a function argument

increment_me <- function(value_to_increment, value_to_increment_by = 1){
	value_to_increment + value_to_increment_by
}

increment_me(4), will return 5

increment_me(4, 6), will return 10

Call a function by using function_name(function_arguments)
apply family of functions: apply(), sapply(), lapply(), and mapply()

apply(dat, MARGIN = 2, mean) will return the average (mean) of each column in dat

Packages

Install package by using install.packages("package-name")
Update packages by using update.packages("package-name")
Load packages by using library("package-name")

Glossary

argument: A value given to a function or program when it runs. The term is often used interchangeably (and inconsistently) with parameter.
call stack: A data structure inside a running program that keeps track of active function calls. Each call’s variables are stored in a stack frame; a new stack frame is put on top of the stack for each call, and discarded when the call is finished.
comma-separated values (CSV): A common textual representation for tables in which the values in each row are separated by commas.
comment: A remark in a program that is intended to help human readers understand what is going on, but is ignored by the computer. Comments in Python, R, and the Unix shell start with a # character and run to the end of the line; comments in SQL start with --, and other languages have other conventions.
conditional statement: A statement in a program that might or might not be executed depending on whether a test is true or false.
dimensions (of an array): An array’s extent, represented as a vector. For example, an array with 5 rows and 3 columns has dimensions (5,3).
documentation: Human-language text written to explain what software does, how it works, or how to use it.
encapsulation: The practice of hiding something’s implementation details so that the rest of a program can worry about what it does rather than how it does it.
for loop: A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
function body: The statements that are executed inside a function.
function call: A use of a function in another piece of software.
function composition: The immediate application of one function to the result of another, such as f(g(x)).
index: A subscript that specifies the location of a single value in a collection, such as a single pixel in an image.
loop variable: The variable that keeps track of the progress of the loop.
notional machine: An abstraction of a computer used to think about what it can and will do.
parameter: A variable named in the function’s declaration that is used to hold a value passed into the call. The term is often used interchangeably (and inconsistently) with argument.
pipe: A connection from the output of one program to the input of another. When two or more programs are connected in this way, they are called a “pipeline”.
return statement: A statement that causes a function to stop executing and return a value to its caller immediately.
silent failure: Failing without producing any warning messages. Silent failures are hard to detect and debug.
slice: A regular subsequence of a larger sequence, such as the first five elements or every second element.
stack frame: A data structure that provides storage for a function’s local variables. Each time a function is called, a new stack frame is created and put on the top of the call stack. When the function returns, the stack frame is discarded.
standard input (stdin): A process’s default input stream. In interactive command-line applications, it is typically connected to the keyboard; in a pipe, it receives data from the standard output of the preceding process.
standard output (stdout): A process’s default output stream. In interactive command-line applications, data sent to standard output is displayed on the screen; in a pipe, it is passed to the standard input of the next process.
string: Short for “character string”, a sequence of zero or more characters.
while loop: A loop that keeps executing as long as some condition is true. See also: for loop.