Advanced Data Manipulation and Plotting

Overview

Class Date: 9/12/2024 -- On Your Own
Teaching: 90 min
Exercises: 30 min

Questions

How do I rapidly calculate statistics based on relevant variables in a dataset?

How to I begin to build more complex graphics?

What tools are available for generating publication quality graphics in R?

Objectives

Use pre-build functions like rowSums() to perform specific operations on data frame elements.

Use the apply() function to rapidly apply general functions across rows or columns of a data frame.

Use the aggregate() function to rapidly segregate dependent variables of interest by one or mor indpendent variables and calculate useful descriptive statistics.

Locate and employ tools in R base graphics to generate custom graphics.

On Your Own

Advanced Data Manipulation

In Class we covered a few of the basic functions for calculating descriptive statistics. Here we will examine a few techniques for using these functions more efficiently across data sets.

The `apply()` function

To start, let’s reload the aneurisms dataset that we were using In Class, and specifically the version with the gender capitalization corrected:

dat <- read.csv(file="data/sample-gendercorrected.csv", header=TRUE, stringsAsFactors=FALSE)

In this dataset, we have aneurism counts in patient eyes at four time points. What if our phenotype of interest is the maximum number of aneurisms across quarters for each patient, or perhaps the average across quarters? How do we calculate these values efficiently for each patient. We want to perform the repeat the same operation across a margin of the data frame (in this case a each row, but it can similarly apply to columns):

Operations Across Margins

In particular, we want to perform the functions max() or mean() on the last 4 columns of every row. To support this, we can use the apply() function.

?apply

apply() allows us to repeat a function on each of the rows (MARGIN = 1) or columns (MARGIN = 2) of a data frame. To obtain the maximum or average inflammation of each patient we will need to calculate the max or mean of all of the rows (MARGIN = 1) of the relevant subset of the dat data frame. Let’s start with mean:

avg_aneurisms_subject <- apply(dat[,6:9], 1, mean)
avg_aneurisms_subject

  [1] 173.25 213.25 186.50 198.00 189.50 285.00 158.00 246.00 196.25 177.25
 [11] 189.75 224.25 188.00 156.25 187.00 169.75 178.75 213.25 168.75 170.00
 [21] 180.00 133.50 230.25 225.00 187.50 165.25 246.50 187.25 126.25 174.50
 [31] 169.75 173.00 207.00 188.25 172.50 164.50 236.75 252.25 167.00 242.50
 [41] 191.25 156.50 179.75 180.50 148.50 260.25 184.50 177.50 187.50 193.00
 [51] 168.25 158.00 189.50 236.25 184.50 188.00 188.75 230.00 166.75 123.25
 [61] 201.50 194.50 195.50 163.75 256.25 185.50 232.00 164.50 224.25 136.75
 [71] 215.00 182.50 188.50 165.25 241.25 230.75 191.25 196.50 191.00 192.50
 [81] 227.00 153.00 182.00 237.00 185.25 210.00 153.50 199.25 159.50 152.00
 [91] 236.75 256.00 183.50 166.25 148.25 168.50 233.00 152.25 230.75 143.00

Each element in the avg_aneurisms_subject vector is the mean of the four values in the Aneurisms_q1, Aneurisms_q2, Aneurisms_q3, and Aneurisms_q4 columns for the corresponding row in the dat data frame. Note that the length of this vector is the same as the number of rows in the dat data frame:

length(avg_aneurisms_subject)

[1] 100

dim(dat)[1]

[1] 100

A convenient way to associated this data with the corresponding patient is to directly add the calculated data to the data frame in a new column instead of assigning it to a separate vector:

dat$Aneurisms_avg <- apply(dat[,6:9], 1, mean)

By changing to MARGIN = 2 (calculate across columns), we ask a different question, namely what is the average number of aneurisms across patients for each quarter:

avg_aneurisms_quarter <- apply(dat[,6:9], 2, mean)
avg_aneurisms_quarter

Aneurisms_q1 Aneurisms_q2 Aneurisms_q3 Aneurisms_q4 
      158.84       168.03       219.85       217.94 

Maximum aneurisms

Determine the maximum and total number of aneurisms for each patient and add these values to the data frame. Save the results to a new file called “sample-processed.csv”.

Solution

First calculate the relevant descriptive statistics and add them directly to the dat data frame using the apply function.

dat$Aneurisms_max <- apply(dat[,6:9], 1, max)
dat$Aneurisms_total <- apply(dat[,6:9], 1, sum)

Check the output:

head(dat)

      ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
Sub001      M    Control           132 16.0          114          140
Sub002      M Treatment2           139 17.2          148          209
Sub003      M Treatment2           130 19.5          196          251
Sub004      F Treatment1           105 15.7          199          140
Sub005      M Treatment1           125 19.9          188          120
Sub006      M Treatment2           112 14.3          260          266
  Aneurisms_q3 Aneurisms_q4 Aneurisms_avg Aneurisms_max Aneurisms_total
        202          237        173.25           237             693
        248          248        213.25           248             853
        122          177        186.50           251             746
        233          220        198.00           233             792
        222          228        189.50           228             758
        320          294        285.00           320            1140

And save the final file using write.csv():

write.csv(dat, file = "data/sample-processed.csv", row.names = F)

Argument Order in Functions

Notice that we did not specify the argument names in apply(). Looking at the help file, the argument order is X (indicating the input data), then MARGIN, then FUN (for function to apply). So long as you enter arguments in the requested order, you do not need to be explicitly about which argument you are referring too. apply() just assumes that the first argument entered is the value for X, the second is the > value for MARGIN, and the third is the value for FUN. You can mix up the order, but then you have to be explicit about which argument you are defining.

For example, this will return an error:
avg_aneurisms_subject <- apply(mean, 1, dat[,6:9])
while this will be interpreted identically to the earlier function call:
avg_aneurisms_subject <- apply(FUN = mean, MARGIN = 1, X = dat[,6:9])

Efficient Alternatives

Some common operations have more efficient alternatives. For example, you can calculate the row-wise or column-wise means with rowMeans() and colMeans(), respectively. Similarly for rowSums() and colSums().

Aggregating data

Looking at summary statistics of the entire data set is great, but usually we are more interested in comparing these statistics across groups. R has a function called aggregate() that lets you apply a function (or more than one function) to a dependent variable (or more than one dependent variable) with the data separated by one or more independent variables.

?aggregate

As with the boxplot() function, aggregate() uses formula notation. To get a feel for how the function operates, let’s calculate the mean() of our newly calculated average aneurisms variable, broken down by treatment group. aggregate() requires three arguments: (1) the formula that defines how the data should be broken down, (2) the data to be used, and (3) the function that you would like to perform:

mean_by_group <- aggregate(Aneurisms_avg ~ Group, data = dat, FUN = mean)
mean_by_group

       Group Aneurisms_avg
  Control      170.1083
Treatment1      175.9857
Treatment2      224.3929

Great! aggregate() selected the subset of Aneurisms_avg values for each treatment group and returned the mean. What if we also want to see the breakdown by Gender?

mean_by_group_gender <- aggregate(Aneurisms_avg ~ Group + Gender, data = dat, FUN = mean)
mean_by_group_gender

       Group Gender Aneurisms_avg
  Control      F      172.8214
Treatment1      F      182.6667
Treatment2      F      229.5385
  Control      M      167.7344
Treatment1      M      172.5000
Treatment2      M      221.3523

So far so good. Now, what if we wean to calculate more than just the mean? Maybe standard deviation to examin variability as well? This makes things a bit more complicated, but aggregate() is up to the task. First, aggregate() can technically only take one “input” function. However, we can get around this by defining our own input function() on the fly that includes multiple sub-functions combined using the c() function:

avg_aneurisms_summary <- aggregate(Aneurisms_avg ~ Group + Gender, data = dat, 
    FUN = function(x) c(mean(x), sd(x))) 
avg_aneurisms_summary

       Group Gender Aneurisms_avg.1 Aneurisms_avg.2
  Control      F       172.82143        22.85519
Treatment1      F       182.66667        13.72056
Treatment2      F       229.53846        20.05717
  Control      M       167.73438        19.01523
Treatment1      M       172.50000        17.57581
Treatment2      M       221.35227        29.26873

Now, this new object avg_aneurisms_summary looks like a standard data frame on the surface, and even claims to be one:

class(avg_aneurisms_summary)

[1] "data.frame"

However, if you actually look at the structure there is something odd going on:

str(avg_aneurisms_summary)

'data.frame':	6 obs. of  3 variables:
 $ Group        : chr  "Control" "Treatment1" "Treatment2" "Control" ...
 $ Gender       : chr  "F" "F" "F" "M" ...
 $ Aneurisms_avg: num [1:6, 1:2] 173 183 230 168 172 ...

Instead of creating separate columns for the calculated mean() and sd() values, aggregate() made those values columns of a numeric matrix and assigned that matrix to a single “column” of the data frame:

# This doesn't actually contain any information:
avg_aneurisms_summary$Aneurisms_avg.1

NULL

# While this contaims a matrix (rather than a vector)
avg_aneurisms_summary$Aneurisms_avg

         [,1]     [,2]
[1,] 172.8214 22.85519
[2,] 182.6667 13.72056
[3,] 229.5385 20.05717
[4,] 167.7344 19.01523
[5,] 172.5000 17.57581
[6,] 221.3523 29.26873

To fix this odd formating, we can use the function do.call() to convert the “matrix-within-a-data-frame” into a normal data frame, and then fix the column names to make them meaningful. We can actually just do this all in one step from the beginning:

# first run the aggregate function with do.call
avg_aneurisms_summary <- do.call(data.frame, 
                     aggregate(Aneurisms_avg ~ Group + Gender, data = dat,                      FUN = function(x) c(mean(x), sd(x)))) 

# check the baseline names of the output (not helpful!)
names(avg_aneurisms_summary)

[1] "Group"           "Gender"          "Aneurisms_avg.1" "Aneurisms_avg.2"

# correct the names of the newly created variables to reflect the functions that 
# were used to generate them
names(avg_aneurisms_summary)[c(3,4)] <- c("Aneurisms_avg_mean", "Aneurisms_avg_sd")

# Now check out the structure. This looks more like what we might have first expected
avg_aneurisms_summary

       Group Gender Aneurisms_avg_mean Aneurisms_avg_sd
  Control      F           172.8214         22.85519
Treatment1      F           182.6667         13.72056
Treatment2      F           229.5385         20.05717
  Control      M           167.7344         19.01523
Treatment1      M           172.5000         17.57581
Treatment2      M           221.3523         29.26873

str(avg_aneurisms_summary)

'data.frame':	6 obs. of  4 variables:
 $ Group             : chr  "Control" "Treatment1" "Treatment2" "Control" ...
 $ Gender            : chr  "F" "F" "F" "M" ...
 $ Aneurisms_avg_mean: num  173 183 230 168 172 ...
 $ Aneurisms_avg_sd  : num  22.9 13.7 20.1 19 17.6 ...

Finally, we will often want to apply the same set of functions to multiple dependent variables. To tie this all together, let’s calculate the same mean() and sd() for both Aneurisms_avg and Aneurisms_total. Do do this, we need to add an additional dependent variable to our formula notation, and bind them together using the cbind() function:

# First calculate the statistics using aggregate() and convert the output to a
# standard data frame using do.call()
aneurisms_summary <- do.call(data.frame, 
                             aggregate(cbind(Aneurisms_avg, Aneurisms_total) ~ Group + Gender, 
                                       data = dat, 
                                       FUN = function(x) c(mean(x), sd(x)))) 

# don't forget to correct the column names to something more meaningfule 
names(aneurisms_summary)[3:6] <- c("Aneurisms_avg_mean", "Aneurisms_avg_sd",
                                          "Aneurisms_total_mean", "Aneurisms_total_sd")

# ... and take a look at the results
aneurisms_summary

       Group Gender Aneurisms_avg_mean Aneurisms_avg_sd Aneurisms_total_mean
  Control      F           172.8214         22.85519             691.2857
Treatment1      F           182.6667         13.72056             730.6667
Treatment2      F           229.5385         20.05717             918.1538
  Control      M           167.7344         19.01523             670.9375
Treatment1      M           172.5000         17.57581             690.0000
Treatment2      M           221.3523         29.26873             885.4091
  Aneurisms_total_sd
         91.42078
         54.88224
         80.22868
         76.06091
         70.30324
        117.07493

Using the aggregate() function to it’s full potential is, admittedly, a bit convoluted. If you don’t quite get all of the detail, especially how to format and deal with the output of a multiple-dependent-variable, multiple-independent-variable, multiple-function aggregate() call, that’s okay. The goal here is to demonstrate that what is possible with the basic functions available in R. Even the basic single-function aggregate() can be really useful in speeding up data processing. Start there and work you way up.

Additional Data Manipulation Exercises

Subsetting More Data

Suppose you want to determine the maximum number of aneurisms for subject 5 across quarters two to four. To do this you would extract the relevant subset from the data frame and calculate the maximum value. Which of the following lines of R code gives the correct answer?

max(dat[5, ])

max(dat[7:9, 5])

max(dat[5, 7:9])

max(dat[5, 7, 9])

Solution

Answer: 3

Explanation: You want to extract the part of the data frame representing data for patient 5 from quarter two (column 7) to quarter 4 (column 9). In this data frame, patient data is organised in rows and the days are represented by the columns. Subscripting in R follows the [i, j] principle, where i = rows and j = columns. Thus, answer 3 is correct since the patient is represented by the value for i (5) and the days are represented by the values in j, which is a subset spanning day 9 to 7.

Subsetting and Re-Assignment

Using the data frame dat from above: Let’s pretend there was something wrong with the instrument on the first two quarters for every second patient (subjects 2, 4, 6, etc.), which resulted in the measurements being twice as large as they should be.

Write a vector containing each affected patient (hint: ?seq).
Create a new data frame in which you halve the first two quarters’ values in only those patients.
Print out the corrected data frame to check that your code has fixed the problem.

Solution

whichPatients <- seq(2, dim(dat)[1], 2) # i.e., which rows
whichQuarters <- seq(6,7)         # i.e., which columns indicates quarters 1 and 2; you could also use c(6,7) or 6:7
dat2 <- dat # make a copy of the original data
# check the size of your subset: returns `30 5`, that is 30 [rows=patients] by 5 [columns=quarters]
dim(dat2[whichPatients, whichQuarters])

[1] 50  2

dat2[whichPatients, whichQuarters] <- dat[whichPatients, whichQuarters] / 2
dat2

        ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
 Sub001      M    Control           132 16.0        114.0        140.0
 Sub002      M Treatment2           139 17.2         74.0        104.5
 Sub003      M Treatment2           130 19.5        196.0        251.0
 Sub004      F Treatment1           105 15.7         99.5         70.0
 Sub005      M Treatment1           125 19.9        188.0        120.0
 Sub006      M Treatment2           112 14.3        130.0        133.0
 Sub007      F    Control           173 17.7        135.0         98.0
 Sub008      M Treatment2           108 19.8        108.0        119.0
 Sub009      M Treatment2           131 19.4        117.0        215.0
Sub010      F    Control           129 18.8         94.0         72.0
Sub011      M Treatment1           126 14.8        134.0        155.0
Sub012      F Treatment2            96 15.3         76.0         88.5
Sub013      F    Control            77 16.5        112.0        220.0
Sub014      M    Control           158 12.6         54.5         75.0
Sub015      M    Control            81 14.3        146.0        140.0
Sub016      M    Control           137 15.9         48.5         86.0
Sub017      F Treatment1           147 18.4        165.0        157.0
Sub018      M Treatment2           130 18.3         79.0        132.5
Sub019      M Treatment1           105 15.4        178.0        109.0
Sub020      F Treatment1            92 14.3         53.5         94.0
Sub021      F    Control           111 12.7        174.0        160.0
Sub022      M Treatment1           122 15.4         48.5         55.0
Sub023      F Treatment2            97 17.2        187.0        239.0
Sub024      F Treatment2           118 17.3         94.0         95.5
Sub025      M Treatment1            82 16.7        114.0        199.0
Sub026      M Treatment1           123 19.6         57.5         80.0
Sub027      M Treatment2           126 15.0        128.0        249.0
Sub028      F Treatment1            94 16.1         56.0        115.0
Sub029      M    Control           135 17.6        136.0        109.0
Sub030      F    Control           108 18.6         51.5         74.0
Sub031      F Treatment1           133 18.3        132.0        151.0
Sub032      M Treatment1           108 16.7         59.0         77.0
Sub033      M Treatment2           122 12.5        166.0        176.0
Sub034      M Treatment1           134 14.3         76.0         52.5
Sub035      M Treatment1           145 19.7        191.0        148.0
Sub036      F    Control           133 17.6         76.0         89.0
Sub037      F Treatment2            90 17.0        161.0        270.0
Sub038      M Treatment2           118 12.2        119.5         92.0
Sub039      M Treatment1           113 15.1        132.0        137.0
Sub040      M Treatment2           115 17.7         84.0        127.5
Sub041      F Treatment1           142 19.0        140.0        184.0
Sub042      M Treatment1           114 14.7         83.0         42.5
Sub043      M    Control           139 15.2        141.0        160.0
Sub044      M Treatment1            90 15.3         80.5         84.0
Sub045      F    Control           126 12.9        103.0        111.0
Sub046      F Treatment2           109 18.4        115.5        120.0
Sub047      M    Control           125 18.1        192.0        141.0
Sub048      M    Control            99 15.6         89.0         90.0
Sub049      M    Control           122 19.5        167.0        123.0
Sub050      M Treatment1           111 13.5         67.5         75.0
Sub051      M Treatment2           109 13.5        150.0        166.0
Sub052      F Treatment1           134 13.7         96.0         40.0
Sub053      F Treatment1           113 18.7        153.0        153.0
Sub054      F Treatment2           105 12.2        102.5        132.0
Sub055      M Treatment2           125 16.9        117.0        194.0
Sub056      F Treatment2           123 19.5         99.5         59.5
Sub057      M    Control           155 12.1        182.0        129.0
Sub058      M Treatment2           117 17.0         90.0         98.0
Sub059      M Treatment1           116 19.2        111.0        111.0
Sub060      F    Control           133 14.7         50.5         49.0
Sub061      F    Control            94 20.0        166.0        167.0
Sub062      F Treatment2           106 14.1         79.0         85.5
Sub063      F Treatment1           144 14.7        189.0        178.0
Sub064      M Treatment1           149 16.6         94.5         50.5
Sub065      F Treatment2           108 15.0        239.0        189.0
Sub066      M Treatment1           116 15.0         92.5        112.0
Sub067      F Treatment2           136 13.8        224.0        112.0
Sub068      F    Control            98 14.8         52.0         69.5
Sub069      M Treatment2           148 19.1        222.0        199.0
Sub070      M    Control            74 18.9         53.5         49.0
Sub071      M Treatment2           147 17.7        153.0        255.0
Sub072      M    Control           116 17.4         59.0         82.5
Sub073      F Treatment1           133 15.5        102.0        184.0
Sub074      M    Control            97 13.1         94.0         62.5
Sub075      M Treatment2           132 12.2        180.0        283.0
Sub076      F Treatment2           153 17.0         89.0        107.0
Sub077      M Treatment1           151 17.7        168.0        184.0
Sub078      M Treatment1           121 19.5         59.0         85.0
Sub079      M Treatment1           116 19.5        169.0        114.0
Sub080      F    Control           104 12.8         78.0         69.0
Sub081      M Treatment2           111 17.6        232.0        211.0
Sub082      M Treatment1            62 17.7         94.0         54.0
Sub083      M Treatment2           124 14.2        169.0        168.0
Sub084      M Treatment2           124 19.2        120.5        116.5
Sub085      M Treatment2           109 16.0         65.0        207.0
Sub086      F    Control           117 15.2        112.5         92.5
Sub087      F    Control            90 17.6        104.0        116.0
Sub088      F Treatment1           158 17.6         89.5         79.0
Sub089      M Treatment1           113 15.1        103.0        140.0
Sub090      M    Control           150 17.8         56.0         65.0
Sub091      F Treatment2           115 16.2        226.0        170.0
Sub092      M Treatment2            83 16.6        114.0        110.5
Sub093      F    Control           116 19.1        209.0        142.0
Sub094      F Treatment1           141 17.2         76.5         52.0
Sub095      M    Control           108 13.6        111.0        118.0
Sub096      M    Control           102 14.6         74.0         66.0
Sub097      F Treatment2            90 19.6        141.0        196.0
Sub098      M Treatment1           133 17.0         96.5         56.0
Sub099      M Treatment2            83 16.2        130.0        226.0
Sub100      M Treatment1           122 18.4         63.0         78.5
    Aneurisms_q3 Aneurisms_q4 Aneurisms_avg Aneurisms_max Aneurisms_total
          202          237        173.25           237             693
          248          248        213.25           248             853
          122          177        186.50           251             746
          233          220        198.00           233             792
          222          228        189.50           228             758
          320          294        285.00           320            1140
          154          245        158.00           245             632
          279          251        246.00           279             984
          181          272        196.25           272             785
         192          185        177.25           192             709
         247          223        189.75           247             759
         323          245        224.25           323             897
         225          195        188.00           225             752
         177          189        156.25           189             625
         239          223        187.00           239             748
         203          207        169.75           207             679
         200          193        178.75           200             715
         243          187        213.25           265             853
         206          182        168.75           206             675
         167          218        170.00           218             680
         203          183        180.00           203             720
         194          133        133.50           194             534
         281          214        230.25           281             921
         256          265        225.00           265             900
         242          195        187.50           242             750
         158          228        165.25           228             661
         294          315        246.50           315             986
         281          126        187.25           281             749
         105          155        126.25           155             505
         219          228        174.50           228             698
         234          162        169.75           234             679
         260          160        173.00           260             692
         253          233        207.00           253             828
         197          299        188.25           299             753
         166          185        172.50           191             690
         158          170        164.50           178             658
         232          284        236.75           284             947
         317          269        252.25           317            1009
         193          206        167.00           206             668
         273          274        242.50           274             970
         239          202        191.25           239             765
         179          196        156.50           196             626
         179          239        179.75           239             719
         212          181        180.50           212             722
         254          126        148.50           254             594
         260          310        260.25           310            1041
         180          225        184.50           225             738
         169          183        177.50           183             710
         236          224        187.50           236             750
         208          279        193.00           279             772
         153          204        168.25           204             673
         138          222        158.00           222             632
         236          216        189.50           236             758
         269          207        236.25           269             945
         216          211        184.50           216             738
         183          251        188.00           251             752
         226          218        188.75           226             755
         250          294        230.00           294             920
         244          201        166.75           244             667
         178          116        123.25           178             493
         232          241        201.50           241             806
         237          212        194.50           237             778
         177          238        195.50           238             782
         193          172        163.75           193             655
         297          300        256.25           300            1025
         151          182        185.50           224             742
         304          288        232.00           304             928
         211          204        164.50           211             658
         280          196        224.25           280             897
         204          138        136.75           204             547
         218          234        215.00           255             860
         220          227        182.50           227             730
         246          222        188.50           246             754
         191          157        165.25           191             661
         204          298        241.25           298             965
         291          240        230.75           291             923
         184          229        191.25           229             765
         249          249        196.50           249             786
         248          233        191.00           248             764
         218          258        192.50           258             770
         219          246        227.00           246             908
         180          136        153.00           188             612
         180          211        182.00           211             728
         292          182        237.00           292             948
         234          235        185.25           235             741
         195          235        210.00           235             840
         173          221        153.50           221             614
         216          244        199.25           244             797
         209          186        159.50           209             638
         175          191        152.00           191             608
         307          244        236.75           307             947
         316          259        256.00           316            1024
         199          184        183.50           209             734
         194          214        166.25           214             665
         173          191        148.25           191             593
         200          194        168.50           200             674
         322          273        233.00           322             932
         123          181        152.25           193             609
         286          281        230.75           286             923
        129          160        143.00           160             572

Using the apply function on patient data

The apply function can be used to summarize datasets and subsets of data across rows and columns using the MARGIN argument. Suppose you want to calculate the mean number of aneurisms for specific quarters and patients in the patient dataset.

Use a combination of the apply function and indexing to:

Calculate the mean number of aneurisms for patients 1 to 5 over all four quarters.

Calculate the mean number of aneurisms for quarters 1 to 3 (across all patients).

Calculate the mean number of aneurisms for each patient in treatment group 2 day (across all quarters).

Think about the number of rows and columns you would expect as the result before each apply call and check your intuition by applying the mean function.
Solution
 # 1.
apply(dat[1:5, 6:9], 1, mean)

# 2.
apply(dat[, 6:8], 2, mean)

# 3.
apply(dat[dat$Group == "Treatment2", 6:9], 1, mean)

Plotting – Additional Detail

R is a robust platform for creating publication quality graphics, regardless of the complexity of you data and presentation. We really just scratched the surface In Class and there is no way we will be able to adequately cover the available tools in one session of one class. The goal is to give you a taste of what you can do (i.e. virtually anything you can image) and where to look for the tools to begin to do so.

Multipanel plots

Most figures contain multiple panels containing different information. One option is to generate each panel as a separate figure and stitch them together after the fact using a PDF editor or some other graphics program. However, R has the tools to build multipanel plots. Let’s look at how we can put multiple related panels in the same plot. This requires going beyond the plot() and boxplot() functions.

Take a look at the ?plot documentation. Notice that there really aren’t that many input options, but that there is a “…” listed, which references the par() function. This will also introduce you to the par() function. Under the surface, there are hundreds of tweaks that you can make to plots through arguments. R has a set of predefined graphical parameters that are all stored in the background. The par() function queries and sets these parameters. If you don’t give it any arguments, par() will return the current settings for all of these variables. We can see what settings are available and what the defaults are using ?par.

?par

Before you start to fiddle with par(), it is a good idea to store all of the current settings to a variable, which can be used to restore the default if we need to later on.

plot.options <- par()

Now to take a look at what we have:

head(plot.options) # this is a list with a bunch of graphical related variables

$xlog
[1] FALSE

$ylog
[1] FALSE

$adj
[1] 0.5

$ann
[1] TRUE

$ask
[1] FALSE

$bg
[1] "white"

length(plot.options) # there are 72 variables that we can tweak in par() alone, and some have multiple parameters

[1] 72

Consider the iris data set. If we want to make a multipanel plot, say with box plots of Petal Length and Petal Width, and then the correlation between them, we can use par() before running plot to tell R how many rows and columns of figure panels we want to create. We use this by passing the argument mfrow = c(<# of rows>, <# of columns>) to par(). This changes the value of the background graphical parameters. Note that we later have to change this back if we want to make a single-panel plot.

Once we set mfrow we can then run our previous plot commands in the order we want them displayed. Let’s give it a try:

# get the current graphical parameters
old.plot.options <- par() # it never hurts to save a copy of the old parameters, so we can reset them if we break something

# setup a plot with 3 figure panels in a row using the mfrow argument
par(mfrow = c(1,3)) # 1 row, 3 columns

# Panel 1: Box plot with jittered stripchart points for Petal Length
boxplot(Petal.Length ~ Species, data = iris,
        outline = FALSE, ylab = "Petal Length (cm)")
stripchart(Petal.Length ~ Species, data = iris,
           pch = 16, method = "jitter",
           add=TRUE, vert=TRUE)

# Panel 2: Box plot with jittered stripchart points for Petal Width
boxplot(Petal.Width ~ Species, data = iris,
        outline = FALSE, ylab = "Petal Width (cm)")
stripchart(Petal.Width ~ Species, data = iris,
           pch = 16, method = "jitter",
           add=TRUE, vert=TRUE)

# Panel 3: Scatter plot comparing Petal Length to Petal Width
col.list <- c("red", "blue", "green")
col.vector <- col.list[iris$Species]
plot(iris$Petal.Length, iris$Petal.Width,
     xlab = "Petal Length (cm)", ylab = "Petal Width (cm)",
     col = col.vector)
legend("bottomright", legend = levels(iris$Species), col=col.list, pch=1)

# and finally, reset the mfrow argument to a single panel
par(mfrow = c(1,1))

Box plots with multiple independent variables – getting to publication quality

Box plots begin to tax the default behavior of R when it comes to things like axis labeling. Let’s use the gender-corrected and processed version of our aneurism data (“sample-processed.csv”) to take a look at plotting when multiple independent variables are in play. First, load the data:

dat <- read.csv("data/sample-processed.csv", stringsAsFactors = T)
head(dat)

      ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
Sub001      M    Control           132 16.0          114          140
Sub002      M Treatment2           139 17.2          148          209
Sub003      M Treatment2           130 19.5          196          251
Sub004      F Treatment1           105 15.7          199          140
Sub005      M Treatment1           125 19.9          188          120
Sub006      M Treatment2           112 14.3          260          266
  Aneurisms_q3 Aneurisms_q4 Aneurisms_avg Aneurisms_max Aneurisms_total
        202          237        173.25           237             693
        248          248        213.25           248             853
        122          177        186.50           251             746
        233          220        198.00           233             792
        222          228        189.50           228             758
        320          294        285.00           320            1140

What if we want to plot total aneurisms broken down by both gender and treatment group. The formula notation used in boxplot() makes this pretty straightforward:

# use the '+' to indicate multiple grouping variables
boxplot(Aneurisms_total ~ Group + Gender, data = dat)

We have the data in the right place, but the group labels are all horizontal, so they can’t be displayed. There is also no clear visual way to distinguish how the treatment groups and gender groups are separated.

We can take several steps to improve the look in chart:

Use the at argument to change the box spacing. By default, boxes are drawn centered over the integers 1, 2, 3, …, n, where n is the number of boxes. By redefining these locations and skipping numbers, you can add spaces between groups.
Use box colors to indicate treatment groups (col argument plus the legend() function) and x-axis labels to indicate gender. There are other ways to do this, but this is one of the more visually appealing.
Tick marks are often not useful in box plots. To turn them off, we will use the xaxt to run off all x-axis feature, and then use the separate function axis() to add the ones we want back (sans tick marks).
Make the y-axis label look better with ylab.
Add all points with stripchart().

# draw the box plot first
boxplot(Aneurisms_total ~ Group + Gender, data = dat,
        xaxt = "n", xlab = "", # turn off all x-axis labeling (so we can redraw it below)
        ylab = "# Aneurisms", # y axis label
        at = c(1,2,3, 5,6,7), # leave a space between gender groups
        col = c("grey","blue","red")) # color boxes by treatment for legend

# now add the points with stripchart with the same parameters as box
stripchart(Aneurisms_total ~ Group + Gender, data = dat,
        add = TRUE, vert = TRUE, # add the points to the current boxplot vertically
        at = c(1,2,3, 5,6,7), # leave a space between gender groups
        col = c("black"), pch = 16, # color and symbol for points
        method = "jitter") # jitter the points

# next add gender labels to the x-axis
axis(side = 1, # add axis labels to the x-axis
     at = c(2,6), # place the labels under the middle boxes of each group
     labels = c("female","male"), # gender labels
     tick = FALSE) # do not add tick marks

# now add a legend to indicate
legend(x = "top", inset = -0.13, # place the legend on top, but nudge it up a bit
       xpd = T, # allow the legend to be drawn outside the plot area
       bty = "n", # don't draw a box around the legend
       ncol = 3, # use 3 columns (one row, since we have 3 groups)
       fill = c("grey", "blue", "red"), # fill boxes with the same colors as boxplot
       legend = levels(dat$Group)) # labels by treatment group

This ends up with a fairly complex script for a single chart, but when you start to directly generate publication quality graphics in R, this is what it tends to look like. Once you go through the process of setting up a plot that you like once, you can always copy that whole chunk of code into the next project (or write a custom plot function with your own defaults!). Over time, you build up a sort of personal “library” of code chunks and can make really nice graphics in relatively little time.

Other resources for plotting

This lesson gives you access to the two most common chart types used in biology: scatterplots and box plots (which can usually be used in place of bar charts). This is just the tip of a very large iceberg. R has the capability of producing any visualization you can imagine, and there are full workshops and courses dedicated just to generating graphics in R. The potential applications are as varied as your projects, and the best thing you can do is to sit down and start using the functions with Google search close at hand.

We don’t have the time to do everything, but we will be using both of these chart types throughout the course, and introducing several others (histograms, density charts, line graphs, Kaplan-Meier charts, etc). For now, a few resources that will help you along the way:

A cheat sheet for the graphics tools that come with R (aka “R base graphics”) can be downloaded here.
The lattice package makes some improvements to R base graphics and functions. The defaults are better for example.
The other commonly graphics package is called ggplot2. I have tended toward base graphics, because you can literally do anything with them, but ggplot2 is easier in some ways, once you get the syntax down (which is very different). Cheat sheet is here.
Google is your friend. There are probably thousands of tutorials, forum Q&As, cheat sheets, etc. to answer your project specific questions. For a taste, go to Google and search “publication quality graphics in R”

Exercises

On the road – does what you drive or where you drive it matter?

Read in our sample data on car speeds (car-speeds.csv) and visualize differences in car speed by both state and color.

Solution

# set data and results directories
data.dir <- "data"
results.dir <- "results"

# load data
carSpeeds <- read.csv(paste0(data.dir,"/car-speeds.csv"))

# setup two panel figure
par(mfrow = c(1,2))

# use a box plot with jittered strip chart overlay to plot car speed as a function
# of state
boxplot(Speed ~ State, data = carSpeeds,
        outline = FALSE, ylab = "Car Speed (mph)")
stripchart(Speed ~ State, data = carSpeeds,
        pch = 16, method = "jitter",
        add=TRUE, vert=TRUE)

# use a box plot with jittered strip chart overlay to plot car speed as a function
# of car color
boxplot(Speed ~ Color, data = carSpeeds,
        outline = FALSE, ylab = "Car Speed (mph)")
stripchart(Speed ~ Color, data = carSpeeds,
        pch = 16, method = "jitter",
            add=TRUE, vert=TRUE)

Blood pressure as a risk factor for aneurism

Does blood pressure appear to have an impact on aneurism risk?

Examine these questions using the file that we created in the “Correcting Case in Gender” exercise earlier (sample-gendercorrected.csv). What, other than blood pressure, may explain any apparent correlation (or lack there of?).

Save your results to a PDF file.
Solution

I left this intentionally a bit vague, which means that there are several solutions. But that is often the nature of real data in real research.

Since we are trying to gt a general idea here, I am first going to calculate the average quarterly aneurisms and the total yearly aneurisms (you could also look at each quarter independently).
# set data and results directories
data.dir <- "data"
results.dir <- "results"

# load data
data.an <- read.csv(paste0(data.dir,"/sample-gendercorrected.csv"))

# define new columns for average quarterly and total annual aneurisms and calculate
data.an$Aneurisms_qmean <- rowMeans(data.an[,6:9])
data.an$Aneurisms_total <- rowSums(data.an[,6:9])
head(data.an)
      ID Gender      Group BloodPressure  Age Aneurisms_q1 Aneurisms_q2
1 Sub001      M    Control           132 16.0          114          140
2 Sub002      M Treatment2           139 17.2          148          209
3 Sub003      M Treatment2           130 19.5          196          251
4 Sub004      F Treatment1           105 15.7          199          140
5 Sub005      M Treatment1           125 19.9          188          120
6 Sub006      M Treatment2           112 14.3          260          266
  Aneurisms_q3 Aneurisms_q4 Aneurisms_qmean Aneurisms_total
1          202          237          173.25             693
2          248          248          213.25             853
3          122          177          186.50             746
4          233          220          198.00             792
5          222          228          189.50             758
6          320          294          285.00            1140
# initiate a PDF file and setup two panel figure
pdf(file=paste0(results.dir, "/aneurisms-bloodpressure-chart.pdf"), width = 8, height = 4)
par(mfrow = c(1,2))

# use a scatter plot to look for apparent correlation between blood pressure and
# both summary measures of aneurism (in separate panels)
plot(data.an$BloodPressure, data.an$Aneurisms_qmean,
     xlab = "Blood Pressure (mmHg)", ylab = "Mean Aneurisms per Quarter")

plot(data.an$BloodPressure, data.an$Aneurisms_total,
     xlab = "Blood Pressure (mmHg)", ylab = "Aneurisms per Year")

# turn off PDF device
dev.off()
png 
  2 
There is no apparent correlation, but there are a lot of potential confounders within the data set that we are not examining in this exercise so far (treatment, gender, age)

Correlations between petal and sepal in iris data

We looked a correlation between petal length and width. What about correlations between petal and sepal? Generate scatter plots to look at length and width correlations, broken down by species.

Save your results to a PDF.

Solution

# set data and results directories
data.dir <- "data"
results.dir <- "results"

# initiate a PDF file and setup two panel figure
pdf(file=paste0(results.dir, "/iris-petal-sepal-correlations.pdf"), width = 8, height = 4)
par(mfrow = c(1,2))

# Define colors
col.list <- c("red", "blue", "green")

# And we use the species.index in place of iris$Species to do the indexing and generate
# our color list for each point on the plot
col.vector <- col.list[iris$Species]

# Plot petal length x sepal length, then petal width x sepal width in subsequent panels
# we only need one legend, which after an initial look at the data, fits best in the top
# left of the first panel
plot(iris$Petal.Length, iris$Sepal.Length,xlab = "Petal Length (cm)",
      ylab = "Sepal Length (cm)", col = col.vector)
legend("topleft", legend = levels(iris$Species), col=col.list, pch=1)
plot(iris$Petal.Width, iris$Sepal.Width,xlab = "Petal Width (cm)",
      ylab = "Sepal Width (cm)", col = col.vector)

# turn off the PDF device
dev.off()

png 
  2 

Exploring plot options

R has hundreds of ways to change the way plots are displayed. The only way to really learn how to manipulate charts is to start playing with the options to see what they do. Below are a range of options that I have used at one point or another.

Try starting with one of the plots with today’s lesson, change one or more of the following parameters, then rerun the plot script to see what changed. A good way to do this is to plot two charts with a single argument change to the RStudio plot panel. The arrows at the top left of the panel let you look through previous charts and see what changed. Alternatively, write the two charts to subsequent pages of a PDF file for comparison.

Arguments to the plot() function Remember that default values can be found in ?plot or by running par() with no arguments, and argument descriptions can be accessed in ?par.

bty

border

col

lty

lwd

main

sub

type

xlab, ylab

xlim, ylim

xaxs, yaxs

xaxt, yaxt

xlog, ylog

xpd

Arguments to the par() function *Note that these are typically passed to par() prior to running running a plot function. Run par() for defaults and ?par for argument descriptions.

cex

mar

mfrow

oma

tcl

You can set most of the arguments listed above in plot() using the par() function as well, and it will set the default for future charts.

Functions that add elements to existing plots For this set, first generate a plot (e.g. with plot() or boxplot()) and then call the following functions. Use ? to see what arguments are required for each

abline()

axis()

box()

legend()

lines()

mtext()

points()

text()

title()

Key Points

Use apply(), lapply(), and sapply() to calculate statistics across the rows or columns of a data frame.

There are more advanced tools for complex data manipulation, including the dplyr and data.table packages.

Use aggregate() to calculate statistics based on the structure of a dataset.

previous episode

MCB 585: Multidisciplinary/Quantitative Approaches to Solving Biological Problems

next episode

Advanced Data Manipulation and Plotting

Overview

On Your Own