Multiple Test Correction

Overview

Class Date: 9/24/2024 -- On Your Own
Teaching: 90 min
Exercises: 30 min
Questions
  • What is the consequence of running multiple statistical comparisons?

  • How do we define a family of tests?

  • What are the strategies for controlling error in multiple testing?

Objectives
  • Understand why we correct our interpretation of statistical tests when running multiple tests in a study.

  • Understand what each common type of multiple test comparison controls for and when to choose each method.

  • Use the p.adjust() function in R to correct P-values for multiple comparisons.

On Your Own

Multiple test correction

When you run a statistical test, you draw a conclusion based on the probability of observing the data from your sample if the null hypothesis is true. Each time you run a statistical test, there is a chance of making either a Type I Error (rejecting \(H_0\) when \(H_0\) is true) or a Type II Error (accepting \(H_0\) when \(H_1\) is true).

As we discussed In Class, we set the P-value threshold, \(\alpha\), in such a way that we control Type I Errors. The commonly used value, \(\alpha = 0.05\), means that there is a 5% change of getting the observed result on any given test even if \(H_0\) is true. Alternatively, if we were to run the experiment repeatedly, we would expect the statistical test to return 1 P-value below the threshold in every 20 repeats, on average. Note that this does not mean that there is a 5% chance that the \(H_0\) is true based on the observed data.

So what happens when, within a given study, you run 20 statistical tests? Randall Monroe gives an excellent example of what can happen (and how it can be, and often is, misinterpreted):

XKCD #882: Significance
Image credit xkcd.com

 

In this fictitious example, the researchers test whether acne is associated with each of 20 different jelly bean colors, and find that eating green jelly beans is significantly associated with acne (!!).

Does this mean that green jelly beans cause acne? No, of course not. Even if the association were real, perhaps people with acne just like green jelly beans better? However, even that is going too far. Recall that setting \(\alpha\) = 0.05 means that we expect to reject \(H_0\) 5% of the time (or 1 time in 20) when \(H_0\) is true. So the illustrated test gave the expected outcome if jelly beans (of any color) had no connection to acne.

We have to consider the family of tests that we are conducting within a study to draw appropriate conclusions about our data. The process of multiple test correction formalizes this procedure.

All in the family

We said that you have to consider your P-values in the context of the family of tests that you are running in your study. So what do we mean by family? It turns out that this term is not well defined.

The XKCD jelly bean example is one extreme, where the P-value for each individual test in a clearly linked set of tests was interpreted on it’s own. The set of jelly bean tests were obviously related, and should be considered as a whole. Thus we consider these a family.

 

Near the other extreme is your average Cell paper, which can have 10s to 100,000s (or more in the age of systems biology) statistical tests in one paper. Should you group all of the various P-values calculated in one paper across figures and data types and treat them all as a family? Clearly not.

 

So where do you draw the line? Usually a family of tests are a series of statistical tests on similar data types using the same statistical test interpretation that are carried out together. Usually the interpretation of each test in the family will have some relation to the other test. Perhaps you are running a screen and want to pick candidates for further testing, for example. There is not a bright line that separates an independent series of tests from a family of tests. Ultimately, you have to use your own judgement.

 

There are a variety of approaches to correcting for multiple comparisons. The two most commonly used criteria are:

These tend to be employed in different types of studies with different goals. We will explore the R implementation of the most widely used tests in these categories.


Controlling Familywise Error Rate (FWER)

The most conservative version of a multiple test correction is designed to control the familywise error rate or FWER. The FWER is the probability that your family of tests includes one or more false positives (Type I Errors). FWER is the simplest to understand, because it is the direct familywise analog to \(\alpha\) for a single test.

We can calculate the FWER for \(\alpha = 0.05\) as the number of comparisons, \(n_{tests}\), increases:

      \(FWER = 1 - 0.95^n_{tests}\)

Plotting this equation:

 

As you repeat your tests, the chance that at least one result will be a false positive increases rapidly for the first few tests and begins to approach certainty once you get to around 60 tests or so. Keep in mind that this is at least one, so the higher you go the more false positives you will tend to accumulate.

 

The Bonferroni correction

The most direct way to control FWER is to simply lower the critical P-value to the point where your expected number of Type I Errors once again falls below your original single-test threshold. To do this, divide your original \(\alpha\) by the number of tests conducted:

      \(\alpha_{corrected} = \frac{alpha}{n_{tests}}\)

 

This is called the Bonferroni correction, after Carlo Emilio Bonferroni. This correction makes no assumptions about your data, population, or sampling method, so it works for any situation. All you need are your P-values from each individual statistical test, and the number of tests run. For the jelly bean example, the researchers performed 20 tests with \(\alpha = 0.05\), so:

      \(\alpha_{corrected} = \frac{alpha}{n_{tests}} = \frac{0.05}{20} = 0.0025\)

 

We now need a much lower P-value to reject the null hypothesis. As a practical alternative, we can leave \(\alpha\) alone and correct the P-value by multiplying by the number of tests (max P-value is always 1):

      \(P_{corrected} = min(P*n_{tests}, 1)\)

 

Let’s look at a real example. A mouse aging study was conducted on a set of commonly used inbred mouse strains. While longevity was the primary endpoint, the researchers also collected body composition data for each strain across lifespan. Let’s start by laoding the body composition dataset and taking a quick look at the structure:

body.comp <- read.delim("data/inbred.body.composition.txt")
str(body.comp)
'data.frame':	1187 obs. of  10 variables:
 $ strain       : chr  "129S1/SvImJ" "129S1/SvImJ" "129S1/SvImJ" "129S1/SvImJ" ...
 $ sex          : chr  "f" "f" "f" "f" ...
 $ animal_id    : chr  "SC-7749" "SC-7751" "SC-7752" "SC-7747" ...
 $ age          : int  6 6 6 6 6 6 6 6 6 6 ...
 $ body_weight_g: num  20.2 22.8 22.5 23.1 21.6 19.6 24 27.4 29.4 28.3 ...
 $ BMI          : num  2.66 2.88 2.97 2.92 2.85 2.71 2.96 3.38 3.33 3.49 ...
 $ percent_fat  : num  22.1 22.2 22.6 22.8 23.1 24 24.3 28.3 17.6 17.7 ...
 $ total_mass_g : num  17.9 19.8 19.4 20.6 18.9 17.3 21.3 25.1 26.5 25.9 ...
 $ lean_mass_g  : chr  "13.9" "15.4" "15" "15.9" ...
 $ fat_mass_g   : num  4.45 5.08 5.08 5.27 4.99 4.69 5.84 7.76 5.16 5 ...

 

Let’s imagine that we want to identify strains with a body weight difference between males and females as part of a study to examine sex differences during aging. We can perform an analysis to calculate the t-test P-value comparing body weight of male vs. female mice in each strain.

# grab strain list
strain.list <- unique(body.comp$strain)

# initialize data frame to store output information (P-values)
bw.by.sex <- data.frame(strain = strain.list,
                        P = numeric(length = length(strain.list)))

# cycle through each strain and compare male and female body weight 
# using a t-test
for (i.strain in 1:length(strain.list)) {
  # grab current strain name
  strain.c <- strain.list[i.strain]
  
  # run t-test for male vs. female body weight for current strain
  t.test.c <- t.test(body_weight_g ~ sex, 
                     data = body.comp[body.comp$strain == strain.c,],
                     type = "two.sample",
                     alternative = "two.sided")
  
  # update data frame
  bw.by.sex$P[i.strain] <- t.test.c$p.value
}

# display strains with significant P-values (alpha < 0.05)
strains.bw.by.sex <- bw.by.sex$strain[bw.by.sex$P < 0.05]
strains.bw.by.sex
 [1] "129S1/SvImJ"       "BALB/cByJ"         "BUB/BnJ"          
 [4] "C3H/HeJ"           "C57BL/10J"         "C57BL/6J"         
 [7] "C57BLKS/J"         "C57BR/cdJ"         "C57L/J"           
[10] "FVB/NJ"            "LP/J"              "MOLF/EiJ"         
[13] "NOD.B10Sn-H2<b>/J" "NON/ShiLtJ"        "P/J"              
[16] "PL/J"              "PWD/PhJ"           "RIIIS/J"          
[19] "SM/J"              "SWR/J"             "WSB/EiJ"          
# note the number of strains with a significant body weight difference
# between sexes relative to the total number of strains
length(strain.list)
[1] 32
length(strains.bw.by.sex)
[1] 21

 

We have 21 strains where sex significantly impacts body weight according to our initial approach. However, with 32 strains we ran 32 t-tests, and our threshold \(\alpha = 0.05\) means that we expect 1 to 2 of these to be false positives. Let’s see how the results change when we apply the Bonferroni multiple test correction to minimize our change of including a false positive.

# add a row for corrected P-value
bw.by.sex$P.bonferroni <- NA

# calculate Bonferroni correction manually, using the min() function
# to correct any P-value that ends up > 1 to 1 (note that min is not a 
# vectorized function, so we have to construct a for loop or use apply)
for (i in 1:length(bw.by.sex$P)) {
  bw.by.sex$P.bonferroni[i] <- min(bw.by.sex$P[i]*length(bw.by.sex$P), 1)
}

 

R has a function, p.adjust() which automatically calculates the adjusted P-value using a variety of methods:

# calculate Bonferroni corrected P-value using p.adjust()
P.bonferroni.auto <- p.adjust(bw.by.sex$P, method = "bonferroni")

# check against manual 
bw.by.sex$P.bonferroni == P.bonferroni.auto
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE
# how many strains now have significant differences?
strains.bw.by.sex.bf <- bw.by.sex$strain[bw.by.sex$P.bonferroni < 0.05]
strains.bw.by.sex.bf
 [1] "BALB/cByJ"         "BUB/BnJ"           "C3H/HeJ"          
 [4] "C57BL/10J"         "C57BL/6J"          "C57BLKS/J"        
 [7] "C57L/J"            "LP/J"              "NOD.B10Sn-H2<b>/J"
[10] "P/J"               "PL/J"              "RIIIS/J"          
[13] "SM/J"              "SWR/J"            
length(strains.bw.by.sex.bf)
[1] 14

 

Applying the Bonferroni method reduces the number of strains with significant body weight between sexes from 21 to 14.

As you may already realize, the Bonferroni test reduces Type I Error rates \((\alpha)\) at the cost of increasing the risk of committing a Type II Error \((\beta)\); aka “false negative”; accepting \(H_0\) when \(H_1\) is true), and thus reducing power \((power = 1 - \beta)\)) to detect real differences when they are present. Thankfully, there is a universally better option that does at least a bit better.

 

The Holm-Bonferroni correction

The primary advantage to the Bonferroni is that the procedure is intuitive and the calculations can be done in your head. This makes it ideal for a back-of-the-envelope first-pass examination of you data to see if you have any comparisons that are in the ballpark.

There is a closely related procedure called the Holm-Bonferroni (or just Holm) method for multiple test correction, named for Sture Holm, who codified the method and is currently serving as Professor Emeritus of Mathematical Sciences at Chalmers University. The procedure and proof is more complicated. While it can be done by hand without too much trouble, we will generally require a computer for efficient processing (which is the point of this course anyway!). If you are interested in the mathematics, there are many places that describe the underlying principles, and the Wikipedia article on Holm-Bonferroni method does a decent job.

The bottom line is that the Holm method does just as good a job at controlling FWER, but does so with better power in all cases. This method is also available in the p.adjust() function:

# calculate Holm-Bonferroni corrected P-value using p.adjust()
bw.by.sex$P.holm <- p.adjust(bw.by.sex$P, method = "holm")

# how many strains now have significant differences?
strains.bw.by.sex.holm <- bw.by.sex$strain[bw.by.sex$P.bonferroni < 0.05]
strains.bw.by.sex.holm
 [1] "BALB/cByJ"         "BUB/BnJ"           "C3H/HeJ"          
 [4] "C57BL/10J"         "C57BL/6J"          "C57BLKS/J"        
 [7] "C57L/J"            "LP/J"              "NOD.B10Sn-H2<b>/J"
[10] "P/J"               "PL/J"              "RIIIS/J"          
[13] "SM/J"              "SWR/J"            
length(strains.bw.by.sex.holm)
[1] 14

 

Still 14 strains, so the Holm method produces similar results to the Bonferroni in this case. How do the P-values compare?

# plot P-values for Bonferroni vs. Holm with a line for equal values
plot(bw.by.sex$P.bonferroni, bw.by.sex$P.holm)
lines(x = c(0,1), y = c(0,1))

 

Despite not finding any new significant strains, it is clear that Holm calculates a lower adjusted P-value in each case relative to Bonferroni. This visually demonstrates that Bonferroni is more conservative.

While the Holm method is all around better than the Bonferroni method, all FWER are very conservative, strictly limiting the probability of any Type I Error occurring to a defined maximum of 1 per family of tests. This if generally fine if you are making a few 10s of comparisons. But what about 1,000s or 100,000s of comparisons, which is commonly needed to analyze modern ‘-omics’ data? FWER methods will control Type I Errors, but will also tremendously limit your power to detect any real differences that are actually present. For large data sets, We need a different type of correction.


Controlling False Discovery Rate (FDR)

A second class of multiple test correction focuses on “discoveries”–comparisons with significant P-values–and limits the fraction of total discoveries that are false (i.e. the False Discovery Rate, or FDR), in contrast to FWER methods, which control the absolute number of false discoveries without regard to the total number of comparisons.

Let’s say we run 1000 comparisons, and 50 are significant using the uncorrected P-value (we reject \(H_0\) in 50 comparisons). FWER will ensure that we only have a small chance to reject \(H_0\) incorrectly (i.e. when \(H_0\) is, in fact, true). In the process, after correction, we only have 20 significant values remaining. In contrast, FDR analysis may remove 5, leaving 45 positives in the corrected data set, with an expectation that 4-5 of these remaining are incorrectly rejecting \(H_0\). The second approach has a lot more power to detect true differences–it potentially “discovers” ~15 true positives that were incorrectly rejected by the FWER method–at the cost of a higher Type I Error rate.

 

The Benjamini-Hochberg FDR method

The most common FDR procedure is the Benjamini-Hochberg method, named for Yoav Benjamini and Yosef Hochberg of Tel Aviv University, who originally described the False Discovery Rate criteria for correcting multiple comparisons.. Again, for more detail on the underlying mathematics, the Wikipedia page on False Discovery Rate has a decent explanation. For our purposes, let’s look at the practical output by applying the correction to our body weight data using p.adjust():

# calculate Benjamini-Hochberg corrected P-value using p.adjust()
bw.by.sex$P.BH <- p.adjust(bw.by.sex$P, method = "BH")

# how many strains now have significant differences?
strains.bw.by.sex.BH <- bw.by.sex$strain[bw.by.sex$P.BH < 0.05]
strains.bw.by.sex.BH
 [1] "129S1/SvImJ"       "BALB/cByJ"         "BUB/BnJ"          
 [4] "C3H/HeJ"           "C57BL/10J"         "C57BL/6J"         
 [7] "C57BLKS/J"         "C57BR/cdJ"         "C57L/J"           
[10] "FVB/NJ"            "LP/J"              "MOLF/EiJ"         
[13] "NOD.B10Sn-H2<b>/J" "NON/ShiLtJ"        "P/J"              
[16] "PL/J"              "RIIIS/J"           "SM/J"             
[19] "SWR/J"            
length(strains.bw.by.sex.BH)
[1] 19

 

Note that the meaning of our P-value threshold is different under FWER and FDR methods. In FDR, by setting \(\alpha = 0.05\) we are stating that we expect 5% of significant P-values to be false positives. Since we now have 19 strains with significant differences in body weight, we expect that ~1 of these strains does not actually have a body weight difference between sexes. Let’s compare these P-values to those calculated using the FWER method:

# plot P-values for Bonferroni vs. Benjamini-Hochberg with a line for 
# equal values
plot(bw.by.sex$P.bonferroni, bw.by.sex$P.BH)
lines(x = c(0,1), y = c(0,1))

# plot P-values for Holm vs. Benjamini-Hochberg with a line for equal 
# values
plot(bw.by.sex$P.holm, bw.by.sex$P.BH)
lines(x = c(0,1), y = c(0,1))

 

We can see that the P-values for the Bonferroni and Holm methods are a lot more conservative than the FDR method.

Which approch is correct? It depends entirely on the goals of your study, and whether you can tolerate Type I or Type II Errors better within the context of those goals. A false positive is much less of an issue if you are screening a drug library for potential candidates that will then be followed up with additional studies, than it will be for a cancer diagnosis where a patient may be subjected to harmful treatment regimes like radiation or chemotherapy.


Other methods

There are other tests that make different assumptions about your data, sampling, and test procedures. We will not cover these here, but see the p.adjust() documentation, and various statistics textbooks and websites for additional information should you come across a more complex data set.


Exercises

Is body weight associated with body composition in aged mice?

Both fat reserves and muscle mass (one component of lean mass) have been associated with better outcomes following health challenges in elderly humans and in mouse models.

In mice, body weight increases early in life, plateaus, and then decrease later in life. In the mouse aging study that we examined earlier, we have information on both total body weight and on different measures of body composition at different ages. We want to know if, in older mice, the body weight is consistently associated with body composition.

Use the data in “inbred.body.composition.txt” to determine if body weight of old (20-month-old) mice correlates consitently with body composition (as indicated by fat percentage) across differe strains of mice.

Solution

First, let’s makes sure our data is properly loaded, and grab just the subset for the 20-month-old mice.

body.comp <- read.delim("data/inbred.body.composition.txt")
body.comp.20 <- body.comp[body.comp$age == 20,]

 

Next we will take a quick look at the distribution of our variables of interest, staring with body weight.

qqnorm(body.comp.20$body_weight_g)
qqline(body.comp.20$body_weight_g)

As we have seen that strain can impact normality, and we are breaking down our analysis by strain anyway, let’s spot check a few strains.

par(mfrow = c(2,2))
qqnorm(body.comp.20$body_weight_g[body.comp.20$strain == "C57BL/6J"])
qqline(body.comp.20$body_weight_g[body.comp.20$strain == "C57BL/6J"])
qqnorm(body.comp.20$body_weight_g[body.comp.20$strain == "KK/HlJ"])
qqline(body.comp.20$body_weight_g[body.comp.20$strain == "KK/HlJ"])
qqnorm(body.comp.20$body_weight_g[body.comp.20$strain == "BUB/BnJ"])
qqline(body.comp.20$body_weight_g[body.comp.20$strain == "BUB/BnJ"])
qqnorm(body.comp.20$body_weight_g[body.comp.20$strain == "129S1/SvImJ"])
qqline(body.comp.20$body_weight_g[body.comp.20$strain == "129S1/SvImJ"])

 

Looks normal enough within each strain to go forward under that assumption. Now let’s check fat percentage:

qqnorm(body.comp.20$percent_fat)
qqline(body.comp.20$percent_fat)

 

This looks very similar, so we are probably looking at the same story as body weight with respect to non-normality being driven by pooling the strains, but to be sure:

par(mfrow = c(2,2))
qqnorm(body.comp.20$percent_fat[body.comp.20$strain == "C57BL/6J"])
qqline(body.comp.20$percent_fat[body.comp.20$strain == "C57BL/6J"])
qqnorm(body.comp.20$percent_fat[body.comp.20$strain == "KK/HlJ"])
qqline(body.comp.20$percent_fat[body.comp.20$strain == "KK/HlJ"])
qqnorm(body.comp.20$percent_fat[body.comp.20$strain == "BUB/BnJ"])
qqline(body.comp.20$percent_fat[body.comp.20$strain == "BUB/BnJ"])
qqnorm(body.comp.20$percent_fat[body.comp.20$strain == "129S1/SvImJ"])
qqline(body.comp.20$percent_fat[body.comp.20$strain == "129S1/SvImJ"])

 

Looks normal enough. We will proceed with that assumption. Looking at the UCLA Institute for Digital Research & Education Table, we are looking at the correlation between 1 interval continuous dependent variable and 1 interval continuous independent variable. The R function for this sort of correlation is cor.test():

?cor.test

 

For correlation tests, the parametric version is Pearson (assumes a normal distribution), while the non-parametric version is Spearman. We will use Pearson because we have data that appears to have an approximately normal distribution. Now we just need to set up our analysis loop:

# grab strain list
strain.list <- unique(body.comp$strain)

# initialize data frame to store output information (P-values)
bw.fat.corr <- data.frame(strain = strain.list,
                        P = numeric(length = length(strain.list)))

# cycle through each strain and compare male and female body weight 
# using a t-test
for (i.strain in 1:length(strain.list)) {
  # grab current strain name
  strain.c <- strain.list[i.strain]
  
  # run t-test for male vs. female body weight for current strain
  cor.test.c <- cor.test(body.comp$body_weight_g[body.comp$strain == strain.c], 
                       body.comp$percent_fat[body.comp$strain == strain.c], 
                       alternative = "two.sided", method = "pearson")
  
  # update data frame
  bw.fat.corr$P[i.strain] <- cor.test.c$p.value
}

# Now we can apply our Holm-Bonferroni correction (and for good
# measure, the Benjamini-Hochberg FDR test, since or main goal is to
# get a sense for the number of strains with a correlation)
bw.fat.corr$P.holm <- p.adjust(bw.fat.corr$P, method = "holm")
bw.fat.corr$P.BH <- p.adjust(bw.fat.corr$P, method = "BH")

# display strains with significant P-values (alpha < 0.05)
strains.bw.fat.corr <- bw.fat.corr$strain[bw.fat.corr$P < 0.05]
strains.bw.fat.corr.holm <- bw.fat.corr$strain[bw.fat.corr$P.holm < 0.05]
strains.bw.fat.corr.BH <- bw.fat.corr$strain[bw.fat.corr$P.BH < 0.05]
length(strains.bw.fat.corr)
[1] 20
length(strains.bw.fat.corr.holm)
[1] 17
length(strains.bw.fat.corr.BH)
[1] 20

 

The range of strains is 14-21 of 32 for our three P-values (uncorrected, Holm, and BH). The third is based on false discovery rate, which may be the best choice here. Since we are trying to get a sense of how many strain of our 32 have correlated body weight and body composition, we care about minimizing both Type I and Type II Errors. The FDR gives more balance to both sides of this trade-off.

If you want to go further, there are many followup questsion. Since we see only maybe half of the strains with a correlation, this may mean that body weight is not consistently driven by changes in body composition, but perhaps that both lean and fat mass are declining with age. Alternatively, sex may be masking an effect in some strains but not others. There are many questions we would want to address if we were to move forward with this data set:

  • Are body weight and sex correlated similarly for each sex within each strain?
  • Do we have sufficient numbers of mice to detect a corrlation in the strains with non-significant p-values?
  • How does this correlation shift with age?
  • Does the change in body weight between time points correlate with change in body composition between time points?

While we won’t be going into this much detail here, you can take these as additional exercises if you would like more practice.


Key Points

  • Running multiple comparisons increases you chance of making a Type I Error.

  • Different multiple test correction strategies correct for different types of errors (Type I vs. Type II) using different strategies.

  • The basic outcome of a multiple test correction is to lower the P-value threshold (\(lpha\)) below which you reject the null hypothesis.