Final Projects

Overview

Class Date: 12/3/2024, 12/5/2024, 12/10/2024
Teaching: 145 min
Exercises: 0 min
Questions
  • What are the requirements of the MCB 585 individual class projects?

Objectives
  • Use the skills learned in MCB 585 to conduct a new (or reproduce an old) analysis on a large dataset.

Final project details

MCB585 will culminate in the completion of individual final projects for each student. The goal of these projects is to apply the skills learned in this course to a novel large dataset to evaluate a hypothesis, design an experiment, or explore the behavior of a system.

Each project will consist of three parts:

  1. Proposal (due 11/7/2023). Each student should submit a written proposal (~1/2 page) that outlines the following:
    • The question being addressed.
    • o The dataset(s) to be used, with descriptive details (e.g. number of samples, variables of interest, potential confounding variables). If an unpublished dataset is being used, the data should be included with the proposal. For large datasets (e.g. outputs from high-throughput systems, genomic data) please provide one example input file in fastq, counts (or another appropriate format) through Box, Microsoft OneDrive, Dropbox, or Google Drive. For large microscopy datasets, a sample image should be provided.
    • The proposed analysis plan, including which tools/skills from the MCB585 coursework will be employed.
    • A description of potential pitfalls that may be encountered and alternative approaches.
Note: The main purpose of the proposal is to allow the instructors to judge the scope of the project and provide feedback prior to the students beginning analysis. In some cases, instructors may request a brief meeting to clarify details or suggest alternative approaches.
  1. Report and code (due 12/3/2024). A brief (1-2 pages of text plus figures) report describing the findings of the project should be turned in once analysis is complete. The report should include, as a separate attachment in addition to the report, a well-documented R or MATLAB script or scripts that can be run to reproduce the analysis presented.

  2. Presentations (12/3/2024, 12/5/2024, 12/10/2024). During the last 3 days of class, each student will give a 10-15 minute presentation describing:

    • The biological question that they chose to explore.
    • The computational approach that they used in their analysis.
    • What they discovered.
    • Any problems that they encountered along the way, and how they addressed the problems.

Project notes

  1. The goal of this project is to conduct analysis of datasets for which the tools in R and Matlab provide a signficant advantage. We prefer that datasets be selected that contain at least 1000 data points. Please ask in advance if you have a dataset in mind that does not meet this criteria.

  2. Projects should make use of at least one, and preferrably multiple, of the tools presented in the course.

  3. Projects that utilize webtools or other analysis pipelines are allowed; however, at least half of the analysis should be performed as original code created by the student in R or MATLAB, and a full understanding of the algorithms underlying the webtools should be demonstrated in the writeup and presentation. For image analysis, a cell profiler pipeline can be used without having to write the code in MATLAB.

  4. The goal of this course is to provide you with a set of practical tools that can be employed in your own research. As such, we strongly encourage you to use your own data in your project. An alternative is to select a paper that conducts and interesting analysis and reproduce one of the figure (using their data or a similar dataset). If you do not currently have data useful for this purpose, you are welcome to use the datasets provided as part of the course (downloaded at the beginning of the course setup, or here). There are also many online resources with publically available datasets that you can use for your project. For example:


Examples

Here are a few examples projects. Feel free to use one of these or some modifed version of these if you are having trouble coming up with a project of you own.

  1. Inbred mouse power charts. Using the inbred mouse lifespan data, construct a series of power charts to use as a reference for mouse aging studies. Consider potential confounding factors, and compare the t-test method to the simulation method for determining sample size. Be sure to account for differences in variability between male and female mice.

  2. Re-examination of observation frequency in lifespan analysis. In 2017, the Dana Miller (University of Washington) and Michael Petrascheck (Scripps Research Institute) published a joint paper examining how several technical aspects of experimental design affect reproducibility (Petrascheck & Miller, 2017). They used a large set of wild-type C. elegans lifespan observations modeled using a Gompertz distribution. Reproduce this analysis using a novel dataset (“CHARGE.screen.lifespan.data.txt” in MCB585-sample-data.zip, updated 10/21/2019) modeled using a Weibull distribution.

  3. Demonstrate the predictions of the Central Limit Theorem. We have repeatedly discussed the prediction of the Central Limit Theorem that, regardless of the shape of the population distribution, the sampling distribution for a phenotype will be approximately normal for large samples. Using a variety of simulated population distributions with different shapes (e.g. highly skewed, multi-modal), explore the relationship between sample size and normality of the sampling distribution. For the set of distributions provided by default in R, see ?distributions. Others are available through add-on packages.

  4. Identifying molecular subtypes of breast cancer. Normalize TCGA breast cancer data from cBioPortal, select the most variable genes, and perform hierarchical, k-means, and model-based clustering on the tumors. Compare and contrast your clusters with subtyping based on presence of ER, PR, and HER2 markers, or based on the PAM50 signature. Use survival analysis to identify differences in prognosis between the subtypes.

 


Key Points

  • MCB 585 individual final projects.