Final Projects
Overview
Class Date: 12/3/2024, 12/5/2024, 12/10/2024
Teaching: 145 min
Exercises: 0 minQuestions
What are the requirements of the MCB 585 individual class projects?
Objectives
Use the skills learned in MCB 585 to conduct a new (or reproduce an old) analysis on a large dataset.
Final project details
Overview & Goals
The final project is the capstone experience for MCB 585. It serves as an opportunity to synthesize and apply the quantitative and computational skills learned throughout the course to a novel, large dataset. The core objective is to use these skills to formally evaluate a hypothesis, design a quantitative experiment, or explore the behavior of a complex biological system. We strongly encourage you to use data from your own research, as a key goal of this course is to provide you with practical tools for your graduate work. However, several alternatives are available if you do not have a suitable dataset. The project is composed of three parts: (1) a proposal, (2) a final written report with code, and (3) an oral presentation. Together, these components account for 35% of your final MCB 585 grade.
Due Dates & Submission Instructures
Each project will consist of three parts:
The three parts of the final Project are due as follows:
- Part 1: Project Proposal. Due Tuesday 11/4/2025
- Part 2: Written Report with Code. Due Tuesday 12/2/2025
- Part 3: Oral Presentation. Delivered during class on 12/4/2025, 12/9/2025, and 12/11/2025 (a signup sheet will be available toward the end of November).
Submission. Project proposals and written reports should be submitted to the corresponding D2L Assignment drop box. The oral presentation just needs to be delivered on your assigned day but does not need to be turned in separately.
Part 1: Project Proposal
The purpose of the proposal is to ensure your project has an appropriate scope and to allow instructors to provide early feedback. In some cases, instructors may request a brief meeting to clarify details or suggest alternative approaches.
- Value: 5% of final grade
- Length: ~1 page
- Due Date: 11/4/2025
Your proposal must outline the following:
- Biological Question or Analysis Goal: Clearly state the specific question or hypothesis you intend to address.
- Dataset Description: Identify the dataset(s) you will use. Provide descriptive details, including:
- The source of the data (e.g., your own research, public repository, published paper)
- The number of samples in the dataset (datasets should preferably contain at least 1,000)
- The primary variable or variables of interest
- Any potential confounding variables.
- Data Submission: If you plan to use an unpublished dataset, it must be included with your proposal. For large datasets (e.g., high-throughput sequencing, genomics), provide a single example input file (e.g., FASTQ, counts file, sample image).
- Note: Please ask for approval in advance if your chosen dataset does not meet these criterion. We do make exceptions if the project is a good fit for the course material.
- Proposed Analysis Plan: Describe the analytical approach you plan to take. Specify which tools and skills from the MCB 585 coursework you will use. Projects must use at least one, and preferably multiple, tools from the course.
- Potential Pitfalls: Briefly describe potential challenges you might encounter (e.g., technical issues, unexpected data structure) and propose alternative approaches you could take.
Part 2: Written Report & Code
You will submit a written report of your findings, accompanied by the code used to perform the analysis.
- Value: 15% of final grade
- Length: 1-2 pages of single-spaced text, plus figures and tables
- Due Date: Tuesday 12/2/2025
The report should be structured like a brief scientific paper with the following sections:
- Introduction: Provide background on the biological question and state the specific goals of your project.
- Dataset Description: Briefly describe your selected dataset, including number of samples, primary variables of interest, and confounding variables.
- Methods:
- Describe your analysis in sufficient detail that it could be replicated independently.
- If you use web tools or other pre-existing pipelines, you must demonstrate your understanding of the underlying algorithm(s).
- Your final report must describe all tools used, including a precise description of how generative AI was employed (if applicable).
- Results: Present your findings in text, supported by figures and tables, as needed.
- Discussion & Challenges: Interpret your results. Discuss any technical problems you encountered and explain how you addressed them.
- Code Submission: You must submit a well-documented R or MATLAB script, or a Cell Profiler workflow, that we can use to reproduce your analysis. “Well-documented” means your code includes descriptive comments and clear variable names that make clear the purpose of each step.
- If you use tools other than R or MATLAB for more than half of your analysis, you must receive instructor permission.
- Your code must be submitted for review prior to your final presentation.
Part 3: Oral Presentation
During the last few class meetings, each student will deliver a presentation on their project.
- Value: 15% of final grade
- Length: Plan to give your presentation with the 15-minute slot, structured as a 10-12 minute talk followed by 3-5 minutes for questions and discussion
- Date: 12/4/2025, 12/9/2025, and 12/11/2025 (a signup sheet will be available toward the end of November)
Your presentation should describe:
- Biological Question: The background and motivation for your project, and the specific goal of your analysis.
- Dataset Description: Briefly describe your selected dataset, including number of samples, primary variables of interest, and confounding variables.
- Computational Approach: Describe the computational tools that you employed to address the question of interest.
- Results: What you discovered through your analysis.
- Discussion & Challenges: Any problems you encountered and how you addressed them.
Additional Information
Choosing a Project Topic: You have several options for your project.
- Use Your Data (Strongly Encouraged!): Apply the tools from the course to learn something new about a dataset from your own thesis research.
- Reproduce a Published Finding: Select a paper with an interesting analysis and reproduce one of its figures using either data provided with the paper or a similar dataset.
- Public or Course Dataset: Analyze a publicly available dataset from an online resource or one of the datasets provided in the course. If you choose a course dataset, you must conduct an analysis the extends beyond what we did in class (or applies an analysis in class to a different dataset than the one used).
Example Project Ideas: Here are a few sample projects. You are free to use or adapt these examples if you are unable to find a suitable project on your own.
- Lifespan Power Charts for Inbred Mice: Using the inbred mouse lifespan dataset, construct power charts to serve as a reference for aging studies. Compare the t-test and simulation methods for determining sample size, and account for variability between male and female mice. Use your power charts to compare experimental design parameters between different inbred strains.
- Reexamine the Impact of Observation Frequency on Lifespan Analysis: A 2017 Petrascheck & Miller paper (DOI: 10.3389/fgene.2017.00092) used a Gompertz distribution to model C. elegans lifespan observations. Reproduce their analysis using the “CHARGE.screen.lifespan.data.txt” dataset provided in the course materials. Use the Weibull distribution method presented in class in place of the Gompertz distribution.
- Demonstrate the Prediction of the Central Limit Theorem: The Central Limit Theorem predicts that a sampling distribution will be approximately normal for large samples, regardless of the population distribution’s shape. Using simulated population distributions of various shapes (e.g., skewed, multi-modal), explore the relationship between sample size and the normality of the sampling distribution. Demonstrate that the Central Limit Theorem is accurate and estimate the sample size needed for each selected distribution shape to become irrelevant. Compare this sample size between distributions.
- Identifying Molecular Subtypes of Breast Cancer: Using a TCGA breast cancer dataset, normalize the data, select for the most variable genes, and perform hierarchical, k-means, and model-based clustering. Compare your resulting clusters to established subtypes (e.g., ER/PR/HER2 status or PAM50) and use survival analysis to identify differences in prognosis between the identified subtypes.
Use of Generative AI (e.g., ChatGPT, Gemini, Claude):
- Permitted Use: You may use generative AI tools to assist with coding and debugging for your project.
- Prohibited Use: You should not use AI tools to fully generate your code, generate the final written answers, or write the narrative for your report. The writing and scientific interpretation must be your own.
- Responsibility: You are responsible for fact-checking and critically examining any output that you receive from an AI tool. These tools can produce code that is flawed or information that is biased or incorrect. This is particularly true for specialized analyses like power and survival analyses, were current models have a tendency to try “reinvent the wheel” and derive analysis tools from scratch (and do so poorly).
- Citation Requirement: If you use generative AI, you must include an acknowledgement in your final report that describes precisely how these tools were employed. Please follow MLA guidelines for citing generative AI.
Project notes
The goal of this project is to conduct analysis of datasets for which the tools in R and Matlab provide a signficant advantage. We prefer that datasets be selected that contain at least 1000 data points. Please ask in advance if you have a dataset in mind that does not meet this criteria.
Projects should make use of at least one, and preferrably multiple, of the tools presented in the course.
Projects that utilize webtools or other analysis pipelines are allowed; however, at least half of the analysis should be performed as original code created by the student using tools covered in class, and a full understanding of the algorithms underlying the webtools should be demonstrated in the writeup and presentation.
The goal of this course is to provide you with a set of practical tools that can be employed in your own research. As such, we strongly encourage you to use your own data in your project. An alternative is to select a paper that conducts and interesting analysis and reproduce one of the figure (using their data or a similar dataset). If you do not currently have data useful for this purpose, you are welcome to use the datasets provided as part of the course (downloaded at the beginning of the course setup, or here). There are also many online resources with publically available datasets that you can use for your project. For example:
Key Points
MCB 585 individual final projects.