Manuscript/Report Template for a Data Analysis Project
This exercise implements usage of GitHub and some basic data analysis. This is the basic structure of a data analysis project.
Rachel Robertson contributed to this exercise.
1 Class Exercise
This dataset comprises heights, weights, and gender identification of nine individuals. In my additions, I included two new column variables with the age and political party affiliation of these individuals.
Figure 1 shows a boxplot figure produced by one of the R scripts.
Democratic party affiliation appears to be overrepresented in our sample, as Green party and Republican party affiliations are fewer, and thus don’t have enough datapoints to fully show a boxplot. Nonetheless, it appears that the more conservative the party affiliation, Libertarian and Republican, the taller the person. It would be interesting to further differentiate this boxplot based on sex, as males tend to be taller and also tend to affiliate with the Republican party.
Figure 2 shows a scatterplot figure produced by one of the R scripts.
It appears that there is an inverse relationship between age and weight, with older individuals weighing less and younger individuals weighing more. This is not what we would normally expect, so perhaps our sampling methods have resulted in a biased sample. Perhaps we have a group of obese children and malnourished adults.
2 Manuscript Template for a Data Analysis Project
Below is a template for a manuscript if the above exercise were actually part of a true data analysis project.
This uses MS Word as output format. See here for more information. You can switch to other formats, like html or pdf. See the Quarto documentation for other formats.
3 Summary/Abstract
In a true data analysis project, this is where your summary/abstract would go. A good format is the IMRaD format: Introduction, Methods, Results, and Discussion. See George Mason’s website for more information.
The following sections could be part of your final paper.
4 Introduction
4.1 General Background Information
In this section, you would provide enough background on your topic that others can understand the why and how of your analysis.
4.2 Description of data and data source
Here you would describe what the data is, what it contains, where it is from, etc. Eventually this might be part of a methods section.
4.3 Questions/Hypotheses to be addressed
Here you would state in plain language the research questions you plan to answer with this analysis.
5 Methods
Here you would describe your methods. That should describe the data, the cleaning processes, and the analysis approaches. You might want to provide a shorter description here and all the details in the supplement.
5.1 Data aquisition
As applicable, explain where and how you got the data. If you directly import the data from an online source, you can combine this section with the next.
5.2 Data import and cleaning
Here you would write code that reads in the file and cleans it so it’s ready for analysis. Since this will be fairly long code for most datasets, it might be a good idea to have it in one or several R scripts. If that is the case, explain here briefly what kind of cleaning/processing you do, and provide more details and well documented code somewhere (e.g. as supplement in a paper). All materials, including files that contain code, should be commented well so everyone can follow along.
5.3 Statistical analysis
Here you would explain anything related to your statistical analyses.
6 Results
6.1 Exploratory/Descriptive analysis
Here you would use a combination of text/tables/figures to explore and describe your data. Show the most important descriptive results here. Additional ones should go in the supplement. Even more can be in the R and Quarto files that are part of your project.
Table 1 shows a summary of the data.
Note the loading of the data providing a relative path using the ../../
notation. (Two dots means a folder up). You never want to specify an absolute path like C:\ahandel\myproject\results\
because if you share this with someone, it won’t work for them since they don’t have that path. You can also use the here
R package to create paths. See examples of that below. I recommend the here
package, but I’m showing the other approach here just in case you encounter it.
skim_type | skim_variable | n_missing | complete_rate | factor.ordered | factor.n_unique | factor.top_counts | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
factor | Gender | 0 | 1 | FALSE | 3 | M: 4, F: 3, O: 2 | NA | NA | NA | NA | NA | NA | NA | NA |
factor | Party | 0 | 1 | FALSE | 4 | Dem: 5, Lib: 2, Gre: 1, Rep: 1 | NA | NA | NA | NA | NA | NA | NA | NA |
numeric | Height | 0 | 1 | NA | NA | NA | 165.66667 | 15.97655 | 133 | 156 | 166 | 178 | 183 | ▂▁▃▃▇ |
numeric | Weight | 0 | 1 | NA | NA | NA | 70.11111 | 21.24526 | 45 | 55 | 70 | 80 | 110 | ▇▂▃▂▂ |
numeric | Age | 0 | 1 | NA | NA | NA | 38.88889 | 23.22953 | 9 | 23 | 33 | 56 | 77 | ▅▇▂▂▅ |
6.2 Basic statistical analysis
To get some further insight into your data, if reasonable you could compute simple statistics (e.g. simple models with 1 predictor) to look for associations between your outcome(s) and each individual predictor variable. Though note that unless you pre-specified the outcome and main exposure, any “p<0.05 means statistical significance” interpretation is not valid.
Figure 3 shows a scatterplot figure produced by one of the R scripts.
6.3 Full analysis
Here yo uwould use one or several suitable statistical/machine learning methods to analyze your data and to produce meaningful figures, tables, etc. This might again be code that is best placed in one or several separate R scripts that need to be well documented. You want the code to produce figures and data ready for display as tables, and save those. Then you load them here.
Example Table 2 shows a summary of a linear model fit.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 149.2726967 | 23.3823360 | 6.3839942 | 0.0013962 |
Weight | 0.2623972 | 0.3512436 | 0.7470519 | 0.4886517 |
GenderM | -2.1244913 | 15.5488953 | -0.1366329 | 0.8966520 |
GenderO | -4.7644739 | 19.0114155 | -0.2506112 | 0.8120871 |
7 Discussion
7.1 Summary and Interpretation
Here you summarize what you did, what you found and what it means.
7.2 Strengths and Limitations
Here you discuss what you perceive as strengths and limitations of your analysis.
7.3 Conclusions
Here you state the main take-home messages.
Include citations in your Rmd file using bibtex, the list of references will automatically be placed at the end
This paper (Leek & Peng, 2015) discusses types of analyses.
These papers (McKay, Ebell, Billings, et al., 2020; McKay, Ebell, Dale, Shen, & Handel, 2020) are good examples of papers published using a fully reproducible setup similar to the one shown in this template.
Note that this cited reference will show up at the end of the document, the reference formatting is determined by the CSL file specified in the YAML header. Many more style files for almost any journal are available. You also specify the location of your bibtex reference file in the YAML. You can call your reference file anything you like, I just used the generic word references.bib
but giving it a more descriptive name is probably better.
8 References
To cite other work (important everywhere, but likely happens first in introduction), make sure your references are in the bibtex file specified in the YAML header above (here dataanalysis_template_references.bib
) and have the right bibtex key. Then you can include like this:
Examples of reproducible research projects can for instance be found in (McKay, Ebell, Billings, et al., 2020; McKay, Ebell, Dale, et al., 2020)