--- title: "Looking at data for TAM I" format: docx editor: visual csl: C:\\Users\\tam2\\Dropbox\\ctr\\mee.csl bibliography: C:\\Users\\tam2\\Dropbox\\ctr\\MainBibFile.bib --- # Introduction ```{r} library(readxl) ``` While preparing to deliver the lecture in Tópicos Avançados de Microbiologia I, we agreed it would be interesting to work over a dataset from Microbiologia to illustrate some of the concepts introduced in class. Prof. Lélia Chambel (LC) suggested I could use a dataset they collected, to explore the growth of four different species of fungi under different conditions. In this document I explore the dataset that was sent to me via email by LC on the Thu 9/14/2023 7:09 PM, to comply with the bonus slide I added to the material given to students: ![](Figs4Rmd/tamSlide.png) It contains, in Lélia's words, the following data: "Sobre os dados é uma ótima ideia para passar da teoria à prática... Não sei se serão adequados mas, numa outra UC deste semestre, os alunos vão realizar um trabalho com fungos e avaliação do seu crescimento, ao longo do tempo, em diferentes condições (meios de cultura e temperaturas). Em anexo envio os dados obtidos, no ano anterior, para um dos fungos (analisamos 4 fungos diferentes). De uma forma muito geral, cada fungo é colocado em 4 meios diferentes (meio sem NaCl e meios com 3 concentrações diferentes de NaCl) e 3 temperaturas (22, 28, 37ºC). Em cada caso, o crescimento é medido ao longo de vários dias. O objetivo é avaliar o efeito do NaCl no crescimento dos fungos assim como da temperatura; perceber se os 4 fungos respondem de igual forma ou diferente. Estes dados são analisados utilizando uma análise de ANOVA (o colega Artur Lourenço faz essa análise com os alunos). Se te parecer que os dados são adequados para"outras análises" posso enviar os dos outros 3 fungos e podemos também falar um pouco pois será mais fácil explicar." Here I reproduce an image with the fungi cultures: ![](Figs4Rmd/Alternaria sp..jpg) So in this document I explore the actual growth data, highlighting some of the concepts and ideas I refer also in class, in particular, the fundamental aspects about how to structure one's data to facilitate effective data analysis, as described by @Broman2017. ![](Figs4Rmd/BromanWoo.png) I strongly recommend anyone having to deal with data and Excel to read said paper. To begin with, this is a dynamic report in RMarkdown (https://bookdown.org/yihui/rmarkdown/), which means that if the data were to change, the analysis and reporting would remain the same, we would just have to recompile the .Rmd. If additional data would be obtained, it should be just a matter of making small modifications to the code to deal with that, and then recompile the document. Another advantage of it being a dynamic report in RMarkdown is that it can be output into different formats just by recompiling the document, including htlm, pdf or Word. If you want to obtain a template to do your own RMarkdown documents, you can find one here: https://github.com/TiagoAMarques/RMarkdownTemplate Having dynamic documents allows one to quickly adapt to changes while making the analysis fully transparent, and hence easily reproducible. In the following I will: - Read the data in - Manipulate the data - Explore the data - Implement some regression models to explore the effects of the different factors on the fungi growth - Conclude with some final thoughts # Reading the data First things first, lets read the data in. The data is in an Excel workshhet, but in a format that makes it hard to read and to analyse in an automated and integrated versatile way. Following the guidelines of @Broman2017, and with LC's permission, I will use this data to illustrate how one could format data to make it amenable to any analysis one might want to conduct later. In doing so I will prepare a document that might be distributed to the students in Tópicos Avançados de Microbiologia 1, to illustrate the points made by @Broman2017 regarding how to organize the data in spreadsheets. A first thing to do is to define a suitable data object to hold all the data, which should be, as @Broman2017 suggest, a rectangular data structure with rows as records and columns as variables. Note that we will need at the very least columns for: - size, the response variable, which I label `size`, and then, potential explanatory covariates - species, as there are 4 different species, which I label `Sp` (we only have data for 1 pecies, Alternaria sp., but making it explicit would allow to add other species later) - temperature, as we have 3 different temperatures (22, 28, 37ºC), the which I label `temp` - concentration, as we have a medium without NaCl and mediums with 3 different concentrations of NaCl, which I label `conc` - time, as measurements are made over time for the same sampling units, which I label `time` we will also had a column to hold the number of the replicate (there are about 4 per combination, see below), which I will label `rep`. Notice how in doing so I have considered two of the suggestions in @Broman2017, by: - using variable names that are simple but intuitive to others, to ensure that the workflow is as transparent as possible, and - creating a data dictionary, ensuring that others reading this document will know what everything is. ```{r} fungus<-data.frame(Sp=character(0),temp=numeric(0),conc=numeric(0),time=numeric(0),size=numeric(0),rep=character(0)) ``` Now, we can read all the data from the Excel file. This requires many lines of code, since the data is spread across multiple sub-tables and different sheets in the original document: ```{r,echo=TRUE} #| echo: false AltcPDAT22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "B5:C10",col_names = c("time","size")) AltcPDAT22$Sp <- "Alternaria" AltcPDAT22$temp <- 22 AltcPDAT22$conc <- 0 AltcPDAT22$rep <- "R1" fungus<-rbind(fungus,AltcPDAT22) #Tabela1 PDA AltcPDAT22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "B15:C20",col_names = c("time","size")) AltcPDAT22$Sp <- "Alternaria" AltcPDAT22$temp <- 22 AltcPDAT22$conc <- 0 AltcPDAT22$rep <- "R2" fungus<-rbind(fungus,AltcPDAT22) #Tabela2 PDA AltcPDAT22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "B25:C28",col_names = c("time","size")) AltcPDAT22$Sp <- "Alternaria" AltcPDAT22$temp <- 22 AltcPDAT22$conc <- 0 AltcPDAT22$rep <- "R3" fungus<-rbind(fungus,AltcPDAT22) #Tabela3 PDA AltcPDAT22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "B35:C38",col_names = c("time","size")) AltcPDAT22$Sp <- "Alternaria" AltcPDAT22$temp <- 22 AltcPDAT22$conc <- 0 AltcPDAT22$rep <- "R4" fungus<-rbind(fungus,AltcPDAT22) #Tabela4 PDA AltcPDAT28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "M5:N10",col_names = c("time","size")) AltcPDAT28$Sp <- "Alternaria" AltcPDAT28$temp <- 28 AltcPDAT28$conc <- 0 AltcPDAT28$rep <- "R5" fungus<-rbind(fungus,AltcPDAT28) #Tabela5 PDA AltcPDAT28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "M15:N20",col_names = c("time","size")) AltcPDAT28$Sp <- "Alternaria" AltcPDAT28$temp <- 28 AltcPDAT28$conc <- 0 AltcPDAT28$rep <- "R6" fungus<-rbind(fungus,AltcPDAT28) #Tabela6 PDA AltcPDAT28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "M25:N28",col_names = c("time","size")) AltcPDAT28$Sp <- "Alternaria" AltcPDAT28$temp <- 28 AltcPDAT28$conc <- 0 AltcPDAT28$rep <- "R7" fungus<-rbind(fungus,AltcPDAT28) #Tabela7 PDA AltcPDAT28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "M35:N38",col_names = c("time","size")) AltcPDAT28$Sp <- "Alternaria" AltcPDAT28$temp <- 28 AltcPDAT28$conc <- 0 AltcPDAT28$rep <- "R8" fungus<-rbind(fungus,AltcPDAT28) #Tabela8 PDA AltcPDAT37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "X5:Y10",col_names = c("time","size")) AltcPDAT37$Sp <- "Alternaria" AltcPDAT37$temp <- 37 AltcPDAT37$conc <- 0 AltcPDAT37$rep <- "R9" fungus<-rbind(fungus,AltcPDAT37) #Tabela9 PDA AltcPDAT37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "X15:Y20",col_names = c("time","size")) AltcPDAT37$Sp <- "Alternaria" AltcPDAT37$temp <- 37 AltcPDAT37$conc <- 0 AltcPDAT37$rep <- "R10" fungus<-rbind(fungus,AltcPDAT37) #Tabela10 PDA AltcPDAT37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "X25:Y28",col_names = c("time","size")) AltcPDAT37$Sp <- "Alternaria" AltcPDAT37$temp <- 37 AltcPDAT37$conc <- 0 AltcPDAT37$rep <- "R11" fungus<-rbind(fungus,AltcPDAT37) #Tabela11 PDA AltcPDAT37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " PDA", range = "X35:Y38",col_names = c("time","size")) AltcPDAT37$Sp <- "Alternaria" AltcPDAT37$temp <- 37 AltcPDAT37$conc <- 0 AltcPDAT37$rep <- "R12" fungus<-rbind(fungus,AltcPDAT37) #Tabela12 PDA Alt.c2.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "B5:C10",col_names = c("time","size")) Alt.c2.5.T22$Sp <- "Alternaria" Alt.c2.5.T22$temp <- 22 Alt.c2.5.T22$conc <- 2.5 Alt.c2.5.T22$rep <- "R13" fungus<-rbind(fungus,Alt.c2.5.T22) #Tabela1 NaCl-2,5% Alt.c2.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "B14:C19",col_names = c("time","size")) Alt.c2.5.T22$Sp <- "Alternaria" Alt.c2.5.T22$temp <- 22 Alt.c2.5.T22$conc <- 2.5 Alt.c2.5.T22$rep <- "R14" fungus<-rbind(fungus,Alt.c2.5.T22) #Tabela2 NaCl-2,5% Alt.c2.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "B23:C26",col_names = c("time","size")) Alt.c2.5.T22$Sp <- "Alternaria" Alt.c2.5.T22$temp <- 22 Alt.c2.5.T22$conc <- 2.5 Alt.c2.5.T22$rep <- "R15" fungus<-rbind(fungus,Alt.c2.5.T22) #Tabela3 NaCl-2,5% Alt.c2.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "B32:C35",col_names = c("time","size")) Alt.c2.5.T22$Sp <- "Alternaria" Alt.c2.5.T22$temp <- 22 Alt.c2.5.T22$conc <- 2.5 Alt.c2.5.T22$rep <- "R16" fungus<-rbind(fungus,Alt.c2.5.T22) #Tabela4 NaCl-2,5% Alt.c2.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "M5:N10",col_names = c("time","size")) Alt.c2.5.T28$Sp <- "Alternaria" Alt.c2.5.T28$temp <- 28 Alt.c2.5.T28$conc <- 2.5 Alt.c2.5.T28$rep <- "R17" fungus<-rbind(fungus,Alt.c2.5.T28) #Tabela5 NaCl-2,5% Alt.c2.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "M14:N19",col_names = c("time","size")) Alt.c2.5.T28$Sp <- "Alternaria" Alt.c2.5.T28$temp <- 28 Alt.c2.5.T28$conc <- 2.5 Alt.c2.5.T28$rep <- "R18" fungus<-rbind(fungus,Alt.c2.5.T28) #Tabela6 NaCl-2,5% Alt.c2.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "M23:N26",col_names = c("time","size")) Alt.c2.5.T28$Sp <- "Alternaria" Alt.c2.5.T28$temp <- 28 Alt.c2.5.T28$conc <- 2.5 Alt.c2.5.T28$rep <- "R19" fungus<-rbind(fungus,Alt.c2.5.T28) #Tabela7 NaCl-2,5% Alt.c2.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "M32:N35",col_names = c("time","size")) Alt.c2.5.T28$Sp <- "Alternaria" Alt.c2.5.T28$temp <- 28 Alt.c2.5.T28$conc <- 2.5 Alt.c2.5.T28$rep <- "R20" fungus<-rbind(fungus,Alt.c2.5.T28) #Tabela8 NaCl-2,5% Alt.c2.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "X5:Y10",col_names = c("time","size")) Alt.c2.5.T37$Sp <- "Alternaria" Alt.c2.5.T37$temp <- 37 Alt.c2.5.T37$conc <- 2.5 Alt.c2.5.T37$rep <- "R21" fungus<-rbind(fungus,Alt.c2.5.T37) #Tabela9 NaCl-2,5% Alt.c2.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "X14:Y19",col_names = c("time","size")) Alt.c2.5.T37$Sp <- "Alternaria" Alt.c2.5.T37$temp <- 37 Alt.c2.5.T37$conc <- 2.5 Alt.c2.5.T37$rep <- "R22" fungus<-rbind(fungus,Alt.c2.5.T37) #Tabela10 NaCl-2,5% Alt.c2.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "X23:Y26",col_names = c("time","size")) Alt.c2.5.T37$Sp <- "Alternaria" Alt.c2.5.T37$temp <- 37 Alt.c2.5.T37$conc <- 2.5 Alt.c2.5.T37$rep <- "R23" fungus<-rbind(fungus,Alt.c2.5.T37) #Tabela11 NaCl-2,5% Alt.c2.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 2,5% NaCl", range = "X32:Y35",col_names = c("time","size")) Alt.c2.5.T37$Sp <- "Alternaria" Alt.c2.5.T37$temp <- 37 Alt.c2.5.T37$conc <- 2.5 Alt.c2.5.T37$rep <- "R24" fungus<-rbind(fungus,Alt.c2.5.T37) #Tabela12 NaCl-2,5% Alt.c5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "B5:C10",col_names = c("time","size")) Alt.c5.T22$Sp <- "Alternaria" Alt.c5.T22$temp <- 22 Alt.c5.T22$conc <- 5 Alt.c5.T22$rep <- "R25" fungus<-rbind(fungus,Alt.c5.T22) #Tabela1 NaCl-5% Alt.c5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "B14:C19",col_names = c("time","size")) Alt.c5.T22$Sp <- "Alternaria" Alt.c5.T22$temp <- 22 Alt.c5.T22$conc <- 5 Alt.c5.T22$rep <- "R26" fungus<-rbind(fungus,Alt.c5.T22) #Tabela2 NaCl-5% Alt.c5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "B23:C26",col_names = c("time","size")) Alt.c5.T22$Sp <- "Alternaria" Alt.c5.T22$temp <- 22 Alt.c5.T22$conc <- 5 Alt.c5.T22$rep <- "R27" fungus<-rbind(fungus,Alt.c5.T22) #Tabela3 NaCl-5% Alt.c5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "B32:C35",col_names = c("time","size")) Alt.c5.T22$Sp <- "Alternaria" Alt.c5.T22$temp <- 22 Alt.c5.T22$conc <- 5 Alt.c5.T22$rep <- "R28" fungus<-rbind(fungus,Alt.c5.T22) #Tabela4 NaCl-5% Alt.c5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "M5:N10",col_names = c("time","size")) Alt.c5.T28$Sp <- "Alternaria" Alt.c5.T28$temp <- 28 Alt.c5.T28$conc <- 5 Alt.c5.T28$rep <- "R29" fungus<-rbind(fungus,Alt.c5.T28) #Tabela5 NaCl-5% Alt.c5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "M14:N19",col_names = c("time","size")) Alt.c5.T28$Sp <- "Alternaria" Alt.c5.T28$temp <- 28 Alt.c5.T28$conc <- 5 Alt.c5.T28$rep <- "R30" fungus<-rbind(fungus,Alt.c5.T28) #Tabela6 NaCl-5% Alt.c5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "M23:N26",col_names = c("time","size")) Alt.c5.T28$Sp <- "Alternaria" Alt.c5.T28$temp <- 28 Alt.c5.T28$conc <- 5 Alt.c5.T28$rep <- "R31" fungus<-rbind(fungus,Alt.c5.T28) #Tabela7 NaCl-5% Alt.c5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "M32:N35",col_names = c("time","size")) Alt.c5.T28$Sp <- "Alternaria" Alt.c5.T28$temp <- 28 Alt.c5.T28$conc <- 5 Alt.c5.T28$rep <- "R32" fungus<-rbind(fungus,Alt.c5.T28) #Tabela8 NaCl-5% Alt.c5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "X5:Y10",col_names = c("time","size")) Alt.c5.T37$Sp <- "Alternaria" Alt.c5.T37$temp <- 37 Alt.c5.T37$conc <- 5 Alt.c5.T37$rep <- "R33" fungus<-rbind(fungus,Alt.c5.T37) #Tabela9 NaCl-5% Alt.c5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "X14:Y19",col_names = c("time","size")) Alt.c5.T37$Sp <- "Alternaria" Alt.c5.T37$temp <- 37 Alt.c5.T37$conc <- 5 Alt.c5.T37$rep <- "R34" fungus<-rbind(fungus,Alt.c5.T37) #Tabela10 NaCl-5% Alt.c5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "X23:Y26",col_names = c("time","size")) Alt.c5.T37$Sp <- "Alternaria" Alt.c5.T37$temp <- 37 Alt.c5.T37$conc <- 5 Alt.c5.T37$rep <- "R35" fungus<-rbind(fungus,Alt.c5.T37) #Tabela11 NaCl-5% Alt.c5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 5% NaCl", range = "X32:Y35",col_names = c("time","size")) Alt.c5.T37$Sp <- "Alternaria" Alt.c5.T37$temp <- 37 Alt.c5.T37$conc <- 5 Alt.c5.T37$rep <- "R36" fungus<-rbind(fungus,Alt.c5.T37) #Tabela12 NaCl-5% Alt.c7.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "B5:C10",col_names = c("time","size")) Alt.c7.5.T22$Sp <- "Alternaria" Alt.c7.5.T22$temp <- 22 Alt.c7.5.T22$conc <- 7.5 Alt.c7.5.T22$rep <- "R37" fungus<-rbind(fungus,Alt.c7.5.T22) #Tabela1 NaCl-7,5% Alt.c7.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "B14:C19",col_names = c("time","size")) Alt.c7.5.T22$Sp <- "Alternaria" Alt.c7.5.T22$temp <- 22 Alt.c7.5.T22$conc <- 7.5 Alt.c7.5.T22$rep <- "R38" fungus<-rbind(fungus,Alt.c7.5.T22) #Tabela2 NaCl-7,5% Alt.c7.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "B23:C26",col_names = c("time","size")) Alt.c7.5.T22$Sp <- "Alternaria" Alt.c7.5.T22$temp <- 22 Alt.c7.5.T22$conc <- 7.5 Alt.c7.5.T22$rep <- "R39" fungus<-rbind(fungus,Alt.c7.5.T22) #Tabela3 NaCl-7,5% Alt.c7.5.T22<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "B32:C35",col_names = c("time","size")) Alt.c7.5.T22$Sp <- "Alternaria" Alt.c7.5.T22$temp <- 22 Alt.c7.5.T22$conc <- 7.5 Alt.c7.5.T22$rep <- "R40" fungus<-rbind(fungus,Alt.c7.5.T22) #Tabela4 NaCl-7,5% Alt.c7.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "M5:N10",col_names = c("time","size")) Alt.c7.5.T28$Sp <- "Alternaria" Alt.c7.5.T28$temp <- 28 Alt.c7.5.T28$conc <- 7.5 Alt.c7.5.T28$rep <- "R41" fungus<-rbind(fungus,Alt.c7.5.T28) #Tabela5 NaCl-7,5% Alt.c7.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "M14:N19",col_names = c("time","size")) Alt.c7.5.T28$Sp <- "Alternaria" Alt.c7.5.T28$temp <- 28 Alt.c7.5.T28$conc <- 7.5 Alt.c7.5.T28$rep <- "R42" fungus<-rbind(fungus,Alt.c7.5.T28) #Tabela6 NaCl-7,5% Alt.c7.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "M23:N26",col_names = c("time","size")) Alt.c7.5.T28$Sp <- "Alternaria" Alt.c7.5.T28$temp <- 28 Alt.c7.5.T28$conc <- 7.5 Alt.c7.5.T28$rep <- "R43" fungus<-rbind(fungus,Alt.c7.5.T28) #Tabela7 NaCl-7,5% Alt.c7.5.T28<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "M32:N35",col_names = c("time","size")) Alt.c7.5.T28$Sp <- "Alternaria" Alt.c7.5.T28$temp <- 28 Alt.c7.5.T28$conc <- 7.5 Alt.c7.5.T28$rep <- "R44" fungus<-rbind(fungus,Alt.c7.5.T28) #Tabela8 NaCl-7,5% Alt.c7.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "X5:Y10",col_names = c("time","size")) Alt.c7.5.T37$Sp <- "Alternaria" Alt.c7.5.T37$temp <- 37 Alt.c7.5.T37$conc <- 7.5 Alt.c7.5.T37$rep <- "R45" fungus<-rbind(fungus,Alt.c7.5.T37) #Tabela9 NaCl-7,5% Alt.c7.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "X14:Y19",col_names = c("time","size")) Alt.c7.5.T37$Sp <- "Alternaria" Alt.c7.5.T37$temp <- 37 Alt.c7.5.T37$conc <- 7.5 Alt.c7.5.T37$rep <- "R46" fungus<-rbind(fungus,Alt.c7.5.T37) #Tabela10 NaCl-7,5% Alt.c7.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "X23:Y26",col_names = c("time","size")) Alt.c7.5.T37$Sp <- "Alternaria" Alt.c7.5.T37$temp <- 37 Alt.c7.5.T37$conc <- 7.5 Alt.c7.5.T37$rep <- "R47" fungus<-rbind(fungus,Alt.c7.5.T37) #Tabela11 NaCl-7,5% Alt.c7.5.T37<- read_excel("Dados_Alternaria sp..xlsx", sheet = " 7,5% NaCl", range = "X32:Y35",col_names = c("time","size")) Alt.c7.5.T37$Sp <- "Alternaria" Alt.c7.5.T37$temp <- 37 Alt.c7.5.T37$conc <- 7.5 Alt.c7.5.T37$rep <- "R48" fungus<-rbind(fungus,Alt.c7.5.T37) #Tabela12 NaCl-7,5% ``` You might think that all these lines of code correspond to lots of work. This was necessary given the original format. but if you look at the code, it is mostly copy-paste of the same structure for each table, pointing at where to find the data for each replicate within each sheet (corresponding to different NaCl concentrations), where results for the multiple temperatures are separated in sub-tables. If you do NOT want to see all the above lines in your report, just change the `chunk` option `echo` in the .Rmd file from `echo=TRUE` to `echo=FALSE`. Now that now we have the data in the format of @Broman2017, we can do anything we might want with it with a couple lines of code. We also export the data as a flat text file (a .txt) for safety, as @Broman2017 suggest. And as they also suggest, do not forget to keep backup copies of your precious original data. The last thing you want is to have to redo all the analysis because a laptop is lot or some file gets corrupted. If something goes wrong, having a safety copy of the full dataset in fundamental. ```{r} write.table(fungus,file="fungus.txt",quote=FALSE) ``` # Exploring the data We start by exploring the data, making sure we have what we wanted to have. We can take a peak at the first few lines of data ```{r} head(fungus,n=10) ``` as well as the last few lines of data ```{r} tail(fungus,n=10) ``` if something wrong happened in the import, that might help diagnose it. We can also see the structure of the data ```{r} str(fungus) ``` and look at an omnibus summary of the data ```{r} summary(fungus) ``` Something weird is happening with the time variable, as it seems like there is 1 value is not a number, which created issues with the import. We can track the issue to cell X35 in worksheet "PDA", where a "i" was recorded instead of a, presumably, 5. That violates a few of the rules from @Broman2017, namely that one should be consistent, and using only one type of data for any given variable. It was just a slip up, but it could be enough to break a bioinformatics pipeline. Respecting the rule that says an analyst should not change the original data, he should always ask the data owner to change the data instead and work from a clean master data file, but here I will change the data in R and then continue the analysis. I assume that instance of "i" was supposed to be 5. Note that I will use implicitly conditional code, such that if later the data gets corrected, my code still works the same. ```{r} fungus$time[fungus$time=="i"]<-5 fungus$time<-as.numeric(as.character(fungus$time)) ``` it all looks as it should, therefore we proceed. As noted, with this structure we can do anything we'd like with it. As examples, we could see the sizes as a function of temperatures, concentrations, or times, pooled across all the other factors. First,,size as a function of temperatures, ```{r} with(fungus,boxplot(size~temp)) ``` which seems to show that sizes were larger at intermediate temperatures. Then we focus on concentration of NACl: ```{r} with(fungus,boxplot(size~conc)) ``` which seems to show that the larger the NaCl concentration, the smaller the sizes, and finally we look at the times ```{r} with(fungus,boxplot(size~time)) ``` We can see a clear effect of increasing size with time, as would be expected. Of course, interpreting those plots by themselves is not sensible, because once conditioning on a single factor, we ignore the variability and the sources of dependency across the other factors. As an example, if looking at the results from the different temperatures, we naively ignore that observations occur within replicates over time. We can take a look at the number of observations per factor level ```{r} with(fungus,table(temp,conc)) ``` and we can see everything is almost perfectly balanced, with 20 observations for each of 3 by 4 = 12 factorial combinations (just one with 21 observations). However, as noted, we must be careful, as these do not correspond to independent data points. The data correspond to in general 4 replicates measured at 5 time points, and an adequate analysis must incorporate that correlation structure. For illustration, We can take a look at one of the 12 factorial combinations, say for temperature = 22 and NaCl concentration = 0. ```{r} with((fungus[fungus$temp==22 & fungus$conc==0,]),plot(time,size,pch=substr(rep,2,2))) ``` It would be great to make a single plot with all the data, with lines joining the points corresponding to the different replicates, and say line widths representing levels of temperature, and line type representing the amount of NaCl. This is easily done using `ggplot2` capabilities, which allow us to look at the data from multiple angles. As an example, all the data in a single plot ```{r} library(ggplot2) ggplot(data=fungus,aes(x=time,y=size,color=as.factor(temp),linewidth=conc/(7.5/2),group=rep))+ geom_line()+scale_linewidth(range = c(0, 2),breaks=unique(fungus$conc/(7.5/2))) ``` or if we prefer to separate for easier viewing, separate by level of concentration, and using color to highlight temperature ```{r} ggplot(data=fungus,aes(x=time,y=size,color=as.factor(temp),group=rep))+ geom_line()+facet_wrap(~conc) ``` or conversely, separate by temperature level, and using colour to highlight concentration ```{r} ggplot(data=fungus,aes(x=time,y=size,color=as.factor(conc),group=rep))+ geom_line()+facet_wrap(~temp) ``` Having now explored what the data looks like, we move on to try to model it. # Modelling the data The endgame of many analysis is a model, which will be able to provide a means to make inferences based on the data, i.e., make genaral statements about the population from where the samples were collected. To model the data data at hand will probably require a model with random effects, where the replicate itself is a random effect. Since the data are sizes, hence strictly positive, a Gamma response variable might be a sensible option. Nonetheless, we will start also by exploring Gaussian responses, as if we can keep it Normal, things are usually far easier. As a first try, we consider a simple GLM, where we use a long link function to ensure positive predictions from the model. Remember, with a standard linear model predictions are not constraied in any way, hence, for a response variable that is strictly possible, one could obtain inadmissible estimates, which pose obvious problems. ```{r} glmGam<-glm(size~time+conc+as.factor(temp),family=Gamma(link="log"),data=fungus) glmGaul<-glm(size~time+conc+as.factor(temp),family=gaussian(link="log"),data=fungus) glmGau<-glm(size~time+conc+as.factor(temp),family=gaussian,data=fungus) summary(glmGam) summary(glmGaul) summary(glmGau) AIC(glmGau,glmGaul,glmGam) ``` As noted above, such analysis would ignore the dependencies in the observations over time within replicates, treating all the data points as independent. Interestingly, the Gaussian model without a log link seems to be preferred, which could mean a Linear Model might be enough for this data. I borrow here on an image presented in class, from Popovic et al. (in press). Four principles for improved statistical ecology. Methods in Ecology and Evolution. ![](Figs4Rmd/popovicetal.png) highlighting that one of the key aspects of effective statistical data analysis is to consider a model that respects the structure and dependencies of the data. Therefore, we attempt a GLMM with replicate as a random effect, to account for the fact that there is (temporal) autocorrelation in the data ```{r} library(MASS) glmm1<-glmmPQL(fixed=size~time+conc+as.factor(temp),family=gaussian,data=fungus,random=~1|rep) summary(glmm1) ``` All the models lead to similar conclusions. As would be expected, the size increases over time. As one increases the concentration of NaCl the growth seems to start from a lower level, while for intermediate temperatures the growths seems to start from a higher level. We could now investigate further subtleties, as potential differences in slopes as a function of interaction between factors or in relation with the random effects, but that does not seem to be an obvious necessity, and further, is beyhond what was the objective of this document, to illustrate the advantages of having data in the format of @Broman2017. Additionally, while we only looked at data from one species, we were told there was data from 3 additional species. One could: 1. repeat this procedure for the other 3 species 2. explore a simple integrated model for all species, eventually considering interactions between species and the factors # Conclusion This document describes in a single integrated workflow, hence easily reproduced by others, a full analysis pipeline, considering: 1. Data import 2. Data manipulation 3. Data exploration 4. Data analysis 5. Data interpretation By doing so in a single document that describes all the steps and includes, as needed, note, some or all the code to repeat them, we have enabled transparency and reproducibility. We did so while illustrating several of the principles This is the kind of thing that I would like to see students being able to do with their own data. While learning how to work with dynamic documents involves many concepts and tools and might seem to be behind an insurmountable steep learning curve at first, once you know your ways around R, that becomes a relatively easy task, with tangible and intangible dividends on the initial investment to be obtained over the course of a lifetime dealing with data. If you are looking for an easy way to start learning R by yourself, there are many excelent and free options online, but I leave here a link to my own startup kit: https://github.com/TiagoAMarques/AnIntro2RTutorial Enjoy the learning!