--- title: "Aula TP 8 -- Resolução FT 6" author: "João Malato e Tiago A. Marques" date: "Novembro de 2019" output: prettydoc::html_pretty: theme: cayman highlight: github toc: true toc_depth: 1 --- # Disclaimer Este documento fornece algumas propostas de resolução aos exercícios da folha de trabalho `TP6.pdf`, resolvida durante a aula prática número 8, na semana de 4 a 8 de Outubro de 2019. É de notar que pode haver (quase) infinitas maneiras de solucionar cada um dos exercícios. Se esta resolução não for *exactamente igual* à que foi apresentada na aula, ou à que cada um de vós conseguiu criar, não tem mal. O importante aqui é que sejam capazes de compreender o que fizeram e entender o que é apresentado neste documento para que possam mais à frente reproduzi-lo de forma independente. Neste documento assumimos que os ficheiros de dados se encontram na mesma directoria do .Rmd. Caso não seja o caso, será necessário primeiro identificar essa directoria. Além do HTML em si, iremos disponibilizar o .Rmd para que possam também ver como se faz este relatório e quem sabe colher alguma inspiração para implementar nas aulas. Nota: caso queiram usar este template precisam instalar o package `prettydoc` para isso acontecer, Podem dar uma vista de olhos aos diversos temas e como instalar [aqui](https://prettydoc.statr.me/index.html). Feito este preliminar, passemos à resolução.
*** # Exercício 1. **Avalie os pressupostos e efectue uma análise de variância aos dados da densidade de uma espécie de árvore em várias serras de Portugal (dados `DataTP6serras.csv`). O que pode concluir quanto à comparação da densidade desta espécie nas várias serras?** ### Leitura dos dados e análise exploratória Como sempre podemos começar o exercício fazendo a leitura dos dados e correndo um breve análise para ficar a saber um pouco mais sobre eles. ```{r} # ler os dados separados por ";" data.serras <- read.csv2("DataTP6serras.csv") # ver os dados head(data.serras) # estrutura dos dados str(data.serras) ``` Temos 80 observações com 2 variáveis (colunas). Uma variável categórica (`factor`), `Serra`, para o número de serras, com quatro categorias possíveis, e outra variável numérica, `Dens.arvores`, para a densidade de árvores na respectiva serra. ```{r} # sumário dos dados summary(data.serras) ``` Conseguimos também ver que para cada umas das quatro serras foram feitas 20 observações com as medições das densidades a variarem, de uma maneira geral, entre `r min(data.serras$Dens.arvores)` e `r max(data.serras$Dens.arvores)`, com uma mediana igual a `r round(median(data.serras$Dens.arvores), 2)` e uma média ligeiramente acima, nos `r round(mean(data.serras$Dens.arvores), 2)` Visualmente podemos ver as variações na densidade das árvores através um diagrama de caixa (boxplot). ```{r, fig.align='center', fig.width=7, fig.height=6} # para ajudar aos olhos podemos definir um conjunto de cores que caracterizem cada serra cores.serras <- c("#1f3f2b", "#52843c", "#e9ed60", "#be945b") # criar o boxplot boxplot(Dens.arvores ~ Serra, data = data.serras, ylim = c(10, 50), lty = 1, col = cores.serras, ylab="Densidade da espécie de árvore") ``` Através esta análise podemos inferir que na Lousã há claramente uma maior densidade da espécie de árvores de interesse relativamente às outras serras analisadas. Podemos também dizer que a Serra do Marão é onde esta espécie se encontra em menor densidade. Quanto à Serra da Estrela e À Serra da Malcata pode-se dizer que embora a densidade na Serra da Estrela seja ligeiramente inferior, as densidades quanto à espécie parecem ser as mais próximas. Podemos também analisar as diferentes densidades através de histogramas ```{r, fig.align='center', fig.width=7, fig.height=6} par(mfrow=c(2,2)) # ajustar a estrutura para podermos ter 4 gráficos (2 linhas e 2 colunas) hist(data.serras$Dens.arvores[data.serras$Serra == "Estrela"], main = "Estrela", xlab = "Densidade", col=cores.serras[1]) hist(data.serras$Dens.arvores[data.serras$Serra == "Lousa"], main = "Lousã", xlab = "Densidade", col=cores.serras[2]) hist(data.serras$Dens.arvores[data.serras$Serra == "Malcata"], main = "Malcata", xlab = "Densidade", col=cores.serras[3]) hist(data.serras$Dens.arvores[data.serras$Serra == "Marao"], main = "Marão", xlab = "Densidade", col=cores.serras[4]) ``` > **Pro tip:** Quando se cria histogramas é preciso ter atenção ao que se pretende ver. É necessário ter cuidado com os valores dos eixos! Se ajustarmos os eixos para as frequências e para as densidades torna-se mais fácil a análise e comparação entre serras. ```{r, fig.align='center', fig.width=7, fig.height=6} par(mfrow=c(2,2)) hist(data.serras$Dens.arvores[data.serras$Serra == "Estrela"], main = "Estrela", xlab = "Densidade", col=cores.serras[1], xlim = c(10,50), ylim = c(0,6)) hist(data.serras$Dens.arvores[data.serras$Serra == "Lousa"], main = "Lousã", xlab = "Densidade", col=cores.serras[2], xlim = c(10,50), ylim = c(0,6)) hist(data.serras$Dens.arvores[data.serras$Serra == "Malcata"], main = "Malcata", xlab = "Densidade", col=cores.serras[3], xlim = c(10,50), ylim = c(0,6)) hist(data.serras$Dens.arvores[data.serras$Serra == "Marao"], main = "Marão", xlab = "Densidade", col=cores.serras[4], xlim = c(10,50), ylim = c(0,6)) # Opção mais avançada para ter tudo de uma só vez: # 1. criar um vector com os nomes das serras # nomes.serras <- levels(data.serras$Serra) # 2. criar histogramas para as quatro usando um ciclo for # par(mfrow=c(2,2)) # for(i in 1:4) { # hist(data.serras$Dens.arvores[data.serras$Serra == nomes.serras[i]], # xlab = "Densidade", main = nomes.serras[i], col=cores.serras[i], # xlim = c(10,50), ylim = c(0,6)) # } ``` ### Análise de variância De uma maneira geral é costume dizer-se que para se fazer uma análise de variância partimos do princípio que os dados têm uma distribuição Gaussiana (Normal). De um ponto de vista formal esta ideia, que é generalisada, é fundamentalmente errada. Na verdade assume-se esta normalidade não para os dados mas sim para os **resíduos**. Ou seja, vamos testar se os resíduos do modelo ajustado são Gaussianos. Este modelo pode ser um modelo linear com a densidade (`Dens.arvores`) como variável resposta, explicada pelo tipo de serra (`Serra`). ```{r} # modelo linear para a densidade explicada pelas serras mod.serras <- lm(Dens.arvores ~ Serra, data = data.serras) summary(mod.serras) # vamos testar a normalidade dos resíduos deste modelo! res.serras <- residuals(mod.serras) res.serras ``` Podemos assim testar normalidade através do teste de Shapiro-Wilk (função `shapiro.test()`) aplicado aos resíduos (`res.serras`), testanto assim para uma hipótese nula do tipo $$H_0:\ \text{A amostra provém de uma população com distribuição Gaussiana.}$$ ```{r} shapiro.test(res.serras) ``` *(Para um nível de significância de 5% rejeitamos a hipótese nula. Mas aqui, par podermos continuar, vamos assumir que tinhamos decidido usar um nível de significância de 1%).* A um nível de significância de 1% não rejeitamos a hipótese $H_0$, podendo então partir para estudar a heterocedasticidade do modelo (função `bartlett.test()`). O teste de Bartlett, ou teste de homogeneidade de variâncias, testa a hipótese $$H_0:\ \text{Existe heterocedasticidade nos resíduos do modelo (i.e., igualdade das variâncias).}$$
> **Defenition:** Para saberem mais sobre heterocedasticidade cliquem [aqui](https://en.wikipedia.org/wiki/Heteroscedasticity). ```{r} # Teste de Bartlett bartlett.test(Dens.arvores ~ Serra, data = data.serras) ``` Com um p-value de 0.8865 não rejeitamos a hipótese nula para igualdade das variâncias. Podemos assim proceder para uma análise paramétrica, i.e., uma análise de variância simples (one way ANOVA). ### Análise de Variância Simples Esta análise faz-se usando a função `aov()` e vamos testar a igualdade entre as diferentes médias das densidades em cada serra, ou seja se forem iguais não existe uma relação entre as serras e a densidade das árvores da espécie (há sempre a mesma densidade independentemente da serra observada). A nossa hipótese a testar aqui vai ser $$H_0:\ \mu_{Estrela} = \mu_{Lousã} = \mu_{Malcata} = \mu_{Marão} \ (\text{São todas iguais.})\\ vs. \\ H_1:\ \text{Pelo menos uma é diferente.}$$ ```{r} # analysis of variance anova.serras <- aov(Dens.arvores ~ Serra, data = data.serras) # sumário dos dados summary(anova.serras) ``` O p-value resultante é muito inferior ao nosso nível de significância escolhido, pelo que podemos dizer que temos evidência estatística suficiente para rejeitar a hipótese nula, assumindo portanto uma relação entre a densidade das árvores específicas e a serra em que elas se encontram. Com este teste ficamos a saber que **as médias das densidades nas quatro serras não são todas iguais**, mas tendo quatro serras registadas não conseguimos saber especificamente que serras são diferentes umas das outras. Para isso temos de estudar cada par e inferir se são ou não significativamente diferentes. Isto faz-se aplicando o teste de Tukey (função `TukeyHSD()`) que nos devolve os p-values associados aos pares de serras. ```{r} tuk.serras <- TukeyHSD(anova.serras) tuk.serras ``` Podemos olhar para as colunas `diff` e `p adj` e ver quais as serras que não rejeitam a hipótese de igualdade. Para um nível de significância de 5% apenas o par **Malcata-Estrela** não é significativamente diferente (p-value > 0.05). Todos os outros pares são (p-value << 0.05). Podemos representar estes resultados graficamente. ```{r, fig.align='center', fig.width=7, fig.height=6} par(mar=c(4, 7, 2, 1)) # ajustar as margens para se poder ver os eixos plot(tuk.serras, las=1) ``` > **Note:** Perhaps surprisingly, or not (!), all densities are potentially different from each other at a 10% significance level. Strictly, at a 5% significance level, Estrela and Malcata could not be considered statistically significantly different. All others would be considered significantly different even at a 1% significance.
*** # Exercício 2. **Sabendo que os pressupostos de normalidade e homocedasticidade não são verificados no conjunto de dados do exercício anterior, efectue um teste de Kruskal-Wallis para comparar as densidades da espécie nas várias serras.** So, despite having assumed a 1% significance level before, if we had used a 5% or 10% significance, we would reject the Gaussianity assumption. Here we implement the corresponding non-paramteric equivalent to the ANOVA, the KW test. ```{r} kruskal.test(Dens.arvores ~ Serra, data = data.serras) ``` The outcome is the same as for the ANOVA, we would reject the H0 that all samples came from populations with the same mean. We therefore conduct the non-parametric equivalent to the Tukey test, i.e. the Dunn test ```{r} library(dunn.test) dunn.test(data.serras$Dens.arvores, data.serras$Serra) ``` The conclusions are very similar to those of the ANOVA. If anything, the Kruskal-Wallis is less powerful than the ANOVA, but here the signal in the data was very strong, so the conclusions do not change. Actually, note that the reported probabilities above are not P-values, and to compare these with the required significance level we need to compare these with $\alpha/2$. See in exercício 3 below further details about this topic.
*** # Exercício 3. **Efectuou uma experiência para avaliar o efeito de diferentes dietas no crescimento de um peixe (dados `DataTP6dietas.csv`). Efectue o teste que lhe pareça adequado para avaliar esse efeito. Qual a dieta que escolheria para maximizar o crescimento do peixe?** As usual, read in the data first and check it read OK. ```{r} data.dietas <- read.csv2("DataTP6dietas.csv") ``` ```{r} head(data.dietas) str(data.dietas) summary(data.dietas) ``` We see we have `r nrow(data.dietas)` records for each of `r ncol(data.dietas)` dietas. We can look at the data, and it immediately became apparent that one of the diets leads to much higher growth. ```{r, fig.align='center', fig.width=7, fig.height=6} cores.dietas <- c("#ddbaba", "#ddbad6", "#babcdd", "#bad9dd") boxplot(crescimento ~ dieta, data = data.dietas, col = cores.dietas, xlab = "Dieta", ylab = "Crescimento") ``` ```{r, fig.align='center', fig.width=7, fig.height=6} medias <- tapply(data.dietas$crescimento, data.dietas$dieta, mean) barplot(medias, col = cores.dietas) ``` It also becomes evident that the assumptions of the parametric test that we could use in this context (i.e. comparing more than 2 means, the ANOVA) might not hold. We test formally the Gaussian assumption ```{r} tapply(data.dietas$crescimento, data.dietas$dieta, shapiro.test) ``` There is no strong evidence to reject the H0 that the data come from Gaussian populations (despite the fact that strictly, the Dieta D H0 would be rejected, at the 5 or 10% significance level, but not at the 1% significance level). In any case, the ANOVA is known to be robust to the failure of the Gaussian assumption, and if that was the only thing going on, we would probably proceed with an ANOVA. However, when we test formally the homocedasticity ```{r} bartlett.test(crescimento ~ dieta, data = data.dietas) ``` we reject H0 that all the variances are equal at any of the usual significance levels. Therefore, we have to proceed to the non-parametric alternative to the ANOVA, i.e., the Kruskal-Wallis test ```{r} kruskal.test(crescimento ~ dieta, data = data.dietas) ``` Given the P-value, we reject the H0 that all means (strictly, being a non-parametric test, that all medians!) are the same. We then follow up with a multiple comparison a posteriori test, the Dunn test ```{r} dunn.test(data.dietas$crescimento, data.dietas$dieta) ``` The conclusion is that we can reject the equality of the means for all diets even at the 5% significance level except for the A-C and B-D diets, for which there's no evidence to suggest a difference. Usually, the reported probability by a significance test is a P-value, so you compare it directly with your significance level $\alpha$ and reject the null if P-value<$\alpha$. Note that the `dunn.test` output is at odds with that, and with all that we have seen before. As the last two lines of the output above make explicit, the p reported is not a P-value (and to be fair, it makes no sense to report it like this). Therefore, you have to divide your significance level by 2 (because this is a bilateral test) and that is the value you compare with the reported probabilities. This is why the `dunn.test` is easier to use with the explicit use of the argument `alpha`, so that the "*" can be correctly interpreted as a significant difference for the $\alpha$ we want. In the analysis below we set $\alpha$=0.1, and the B-D difference becomes, at the 10% significance level, statistically significant. ```{r} # alterando o nivel de significância para 10% dunn.test(data.dietas$crescimento, data.dietas$dieta, alpha = 0.1) ``` Therefore, the best diet to use, assuming we want heavier animals, is diet D. Note this was actually obvious from the first look at the data, so the statistical analysis just confirmed what the intuition was telling us.
*** # Exercício 4. **Realizou uma experiência idêntica à do exercício anterior, mas introduziu um outro factor –- a temperatura (`DataTP6dietastemperatura.csv`). Efectue uma análise de variância para avaliar o efeito dos factores seleccionados no crescimento do peixe, admitindo que os pressupostos são satisfeitos.** By now you know the drill... read the data and see it is OK ```{r} data.dt <- read.csv2("DataTP6dietastemperatura.csv") ``` ```{r} head(data.dt) str(data.dt) summary(data.dt) ``` A variável temperatura é aqui vista como um vector com valores numéricos quando a queremos como uma variável categórica. Podemos assim criar a variável `ftemp` que representa a temperatura como um factor com duas categorias relativas aos dois níveis de temperatura possíveis. ```{r} data.dt$ftemp <- factor(data.dt$temperatura) ``` ```{r} par(mar=c(8,4,0.5,0.5)) boxplot(crescimento~dieta+ftemp,data=data.dt,las=2,xlab="") mtext("Tratamento",1,7) ``` This boxplot sequence is non-ideal to compare across diets and across temperatures. Here is an alternative way to do the boxplots. This uses ggplot2, a different paradigm for plots (related to the tydiverse paradigm of data analysis). This is outside what we do in Ecologia Numérica, so please don't ask me about this code, but these look really cool... don't they? ```{r} library(ggplot2) # Change box plot colors by groups ggplot(data.dt, aes(x=dieta,y=crescimento,fill=temperatura))+geom_boxplot() ``` We can see e.g. that for all diets the higher temperature leads to higher growth, except for diet C. This would point to an interaction (i.e., different effects of one factor depending on the levels of the other) between diet and temperature. Now, run the ANOVA analysis (since we are told the assumptions hold, but we could always test these!) ```{r} anovadietast<-aov(crescimento~dieta*ftemp,data=data.dt) summary(anovadietast) ``` We can see that both factors, as well as the interaction effect (`dieta:temperatura`) are significant at the usual significance levels. Therefore, we can use the Tukey test to see what exactly is different. The output is rather lengthy ```{r} TukeyHSD(anovadietast,data=dietat) par(mfrow=c(1,1),mar=c(4,15,2,1)) plot(TukeyHSD(anovadietast),las=2) ``` Just as examples, we would not reject the H0 that `dieta C:20-dieta A:10` or `dieta A:20-dieta D:10` treatments would be different, but we would reject at the usual significance levels the equality of the means of either `dieta D:20-dieta A:20` or `dieta D:20-dieta C:20` treatments. As noted above, the result of the different diets seems to be dependent on the temperature. The best diet would be diet D, at temperature 20. However, if one had to use temperature 10, there would be not enough evidence to suggest that D was better than B (P-value 0.9812870) or C (P-value 0.8256648). Hence, assuming that the cost of having the temperature at 20 was too high, and that the diet B was much cheaper than D, we might opt for diet B. This is the kind of inferential insight that would show me that you know what you are doing!
*** # Exercício 5. **Pretende avaliar o efeito da temperatura, oxigénio dissolvido e pH da água na mortalidade de um copépode. Sintetize as condições experimentais indicadas nos dados `DataTP6mortalidade.csv`.** **Admitindo que os pressupostos de normalidade e homocedasticidade são satisfeitos, efectue uma análise de variância para avaliar as várias hipóteses em apreciação. Retire as principais conclusões. Se pretendesse apenas avaliar os efeitos dos factores, e não as suas interacções, como poderia proceder?** Mamma mia... here I go again... read the data in and check it's OK ```{r} mort <- read.csv("DataTP6mortalidade.csv", sep=";") str(mort) ``` We now have the mortality of the animals as a function of 3 factors, temperature, oxygen and pH. Note that these are all numeric, but strictly we need them as factors (a numerical variable that only takes 2 or 3 values can't really be used as anything but as a factor, since there are not enough values to evaluate how the variable impacts on the response in a continuous scale) ```{r} mort$temperatura=as.factor(mort$temperatura) mort$oxigenio=as.factor(mort$oxigenio) mort$ph=as.factor(mort$ph) ``` And now we can see how many levels are in each level ```{r} summary(mort) ``` We have 2 levels for each factor variable, which we can consider the high and low level, say (independently of the values). ```{r} boxplot(mortalidade~as.factor(temperatura)*as.factor(oxigenio)*as.factor(ph),data=mort) ``` We can now fit the model with all interactions (i.e. including a 3 way interaction) ```{r} anovamort<-aov(mortalidade~as.factor(temperatura)*as.factor(oxigenio)*as.factor(ph),data=mort) summary(anovamort) ``` Apparently, only the pH main effect is significant. However, even the 3 way interaction is significant, which means that it is very hard to generalize interpretations, since the effect of the pH will be different depending on the level of the temperature and oxygen. But these two statements seem contradictory... While if one looks at the effect of temperature or oxygen alone there is no apparent effect, that effect shows up when pH is considered. Each combination of the different levels seems important and potentially different. We could test for each pairwise treatment combination if there are differences given the Tukey test ```{r} TukeyHSD(anovamort) ``` If one were to ignore the interactions, the model would be fit like this ```{r} anovamort2<-aov(mortalidade~as.factor(temperatura)+as.factor(oxigenio)+as.factor(ph),data=mort) summary(anovamort2) ``` and the post hoc tests would be ```{r} TukeyHSD(anovamort2) ``` Note that in this case we would have no way to assess that temperature and oxygen can actually have an important role. However, 3 way interaction models are extremely hard to interpret in practice.
*** # Exercício 6. **Recolheu amostras a Norte e a Sul de Peniche, em várias praias da costa portuguesa sujeitas a diferente grau de hidrodinamismo (calmo e exposto), e foram quantificadas as densidades de cracas nas zonas rochosas (`DataTP6cracas.csv`). Admitindo que os pressupostos são verificados, efectue uma análise de variância hierárquica. Quais as principais conclusões?** Read the data and check it was read OK ```{r} cracas <- read.csv("DataTP6cracas.csv", sep=";") str(cracas) ``` We begin by looking at the data ```{r} boxplot(densidade~latitude+hidrodinamismo, data=cracas) ``` We implemented a nested ANOVA, as suggested. Note, as detailed below, that this would not be an obvious analysis just by looking at the data, there was a missing bit of information for this to be the right analysis to implement! We need to be explicit that exposure (=hidrodinamismo) is nested within latitude, and the analysis would be different if we defined that latitude is nested in exposure. Further, there was a missing variable in the data providing a beach label, and only then we would be able to implement the correct analysis. Here we assume each sample came from a different beach, as if say we had ```{r} cracas$Praia=1:20 cracas ``` Then the analysis is ```{r} anovacracas<-aov(densidade~latitude/hidrodinamismo,data=cracas) summary(anovacracas) ``` Our conclusion is that latitude as an effect on density, and that above and beyond that, exposure (= hidrodinamismo) also has an effect. Note that the following two formulations fit the exact same model. In the first we can appreciate that by introducing the nesting effect we implicitly require the main effect and an interaction to be present. We could therefore fit the main effect explicitly. ```{r} anovacracas2<-aov(densidade~latitude+latitude/hidrodinamismo,data=cracas) summary(anovacracas2) ``` which means that we can specify the same model explicitly as ```{r} anovacracas3<-aov(densidade~latitude+latitude:hidrodinamismo,data=cracas) summary(anovacracas3) ``` For more details on anova specification, you can find some help here: http://conjugateprior.org/2013/01/formulae-in-r-anova/ Note that the exercise was actually not ideal, and far from explicit. The analysis implemented implies that there were 20 different beaches, but actually this could have been quite different. Consider the following beach label option given by a different assumption about the data ```{r} cracas$Praia2=c(rep(1:5,times=2),rep(6:10,times=2)) cracas ``` The earlier solution implies that each sample came from a different beach (i.e., as defined by `Praia` in the code presented). Hence we assumed that each beach only had a sample, either correponding to exposed or calm. If these were 5 beaches in the north and 5 in the south, each with samples for both exposed and calm, as defined by `Praia2` the analysis would be different. Then all beaches would have samples for each level of exposure ```{r} table(cracas$latitude,cracas$hidrodinamismo) ``` and hence in fact would be no reason to implement a hierarchical ANOVA. The right model would then correspond to a full factorial design ```{r} anovacracas4<-aov(densidade~latitude*hidrodinamismo,data=cracas) summary(anovacracas4) ``` We would therefore conclude that the interaction is not significant (note this was not the case before, when exposure was not a main effect), but both exposure and latitude are ```{r} anovacracas5<-aov(densidade~latitude+hidrodinamismo,data=cracas) summary(anovacracas5) ``` Hence, ideally, the variable `Praia` would be present in the dataset. From 2020 onwards, it will!
*** # Exercício 7. **Administrou uma substância em ratos de laboratório e pretende agora medir o seu efeito na concentração de uma enzima no sangue. Efectuou medições em 10 ratos, duas horas e 24 horas após a administração da substância (dados `DataTP6ratos.csv`). Admitindo que os pressupostos são satisfeitos, efectue uma análise de variância que seja adequada e interprete os resultados.** Read the data and check it was read OK ```{r} ratos <- read.csv2("DataTP6ratos.csv") str(ratos) ``` It is best to recode both `tempo` and `rato` as factors ```{r} ratos$tempo=as.factor(ratos$tempo) ratos$rato=as.factor(ratos$rato) ``` We have measurements for an enzime measured in 10 rats, two measurements in 2 different times for each rat. This is an example of longitudinal data, a special case of repeated measurement data. ```{r} with(ratos,table(rato,tempo)) ``` We can plot the data, and it appears that the enzime measurements are higher 24 hours than 2 hours after the substance was given to the rats. ```{r} boxplot(enzima~tempo,data=ratos) ``` If we ignore the repeated measurements nature of the data, we could do a simple ANOVA. ```{r} anova.teste<-aov(enzima~tempo, data=ratos) #modelo incorrecto sem contemplar as medidas repetidas summary(anova.teste) ``` Note in fact this is the same as a t-test, as there are only 2 groups (i.e. two levels of time)! ```{r} with(ratos,t.test(enzima~tempo)) ``` and `tempo` would be considered significant at the 5% or 10% level (but not at the 1% significance level, say). In fact, accounting for the repeated measurement nature of the data is equivalent to a lack of independence in two samples, so we could also implement a paired-sample t-test ```{r} ttest=with(ratos,t.test(enzima[tempo=="2h"],enzima[tempo=="24h"],paired = TRUE)) ttest ``` which is exactly the same as doing a simple t.test over the differences ```{r} with(ratos,t.test(enzima[tempo=="2h"]-enzima[tempo=="24h"])) ``` and therefore we see that `tempo` is clearly significant (i.e. the means are different for the different times). The fact that we use the information that the measurements are within the same rats makes the test more powerful, as expected. However, this was an exercise within the Ficha de Trabalho for the ANOVA, so... Lets do an ANOVA (this is like pretending that `rato` is a block, and it will be treated as a random effect - we have repeated measurements within each `rato`) ```{r,eval=FALSE,echo=FALSE} #why does this return the same results as the formulation below anova.ratos1<-aov(enzima~rato+tempo, data=ratos) summary(anova.ratos1) ``` To define `rato` as a random effect the required sintax is ```{r} anova.ratos2<-aov(enzima ~ tempo + Error(rato), data=ratos) summary(anova.ratos2) ``` We see that the practical conclusions, i.e. the enzime measurements are significantly different in each time, are the same as above. Recal that a t distribution with $n$ degrees of freedom, squared, becomes an F distribution with (1,$n$) degrees of freedom, so note above that the t-test statistic squared is the same as the F-test statistic (`r ttest[[1]]`$^2$=`r ttest[[1]]^2`). Note that, in general, with more than 3 measurements per `rato`, the t-test trick would not be available and the only way to deal with this would be to do it the `aov` way (or using any other mixed model fitting function, and ANOVA with a fixed factor and a random factor is the simplest case of a mixed model - these can be implemented with more flexible functions than `aov`): For additional details, see e.g. http://conjugateprior.org/2013/01/formulae-in-r-anova/