Disclaimer

Este documento fornece algumas propostas de resolução aos exercícios da folha de trabalho TP6.pdf, resolvida durante a aula prática número 8, na semana de 4 a 8 de Outubro de 2019. É de notar que pode haver (quase) infinitas maneiras de solucionar cada um dos exercícios. Se esta resolução não for exactamente igual à que foi apresentada na aula, ou à que cada um de vós conseguiu criar, não tem mal. O importante aqui é que sejam capazes de compreender o que fizeram e entender o que é apresentado neste documento para que possam mais à frente reproduzi-lo de forma independente.

Neste documento assumimos que os ficheiros de dados se encontram na mesma directoria do .Rmd. Caso não seja o caso, será necessário primeiro identificar essa directoria.

Além do HTML em si, iremos disponibilizar o .Rmd para que possam também ver como se faz este relatório e quem sabe colher alguma inspiração para implementar nas aulas.
Nota: caso queiram usar este template precisam instalar o package prettydoc para isso acontecer, Podem dar uma vista de olhos aos diversos temas e como instalar aqui.

Feito este preliminar, passemos à resolução.



Exercício 1.

Avalie os pressupostos e efectue uma análise de variância aos dados da densidade de uma espécie de árvore em várias serras de Portugal (dados DataTP6serras.csv). O que pode concluir quanto à comparação da densidade desta espécie nas várias serras?

Leitura dos dados e análise exploratória

Como sempre podemos começar o exercício fazendo a leitura dos dados e correndo um breve análise para ficar a saber um pouco mais sobre eles.

##     Serra Dens.arvores
## 1 Estrela         21.0
## 2 Estrela         28.8
## 3 Estrela         23.1
## 4 Estrela         25.9
## 5 Estrela         25.6
## 6 Estrela         25.7
## 'data.frame':    80 obs. of  2 variables:
##  $ Serra       : Factor w/ 4 levels "Estrela","Lousa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dens.arvores: num  21 28.8 23.1 25.9 25.6 25.7 15.7 27.1 23.2 17 ...

Temos 80 observações com 2 variáveis (colunas). Uma variável categórica (factor), Serra, para o número de serras, com quatro categorias possíveis, e outra variável numérica, Dens.arvores, para a densidade de árvores na respectiva serra.

##      Serra     Dens.arvores  
##  Estrela:20   Min.   :12.60  
##  Lousa  :20   1st Qu.:20.32  
##  Malcata:20   Median :25.65  
##  Marao  :20   Mean   :27.52  
##               3rd Qu.:34.05  
##               Max.   :48.20

Conseguimos também ver que para cada umas das quatro serras foram feitas 20 observações com as medições das densidades a variarem, de uma maneira geral, entre 12.6 e 48.2, com uma mediana igual a 25.65 e uma média ligeiramente acima, nos 27.52

Visualmente podemos ver as variações na densidade das árvores através um diagrama de caixa (boxplot).

Através esta análise podemos inferir que na Lousã há claramente uma maior densidade da espécie de árvores de interesse relativamente às outras serras analisadas. Podemos também dizer que a Serra do Marão é onde esta espécie se encontra em menor densidade. Quanto à Serra da Estrela e À Serra da Malcata pode-se dizer que embora a densidade na Serra da Estrela seja ligeiramente inferior, as densidades quanto à espécie parecem ser as mais próximas.

Podemos também analisar as diferentes densidades através de histogramas

Pro tip: Quando se cria histogramas é preciso ter atenção ao que se pretende ver. É necessário ter cuidado com os valores dos eixos! Se ajustarmos os eixos para as frequências e para as densidades torna-se mais fácil a análise e comparação entre serras.

Análise de variância

De uma maneira geral é costume dizer-se que para se fazer uma análise de variância partimos do princípio que os dados têm uma distribuição Gaussiana (Normal). De um ponto de vista formal esta ideia, que é generalisada, é fundamentalmente errada. Na verdade assume-se esta normalidade não para os dados mas sim para os resíduos. Ou seja, vamos testar se os resíduos do modelo ajustado são Gaussianos. Este modelo pode ser um modelo linear com a densidade (Dens.arvores) como variável resposta, explicada pelo tipo de serra (Serra).

## 
## Call:
## lm(formula = Dens.arvores ~ Serra, data = data.serras)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.925 -3.331 -0.345  3.160  7.620 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23.4650     0.9665  24.278  < 2e-16 ***
## SerraLousa    17.6550     1.3669  12.916  < 2e-16 ***
## SerraMalcata   3.5600     1.3669   2.604 0.011062 *  
## SerraMarao    -4.9850     1.3669  -3.647 0.000484 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.322 on 76 degrees of freedom
## Multiple R-squared:  0.7997, Adjusted R-squared:  0.7917 
## F-statistic: 101.1 on 3 and 76 DF,  p-value: < 2.2e-16
##      1      2      3      4      5      6      7      8      9     10 
## -2.465  5.335 -0.365  2.435  2.135  2.235 -7.765  3.635 -0.265 -6.465 
##     11     12     13     14     15     16     17     18     19     20 
##  2.535 -0.365  1.635 -4.765 -5.665 -3.365  3.235 -0.865  3.135  6.035 
##     21     22     23     24     25     26     27     28     29     30 
##  4.520  1.320 -2.180 -5.180 -0.480  5.120  6.320 -2.080 -5.880 -2.980 
##     31     32     33     34     35     36     37     38     39     40 
##  7.620 -2.180 -4.380  2.920 -2.080 -1.880  7.120 -3.180 -3.280  0.820 
##     41     42     43     44     45     46     47     48     49     50 
## -4.425 -6.625 -2.225 -0.325  4.275 -3.925  0.975  6.375  5.175 -3.625 
##     51     52     53     54     55     56     57     58     59     60 
## -7.925  2.075  6.675  6.175 -1.525  0.975  6.875 -2.225 -0.425 -6.325 
##     61     62     63     64     65     66     67     68     69     70 
## -2.120 -3.320  7.080 -5.520 -4.420  0.280  6.280  1.580  4.280  0.780 
##     71     72     73     74     75     76     77     78     79     80 
##  2.780  4.680 -6.620  5.380 -5.120 -3.920 -5.020 -0.820  1.480  2.280

Podemos assim testar normalidade através do teste de Shapiro-Wilk (função shapiro.test()) aplicado aos resíduos (res.serras), testanto assim para uma hipótese nula do tipo

\[H_0:\ \text{A amostra provém de uma população com distribuição Gaussiana.}\]

## 
##  Shapiro-Wilk normality test
## 
## data:  res.serras
## W = 0.96501, p-value = 0.02745

(Para um nível de significância de 5% rejeitamos a hipótese nula. Mas aqui, par podermos continuar, vamos assumir que tinhamos decidido usar um nível de significância de 1%).
A um nível de significância de 1% não rejeitamos a hipótese \(H_0\), podendo então partir para estudar a heterocedasticidade do modelo (função bartlett.test()).

O teste de Bartlett, ou teste de homogeneidade de variâncias, testa a hipótese

\[H_0:\ \text{Existe heterocedasticidade nos resíduos do modelo (i.e., igualdade das variâncias).}\]

Defenition: Para saberem mais sobre heterocedasticidade cliquem aqui.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  Dens.arvores by Serra
## Bartlett's K-squared = 0.64322, df = 3, p-value = 0.8865

Com um p-value de 0.8865 não rejeitamos a hipótese nula para igualdade das variâncias. Podemos assim proceder para uma análise paramétrica, i.e., uma análise de variância simples (one way ANOVA).

Análise de Variância Simples

Esta análise faz-se usando a função aov() e vamos testar a igualdade entre as diferentes médias das densidades em cada serra, ou seja se forem iguais não existe uma relação entre as serras e a densidade das árvores da espécie (há sempre a mesma densidade independentemente da serra observada). A nossa hipótese a testar aqui vai ser

\[H_0:\ \mu_{Estrela} = \mu_{Lousã} = \mu_{Malcata} = \mu_{Marão} \ (\text{São todas iguais.})\\ vs. \\ H_1:\ \text{Pelo menos uma é diferente.}\]

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Serra        3   5667  1889.1   101.1 <2e-16 ***
## Residuals   76   1420    18.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

O p-value resultante é muito inferior ao nosso nível de significância escolhido, pelo que podemos dizer que temos evidência estatística suficiente para rejeitar a hipótese nula, assumindo portanto uma relação entre a densidade das árvores específicas e a serra em que elas se encontram.
Com este teste ficamos a saber que as médias das densidades nas quatro serras não são todas iguais, mas tendo quatro serras registadas não conseguimos saber especificamente que serras são diferentes umas das outras. Para isso temos de estudar cada par e inferir se são ou não significativamente diferentes. Isto faz-se aplicando o teste de Tukey (função TukeyHSD()) que nos devolve os p-values associados aos pares de serras.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Dens.arvores ~ Serra, data = data.serras)
## 
## $Serra
##                    diff          lwr        upr     p adj
## Lousa-Estrela    17.655  14.06451863  21.245481 0.0000000
## Malcata-Estrela   3.560  -0.03048137   7.150481 0.0528262
## Marao-Estrela    -4.985  -8.57548137  -1.394519 0.0026724
## Malcata-Lousa   -14.095 -17.68548137 -10.504519 0.0000000
## Marao-Lousa     -22.640 -26.23048137 -19.049519 0.0000000
## Marao-Malcata    -8.545 -12.13548137  -4.954519 0.0000001

Podemos olhar para as colunas diff e p adj e ver quais as serras que não rejeitam a hipótese de igualdade. Para um nível de significância de 5% apenas o par Malcata-Estrela não é significativamente diferente (p-value > 0.05). Todos os outros pares são (p-value << 0.05).

Podemos representar estes resultados graficamente.

Note: Perhaps surprisingly, or not (!), all densities are potentially different from each other at a 10% significance level. Strictly, at a 5% significance level, Estrela and Malcata could not be considered statistically significantly different. All others would be considered significantly different even at a 1% significance.



Exercício 2.

Sabendo que os pressupostos de normalidade e homocedasticidade não são verificados no conjunto de dados do exercício anterior, efectue um teste de Kruskal-Wallis para comparar as densidades da espécie nas várias serras.

So, despite having assumed a 1% significance level before, if we had used a 5% or 10% significance, we would reject the Gaussianity assumption. Here we implement the corresponding non-paramteric equivalent to the ANOVA, the KW test.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Dens.arvores by Serra
## Kruskal-Wallis chi-squared = 57.341, df = 3, p-value = 2.173e-12

The outcome is the same as for the ANOVA, we would reject the H0 that all samples came from populations with the same mean. We therefore conduct the non-parametric equivalent to the Tukey test, i.e. the Dunn test

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 57.3412, df = 3, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |    Estrela      Lousa    Malcata
## ---------+---------------------------------
##    Lousa |  -5.117161
##          |    0.0000*
##          |
##  Malcata |  -1.282692   3.834468
##          |     0.0998    0.0001*
##          |
##    Marao |   2.262574   7.379736   3.545267
##          |    0.0118*    0.0000*    0.0002*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

The conclusions are very similar to those of the ANOVA. If anything, the Kruskal-Wallis is less powerful than the ANOVA, but here the signal in the data was very strong, so the conclusions do not change.

Actually, note that the reported probabilities above are not P-values, and to compare these with the required significance level we need to compare these with \(\alpha/2\). See in exercício 3 below further details about this topic.



Exercício 3.

Efectuou uma experiência para avaliar o efeito de diferentes dietas no crescimento de um peixe (dados DataTP6dietas.csv). Efectue o teste que lhe pareça adequado para avaliar esse efeito. Qual a dieta que escolheria para maximizar o crescimento do peixe?

As usual, read in the data first and check it read OK.

##     dieta crescimento
## 1 dieta A         1.6
## 2 dieta A         1.9
## 3 dieta A         1.1
## 4 dieta A         1.7
## 5 dieta A         1.9
## 6 dieta A         1.9
## 'data.frame':    40 obs. of  2 variables:
##  $ dieta      : Factor w/ 4 levels "dieta A","dieta B",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ crescimento: num  1.6 1.9 1.1 1.7 1.9 1.9 1.5 1.4 1.4 2 ...
##      dieta     crescimento    
##  dieta A:10   Min.   : 1.100  
##  dieta B:10   1st Qu.: 1.600  
##  dieta C:10   Median : 3.000  
##  dieta D:10   Mean   : 4.668  
##               3rd Qu.: 5.225  
##               Max.   :23.100

We see we have 40 records for each of 2 dietas. We can look at the data, and it immediately became apparent that one of the diets leads to much higher growth.

It also becomes evident that the assumptions of the parametric test that we could use in this context (i.e. comparing more than 2 means, the ANOVA) might not hold. We test formally the Gaussian assumption

## $`dieta A`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.92604, p-value = 0.4101
## 
## 
## $`dieta B`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.94479, p-value = 0.6075
## 
## 
## $`dieta C`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.95639, p-value = 0.744
## 
## 
## $`dieta D`
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.82321, p-value = 0.02771

There is no strong evidence to reject the H0 that the data come from Gaussian populations (despite the fact that strictly, the Dieta D H0 would be rejected, at the 5 or 10% significance level, but not at the 1% significance level). In any case, the ANOVA is known to be robust to the failure of the Gaussian assumption, and if that was the only thing going on, we would probably proceed with an ANOVA. However, when we test formally the homocedasticity

## 
##  Bartlett test of homogeneity of variances
## 
## data:  crescimento by dieta
## Bartlett's K-squared = 100.29, df = 3, p-value < 2.2e-16

we reject H0 that all the variances are equal at any of the usual significance levels. Therefore, we have to proceed to the non-parametric alternative to the ANOVA, i.e., the Kruskal-Wallis test

## 
##  Kruskal-Wallis rank sum test
## 
## data:  crescimento by dieta
## Kruskal-Wallis chi-squared = 33.195, df = 3, p-value = 2.929e-07

Given the P-value, we reject the H0 that all means (strictly, being a non-parametric test, that all medians!) are the same. We then follow up with a multiple comparison a posteriori test, the Dunn test

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 33.1951, df = 3, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |    dieta A    dieta B    dieta C
## ---------+---------------------------------
##  dieta B |  -2.652118
##          |    0.0040*
##          |
##  dieta C |   0.440423   3.092542
##          |     0.3298    0.0010*
##          |
##  dieta D |  -4.567005  -1.914886  -5.007429
##          |    0.0000*     0.0278    0.0000*
## 
## alpha = 0.05
## Reject Ho if p <= alpha/2

The conclusion is that we can reject the equality of the means for all diets even at the 5% significance level except for the A-C and B-D diets, for which there’s no evidence to suggest a difference.

Usually, the reported probability by a significance test is a P-value, so you compare it directly with your significance level \(\alpha\) and reject the null if P-value<\(\alpha\). Note that the dunn.test output is at odds with that, and with all that we have seen before. As the last two lines of the output above make explicit, the p reported is not a P-value (and to be fair, it makes no sense to report it like this). Therefore, you have to divide your significance level by 2 (because this is a bilateral test) and that is the value you compare with the reported probabilities. This is why the dunn.test is easier to use with the explicit use of the argument alpha, so that the "*" can be correctly interpreted as a significant difference for the \(\alpha\) we want. In the analysis below we set \(\alpha\)=0.1, and the B-D difference becomes, at the 10% significance level, statistically significant.

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 33.1951, df = 3, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                 (No adjustment)                                
## Col Mean-|
## Row Mean |    dieta A    dieta B    dieta C
## ---------+---------------------------------
##  dieta B |  -2.652118
##          |    0.0040*
##          |
##  dieta C |   0.440423   3.092542
##          |     0.3298    0.0010*
##          |
##  dieta D |  -4.567005  -1.914886  -5.007429
##          |    0.0000*    0.0278*    0.0000*
## 
## alpha = 0.1
## Reject Ho if p <= alpha/2

Therefore, the best diet to use, assuming we want heavier animals, is diet D. Note this was actually obvious from the first look at the data, so the statistical analysis just confirmed what the intuition was telling us.



Exercício 4.

Realizou uma experiência idêntica à do exercício anterior, mas introduziu um outro factor –- a temperatura (DataTP6dietastemperatura.csv). Efectue uma análise de variância para avaliar o efeito dos factores seleccionados no crescimento do peixe, admitindo que os pressupostos são satisfeitos.

By now you know the drill… read the data and see it is OK

##     dieta temperatura crescimento
## 1 dieta A          10         1.6
## 2 dieta A          10         1.9
## 3 dieta A          10         1.1
## 4 dieta A          10         1.7
## 5 dieta A          10         1.9
## 6 dieta A          20         3.2
## 'data.frame':    40 obs. of  3 variables:
##  $ dieta      : Factor w/ 4 levels "dieta A","dieta B",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ temperatura: int  10 10 10 10 10 20 20 20 20 20 ...
##  $ crescimento: num  1.6 1.9 1.1 1.7 1.9 3.2 4.2 3.7 4.4 3.9 ...
##      dieta     temperatura  crescimento    
##  dieta A:10   Min.   :10   Min.   : 1.100  
##  dieta B:10   1st Qu.:10   1st Qu.: 2.200  
##  dieta C:10   Median :15   Median : 4.200  
##  dieta D:10   Mean   :15   Mean   : 5.633  
##               3rd Qu.:20   3rd Qu.: 7.150  
##               Max.   :20   Max.   :23.100

A variável temperatura é aqui vista como um vector com valores numéricos quando a queremos como uma variável categórica. Podemos assim criar a variável ftemp que representa a temperatura como um factor com duas categorias relativas aos dois níveis de temperatura possíveis.

This boxplot sequence is non-ideal to compare across diets and across temperatures.

Here is an alternative way to do the boxplots. This uses ggplot2, a different paradigm for plots (related to the tydiverse paradigm of data analysis). This is outside what we do in Ecologia Numérica, so please don’t ask me about this code, but these look really cool… don’t they?

We can see e.g. that for all diets the higher temperature leads to higher growth, except for diet C. This would point to an interaction (i.e., different effects of one factor depending on the levels of the other) between diet and temperature.

Now, run the ANOVA analysis (since we are told the assumptions hold, but we could always test these!)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dieta        3  436.6  145.53   39.66 6.85e-11 ***
## ftemp        1  119.4  119.37   32.53 2.56e-06 ***
## dieta:ftemp  3  204.1   68.04   18.54 3.74e-07 ***
## Residuals   32  117.4    3.67                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that both factors, as well as the interaction effect (dieta:temperatura) are significant at the usual significance levels. Therefore, we can use the Tukey test to see what exactly is different. The output is rather lengthy

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = crescimento ~ dieta * ftemp, data = data.dt)
## 
## $dieta
##                  diff       lwr       upr     p adj
## dieta B-dieta A  3.58  1.259024  5.900976 0.0011550
## dieta C-dieta A -0.08 -2.400976  2.240976 0.9997017
## dieta D-dieta A  7.99  5.669024 10.310976 0.0000000
## dieta C-dieta B -3.66 -5.980976 -1.339024 0.0008902
## dieta D-dieta B  4.41  2.089024  6.730976 0.0000734
## dieta D-dieta C  8.07  5.749024 10.390976 0.0000000
## 
## $ftemp
##        diff      lwr      upr   p adj
## 20-10 3.455 2.221141 4.688859 2.6e-06
## 
## $`dieta:ftemp`
##                        diff         lwr        upr     p adj
## dieta B:10-dieta A:10  2.86  -1.0643669  6.7843669 0.2941201
## dieta C:10-dieta A:10  2.22  -1.7043669  6.1443669 0.6039013
## dieta D:10-dieta A:10  3.98   0.0556331  7.9043669 0.0449154
## dieta A:20-dieta A:10  2.24  -1.6843669  6.1643669 0.5933638
## dieta B:20-dieta A:10  6.54   2.6156331 10.4643669 0.0001534
## dieta C:20-dieta A:10 -0.14  -4.0643669  3.7843669 1.0000000
## dieta D:20-dieta A:10 14.24  10.3156331 18.1643669 0.0000000
## dieta C:10-dieta B:10 -0.64  -4.5643669  3.2843669 0.9994006
## dieta D:10-dieta B:10  1.12  -2.8043669  5.0443669 0.9812870
## dieta A:20-dieta B:10 -0.62  -4.5443669  3.3043669 0.9995129
## dieta B:20-dieta B:10  3.68  -0.2443669  7.6043669 0.0789258
## dieta C:20-dieta B:10 -3.00  -6.9243669  0.9243669 0.2415612
## dieta D:20-dieta B:10 11.38   7.4556331 15.3043669 0.0000000
## dieta D:10-dieta C:10  1.76  -2.1643669  5.6843669 0.8256648
## dieta A:20-dieta C:10  0.02  -3.9043669  3.9443669 1.0000000
## dieta B:20-dieta C:10  4.32   0.3956331  8.2443669 0.0227810
## dieta C:20-dieta C:10 -2.36  -6.2843669  1.5643669 0.5302179
## dieta D:20-dieta C:10 12.02   8.0956331 15.9443669 0.0000000
## dieta A:20-dieta D:10 -1.74  -5.6643669  2.1843669 0.8336689
## dieta B:20-dieta D:10  2.56  -1.3643669  6.4843669 0.4284288
## dieta C:20-dieta D:10 -4.12  -8.0443669 -0.1956331 0.0341217
## dieta D:20-dieta D:10 10.26   6.3356331 14.1843669 0.0000000
## dieta B:20-dieta A:20  4.30   0.3756331  8.2243669 0.0237336
## dieta C:20-dieta A:20 -2.38  -6.3043669  1.5443669 0.5197753
## dieta D:20-dieta A:20 12.00   8.0756331 15.9243669 0.0000000
## dieta C:20-dieta B:20 -6.68 -10.6043669 -2.7556331 0.0001103
## dieta D:20-dieta B:20  7.70   3.7756331 11.6243669 0.0000100
## dieta D:20-dieta C:20 14.38  10.4556331 18.3043669 0.0000000

Just as examples, we would not reject the H0 that dieta C:20-dieta A:10 or dieta A:20-dieta D:10 treatments would be different, but we would reject at the usual significance levels the equality of the means of either dieta D:20-dieta A:20 or dieta D:20-dieta C:20 treatments. As noted above, the result of the different diets seems to be dependent on the temperature. The best diet would be diet D, at temperature 20. However, if one had to use temperature 10, there would be not enough evidence to suggest that D was better than B (P-value 0.9812870) or C (P-value 0.8256648). Hence, assuming that the cost of having the temperature at 20 was too high, and that the diet B was much cheaper than D, we might opt for diet B. This is the kind of inferential insight that would show me that you know what you are doing!



Exercício 5.

Pretende avaliar o efeito da temperatura, oxigénio dissolvido e pH da água na mortalidade de um copépode. Sintetize as condições experimentais indicadas nos dados DataTP6mortalidade.csv.
Admitindo que os pressupostos de normalidade e homocedasticidade são satisfeitos, efectue uma análise de variância para avaliar as várias hipóteses em apreciação. Retire as principais conclusões. Se pretendesse apenas avaliar os efeitos dos factores, e não as suas interacções, como poderia proceder?

Mamma mia… here I go again… read the data in and check it’s OK

## 'data.frame':    40 obs. of  4 variables:
##  $ temperatura: int  10 10 10 10 10 20 20 20 20 20 ...
##  $ oxigenio   : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ ph         : int  7 7 7 5 5 7 7 7 5 5 ...
##  $ mortalidade: int  45 56 64 49 54 22 32 41 88 79 ...

We now have the mortality of the animals as a function of 3 factors, temperature, oxygen and pH. Note that these are all numeric, but strictly we need them as factors (a numerical variable that only takes 2 or 3 values can’t really be used as anything but as a factor, since there are not enough values to evaluate how the variable impacts on the response in a continuous scale)

And now we can see how many levels are in each level

##  temperatura oxigenio ph      mortalidade   
##  10:20       5:20     5:20   Min.   :22.00  
##  20:20       8:20     7:20   1st Qu.:33.50  
##                              Median :51.50  
##                              Mean   :57.95  
##                              3rd Qu.:82.25  
##                              Max.   :98.00

We have 2 levels for each factor variable, which we can consider the high and low level, say (independently of the values).

We can now fit the model with all interactions (i.e. including a 3 way interaction)

##                                                          Df Sum Sq Mean Sq
## as.factor(temperatura)                                    1     62      62
## as.factor(oxigenio)                                       1     44      44
## as.factor(ph)                                             1  17893   17893
## as.factor(temperatura):as.factor(oxigenio)                1     22      22
## as.factor(temperatura):as.factor(ph)                      1    865     865
## as.factor(oxigenio):as.factor(ph)                         1   1369    1369
## as.factor(temperatura):as.factor(oxigenio):as.factor(ph)  1    722     722
## Residuals                                                32   2970      93
##                                                          F value   Pr(>F)
## as.factor(temperatura)                                     0.673 0.417909
## as.factor(oxigenio)                                        0.475 0.495566
## as.factor(ph)                                            192.811 4.27e-15
## as.factor(temperatura):as.factor(oxigenio)                 0.242 0.625799
## as.factor(temperatura):as.factor(ph)                       9.320 0.004537
## as.factor(oxigenio):as.factor(ph)                         14.751 0.000547
## as.factor(temperatura):as.factor(oxigenio):as.factor(ph)   7.786 0.008802
## Residuals                                                                
##                                                             
## as.factor(temperatura)                                      
## as.factor(oxigenio)                                         
## as.factor(ph)                                            ***
## as.factor(temperatura):as.factor(oxigenio)                  
## as.factor(temperatura):as.factor(ph)                     ** 
## as.factor(oxigenio):as.factor(ph)                        ***
## as.factor(temperatura):as.factor(oxigenio):as.factor(ph) ** 
## Residuals                                                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Apparently, only the pH main effect is significant. However, even the 3 way interaction is significant, which means that it is very hard to generalize interpretations, since the effect of the pH will be different depending on the level of the temperature and oxygen. But these two statements seem contradictory… While if one looks at the effect of temperature or oxygen alone there is no apparent effect, that effect shows up when pH is considered.

Each combination of the different levels seems important and potentially different. We could test for each pairwise treatment combination if there are differences given the Tukey test

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mortalidade ~ as.factor(temperatura) * as.factor(oxigenio) * as.factor(ph), data = mort)
## 
## $`as.factor(temperatura)`
##       diff       lwr      upr     p adj
## 20-10  2.5 -3.705129 8.705129 0.4179088
## 
## $`as.factor(oxigenio)`
##     diff       lwr      upr     p adj
## 8-5 -2.1 -8.305129 4.105129 0.4955663
## 
## $`as.factor(ph)`
##      diff       lwr       upr p adj
## 7-5 -42.3 -48.50513 -36.09487     0
## 
## $`as.factor(temperatura):as.factor(oxigenio)`
##           diff        lwr       upr     p adj
## 20:5-10:5  4.0  -7.672287 15.672287 0.7898689
## 10:8-10:5 -0.6 -12.272287 11.072287 0.9990151
## 20:8-10:5  0.4 -11.272287 12.072287 0.9997069
## 10:8-20:5 -4.6 -16.272287  7.072287 0.7112595
## 20:8-20:5 -3.6 -15.272287  8.072287 0.8371229
## 20:8-10:8  1.0 -10.672287 12.672287 0.9955038
## 
## $`as.factor(temperatura):as.factor(ph)`
##            diff         lwr        upr     p adj
## 20:5-10:5  11.8   0.1277127  23.472287 0.0467605
## 10:7-10:5 -33.0 -44.6722873 -21.327713 0.0000001
## 20:7-10:5 -39.8 -51.4722873 -28.127713 0.0000000
## 10:7-20:5 -44.8 -56.4722873 -33.127713 0.0000000
## 20:7-20:5 -51.6 -63.2722873 -39.927713 0.0000000
## 20:7-10:7  -6.8 -18.4722873   4.872287 0.4048363
## 
## $`as.factor(oxigenio):as.factor(ph)`
##          diff        lwr        upr     p adj
## 8:5-5:5   9.6  -2.072287  21.272287 0.1372628
## 5:7-5:5 -30.6 -42.272287 -18.927713 0.0000003
## 8:7-5:5 -44.4 -56.072287 -32.727713 0.0000000
## 5:7-8:5 -40.2 -51.872287 -28.527713 0.0000000
## 8:7-8:5 -54.0 -65.672287 -42.327713 0.0000000
## 8:7-5:7 -13.8 -25.472287  -2.127713 0.0154383
## 
## $`as.factor(temperatura):as.factor(oxigenio):as.factor(ph)`
##                diff         lwr         upr     p adj
## 20:5:5-10:5:5  21.8   2.0641907  41.5358093 0.0220982
## 10:8:5-10:5:5  19.6  -0.1358093  39.3358093 0.0526494
## 20:8:5-10:5:5  21.4   1.6641907  41.1358093 0.0259985
## 10:5:7-10:5:5 -12.8 -32.5358093   6.9358093 0.4356975
## 20:5:7-10:5:5 -26.6 -46.3358093  -6.8641907 0.0027882
## 10:8:7-10:5:5 -33.6 -53.3358093 -13.8641907 0.0001100
## 20:8:7-10:5:5 -33.4 -53.1358093 -13.6641907 0.0001208
## 10:8:5-20:5:5  -2.2 -21.9358093  17.5358093 0.9999525
## 20:8:5-20:5:5  -0.4 -20.1358093  19.3358093 1.0000000
## 10:5:7-20:5:5 -34.6 -54.3358093 -14.8641907 0.0000688
## 20:5:7-20:5:5 -48.4 -68.1358093 -28.6641907 0.0000001
## 10:8:7-20:5:5 -55.4 -75.1358093 -35.6641907 0.0000000
## 20:8:7-20:5:5 -55.2 -74.9358093 -35.4641907 0.0000000
## 20:8:5-10:8:5   1.8 -17.9358093  21.5358093 0.9999879
## 10:5:7-10:8:5 -32.4 -52.1358093 -12.6641907 0.0001930
## 20:5:7-10:8:5 -46.2 -65.9358093 -26.4641907 0.0000003
## 10:8:7-10:8:5 -53.2 -72.9358093 -33.4641907 0.0000000
## 20:8:7-10:8:5 -53.0 -72.7358093 -33.2641907 0.0000000
## 10:5:7-20:8:5 -34.2 -53.9358093 -14.4641907 0.0000830
## 20:5:7-20:8:5 -48.0 -67.7358093 -28.2641907 0.0000001
## 10:8:7-20:8:5 -55.0 -74.7358093 -35.2641907 0.0000000
## 20:8:7-20:8:5 -54.8 -74.5358093 -35.0641907 0.0000000
## 20:5:7-10:5:7 -13.8 -33.5358093   5.9358093 0.3427519
## 10:8:7-10:5:7 -20.8 -40.5358093  -1.0641907 0.0330538
## 20:8:7-10:5:7 -20.6 -40.3358093  -0.8641907 0.0357707
## 10:8:7-20:5:7  -7.0 -26.7358093  12.7358093 0.9403038
## 20:8:7-20:5:7  -6.8 -26.5358093  12.9358093 0.9484389
## 20:8:7-10:8:7   0.2 -19.5358093  19.9358093 1.0000000

If one were to ignore the interactions, the model would be fit like this

##                        Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(temperatura)  1     62      62   0.378    0.542    
## as.factor(oxigenio)     1     44      44   0.267    0.609    
## as.factor(ph)           1  17893   17893 108.289 2.12e-12 ***
## Residuals              36   5948     165                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and the post hoc tests would be

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mortalidade ~ as.factor(temperatura) + as.factor(oxigenio) + as.factor(ph), data = mort)
## 
## $`as.factor(temperatura)`
##       diff      lwr      upr     p adj
## 20-10  2.5 -5.74398 10.74398 0.5424081
## 
## $`as.factor(oxigenio)`
##     diff       lwr     upr     p adj
## 8-5 -2.1 -10.34398 6.14398 0.6085814
## 
## $`as.factor(ph)`
##      diff       lwr       upr p adj
## 7-5 -42.3 -50.54398 -34.05602     0

Note that in this case we would have no way to assess that temperature and oxygen can actually have an important role. However, 3 way interaction models are extremely hard to interpret in practice.



Exercício 6.

Recolheu amostras a Norte e a Sul de Peniche, em várias praias da costa portuguesa sujeitas a diferente grau de hidrodinamismo (calmo e exposto), e foram quantificadas as densidades de cracas nas zonas rochosas (DataTP6cracas.csv). Admitindo que os pressupostos são verificados, efectue uma análise de variância hierárquica. Quais as principais conclusões?

Read the data and check it was read OK

## 'data.frame':    20 obs. of  3 variables:
##  $ latitude      : Factor w/ 2 levels "norte","sul": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hidrodinamismo: Factor w/ 2 levels "calmo","exp": 2 2 2 2 2 1 1 1 1 1 ...
##  $ densidade     : int  21 23 25 32 34 21 19 18 21 17 ...

We begin by looking at the data

We implemented a nested ANOVA, as suggested. Note, as detailed below, that this would not be an obvious analysis just by looking at the data, there was a missing bit of information for this to be the right analysis to implement!

We need to be explicit that exposure (=hidrodinamismo) is nested within latitude, and the analysis would be different if we defined that latitude is nested in exposure. Further, there was a missing variable in the data providing a beach label, and only then we would be able to implement the correct analysis. Here we assume each sample came from a different beach, as if say we had

##    latitude hidrodinamismo densidade Praia
## 1     norte            exp        21     1
## 2     norte            exp        23     2
## 3     norte            exp        25     3
## 4     norte            exp        32     4
## 5     norte            exp        34     5
## 6     norte          calmo        21     6
## 7     norte          calmo        19     7
## 8     norte          calmo        18     8
## 9     norte          calmo        21     9
## 10    norte          calmo        17    10
## 11      sul            exp        34    11
## 12      sul            exp        44    12
## 13      sul            exp        53    13
## 14      sul            exp        45    14
## 15      sul            exp        56    15
## 16      sul          calmo        34    16
## 17      sul          calmo        37    17
## 18      sul          calmo        38    18
## 19      sul          calmo        41    19
## 20      sul          calmo        23    20

Then the analysis is

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## latitude                 1 1513.8  1513.8   38.25 1.31e-05 ***
## latitude:hidrodinamismo  2  500.2   250.1    6.32  0.00949 ** 
## Residuals               16  633.2    39.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our conclusion is that latitude as an effect on density, and that above and beyond that, exposure (= hidrodinamismo) also has an effect.

Note that the following two formulations fit the exact same model. In the first we can appreciate that by introducing the nesting effect we implicitly require the main effect and an interaction to be present. We could therefore fit the main effect explicitly.

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## latitude                 1 1513.8  1513.8   38.25 1.31e-05 ***
## latitude:hidrodinamismo  2  500.2   250.1    6.32  0.00949 ** 
## Residuals               16  633.2    39.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

which means that we can specify the same model explicitly as

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## latitude                 1 1513.8  1513.8   38.25 1.31e-05 ***
## latitude:hidrodinamismo  2  500.2   250.1    6.32  0.00949 ** 
## Residuals               16  633.2    39.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For more details on anova specification, you can find some help here:

http://conjugateprior.org/2013/01/formulae-in-r-anova/

Note that the exercise was actually not ideal, and far from explicit. The analysis implemented implies that there were 20 different beaches, but actually this could have been quite different. Consider the following beach label option given by a different assumption about the data

##    latitude hidrodinamismo densidade Praia Praia2
## 1     norte            exp        21     1      1
## 2     norte            exp        23     2      2
## 3     norte            exp        25     3      3
## 4     norte            exp        32     4      4
## 5     norte            exp        34     5      5
## 6     norte          calmo        21     6      1
## 7     norte          calmo        19     7      2
## 8     norte          calmo        18     8      3
## 9     norte          calmo        21     9      4
## 10    norte          calmo        17    10      5
## 11      sul            exp        34    11      6
## 12      sul            exp        44    12      7
## 13      sul            exp        53    13      8
## 14      sul            exp        45    14      9
## 15      sul            exp        56    15     10
## 16      sul          calmo        34    16      6
## 17      sul          calmo        37    17      7
## 18      sul          calmo        38    18      8
## 19      sul          calmo        41    19      9
## 20      sul          calmo        23    20     10

The earlier solution implies that each sample came from a different beach (i.e., as defined by Praia in the code presented). Hence we assumed that each beach only had a sample, either correponding to exposed or calm.

If these were 5 beaches in the north and 5 in the south, each with samples for both exposed and calm, as defined by Praia2 the analysis would be different. Then all beaches would have samples for each level of exposure

##        
##         calmo exp
##   norte     5   5
##   sul       5   5

and hence in fact would be no reason to implement a hierarchical ANOVA. The right model would then correspond to a full factorial design

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## latitude                 1 1513.8  1513.8  38.251 1.31e-05 ***
## hidrodinamismo           1  480.2   480.2  12.134  0.00307 ** 
## latitude:hidrodinamismo  1   20.0    20.0   0.505  0.48738    
## Residuals               16  633.2    39.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We would therefore conclude that the interaction is not significant (note this was not the case before, when exposure was not a main effect), but both exposure and latitude are

##                Df Sum Sq Mean Sq F value   Pr(>F)    
## latitude        1 1513.8  1513.8    39.4 8.35e-06 ***
## hidrodinamismo  1  480.2   480.2    12.5  0.00254 ** 
## Residuals      17  653.2    38.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hence, ideally, the variable Praia would be present in the dataset. From 2020 onwards, it will!



Exercício 7.

Administrou uma substância em ratos de laboratório e pretende agora medir o seu efeito na concentração de uma enzima no sangue. Efectuou medições em 10 ratos, duas horas e 24 horas após a administração da substância (dados DataTP6ratos.csv). Admitindo que os pressupostos são satisfeitos, efectue uma análise de variância que seja adequada e interprete os resultados.

Read the data and check it was read OK

## 'data.frame':    20 obs. of  3 variables:
##  $ rato  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ tempo : Factor w/ 2 levels "24h","2h": 2 2 2 2 2 2 2 2 2 2 ...
##  $ enzima: num  2.2 2.3 2.2 2.1 2.4 3.2 3.3 7.2 4.4 8.3 ...

It is best to recode both tempo and rato as factors

We have measurements for an enzime measured in 10 rats, two measurements in 2 different times for each rat. This is an example of longitudinal data, a special case of repeated measurement data.

##     tempo
## rato 24h 2h
##   1    1  1
##   2    1  1
##   3    1  1
##   4    1  1
##   5    1  1
##   6    1  1
##   7    1  1
##   8    1  1
##   9    1  1
##   10   1  1

We can plot the data, and it appears that the enzime measurements are higher 24 hours than 2 hours after the substance was given to the rats.

If we ignore the repeated measurements nature of the data, we could do a simple ANOVA.

##             Df Sum Sq Mean Sq F value Pr(>F)  
## tempo        1  35.64   35.64   4.743  0.043 *
## Residuals   18 135.26    7.51                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note in fact this is the same as a t-test, as there are only 2 groups (i.e. two levels of time)!

## 
##  Welch Two Sample t-test
## 
## data:  enzima by tempo
## t = 2.1779, df = 16.185, p-value = 0.04454
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.07352186 5.26647814
## sample estimates:
## mean in group 24h  mean in group 2h 
##              6.43              3.76

and tempo would be considered significant at the 5% or 10% level (but not at the 1% significance level, say). In fact, accounting for the repeated measurement nature of the data is equivalent to a lack of independence in two samples, so we could also implement a paired-sample t-test

## 
##  Paired t-test
## 
## data:  enzima[tempo == "2h"] and enzima[tempo == "24h"]
## t = -7.469, df = 9, p-value = 3.815e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.478666 -1.861334
## sample estimates:
## mean of the differences 
##                   -2.67

which is exactly the same as doing a simple t.test over the differences

## 
##  One Sample t-test
## 
## data:  enzima[tempo == "2h"] - enzima[tempo == "24h"]
## t = -7.469, df = 9, p-value = 3.815e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -3.478666 -1.861334
## sample estimates:
## mean of x 
##     -2.67

and therefore we see that tempo is clearly significant (i.e. the means are different for the different times). The fact that we use the information that the measurements are within the same rats makes the test more powerful, as expected.

However, this was an exercise within the Ficha de Trabalho for the ANOVA, so… Lets do an ANOVA (this is like pretending that rato is a block, and it will be treated as a random effect - we have repeated measurements within each rato)

To define rato as a random effect the required sintax is

## 
## Error: rato
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  9  129.5   14.39               
## 
## Error: Within
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## tempo      1  35.64   35.64   55.79 3.81e-05 ***
## Residuals  9   5.75    0.64                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the practical conclusions, i.e. the enzime measurements are significantly different in each time, are the same as above.

Recal that a t distribution with \(n\) degrees of freedom, squared, becomes an F distribution with (1,\(n\)) degrees of freedom, so note above that the t-test statistic squared is the same as the F-test statistic (-7.4690388\(^2\)=55.7865403).

Note that, in general, with more than 3 measurements per rato, the t-test trick would not be available and the only way to deal with this would be to do it the aov way (or using any other mixed model fitting function, and ANOVA with a fixed factor and a random factor is the simplest case of a mixed model - these can be implemented with more flexible functions than aov): For additional details, see e.g.

http://conjugateprior.org/2013/01/formulae-in-r-anova/