Exercício 1

Efectue uma análise de regressão (Função lm) sobre os dados DataTP7densidade.csv, relativos à densidade de uma espécie em função de variáveis ambientais, considerando a densidade como a variável dependente a temperatura como independente. Faça uma representação gráfica adequada dos dados e do modelo. Explore os resultados e efectue uma avaliação do cumprimento dos pressupostos do modelo linear.

Read in the data

dens <- read.csv("DataTP7densidade.csv", sep=";")

and as usual, check all is OK

str(dens)
## 'data.frame':    14 obs. of  5 variables:
##  $ Densidade  : num  51.4 72.1 53.2 83.2 57.4 66.5 98.3 74.8 92.2 97.9 ...
##  $ Turbidez   : num  0.2 1.9 0.2 10.7 6.8 10.6 9.6 6.3 10.8 9.6 ...
##  $ Salinidade : num  17.8 29.4 17.2 30.2 15.3 17.6 35.6 28.2 34.7 35.8 ...
##  $ Temperatura: num  24.6 20.7 18.5 10.6 8.9 11.1 10.6 8.8 11.9 10.8 ...
##  $ Oxigenio   : num  18.9 8 22.6 7.1 27.3 20.8 5.6 13.1 5.9 5.5 ...

We have 14 observations for 5 variables. We are told that densidade is the dependent variable, and temperatura the independent. We fit a simple regression model

reg<-lm(Densidade~Temperatura,data=dens)
summary(reg)
## 
## Call:
## lm(formula = Densidade ~ Temperatura, data = dens)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.317 -10.394   4.093  10.744  16.053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  97.6494     9.7515  10.014 3.53e-07 ***
## Temperatura  -1.4531     0.6233  -2.331    0.038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 12 degrees of freedom
## Multiple R-squared:  0.3117, Adjusted R-squared:  0.2544 
## F-statistic: 5.435 on 1 and 12 DF,  p-value: 0.03798

We can take a look at the model and the fitted line

with(dens,plot(Temperatura,Densidade))
abline(reg)

Note we can look at the residuals of the model

round(reg$residuals,3)
##       1       2       3       4       5       6       7       8       9 
## -10.504   4.529 -17.568   0.953 -27.317 -15.020  16.053 -10.062  11.842 
##      10      11      12      13      14 
##  15.944   7.451   6.886   3.657  13.157

We can look at the diagnostic plots for the fitted model. There is no reason to suspect that the model is not adequate, so we could use this model for predicting Densidade as a function of Temperatura. Nonetheless, note that with such a small sample size, it would be almost impossible to find any strong reasons to find the model unsuitable, unless these were blatant!

par(mfrow=c(2,2))
plot(reg)

The correlation \(R\) is (naturally negative, as larger temperatura leads to lower densidade)

with(dens,cor(Temperatura,Densidade))
## [1] -0.5583191

and the \(R^2\) is 0.31. Note that this value was present in the output of the summary of the model, as “Multiple R-squared”. Therefore, temperatura explains 31.17 % of the variability in densidade.

Exercício 2

Com base nos dados da alínea anterior, efectue uma análise de regressão múltipla usando a função lm, sendo a variável dependente a densidade e as independentes todas as restantes. Comente os resultados (quais as variáveis importantes para explicar a densidade?) e verifique se os pressupostos são cumpridos. Há funções muito práticas para extrair sub-componentes de modelos. Sobre o modelo que usar, experimente usar as funções coef, fitted, predict.lm e AIC. Qual a differença entre fitted e predict.lm? Com base no melhor modelo encontrado, represente visualmente a relação entre a variável resposta e o(s) preditore(s) relevante(s). Represente também essa mesma relação caso estivesse interessado em prever a Densidade em função da Turbidez, mas sendo que a Salinidade era igual ao valor mínimo presente nos dados. Interprete os resultados obtidos.

We can start by looking at the pairs function, that allows us to visualize the relationship between the variables in a data.frame (note I use some fancy functions I found by looking into ?pairs to make the output much more slick!)

panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(dens, lower.panel = panel.smooth, upper.panel = panel.cor,
      gap=0, row1attop=FALSE,diag.panel = panel.hist)

Note that this shows us that in fact the predictor we used above, Temperature, is the one that is less related to Densidade, so maybe the above model was not the best possible anyway!

reg.mult<-lm(Densidade~Turbidez+Salinidade+Temperatura+Oxigenio,data=dens)
summary(reg.mult)
## 
## Call:
## lm(formula = Densidade ~ Turbidez + Salinidade + Temperatura + 
##     Oxigenio, data = dens)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1977 -1.3324 -0.7577  1.9245  4.0316 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -30.8800    37.3544  -0.827  0.42979   
## Turbidez      2.0845     0.4562   4.570  0.00135 **
## Salinidade    2.5951     0.7353   3.529  0.00642 **
## Temperatura   0.6453     0.4584   1.408  0.19281   
## Oxigenio      1.1147     0.7595   1.468  0.17622   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.079 on 9 degrees of freedom
## Multiple R-squared:  0.9761, Adjusted R-squared:  0.9655 
## F-statistic: 91.83 on 4 and 9 DF,  p-value: 2.728e-07

The conclusion is that both Turbidez and Salinidade are important to explain Densidade. Although we were told explicitly to fit a model with all variables, if some of these are not relevant, we might as well simplify the model

reg.mult2<-lm(Densidade~Turbidez+Salinidade,data=dens)
summary(reg.mult2)
## 
## Call:
## lm(formula = Densidade ~ Turbidez + Salinidade, data = dens)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6288 -1.5121 -0.7341  2.2797  4.8000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.8189     3.1332   7.921 7.17e-06 ***
## Turbidez      1.4783     0.1548   9.548 1.17e-06 ***
## Salinidade    1.5306     0.1152  13.282 4.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.114 on 11 degrees of freedom
## Multiple R-squared:  0.9701, Adjusted R-squared:  0.9647 
## F-statistic: 178.5 on 2 and 11 DF,  p-value: 4.13e-09

Take a look at model diagnostics

par(mfrow=c(2,2))
plot(reg.mult2)

Everything looks good, no problems are evident in the plots. Nonetheless, we could explicitly test the assumptions of the model, namely that the residuals are Gaussian.

shapiro.test(resid(reg.mult2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(reg.mult2)
## W = 0.94903, p-value = 0.5458

there is no evidence to reject the H0 that the residuals are Gaussian.

Note there are special functions to extract components of a fitted model. We can easily extract the coefficients of the model

coef(reg.mult2)
## (Intercept)    Turbidez  Salinidade 
##   24.818859    1.478292    1.530605

and the fitted values

fitted(reg.mult2)
##        1        2        3        4        5        6        7        8 
## 52.35928 72.62739 51.44092 86.86084 58.28949 67.42739 93.49998 77.29515 
##        9       10       11       12       13       14 
## 93.89639 93.80610 85.64682 95.37874 59.54266 86.22885

or the predicted values

predict.lm(reg.mult2)
##        1        2        3        4        5        6        7        8 
## 52.35928 72.62739 51.44092 86.86084 58.28949 67.42739 93.49998 77.29515 
##        9       10       11       12       13       14 
## 93.89639 93.80610 85.64682 95.37874 59.54266 86.22885

which is equivalent with the above, but if we use the aditional argument newdata, which MUST be a data.frame, then we can predict for any new values, say below for the values of Salinidade of 20 and 25.2 and Turbidez of 15 and 30. The argument type="response" ensures you are predicting in the scale of the response. This is not required in a lm, as the link function is the identity, bu definition, but is very useful for a glm or a gam, as otherwise by default you predict on the scale of the link function, which (at least most) humans have trouble relating to!!

predict.lm(reg.mult2,type="response",newdata=data.frame(Salinidade=c(20,25.2),Turbidez=c(15,30)))
##         1         2 
##  77.60533 107.73884

Finally, we can extract the AIC of the model

AIC(reg.mult2)
## [1] 76.15575

In fact, the last function, AIC, is often used to choose between multiple models. Lets try that here

AIC(reg.mult2,reg.mult,reg)
##           df       AIC
## reg.mult2  4  76.15575
## reg.mult   6  77.03036
## reg        3 118.06602

Not surprisingly, the simpler multiple regression model has the lowest AIC (i.e., is the most parsimonious), and the original model with Temperature alone is the worst.

Perhaps surprisingly, when I used Temperatura alone, it was significant (say at a significance level of 5 or 10%), but in the end it was not useful… Why? Because Temperatura is higly correlated with Turbidez (R=-0.84), and hence, once Turbidez is included, then Temperatura is no longer useful. The important message is that we never really know if there might not have been a better model based on covariates that we did not collect. Imagine that you only had collected Temperatura, then not knowing any better, you would assume it was a good predictor of Densidade. After you have Turbidez, then Temperatura becomes uselless to predict Densidade.

Lets suppose we wanted to visualize the results of this model in a plot, just as we did with a single variable before. This become tricky, because in a single plot we can’t have more than a variable! But there is still a way to go. We need to plot each independent variable separately, and choose a reasonable value for the variable not in the plot.

A possibility is to ignore the other variables!

regTS=lm(Densidade~Turbidez+Salinidade,data=dens)
regT=lm(Densidade~Turbidez,data=dens)
regS=lm(Densidade~Salinidade,data=dens)
par(mfrow=c(1,2))
plot(Densidade~Turbidez,data=dens)
abline(regT,col="red")
plot(Densidade~Salinidade,data=dens)
abline(regS,col="red")

but that is not sensible, as it explains the data in the plot, but it does not use the information about the variable not on the plot.

So, we can use it, and we represent both sets of “predictions” in the next plot.

par(mfrow=c(1,2))
plot(Densidade~Turbidez,data=dens)
abline(regT,col="red")
seqT=seq(0,20,by=0.5)
predDT=predict(regTS,newdata=data.frame(Turbidez=seqT,Salinidade=mean(dens$Salinidade)))
lines(seqT,predDT,col="blue")
plot(Densidade~Salinidade,data=dens)
abline(regS,col="red")
seqS=seq(10,40,by=0.5)
predDS=predict(regTS,newdata=data.frame(Turbidez=mean(dens$Turbidez),Salinidade=seqS))
lines(seqS,predDS,col="blue")

The model accounting for the second variable (blue line) is different from the one ignoring the variable (red line). In other words, once we consider a given Turbidez, the predicted value for Densidade based on Salinidade changes, and by the same token, once we consider a given Salinidade, the predicted value for Densidade based on Turbidez changes. We make the predictions of Densidade based on Turbidez conditional on the mean value of Salinidade, and the predictions of Densidade based on Salinidade conditional on the mean value of Turbidez. Of course, we could also want to see what would happen if another value of the second variable was considered. Imagine we wanted to condition on the minimum value of the other variable.

par(mfrow=c(1,2))
plot(Densidade~Turbidez,data=dens)
abline(regT,col="red")
seqT=seq(0,20,by=0.5)
predDT=predict(regTS,newdata=data.frame(Turbidez=seqT,Salinidade=mean(dens$Salinidade)))
predDTmin=predict(regTS,newdata=data.frame(Turbidez=seqT,Salinidade=min(dens$Salinidade)))
lines(seqT,predDT,col="blue")
lines(seqT,predDTmin,col="blue",lty=2)
plot(Densidade~Salinidade,data=dens)
abline(regS,col="red")
seqS=seq(10,40,by=0.5)
predDS=predict(regTS,newdata=data.frame(Turbidez=mean(dens$Turbidez),Salinidade=seqS))
predDSmin=predict(regTS,newdata=data.frame(Turbidez=min(dens$Turbidez),Salinidade=seqS))
lines(seqS,predDS,col="blue")
lines(seqS,predDSmin,col="blue",lty=2)

The predictions are now different. They do not fit the data, but they are not wrong. They simply reflect the predicted density conditional on the minimum value of the variable not in the plot. Given that both variables influence positively the density, conditioning on the minimum lowers the line (but the line is parallel to the previous line, only the intercept changes, the slope remains the same!).

Exercício 3

Estimou a mortalidade de aves em alguns parques eólicos, em função de variáveis ambientais (DataTP7parqueseolicos.csv). Verifique se a variável “altitude” é um bom preditor da mortalidade, através de um modelo de regressão simples (Função lm). Caracterize a relação entre as variáveis. Elabore uma função que gere estimativas de mortalidade em função da altitude (Relembrar como criar funções, tutorial 1). Avalie o desempenho do modelo de regressão (Por exemplo, faça um plot do objecto de regressão criado e avalie os pressupostos do modelo de regressão).

As usual, we begin by reading the data

peol <- read.csv("DataTP7parqueseolicos.csv", sep=";")

and making sure all went fine

str(peol)
## 'data.frame':    10 obs. of  5 variables:
##  $ mortalidade     : int  2 5 6 7 12 15 36 45 58 124
##  $ densidade.aero  : int  5 25 9 36 11 9 8 7 41 25
##  $ altitude        : int  150 236 364 489 684 789 1254 1564 1789 1987
##  $ proximidade.aero: int  3 5 6 8 9 12 25 69 154 351
##  $ estradas        : int  21 0 17 1 18 21 23 14 1 2

a summary of the data

summary(peol)
##   mortalidade     densidade.aero     altitude      proximidade.aero
##  Min.   :  2.00   Min.   : 5.00   Min.   : 150.0   Min.   :  3.0   
##  1st Qu.:  6.25   1st Qu.: 8.25   1st Qu.: 395.2   1st Qu.:  6.5   
##  Median : 13.50   Median :10.00   Median : 736.5   Median : 10.5   
##  Mean   : 31.00   Mean   :17.60   Mean   : 930.6   Mean   : 64.2   
##  3rd Qu.: 42.75   3rd Qu.:25.00   3rd Qu.:1486.5   3rd Qu.: 58.0   
##  Max.   :124.00   Max.   :41.00   Max.   :1987.0   Max.   :351.0   
##     estradas    
##  Min.   : 0.00  
##  1st Qu.: 1.25  
##  Median :15.50  
##  Mean   :11.80  
##  3rd Qu.:20.25  
##  Max.   :23.00

We can see that we have a response variable, mortalidade, and we are asked to evaluate if altitude is a good predictor using a simple regression model. We first plot the data

with(peol,plot(mortalidade~altitude))

it seems clear that as altitude increases, mortality also increases (but the relation does not seem linear, more like an exponential). We can fit a regression model, and add it to the plot

lmalt=lm(mortalidade~altitude,data=peol)
with(peol,plot(mortalidade~altitude))
abline(lmalt,lty=2,col=2)

We can see the summary of the model

summary(lmalt)
## 
## Call:
## lm(formula = mortalidade ~ altitude, data = peol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.975 -10.707  -4.129   7.699  39.671 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.978726  10.139500  -1.576 0.153701    
## altitude      0.050482   0.008995   5.612 0.000503 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.1 on 8 degrees of freedom
## Multiple R-squared:  0.7975, Adjusted R-squared:  0.7721 
## F-statistic:  31.5 on 1 and 8 DF,  p-value: 0.0005029

and we can see that altitude explains 79.7466% of the variability in mortality.

The fitted object model contains all the information we need to build a function predicting mortalidade as a function of altitude. The function takes a single argument, the value of altitude, and then returns the corresponding predicted value of mortality.

mypred=function(data){
  res=lmalt$coefficients[1]+lmalt$coefficients[2]*data
  return(res)
}

We can now use the new function to predict the mortality for say an altitude of 1000m

mypred(1000)
## (Intercept) 
##    34.50346

This would be represented in the plot as

with(peol,plot(mortalidade~altitude))
abline(lmalt,lty=2,col=2)
abline(v=1000,h=mypred(1000),lty=2,col=3:4)

Another way of predicting for new values is to use the predict functions built in in R.

new<-data.frame(altitude=c(1000))
predict.lm(lmalt,new)
##        1 
## 34.50346

Finally, we can look at the performance of the model

par(mfrow=c(2,2))
plot(lmalt)

Clearly, the model is not adequate, as there is a pattern in the residuals (note this is a function of the non-linearity noted above): we have 2 positive residuals, then all the negative residuals, finally another positive residual. We would not want to make predictions from such a model.

Exercício 4

Realize uma regressão múltipla aos dados DataTP7parqueseolicos.csv e interprete os resultados. Pretende obter um sub-modelo para utilizar noutras regiões com parques eólicos. Obtenha um sub-modelo para esse efeito. (Dica: começar por criar um modelo com todas as variáveis disponíveis e depois ver quais as que não parecem importantes).

As we already read the data in the previous exercise, we can proceed to implement the model. But before, we plot all the data against each other

#from the ?pairs
## put histograms on the diagonal
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(peol, lower.panel = panel.smooth, upper.panel = panel.cor,
      gap=0, row1attop=FALSE)

The very large correlation between estradas and proximidade.aero means that models with both variables might become unstable. We nonetheless fit a model with all variables.

regmul<-lm(mortalidade~., data=peol)
summary(regmul)
## 
## Call:
## lm(formula = mortalidade ~ ., data = peol)
## 
## Residuals:
##      1      2      3      4      5      6      7      8      9     10 
## -2.146  1.872 -2.155  3.056 -1.753 -1.536  7.734 -2.491 -4.593  2.010 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.418208  10.009923   0.741 0.491957    
## densidade.aero   -0.394911   0.303776  -1.300 0.250299    
## altitude          0.018310   0.004458   4.108 0.009286 ** 
## proximidade.aero  0.252193   0.029437   8.567 0.000357 ***
## estradas         -0.228630   0.446504  -0.512 0.630427    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.872 on 5 degrees of freedom
## Multiple R-squared:  0.9908, Adjusted R-squared:  0.9835 
## F-statistic:   135 on 4 and 5 DF,  p-value: 2.804e-05
par(mfrow=c(2,2))
plot(regmul)

We could check the residuals, and there’s now less evidence of problems (the additional variables helped solving the lack of independence - but strictly, the response in altitude might be non-linear!).

Given this full model, both altitude and proximidade.aero seem significant (i.e. useful to explain mortalidade). We can now check what might be the best model, e.g. using a backward selection using AIC

library(MASS)
full.model<-lm(mortalidade~.,data=peol)
best.model<-stepAIC(full.model, direction="backward")
## Start:  AIC=34.74
## mortalidade ~ densidade.aero + altitude + proximidade.aero + 
##     estradas
## 
##                    Df Sum of Sq     RSS    AIC
## - estradas          1      6.22  124.90 33.249
## <none>                           118.68 34.738
## - densidade.aero    1     40.11  158.79 35.650
## - altitude          1    400.47  519.15 47.496
## - proximidade.aero  1   1742.14 1860.81 60.262
## 
## Step:  AIC=33.25
## mortalidade ~ densidade.aero + altitude + proximidade.aero
## 
##                    Df Sum of Sq     RSS    AIC
## <none>                           124.90 33.249
## - densidade.aero    1     85.32  210.22 36.456
## - altitude          1    435.26  560.16 46.256
## - proximidade.aero  1   2465.64 2590.54 61.570
summary(best.model)
## 
## Call:
## lm(formula = mortalidade ~ densidade.aero + altitude + proximidade.aero, 
##     data = peol)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.248 -2.228 -1.510  2.738  7.114 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.665547   3.510035   0.759   0.4764    
## densidade.aero   -0.255509   0.126211  -2.024   0.0893 .  
## altitude          0.017361   0.003797   4.573   0.0038 ** 
## proximidade.aero  0.259738   0.023866  10.883 3.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.563 on 6 degrees of freedom
## Multiple R-squared:  0.9903, Adjusted R-squared:  0.9855 
## F-statistic: 205.1 on 3 and 6 DF,  p-value: 1.963e-06

We see that in fact, at least considering AIC, 3 variables are required to explain the mortalidade. We would therefore conclude that 3 variables, densidade.aero, altitude and proximidade.aero, are relevant for explaining mortality.

par(mfrow=c(2,2))
plot(best.model)

Actually, with such a small number of variables, we might just fit all the models. This can be done using function bestglm in package bestglm (note that while the model is a simple linear model, the function can cope with GLMs too!). Note that the first argument MUST be a matrix with all the variables, and the last column needs to be the response variable.

library(bestglm)
bestGLM=bestglm(Xy=peol[,c(2:5,1)],family = gaussian,IC="AIC",RequireFullEnumerationQ=TRUE)
## Morgan-Tatar search RequireFullEnumerationQ=TRUE

The object produced by bestGLM contains a number of components.

names(bestGLM)
## [1] "BestModel"   "BestModels"  "Bestq"       "qTable"      "Subsets"    
## [6] "Title"       "ModelReport"

In this case, we get the same result as with the backwards procedure, which is reassuring, as for many variables, doing all the possible combinations becomes computationally intense.

bestGLM$BestModel
## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Coefficients:
##      (Intercept)    densidade.aero          altitude  proximidade.aero  
##          2.66555          -0.25551           0.01736           0.25974

Exercício 5

Aplique um GLM aos dados DataTP7presenca, os quais descrevem a abundância de uma espécie em função de algumas variáveis ambientais. Utilize a distribuição Gama como família de distribuição do erro (Gamma) e a função inversa como função de ligação (inverse, in a glm call, use argumente family=Gamma(link="inverse")). Compare os resultados com um GLM Poisson (in glm call, family=poisson). Compare as predições dos dois modelos num gráfico. Qual seria o valor previsto num caso em que estradas tomasse o valor 7. Qual dos dois modelos escolheria para modelar os dados (ver a função AIC)?

Read the data in

presenca <- read.csv("DataTP7presenca.csv", sep=";")

Look at the response variable

par(mfrow=c(1,2))
hist(presenca$presenca)
boxplot(presenca$presenca)

Fit the suggested GLM

pres.model<-glm(presenca~altitude+floresta+temperatura+estradas, data=presenca, family=Gamma(link="inverse"))
summary(pres.model)
## 
## Call:
## glm(formula = presenca ~ altitude + floresta + temperatura + 
##     estradas, family = Gamma(link = "inverse"), data = presenca)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7  
## -0.13372   0.14525   0.30615   0.20438   0.02682  -0.73211  -0.12128  
##        8         9        10  
## -0.00656   0.04631   0.03274  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.2824582  0.1578231   1.790   0.1335  
## altitude    -0.0001675  0.0001378  -1.216   0.2783  
## floresta    -0.0009144  0.0015712  -0.582   0.5858  
## temperatura -0.0090641  0.0066797  -1.357   0.2328  
## estradas     0.0207172  0.0053274   3.889   0.0115 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1078066)
## 
##     Null deviance: 8.76092  on 9  degrees of freedom
## Residual deviance: 0.72915  on 5  degrees of freedom
## AIC: 45.853
## 
## Number of Fisher Scoring iterations: 5

We would conclude, based on this model, that only estradas might be a relevant covariate to explain presença. Given that, we could fit a simpler model to the data

pres.model2<-glm(presenca~estradas, data=presenca, family=Gamma(link="inverse"))
summary(pres.model2)
## 
## Call:
## glm(formula = presenca ~ estradas, family = Gamma(link = "inverse"), 
##     data = presenca)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.72360  -0.10694   0.06731   0.16441   0.36233  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.050970   0.010765   4.735 0.001473 ** 
## estradas    0.018495   0.002807   6.588 0.000171 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.09651818)
## 
##     Null deviance: 8.76092  on 9  degrees of freedom
## Residual deviance: 0.98988  on 8  degrees of freedom
## AIC: 42.953
## 
## Number of Fisher Scoring iterations: 5

We could plot the estimated relation

with(presenca,plot(presenca~estradas))

It seems clear that, as the model predicts, the larger the estradas the less the presença

with(presenca,plot(presenca~estradas))
estr2=seq(0,25,by=0.2)
predsG=predict.glm(pres.model2,newdata=data.frame(estradas=estr2),type="response")
lines(estr2,predsG,lty=2,col=3)

Note how the relationship seems to be clearly non-linear in the response scale.

Actually, the results were integers… so, could we have done it with a simpler Poisson GLM?

modelPois<-glm(presenca~estradas, data=presenca, family=poisson)
summary(modelPois)
## 
## Call:
## glm(formula = presenca ~ estradas, family = poisson, data = presenca)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.15220  -0.01242   0.02047   0.25184   0.67368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.80006    0.13646  20.520  < 2e-16 ***
## estradas    -0.10038    0.01633  -6.148 7.85e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 53.515  on 9  degrees of freedom
## Residual deviance:  2.793  on 8  degrees of freedom
## AIC: 41.465
## 
## Number of Fisher Scoring iterations: 4

We could plot the estimated relation

with(presenca,plot(presenca~estradas))

As expected, the larger the estradas the less the presenca. We could now add to the plot the predicted values under the two models.

estr2=seq(0,25,by=0.2)
predsP=predict.glm(modelPois,newdata=data.frame(estradas=estr2),type="response")
valest=7
predG7=predsG[estr2==valest]
predP7=predsP[estr2==valest]

When estradas is 7 the predicted value for the gama model would be 5.5, while for the Poisson model it would be 8.1.

with(presenca,plot(presenca~estradas))
#the Poisson predictions
lines(estr2,predsP,lty=2,col=4)
#the gama predictions
lines(estr2,predsG,lty=2,col=3)
legend("topright",legend=c("Poisson Predictions","Gamma predictions"),lty=2,col=4:3)
valest=7
abline(v=valest,lty=2,col=5)
abline(h=predG7,lty=2,col=3)
abline(h=predP7,lty=2,col=4)

How might we choose across these two? The AIC favors the Poisson, even if ever so slightly!

AIC(modelPois,pres.model2)
##             df      AIC
## modelPois    2 41.46529
## pres.model2  3 42.95326

Exercício 6

Aplique um GLM aos dados DataTP7anchova.csv, referentes à abundância de anchova (variável dependente) em função de várias variáveis ambientais (variáveis independentes). Explore os resultados.

Read in the data

anchova <- read.csv("DataTP7anchova.csv", sep=";")

We have 58 observations for 6 variables. We are told that anchova is the dependent variable while all the others are potentially explanatory (i.e. independent) variables. Sometimes we refer to these as covariates.

We begin by looking at the response variable (and its log(x+1))

par(mfrow=c(1,2))
hist(anchova$ANC)
loganc<-log10((anchova$ANC)+1)
hist(loganc)

We can look at the correlation across the different variables

pairs(anchova, lower.panel = panel.smooth, upper.panel = panel.cor,
      gap=0, row1attop=FALSE)

The highest correlation is with Turbidity, but then again, it is not high enough to expect that alone it will be a reasonable predictor on itself.

with(anchova,plot(ANC~Turbidity))
lm1=lm(ANC~Turbidity,data=anchova)
abline(lm(ANC~Turbidity,data=anchova))

summary(lm1)
## 
## Call:
## lm(formula = ANC ~ Turbidity, data = anchova)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44504 -0.20065 -0.08559  0.18283  0.52631 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.885e-01  4.217e-02   9.213 8.17e-13 ***
## Turbidity   1.487e-04  4.492e-05   3.310  0.00163 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2382 on 56 degrees of freedom
## Multiple R-squared:  0.1637, Adjusted R-squared:  0.1487 
## F-statistic: 10.96 on 1 and 56 DF,  p-value: 0.001635

Yet, as we can see, the variable seems important to explain ANC. We can look at diagnostic plots

par(mfrow=c(2,2))
plot(lm1)

Since the residuals are not consistent with a Gaussian, we could apply a Gama GLM. Note the value 0 present in ANC is a problem, as a Gama expects strictly a positive response, therefore 0’s are not admissible. We can use a simple trick which is to add a small constant (like 0.001) to all data points - note this has virtually no impact on the response, besides the fact of allowing the Gama model to be fitted. We consider both the standard link function for the Gama, “inverse”, and the “log” link.

anchmod<-glm(ANC+0.001~., data=anchova, family=Gamma(link="log"))
summary(anchmod)
## 
## Call:
## glm(formula = ANC + 0.001 ~ ., family = Gamma(link = "log"), 
##     data = anchova)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0642  -0.4833  -0.1056   0.3404   0.7205  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.329e+00  3.361e-01  -3.954 0.000233 ***
## Salinity     3.825e-02  1.985e-02   1.927 0.059454 .  
## Turbidity    3.118e-04  9.714e-05   3.210 0.002279 ** 
## Flow        -9.458e-05  1.606e-04  -0.589 0.558554    
## Temperature  1.470e-02  1.407e-02   1.045 0.300901    
## NAO          1.402e-01  1.258e-01   1.115 0.270087    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2537669)
## 
##     Null deviance: 26.544  on 57  degrees of freedom
## Residual deviance: 22.318  on 52  degrees of freedom
## AIC: 15.988
## 
## Number of Fisher Scoring iterations: 6
anchmodI<-glm(ANC+0.001~., data=anchova, family=Gamma(link="inverse"))
summary(anchmodI)
## 
## Call:
## glm(formula = ANC + 0.001 ~ ., family = Gamma(link = "inverse"), 
##     data = anchova)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.08124  -0.47486  -0.03479   0.31995   0.75011  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.3652923  0.7537738   4.465 4.34e-05 ***
## Salinity    -0.0789998  0.0440940  -1.792 0.079013 .  
## Turbidity   -0.0005127  0.0001450  -3.537 0.000862 ***
## Flow         0.0002565  0.0003576   0.717 0.476468    
## Temperature -0.0384309  0.0291309  -1.319 0.192866    
## NAO         -0.3587082  0.2683775  -1.337 0.187178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2571043)
## 
##     Null deviance: 26.544  on 57  degrees of freedom
## Residual deviance: 22.601  on 52  degrees of freedom
## AIC: 16.763
## 
## Number of Fisher Scoring iterations: 5

As it stands, irrespective of the link function considered only Turbidity seems relevant to explain the numbers of anchova, while Salinity might be useful too.

anchmod<-glm(ANC+0.001~Turbidity+Salinity, data=anchova, family=Gamma(link="log"))
summary(anchmod)
## 
## Call:
## glm(formula = ANC + 0.001 ~ Turbidity + Salinity, family = Gamma(link = "log"), 
##     data = anchova)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1168  -0.5019  -0.1258   0.3641   0.7860  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.189e+00  1.402e-01  -8.478 1.46e-11 ***
## Turbidity    2.992e-04  9.477e-05   3.157  0.00259 ** 
## Salinity     4.157e-02  1.903e-02   2.184  0.03324 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2521114)
## 
##     Null deviance: 26.544  on 57  degrees of freedom
## Residual deviance: 23.011  on 55  degrees of freedom
## AIC: 11.874
## 
## Number of Fisher Scoring iterations: 6
anchmodI<-glm(ANC+0.001~Turbidity+Salinity, data=anchova, family=Gamma(link="inverse"))
summary(anchmodI)
## 
## Call:
## glm(formula = ANC + 0.001 ~ Turbidity + Salinity, family = Gamma(link = "inverse"), 
##     data = anchova)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1334  -0.5111  -0.1441   0.3749   0.7726  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.9259334  0.3467145   8.439 1.69e-11 ***
## Turbidity   -0.0004209  0.0001304  -3.227  0.00211 ** 
## Salinity    -0.0840767  0.0421020  -1.997  0.05079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.2590704)
## 
##     Null deviance: 26.544  on 57  degrees of freedom
## Residual deviance: 23.543  on 55  degrees of freedom
## AIC: 13.286
## 
## Number of Fisher Scoring iterations: 5

The diagnostic plots for the model (with Inverse link) are

par(mfrow=c(2,2))
plot(anchmodI)

the residuals still not seem to fit the qqplot adequately, hinting for the fact that the Gama is not ideal either.

We could also have tried to transform the data, say using a logarithmic transformation (log(x+1) given the 0’s)

dados.log<-cbind(anchova,loganc)
null.ancl <- glm(formula = loganc ~ 1, data = dados.log,family=gaussian(link = "identity")) 
glm.ancl <- stepAIC(null.ancl, loganc ~ Salinity + Turbidity + Flow + Temperature + NAO, data = dados.log, direction = "both")
## Start:  AIC=-132.51
## loganc ~ 1
## 
##               Df Deviance     AIC
## + Turbidity    1  0.26884 -141.10
## + Salinity     1  0.30389 -133.99
## <none>            0.32271 -132.51
## + Flow         1  0.31476 -131.95
## + Temperature  1  0.31906 -131.17
## + NAO          1  0.32250 -130.55
## 
## Step:  AIC=-141.1
## loganc ~ Turbidity
## 
##               Df Deviance     AIC
## + Salinity     1  0.25261 -142.71
## + Temperature  1  0.25843 -141.39
## <none>            0.26884 -141.10
## + Flow         1  0.25984 -141.07
## + NAO          1  0.26857 -139.16
## - Turbidity    1  0.32271 -132.51
## 
## Step:  AIC=-142.71
## loganc ~ Turbidity + Salinity
## 
##               Df Deviance     AIC
## <none>            0.25261 -142.71
## + Temperature  1  0.24627 -142.19
## + Flow         1  0.24822 -141.73
## - Salinity     1  0.26884 -141.10
## + NAO          1  0.25135 -141.00
## - Turbidity    1  0.30389 -133.99
summary(glm.ancl)
## 
## Call:
## glm(formula = loganc ~ Turbidity + Salinity, family = gaussian(link = "identity"), 
##     data = dados.log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.12816  -0.06181  -0.01001   0.05339   0.12770  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.094e-01  1.892e-02   5.781 3.62e-07 ***
## Turbidity   4.274e-05  1.279e-05   3.341   0.0015 ** 
## Salinity    4.829e-03  2.569e-03   1.880   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.004592908)
## 
##     Null deviance: 0.32271  on 57  degrees of freedom
## Residual deviance: 0.25261  on 55  degrees of freedom
## AIC: -142.71
## 
## Number of Fisher Scoring iterations: 2

We can evaluate this model diagnostics

par(mfrow=c(2,2))
plot(glm.ancl)

It does seem like the residuals have heavy tails. Presumably, this would not happen with the Gama GLM

par(mfrow=c(2,2))
plot(anchmodI)

The truth is that there are some observations that seem quite influential, namely observation 58 (outstanding in the qqplot) and 24 (large Cook distance). We could look at the results of models where these would be excluded. You can do that yourself if you are feeling adventurous…!

Based on AIC alone,

AIC(anchmod,anchmodI,glm.ancl)
##          df        AIC
## anchmod   4   11.87442
## anchmodI  4   13.28572
## glm.ancl  4 -142.71153

hummm… why such different values? Well, the truth is that you cannon compare AIC’s for models for which you transform the data, so these are meaningless!

For comparison, had we done the Gaussian model with untransformed data, the Gaussian model would be clearly preferred.

null.anc <- glm(formula = ANC+0.001 ~ 1, data = dados.log,family=gaussian(link = "identity")) 
glm.anc <- stepAIC(null.anc, ANC+0.001 ~ Salinity + Turbidity + Flow + Temperature + NAO, data = dados.log, direction = "both")
## Start:  AIC=10.5
## ANC + 0.001 ~ 1
## 
##               Df Deviance     AIC
## + Turbidity    1   3.1772  2.1396
## + Salinity     1   3.5660  8.8342
## <none>             3.7989 10.5048
## + Flow         1   3.6948 10.8928
## + Temperature  1   3.7547 11.8259
## + NAO          1   3.7957 12.4557
## 
## Step:  AIC=2.14
## ANC + 0.001 ~ Turbidity
## 
##               Df Deviance     AIC
## + Salinity     1   2.9753  0.3312
## + Temperature  1   3.0536  1.8384
## + Flow         1   3.0603  1.9657
## <none>             3.1772  2.1396
## + NAO          1   3.1734  4.0690
## - Turbidity    1   3.7989 10.5048
## 
## Step:  AIC=0.33
## ANC + 0.001 ~ Turbidity + Salinity
## 
##               Df Deviance    AIC
## <none>             2.9753 0.3312
## + Temperature  1   2.9012 0.8679
## + Flow         1   2.9171 1.1846
## + NAO          1   2.9585 2.0025
## - Salinity     1   3.1772 2.1396
## - Turbidity    1   3.5660 8.8342
AIC(anchmod,anchmodI,glm.anc)
##          df        AIC
## anchmod   4 11.8744180
## anchmodI  4 13.2857175
## glm.anc   4  0.3312284

We would most likely stick to the Gaussian model with untransformed data, even though the residuals look far from optimal (the variance seems to increase with the fitted values and the residuals do not look Gaussian, and in fact, this would probably be a can of worms to solve!).

summary(glm.anc)
## 
## Call:
## glm(formula = ANC + 0.001 ~ Turbidity + Salinity, family = gaussian(link = "identity"), 
##     data = dados.log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.39157  -0.20702  -0.04176   0.16651   0.47556  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2925124  0.0649432   4.504 3.52e-05 ***
## Turbidity   0.0001451  0.0000439   3.304  0.00168 ** 
## Salinity    0.0170323  0.0088160   1.932  0.05852 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.05409641)
## 
##     Null deviance: 3.7989  on 57  degrees of freedom
## Residual deviance: 2.9753  on 55  degrees of freedom
## AIC: 0.33123
## 
## Number of Fisher Scoring iterations: 2
par(mfrow=c(2,2))
plot(glm.anc)

Note that the percentage of the deviance explained by the model would be

1-(glm.anc$deviance/glm.anc$null.deviance)
## [1] 0.2168037

which is really not that much. But, then again, often biological datasets happen to have much more noise than signal. This is perhaps far from satisfying, but this will likely be what you will most happen found in real life models. Things are never clear, clean and easy as in the books.

Exercício 7

  1. Explore os dados disponíveis em https://data.mendeley.com/datasets/r3xpn3mccc/2, usados num trabalho para prever o tamanho de grupos de baleias de bico em função da sua pegada acústica. Há uma boa descrição dos dados em “data4Mendeley.pdf”, mas por agora ignore as variáveis “gID”, ”conf” e “cs0” como variáveis explicativas (na realidade, cs0=cs-1, o que pode ser interessante porque não há grupos de tamanho 0, mas uma Poisson assume que existem 0’s). Use o ficheiro “modeldata.txt” para tentar explicar o tamanho do grupo de animais (cs) em função das outras variáveis disponíveis.

Read the data in

whales <- read.csv("modeldata.txt", sep="")

and check it is all OK

str(whales)
## 'data.frame':    51 obs. of  12 variables:
##  $ gID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cs       : int  2 2 3 2 5 4 1 2 4 3 ...
##  $ conf     : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ maxcount : int  740 575 5263 3214 3852 3167 1741 1378 5185 2287 ...
##  $ meancount: num  336 240 924 492 1140 ...
##  $ nhyd     : int  6 6 11 10 9 6 5 1 6 13 ...
##  $ cdur     : num  24.2 13.7 39.4 42 32 ...
##  $ nclicks  : int  2013 1438 10164 4916 10261 7490 2530 1378 9243 6818 ...
##  $ crate    : num  83.2 105.2 258 117.2 321 ...
##  $ wisk     : int  1 1 1 0 0 0 1 0 0 1 ...
##  $ direction: int  0 0 0 1 1 1 1 1 1 1 ...
##  $ cs0      : int  1 1 2 1 4 3 0 1 3 2 ...

We have 51 observations for 12 variables. We were told that cs is the dependent variable while all the others are potentially explanatory (i.e. independent) variables. Note that the gID and cs0 are not explanatory variables!

We look at the response variable

table(whales$cs)
## 
##  1  2  3  4  5  6 
##  5 21 19  4  1  1

We have group size, which is an integer, so we might try a Poisson GLM. If we model the group size alone, as a function of all other variables, surprisingly, no variable is significant!

whaleglm=glm(cs~.,data=whales[,-c(1,ncol(whales))],family=poisson)
summary(whaleglm)
## 
## Call:
## glm(formula = cs ~ ., family = poisson, data = whales[, -c(1, 
##     ncol(whales))])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.97690  -0.33728  -0.01651   0.31285   0.80632  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  7.776e-01  7.020e-01   1.108    0.268
## conf        -9.075e-03  2.419e-01  -0.038    0.970
## maxcount    -1.014e-05  1.119e-04  -0.091    0.928
## meancount   -1.339e-04  7.025e-04  -0.191    0.849
## nhyd        -5.827e-02  8.075e-02  -0.722    0.471
## cdur         7.174e-03  1.606e-02   0.447    0.655
## nclicks      1.045e-05  4.332e-05   0.241    0.809
## crate        2.653e-03  3.275e-03   0.810    0.418
## wisk        -1.605e-01  2.844e-01  -0.564    0.572
## direction    2.416e-02  2.791e-01   0.087    0.931
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18.260  on 50  degrees of freedom
## Residual deviance: 11.122  on 41  degrees of freedom
## AIC: 173.06
## 
## Number of Fisher Scoring iterations: 4

Another option might be to consider cs-1 (given by definition of a group, there are no groups of size 0, but the Poisson assumes 0’s are possible).

whaleglm=glm((cs-1)~.,data=whales[,-c(1,ncol(whales))],family=poisson)
summary(whaleglm)
## 
## Call:
## glm(formula = (cs - 1) ~ ., family = poisson, data = whales[, 
##     -c(1, ncol(whales))])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.62280  -0.41929  -0.05318   0.42518   1.00981  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.303e-01  9.217e-01   0.141    0.888
## conf        -1.007e-02  3.161e-01  -0.032    0.975
## maxcount    -7.431e-06  1.433e-04  -0.052    0.959
## meancount   -2.679e-04  9.414e-04  -0.285    0.776
## nhyd        -1.060e-01  1.135e-01  -0.934    0.350
## cdur         1.414e-02  2.106e-02   0.672    0.502
## nclicks      9.153e-06  5.526e-05   0.166    0.868
## crate        4.843e-03  4.369e-03   1.108    0.268
## wisk        -2.596e-01  3.733e-01  -0.695    0.487
## direction    3.786e-02  3.702e-01   0.102    0.919
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 34.198  on 50  degrees of freedom
## Residual deviance: 22.732  on 41  degrees of freedom
## AIC: 153.11
## 
## Number of Fisher Scoring iterations: 5

Still, nothing seems important!

However, when we look specifically at crate, it does seem important - its importance had been lost in the mix!

whaleglm=glm((cs-1)~crate,data=whales,family=poisson)
summary(whaleglm)
## 
## Call:
## glm(formula = (cs - 1) ~ crate, family = poisson, data = whales)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8304  -0.4809  -0.1375   0.4521   1.2735  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.0294291  0.2406810  -0.122    0.903  
## crate        0.0021919  0.0009101   2.408    0.016 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 34.198  on 50  degrees of freedom
## Residual deviance: 28.806  on 49  degrees of freedom
## AIC: 143.18
## 
## Number of Fisher Scoring iterations: 5

Could we get this using bestglm?

bestGLMw=bestglm(Xy=whales[,-c(1,2)],family = poisson,IC="AIC",RequireFullEnumerationQ=TRUE)
## Morgan-Tatar search since family is non-gaussian.
summary(bestGLMw)
## Fitting algorithm:  AIC-glm
## Best Model:
##            df deviance
## Null Model 49 26.21518
## Full Model 50 34.19845
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit AIC-glm
## X = 7.9833, df = 1, p-value = 0.004721
bestGLMw
## AIC
## BICq equivalent for q in (0.116528565422219, 0.783729954539495)
## Best Model:
##                  Estimate   Std. Error    z value    Pr(>|z|)
## (Intercept) -0.1932322279 0.2660331813 -0.7263463 0.467626477
## meancount    0.0006715969 0.0002338329  2.8721230 0.004077242

We finish on a different model (the search failed to find the best model!)

whaleglm2=glm(cs0~maxcount+nclicks+wisk+crate,data=whales,family=poisson)
summary(whaleglm2)
## 
## Call:
## glm(formula = cs0 ~ maxcount + nclicks + wisk + crate, family = poisson, 
##     data = whales)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.69286  -0.46080  -0.06297   0.39043   1.24434  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.101e-01  3.193e-01   0.345   0.7302  
## maxcount     1.695e-05  9.976e-05   0.170   0.8651  
## nclicks      1.380e-05  3.375e-05   0.409   0.6827  
## wisk        -4.542e-01  2.421e-01  -1.876   0.0607 .
## crate        1.817e-03  1.906e-03   0.953   0.3407  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 34.198  on 50  degrees of freedom
## Residual deviance: 24.404  on 46  degrees of freedom
## AIC: 144.78
## 
## Number of Fisher Scoring iterations: 5
AIC(whaleglm,whaleglm2)
##           df      AIC
## whaleglm   2 143.1803
## whaleglm2  5 144.7784

This is completely out of what you would be expected to do in Ecologia Numérica, but… we could actually use a special GLM, a truncated Poisson, to model directly the cs response variable

library(VGAM)
mp7 <- vglm(cs ~ crate, family = pospoisson, data = whales)
summary(mp7)
## 
## Call:
## vglm(formula = cs ~ crate, family = pospoisson, data = whales)
## 
## Pearson residuals:
##                    Min      1Q  Median     3Q   Max
## loglink(lambda) -1.168 -0.4051 -0.1247 0.4365 1.312
## 
## Coefficients: 
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 0.4465136  0.2186045   2.043   0.0411 *
## crate       0.0017909  0.0008241   2.173   0.0298 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Name of linear predictor: loglink(lambda) 
## 
## Log-likelihood: -73.2749 on 49 degrees of freedom
## 
## Number of Fisher scoring iterations: 4 
## 
## No Hauck-Donner effect found in any of the estimates

Exercício 8

Usando os dados dataXY.txt, ajuste um modelo linear, um GLM e um GAM e escolha o melhor modelo para representar a relação que explica y em função de x.

Read in the data

dataXY <- read.csv("data4GAM.txt", sep="")

Look at the data

with(dataXY,plot(x,y))

Now, we fit each of the models (I tried the Gamma, you can try other families for the response)

library(mgcv)
lm1=lm(y~x,data=dataXY)
glm1=glm(y~x,data=dataXY,family=Gamma(link="log"))
gam1=gam(y~s(x),data=dataXY,family=Gamma(link="log"))

we can see what these look like overlaid on the data. To get the predictions from the GLM and the GAM we need to predict on the response scale

xs=seq(0,40,by=0.5)
predglm1=predict(glm1,newdata=data.frame(x=xs),type="response")
predgam1=predict(gam1,newdata=data.frame(x=xs),type="response")
with(dataXY,plot(x,y))
abline(lm1,lty=2,col="red")
lines(xs,sort(predglm1),lty=2,col="green")
lines(xs,predgam1,lty=2,col="blue")

As we can see, the only model that manages to capture the non-linear relation between y and x is the GAM. We could look at fitted model

plot(gam1)

and we can also look at the residuals (here, on the response scale)

plot(resid(gam1,type="response"))

and the residuals look indeed adequate. The GAM would therefore be a reasonable model to predict y from x.