Efectue uma análise de regressão (Função lm
) sobre os dados DataTP7densidade.csv, relativos à densidade de uma espécie em função de variáveis ambientais, considerando a densidade como a variável dependente a temperatura como independente. Faça uma representação gráfica adequada dos dados e do modelo. Explore os resultados e efectue uma avaliação do cumprimento dos pressupostos do modelo linear.
Read in the data
dens <- read.csv("DataTP7densidade.csv", sep=";")
and as usual, check all is OK
str(dens)
## 'data.frame': 14 obs. of 5 variables:
## $ Densidade : num 51.4 72.1 53.2 83.2 57.4 66.5 98.3 74.8 92.2 97.9 ...
## $ Turbidez : num 0.2 1.9 0.2 10.7 6.8 10.6 9.6 6.3 10.8 9.6 ...
## $ Salinidade : num 17.8 29.4 17.2 30.2 15.3 17.6 35.6 28.2 34.7 35.8 ...
## $ Temperatura: num 24.6 20.7 18.5 10.6 8.9 11.1 10.6 8.8 11.9 10.8 ...
## $ Oxigenio : num 18.9 8 22.6 7.1 27.3 20.8 5.6 13.1 5.9 5.5 ...
We have 14 observations for 5 variables. We are told that densidade
is the dependent variable, and temperatura
the independent. We fit a simple regression model
reg<-lm(Densidade~Temperatura,data=dens)
summary(reg)
##
## Call:
## lm(formula = Densidade ~ Temperatura, data = dens)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.317 -10.394 4.093 10.744 16.053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.6494 9.7515 10.014 3.53e-07 ***
## Temperatura -1.4531 0.6233 -2.331 0.038 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 12 degrees of freedom
## Multiple R-squared: 0.3117, Adjusted R-squared: 0.2544
## F-statistic: 5.435 on 1 and 12 DF, p-value: 0.03798
We can take a look at the model and the fitted line
with(dens,plot(Temperatura,Densidade))
abline(reg)
Note we can look at the residuals of the model
round(reg$residuals,3)
## 1 2 3 4 5 6 7 8 9
## -10.504 4.529 -17.568 0.953 -27.317 -15.020 16.053 -10.062 11.842
## 10 11 12 13 14
## 15.944 7.451 6.886 3.657 13.157
We can look at the diagnostic plots for the fitted model. There is no reason to suspect that the model is not adequate, so we could use this model for predicting Densidade
as a function of Temperatura
. Nonetheless, note that with such a small sample size, it would be almost impossible to find any strong reasons to find the model unsuitable, unless these were blatant!
par(mfrow=c(2,2))
plot(reg)
The correlation \(R\) is (naturally negative, as larger temperatura
leads to lower densidade
)
with(dens,cor(Temperatura,Densidade))
## [1] -0.5583191
and the \(R^2\) is 0.31. Note that this value was present in the output of the summary of the model, as “Multiple R-squared”. Therefore, temperatura
explains 31.17 % of the variability in densidade
.
Com base nos dados da alínea anterior, efectue uma análise de regressão múltipla usando a função lm
, sendo a variável dependente a densidade e as independentes todas as restantes. Comente os resultados (quais as variáveis importantes para explicar a densidade?) e verifique se os pressupostos são cumpridos. Há funções muito práticas para extrair sub-componentes de modelos. Sobre o modelo que usar, experimente usar as funções coef
, fitted
, predict.lm
e AIC
. Qual a differença entre fitted
e predict.lm
? Com base no melhor modelo encontrado, represente visualmente a relação entre a variável resposta e o(s) preditore(s) relevante(s). Represente também essa mesma relação caso estivesse interessado em prever a Densidade em função da Turbidez, mas sendo que a Salinidade era igual ao valor mínimo presente nos dados. Interprete os resultados obtidos.
We can start by looking at the pairs
function, that allows us to visualize the relationship between the variables in a data.frame
(note I use some fancy functions I found by looking into ?pairs
to make the output much more slick!)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(dens, lower.panel = panel.smooth, upper.panel = panel.cor,
gap=0, row1attop=FALSE,diag.panel = panel.hist)
Note that this shows us that in fact the predictor we used above, Temperature
, is the one that is less related to Densidade
, so maybe the above model was not the best possible anyway!
reg.mult<-lm(Densidade~Turbidez+Salinidade+Temperatura+Oxigenio,data=dens)
summary(reg.mult)
##
## Call:
## lm(formula = Densidade ~ Turbidez + Salinidade + Temperatura +
## Oxigenio, data = dens)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1977 -1.3324 -0.7577 1.9245 4.0316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.8800 37.3544 -0.827 0.42979
## Turbidez 2.0845 0.4562 4.570 0.00135 **
## Salinidade 2.5951 0.7353 3.529 0.00642 **
## Temperatura 0.6453 0.4584 1.408 0.19281
## Oxigenio 1.1147 0.7595 1.468 0.17622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.079 on 9 degrees of freedom
## Multiple R-squared: 0.9761, Adjusted R-squared: 0.9655
## F-statistic: 91.83 on 4 and 9 DF, p-value: 2.728e-07
The conclusion is that both Turbidez
and Salinidade
are important to explain Densidade
. Although we were told explicitly to fit a model with all variables, if some of these are not relevant, we might as well simplify the model
reg.mult2<-lm(Densidade~Turbidez+Salinidade,data=dens)
summary(reg.mult2)
##
## Call:
## lm(formula = Densidade ~ Turbidez + Salinidade, data = dens)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6288 -1.5121 -0.7341 2.2797 4.8000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.8189 3.1332 7.921 7.17e-06 ***
## Turbidez 1.4783 0.1548 9.548 1.17e-06 ***
## Salinidade 1.5306 0.1152 13.282 4.07e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.114 on 11 degrees of freedom
## Multiple R-squared: 0.9701, Adjusted R-squared: 0.9647
## F-statistic: 178.5 on 2 and 11 DF, p-value: 4.13e-09
Take a look at model diagnostics
par(mfrow=c(2,2))
plot(reg.mult2)
Everything looks good, no problems are evident in the plots. Nonetheless, we could explicitly test the assumptions of the model, namely that the residuals are Gaussian.
shapiro.test(resid(reg.mult2))
##
## Shapiro-Wilk normality test
##
## data: resid(reg.mult2)
## W = 0.94903, p-value = 0.5458
there is no evidence to reject the H0 that the residuals are Gaussian.
Note there are special functions to extract components of a fitted model. We can easily extract the coefficients of the model
coef(reg.mult2)
## (Intercept) Turbidez Salinidade
## 24.818859 1.478292 1.530605
and the fitted values
fitted(reg.mult2)
## 1 2 3 4 5 6 7 8
## 52.35928 72.62739 51.44092 86.86084 58.28949 67.42739 93.49998 77.29515
## 9 10 11 12 13 14
## 93.89639 93.80610 85.64682 95.37874 59.54266 86.22885
or the predicted values
predict.lm(reg.mult2)
## 1 2 3 4 5 6 7 8
## 52.35928 72.62739 51.44092 86.86084 58.28949 67.42739 93.49998 77.29515
## 9 10 11 12 13 14
## 93.89639 93.80610 85.64682 95.37874 59.54266 86.22885
which is equivalent with the above, but if we use the aditional argument newdata
, which MUST be a data.frame
, then we can predict for any new values, say below for the values of Salinidade
of 20 and 25.2 and Turbidez
of 15 and 30. The argument type="response"
ensures you are predicting in the scale of the response. This is not required in a lm
, as the link function is the identity, bu definition, but is very useful for a glm
or a gam
, as otherwise by default you predict on the scale of the link function, which (at least most) humans have trouble relating to!!
predict.lm(reg.mult2,type="response",newdata=data.frame(Salinidade=c(20,25.2),Turbidez=c(15,30)))
## 1 2
## 77.60533 107.73884
Finally, we can extract the AIC of the model
AIC(reg.mult2)
## [1] 76.15575
In fact, the last function, AIC, is often used to choose between multiple models. Lets try that here
AIC(reg.mult2,reg.mult,reg)
## df AIC
## reg.mult2 4 76.15575
## reg.mult 6 77.03036
## reg 3 118.06602
Not surprisingly, the simpler multiple regression model has the lowest AIC (i.e., is the most parsimonious), and the original model with Temperature alone is the worst.
Perhaps surprisingly, when I used Temperatura
alone, it was significant (say at a significance level of 5 or 10%), but in the end it was not useful… Why? Because Temperatura
is higly correlated with Turbidez
(R=-0.84), and hence, once Turbidez
is included, then Temperatura
is no longer useful. The important message is that we never really know if there might not have been a better model based on covariates that we did not collect. Imagine that you only had collected Temperatura
, then not knowing any better, you would assume it was a good predictor of Densidade
. After you have Turbidez
, then Temperatura
becomes uselless to predict Densidade
.
Lets suppose we wanted to visualize the results of this model in a plot, just as we did with a single variable before. This become tricky, because in a single plot we can’t have more than a variable! But there is still a way to go. We need to plot each independent variable separately, and choose a reasonable value for the variable not in the plot.
A possibility is to ignore the other variables!
regTS=lm(Densidade~Turbidez+Salinidade,data=dens)
regT=lm(Densidade~Turbidez,data=dens)
regS=lm(Densidade~Salinidade,data=dens)
par(mfrow=c(1,2))
plot(Densidade~Turbidez,data=dens)
abline(regT,col="red")
plot(Densidade~Salinidade,data=dens)
abline(regS,col="red")
but that is not sensible, as it explains the data in the plot, but it does not use the information about the variable not on the plot.
So, we can use it, and we represent both sets of “predictions” in the next plot.
par(mfrow=c(1,2))
plot(Densidade~Turbidez,data=dens)
abline(regT,col="red")
seqT=seq(0,20,by=0.5)
predDT=predict(regTS,newdata=data.frame(Turbidez=seqT,Salinidade=mean(dens$Salinidade)))
lines(seqT,predDT,col="blue")
plot(Densidade~Salinidade,data=dens)
abline(regS,col="red")
seqS=seq(10,40,by=0.5)
predDS=predict(regTS,newdata=data.frame(Turbidez=mean(dens$Turbidez),Salinidade=seqS))
lines(seqS,predDS,col="blue")
The model accounting for the second variable (blue line) is different from the one ignoring the variable (red line). In other words, once we consider a given Turbidez, the predicted value for Densidade based on Salinidade changes, and by the same token, once we consider a given Salinidade, the predicted value for Densidade based on Turbidez changes. We make the predictions of Densidade based on Turbidez conditional on the mean value of Salinidade, and the predictions of Densidade based on Salinidade conditional on the mean value of Turbidez. Of course, we could also want to see what would happen if another value of the second variable was considered. Imagine we wanted to condition on the minimum value of the other variable.
par(mfrow=c(1,2))
plot(Densidade~Turbidez,data=dens)
abline(regT,col="red")
seqT=seq(0,20,by=0.5)
predDT=predict(regTS,newdata=data.frame(Turbidez=seqT,Salinidade=mean(dens$Salinidade)))
predDTmin=predict(regTS,newdata=data.frame(Turbidez=seqT,Salinidade=min(dens$Salinidade)))
lines(seqT,predDT,col="blue")
lines(seqT,predDTmin,col="blue",lty=2)
plot(Densidade~Salinidade,data=dens)
abline(regS,col="red")
seqS=seq(10,40,by=0.5)
predDS=predict(regTS,newdata=data.frame(Turbidez=mean(dens$Turbidez),Salinidade=seqS))
predDSmin=predict(regTS,newdata=data.frame(Turbidez=min(dens$Turbidez),Salinidade=seqS))
lines(seqS,predDS,col="blue")
lines(seqS,predDSmin,col="blue",lty=2)
The predictions are now different. They do not fit the data, but they are not wrong. They simply reflect the predicted density conditional on the minimum value of the variable not in the plot. Given that both variables influence positively the density, conditioning on the minimum lowers the line (but the line is parallel to the previous line, only the intercept changes, the slope remains the same!).
Estimou a mortalidade de aves em alguns parques eólicos, em função de variáveis ambientais (DataTP7parqueseolicos.csv). Verifique se a variável “altitude” é um bom preditor da mortalidade, através de um modelo de regressão simples (Função lm
). Caracterize a relação entre as variáveis. Elabore uma função que gere estimativas de mortalidade em função da altitude (Relembrar como criar funções, tutorial 1). Avalie o desempenho do modelo de regressão (Por exemplo, faça um plot
do objecto de regressão criado e avalie os pressupostos do modelo de regressão).
As usual, we begin by reading the data
peol <- read.csv("DataTP7parqueseolicos.csv", sep=";")
and making sure all went fine
str(peol)
## 'data.frame': 10 obs. of 5 variables:
## $ mortalidade : int 2 5 6 7 12 15 36 45 58 124
## $ densidade.aero : int 5 25 9 36 11 9 8 7 41 25
## $ altitude : int 150 236 364 489 684 789 1254 1564 1789 1987
## $ proximidade.aero: int 3 5 6 8 9 12 25 69 154 351
## $ estradas : int 21 0 17 1 18 21 23 14 1 2
a summary of the data
summary(peol)
## mortalidade densidade.aero altitude proximidade.aero
## Min. : 2.00 Min. : 5.00 Min. : 150.0 Min. : 3.0
## 1st Qu.: 6.25 1st Qu.: 8.25 1st Qu.: 395.2 1st Qu.: 6.5
## Median : 13.50 Median :10.00 Median : 736.5 Median : 10.5
## Mean : 31.00 Mean :17.60 Mean : 930.6 Mean : 64.2
## 3rd Qu.: 42.75 3rd Qu.:25.00 3rd Qu.:1486.5 3rd Qu.: 58.0
## Max. :124.00 Max. :41.00 Max. :1987.0 Max. :351.0
## estradas
## Min. : 0.00
## 1st Qu.: 1.25
## Median :15.50
## Mean :11.80
## 3rd Qu.:20.25
## Max. :23.00
We can see that we have a response variable, mortalidade
, and we are asked to evaluate if altitude
is a good predictor using a simple regression model. We first plot the data
with(peol,plot(mortalidade~altitude))
it seems clear that as altitude increases, mortality also increases (but the relation does not seem linear, more like an exponential). We can fit a regression model, and add it to the plot
lmalt=lm(mortalidade~altitude,data=peol)
with(peol,plot(mortalidade~altitude))
abline(lmalt,lty=2,col=2)
We can see the summary of the model
summary(lmalt)
##
## Call:
## lm(formula = mortalidade ~ altitude, data = peol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.975 -10.707 -4.129 7.699 39.671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.978726 10.139500 -1.576 0.153701
## altitude 0.050482 0.008995 5.612 0.000503 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.1 on 8 degrees of freedom
## Multiple R-squared: 0.7975, Adjusted R-squared: 0.7721
## F-statistic: 31.5 on 1 and 8 DF, p-value: 0.0005029
and we can see that altitude
explains 79.7466% of the variability in mortality
.
The fitted object model contains all the information we need to build a function predicting mortalidade
as a function of altitude
. The function takes a single argument, the value of altitude
, and then returns the corresponding predicted value of mortality
.
mypred=function(data){
res=lmalt$coefficients[1]+lmalt$coefficients[2]*data
return(res)
}
We can now use the new function to predict the mortality for say an altitude
of 1000m
mypred(1000)
## (Intercept)
## 34.50346
This would be represented in the plot as
with(peol,plot(mortalidade~altitude))
abline(lmalt,lty=2,col=2)
abline(v=1000,h=mypred(1000),lty=2,col=3:4)
Another way of predicting for new values is to use the predict functions built in in R.
new<-data.frame(altitude=c(1000))
predict.lm(lmalt,new)
## 1
## 34.50346
Finally, we can look at the performance of the model
par(mfrow=c(2,2))
plot(lmalt)
Clearly, the model is not adequate, as there is a pattern in the residuals (note this is a function of the non-linearity noted above): we have 2 positive residuals, then all the negative residuals, finally another positive residual. We would not want to make predictions from such a model.
Realize uma regressão múltipla aos dados DataTP7parqueseolicos.csv e interprete os resultados. Pretende obter um sub-modelo para utilizar noutras regiões com parques eólicos. Obtenha um sub-modelo para esse efeito. (Dica: começar por criar um modelo com todas as variáveis disponíveis e depois ver quais as que não parecem importantes).
As we already read the data in the previous exercise, we can proceed to implement the model. But before, we plot all the data against each other
#from the ?pairs
## put histograms on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(peol, lower.panel = panel.smooth, upper.panel = panel.cor,
gap=0, row1attop=FALSE)
The very large correlation between estradas
and proximidade.aero
means that models with both variables might become unstable. We nonetheless fit a model with all variables.
regmul<-lm(mortalidade~., data=peol)
summary(regmul)
##
## Call:
## lm(formula = mortalidade ~ ., data = peol)
##
## Residuals:
## 1 2 3 4 5 6 7 8 9 10
## -2.146 1.872 -2.155 3.056 -1.753 -1.536 7.734 -2.491 -4.593 2.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.418208 10.009923 0.741 0.491957
## densidade.aero -0.394911 0.303776 -1.300 0.250299
## altitude 0.018310 0.004458 4.108 0.009286 **
## proximidade.aero 0.252193 0.029437 8.567 0.000357 ***
## estradas -0.228630 0.446504 -0.512 0.630427
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.872 on 5 degrees of freedom
## Multiple R-squared: 0.9908, Adjusted R-squared: 0.9835
## F-statistic: 135 on 4 and 5 DF, p-value: 2.804e-05
par(mfrow=c(2,2))
plot(regmul)
We could check the residuals, and there’s now less evidence of problems (the additional variables helped solving the lack of independence - but strictly, the response in altitude might be non-linear!).
Given this full model, both altitude
and proximidade.aero
seem significant (i.e. useful to explain mortalidade
). We can now check what might be the best model, e.g. using a backward selection using AIC
library(MASS)
full.model<-lm(mortalidade~.,data=peol)
best.model<-stepAIC(full.model, direction="backward")
## Start: AIC=34.74
## mortalidade ~ densidade.aero + altitude + proximidade.aero +
## estradas
##
## Df Sum of Sq RSS AIC
## - estradas 1 6.22 124.90 33.249
## <none> 118.68 34.738
## - densidade.aero 1 40.11 158.79 35.650
## - altitude 1 400.47 519.15 47.496
## - proximidade.aero 1 1742.14 1860.81 60.262
##
## Step: AIC=33.25
## mortalidade ~ densidade.aero + altitude + proximidade.aero
##
## Df Sum of Sq RSS AIC
## <none> 124.90 33.249
## - densidade.aero 1 85.32 210.22 36.456
## - altitude 1 435.26 560.16 46.256
## - proximidade.aero 1 2465.64 2590.54 61.570
summary(best.model)
##
## Call:
## lm(formula = mortalidade ~ densidade.aero + altitude + proximidade.aero,
## data = peol)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.248 -2.228 -1.510 2.738 7.114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.665547 3.510035 0.759 0.4764
## densidade.aero -0.255509 0.126211 -2.024 0.0893 .
## altitude 0.017361 0.003797 4.573 0.0038 **
## proximidade.aero 0.259738 0.023866 10.883 3.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.563 on 6 degrees of freedom
## Multiple R-squared: 0.9903, Adjusted R-squared: 0.9855
## F-statistic: 205.1 on 3 and 6 DF, p-value: 1.963e-06
We see that in fact, at least considering AIC, 3 variables are required to explain the mortalidade
. We would therefore conclude that 3 variables, densidade.aero
, altitude
and proximidade.aero
, are relevant for explaining mortality.
par(mfrow=c(2,2))
plot(best.model)
Actually, with such a small number of variables, we might just fit all the models. This can be done using function bestglm
in package bestglm
(note that while the model is a simple linear model, the function can cope with GLMs too!). Note that the first argument MUST be a matrix with all the variables, and the last column needs to be the response variable.
library(bestglm)
bestGLM=bestglm(Xy=peol[,c(2:5,1)],family = gaussian,IC="AIC",RequireFullEnumerationQ=TRUE)
## Morgan-Tatar search RequireFullEnumerationQ=TRUE
The object produced by bestGLM
contains a number of components.
names(bestGLM)
## [1] "BestModel" "BestModels" "Bestq" "qTable" "Subsets"
## [6] "Title" "ModelReport"
In this case, we get the same result as with the backwards procedure, which is reassuring, as for many variables, doing all the possible combinations becomes computationally intense.
bestGLM$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) densidade.aero altitude proximidade.aero
## 2.66555 -0.25551 0.01736 0.25974
Aplique um GLM aos dados DataTP7presenca, os quais descrevem a abundância de uma espécie em função de algumas variáveis ambientais. Utilize a distribuição Gama como família de distribuição do erro (Gamma) e a função inversa como função de ligação (inverse, in a glm
call, use argumente family=Gamma(link="inverse"))
. Compare os resultados com um GLM Poisson (in glm call, family=poisson
). Compare as predições dos dois modelos num gráfico. Qual seria o valor previsto num caso em que estradas tomasse o valor 7. Qual dos dois modelos escolheria para modelar os dados (ver a função AIC
)?
Read the data in
presenca <- read.csv("DataTP7presenca.csv", sep=";")
Look at the response variable
par(mfrow=c(1,2))
hist(presenca$presenca)
boxplot(presenca$presenca)
Fit the suggested GLM
pres.model<-glm(presenca~altitude+floresta+temperatura+estradas, data=presenca, family=Gamma(link="inverse"))
summary(pres.model)
##
## Call:
## glm(formula = presenca ~ altitude + floresta + temperatura +
## estradas, family = Gamma(link = "inverse"), data = presenca)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## -0.13372 0.14525 0.30615 0.20438 0.02682 -0.73211 -0.12128
## 8 9 10
## -0.00656 0.04631 0.03274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2824582 0.1578231 1.790 0.1335
## altitude -0.0001675 0.0001378 -1.216 0.2783
## floresta -0.0009144 0.0015712 -0.582 0.5858
## temperatura -0.0090641 0.0066797 -1.357 0.2328
## estradas 0.0207172 0.0053274 3.889 0.0115 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1078066)
##
## Null deviance: 8.76092 on 9 degrees of freedom
## Residual deviance: 0.72915 on 5 degrees of freedom
## AIC: 45.853
##
## Number of Fisher Scoring iterations: 5
We would conclude, based on this model, that only estradas
might be a relevant covariate to explain presença
. Given that, we could fit a simpler model to the data
pres.model2<-glm(presenca~estradas, data=presenca, family=Gamma(link="inverse"))
summary(pres.model2)
##
## Call:
## glm(formula = presenca ~ estradas, family = Gamma(link = "inverse"),
## data = presenca)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.72360 -0.10694 0.06731 0.16441 0.36233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.050970 0.010765 4.735 0.001473 **
## estradas 0.018495 0.002807 6.588 0.000171 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.09651818)
##
## Null deviance: 8.76092 on 9 degrees of freedom
## Residual deviance: 0.98988 on 8 degrees of freedom
## AIC: 42.953
##
## Number of Fisher Scoring iterations: 5
We could plot the estimated relation
with(presenca,plot(presenca~estradas))
It seems clear that, as the model predicts, the larger the estradas
the less the presença
with(presenca,plot(presenca~estradas))
estr2=seq(0,25,by=0.2)
predsG=predict.glm(pres.model2,newdata=data.frame(estradas=estr2),type="response")
lines(estr2,predsG,lty=2,col=3)
Note how the relationship seems to be clearly non-linear in the response scale.
Actually, the results were integers… so, could we have done it with a simpler Poisson GLM?
modelPois<-glm(presenca~estradas, data=presenca, family=poisson)
summary(modelPois)
##
## Call:
## glm(formula = presenca ~ estradas, family = poisson, data = presenca)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.15220 -0.01242 0.02047 0.25184 0.67368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.80006 0.13646 20.520 < 2e-16 ***
## estradas -0.10038 0.01633 -6.148 7.85e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 53.515 on 9 degrees of freedom
## Residual deviance: 2.793 on 8 degrees of freedom
## AIC: 41.465
##
## Number of Fisher Scoring iterations: 4
We could plot the estimated relation
with(presenca,plot(presenca~estradas))
As expected, the larger the estradas
the less the presenca
. We could now add to the plot the predicted values under the two models.
estr2=seq(0,25,by=0.2)
predsP=predict.glm(modelPois,newdata=data.frame(estradas=estr2),type="response")
valest=7
predG7=predsG[estr2==valest]
predP7=predsP[estr2==valest]
When estradas
is 7 the predicted value for the gama model would be 5.5, while for the Poisson model it would be 8.1.
with(presenca,plot(presenca~estradas))
#the Poisson predictions
lines(estr2,predsP,lty=2,col=4)
#the gama predictions
lines(estr2,predsG,lty=2,col=3)
legend("topright",legend=c("Poisson Predictions","Gamma predictions"),lty=2,col=4:3)
valest=7
abline(v=valest,lty=2,col=5)
abline(h=predG7,lty=2,col=3)
abline(h=predP7,lty=2,col=4)
How might we choose across these two? The AIC favors the Poisson, even if ever so slightly!
AIC(modelPois,pres.model2)
## df AIC
## modelPois 2 41.46529
## pres.model2 3 42.95326
Aplique um GLM aos dados DataTP7anchova.csv, referentes à abundância de anchova (variável dependente) em função de várias variáveis ambientais (variáveis independentes). Explore os resultados.
Read in the data
anchova <- read.csv("DataTP7anchova.csv", sep=";")
We have 58 observations for 6 variables. We are told that anchova
is the dependent variable while all the others are potentially explanatory (i.e. independent) variables. Sometimes we refer to these as covariates.
We begin by looking at the response variable (and its log(x+1))
par(mfrow=c(1,2))
hist(anchova$ANC)
loganc<-log10((anchova$ANC)+1)
hist(loganc)
We can look at the correlation across the different variables
pairs(anchova, lower.panel = panel.smooth, upper.panel = panel.cor,
gap=0, row1attop=FALSE)
The highest correlation is with Turbidity
, but then again, it is not high enough to expect that alone it will be a reasonable predictor on itself.
with(anchova,plot(ANC~Turbidity))
lm1=lm(ANC~Turbidity,data=anchova)
abline(lm(ANC~Turbidity,data=anchova))
summary(lm1)
##
## Call:
## lm(formula = ANC ~ Turbidity, data = anchova)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44504 -0.20065 -0.08559 0.18283 0.52631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.885e-01 4.217e-02 9.213 8.17e-13 ***
## Turbidity 1.487e-04 4.492e-05 3.310 0.00163 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2382 on 56 degrees of freedom
## Multiple R-squared: 0.1637, Adjusted R-squared: 0.1487
## F-statistic: 10.96 on 1 and 56 DF, p-value: 0.001635
Yet, as we can see, the variable seems important to explain ANC
. We can look at diagnostic plots
par(mfrow=c(2,2))
plot(lm1)
Since the residuals are not consistent with a Gaussian, we could apply a Gama GLM. Note the value 0 present in ANC
is a problem, as a Gama expects strictly a positive response, therefore 0’s are not admissible. We can use a simple trick which is to add a small constant (like 0.001) to all data points - note this has virtually no impact on the response, besides the fact of allowing the Gama model to be fitted. We consider both the standard link function for the Gama, “inverse”, and the “log” link.
anchmod<-glm(ANC+0.001~., data=anchova, family=Gamma(link="log"))
summary(anchmod)
##
## Call:
## glm(formula = ANC + 0.001 ~ ., family = Gamma(link = "log"),
## data = anchova)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0642 -0.4833 -0.1056 0.3404 0.7205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.329e+00 3.361e-01 -3.954 0.000233 ***
## Salinity 3.825e-02 1.985e-02 1.927 0.059454 .
## Turbidity 3.118e-04 9.714e-05 3.210 0.002279 **
## Flow -9.458e-05 1.606e-04 -0.589 0.558554
## Temperature 1.470e-02 1.407e-02 1.045 0.300901
## NAO 1.402e-01 1.258e-01 1.115 0.270087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2537669)
##
## Null deviance: 26.544 on 57 degrees of freedom
## Residual deviance: 22.318 on 52 degrees of freedom
## AIC: 15.988
##
## Number of Fisher Scoring iterations: 6
anchmodI<-glm(ANC+0.001~., data=anchova, family=Gamma(link="inverse"))
summary(anchmodI)
##
## Call:
## glm(formula = ANC + 0.001 ~ ., family = Gamma(link = "inverse"),
## data = anchova)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.08124 -0.47486 -0.03479 0.31995 0.75011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3652923 0.7537738 4.465 4.34e-05 ***
## Salinity -0.0789998 0.0440940 -1.792 0.079013 .
## Turbidity -0.0005127 0.0001450 -3.537 0.000862 ***
## Flow 0.0002565 0.0003576 0.717 0.476468
## Temperature -0.0384309 0.0291309 -1.319 0.192866
## NAO -0.3587082 0.2683775 -1.337 0.187178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2571043)
##
## Null deviance: 26.544 on 57 degrees of freedom
## Residual deviance: 22.601 on 52 degrees of freedom
## AIC: 16.763
##
## Number of Fisher Scoring iterations: 5
As it stands, irrespective of the link function considered only Turbidity
seems relevant to explain the numbers of anchova, while Salinity
might be useful too.
anchmod<-glm(ANC+0.001~Turbidity+Salinity, data=anchova, family=Gamma(link="log"))
summary(anchmod)
##
## Call:
## glm(formula = ANC + 0.001 ~ Turbidity + Salinity, family = Gamma(link = "log"),
## data = anchova)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1168 -0.5019 -0.1258 0.3641 0.7860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.189e+00 1.402e-01 -8.478 1.46e-11 ***
## Turbidity 2.992e-04 9.477e-05 3.157 0.00259 **
## Salinity 4.157e-02 1.903e-02 2.184 0.03324 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2521114)
##
## Null deviance: 26.544 on 57 degrees of freedom
## Residual deviance: 23.011 on 55 degrees of freedom
## AIC: 11.874
##
## Number of Fisher Scoring iterations: 6
anchmodI<-glm(ANC+0.001~Turbidity+Salinity, data=anchova, family=Gamma(link="inverse"))
summary(anchmodI)
##
## Call:
## glm(formula = ANC + 0.001 ~ Turbidity + Salinity, family = Gamma(link = "inverse"),
## data = anchova)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1334 -0.5111 -0.1441 0.3749 0.7726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9259334 0.3467145 8.439 1.69e-11 ***
## Turbidity -0.0004209 0.0001304 -3.227 0.00211 **
## Salinity -0.0840767 0.0421020 -1.997 0.05079 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.2590704)
##
## Null deviance: 26.544 on 57 degrees of freedom
## Residual deviance: 23.543 on 55 degrees of freedom
## AIC: 13.286
##
## Number of Fisher Scoring iterations: 5
The diagnostic plots for the model (with Inverse link) are
par(mfrow=c(2,2))
plot(anchmodI)
the residuals still not seem to fit the qqplot adequately, hinting for the fact that the Gama is not ideal either.
We could also have tried to transform the data, say using a logarithmic transformation (log(x+1) given the 0’s)
dados.log<-cbind(anchova,loganc)
null.ancl <- glm(formula = loganc ~ 1, data = dados.log,family=gaussian(link = "identity"))
glm.ancl <- stepAIC(null.ancl, loganc ~ Salinity + Turbidity + Flow + Temperature + NAO, data = dados.log, direction = "both")
## Start: AIC=-132.51
## loganc ~ 1
##
## Df Deviance AIC
## + Turbidity 1 0.26884 -141.10
## + Salinity 1 0.30389 -133.99
## <none> 0.32271 -132.51
## + Flow 1 0.31476 -131.95
## + Temperature 1 0.31906 -131.17
## + NAO 1 0.32250 -130.55
##
## Step: AIC=-141.1
## loganc ~ Turbidity
##
## Df Deviance AIC
## + Salinity 1 0.25261 -142.71
## + Temperature 1 0.25843 -141.39
## <none> 0.26884 -141.10
## + Flow 1 0.25984 -141.07
## + NAO 1 0.26857 -139.16
## - Turbidity 1 0.32271 -132.51
##
## Step: AIC=-142.71
## loganc ~ Turbidity + Salinity
##
## Df Deviance AIC
## <none> 0.25261 -142.71
## + Temperature 1 0.24627 -142.19
## + Flow 1 0.24822 -141.73
## - Salinity 1 0.26884 -141.10
## + NAO 1 0.25135 -141.00
## - Turbidity 1 0.30389 -133.99
summary(glm.ancl)
##
## Call:
## glm(formula = loganc ~ Turbidity + Salinity, family = gaussian(link = "identity"),
## data = dados.log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.12816 -0.06181 -0.01001 0.05339 0.12770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.094e-01 1.892e-02 5.781 3.62e-07 ***
## Turbidity 4.274e-05 1.279e-05 3.341 0.0015 **
## Salinity 4.829e-03 2.569e-03 1.880 0.0654 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.004592908)
##
## Null deviance: 0.32271 on 57 degrees of freedom
## Residual deviance: 0.25261 on 55 degrees of freedom
## AIC: -142.71
##
## Number of Fisher Scoring iterations: 2
We can evaluate this model diagnostics
par(mfrow=c(2,2))
plot(glm.ancl)
It does seem like the residuals have heavy tails. Presumably, this would not happen with the Gama GLM
par(mfrow=c(2,2))
plot(anchmodI)
The truth is that there are some observations that seem quite influential, namely observation 58 (outstanding in the qqplot) and 24 (large Cook distance). We could look at the results of models where these would be excluded. You can do that yourself if you are feeling adventurous…!
Based on AIC alone,
AIC(anchmod,anchmodI,glm.ancl)
## df AIC
## anchmod 4 11.87442
## anchmodI 4 13.28572
## glm.ancl 4 -142.71153
hummm… why such different values? Well, the truth is that you cannon compare AIC’s for models for which you transform the data, so these are meaningless!
For comparison, had we done the Gaussian model with untransformed data, the Gaussian model would be clearly preferred.
null.anc <- glm(formula = ANC+0.001 ~ 1, data = dados.log,family=gaussian(link = "identity"))
glm.anc <- stepAIC(null.anc, ANC+0.001 ~ Salinity + Turbidity + Flow + Temperature + NAO, data = dados.log, direction = "both")
## Start: AIC=10.5
## ANC + 0.001 ~ 1
##
## Df Deviance AIC
## + Turbidity 1 3.1772 2.1396
## + Salinity 1 3.5660 8.8342
## <none> 3.7989 10.5048
## + Flow 1 3.6948 10.8928
## + Temperature 1 3.7547 11.8259
## + NAO 1 3.7957 12.4557
##
## Step: AIC=2.14
## ANC + 0.001 ~ Turbidity
##
## Df Deviance AIC
## + Salinity 1 2.9753 0.3312
## + Temperature 1 3.0536 1.8384
## + Flow 1 3.0603 1.9657
## <none> 3.1772 2.1396
## + NAO 1 3.1734 4.0690
## - Turbidity 1 3.7989 10.5048
##
## Step: AIC=0.33
## ANC + 0.001 ~ Turbidity + Salinity
##
## Df Deviance AIC
## <none> 2.9753 0.3312
## + Temperature 1 2.9012 0.8679
## + Flow 1 2.9171 1.1846
## + NAO 1 2.9585 2.0025
## - Salinity 1 3.1772 2.1396
## - Turbidity 1 3.5660 8.8342
AIC(anchmod,anchmodI,glm.anc)
## df AIC
## anchmod 4 11.8744180
## anchmodI 4 13.2857175
## glm.anc 4 0.3312284
We would most likely stick to the Gaussian model with untransformed data, even though the residuals look far from optimal (the variance seems to increase with the fitted values and the residuals do not look Gaussian, and in fact, this would probably be a can of worms to solve!).
summary(glm.anc)
##
## Call:
## glm(formula = ANC + 0.001 ~ Turbidity + Salinity, family = gaussian(link = "identity"),
## data = dados.log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.39157 -0.20702 -0.04176 0.16651 0.47556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2925124 0.0649432 4.504 3.52e-05 ***
## Turbidity 0.0001451 0.0000439 3.304 0.00168 **
## Salinity 0.0170323 0.0088160 1.932 0.05852 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.05409641)
##
## Null deviance: 3.7989 on 57 degrees of freedom
## Residual deviance: 2.9753 on 55 degrees of freedom
## AIC: 0.33123
##
## Number of Fisher Scoring iterations: 2
par(mfrow=c(2,2))
plot(glm.anc)
Note that the percentage of the deviance explained by the model would be
1-(glm.anc$deviance/glm.anc$null.deviance)
## [1] 0.2168037
which is really not that much. But, then again, often biological datasets happen to have much more noise than signal. This is perhaps far from satisfying, but this will likely be what you will most happen found in real life models. Things are never clear, clean and easy as in the books.
Read the data in
whales <- read.csv("modeldata.txt", sep="")
and check it is all OK
str(whales)
## 'data.frame': 51 obs. of 12 variables:
## $ gID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cs : int 2 2 3 2 5 4 1 2 4 3 ...
## $ conf : int 2 1 1 1 1 1 1 1 1 1 ...
## $ maxcount : int 740 575 5263 3214 3852 3167 1741 1378 5185 2287 ...
## $ meancount: num 336 240 924 492 1140 ...
## $ nhyd : int 6 6 11 10 9 6 5 1 6 13 ...
## $ cdur : num 24.2 13.7 39.4 42 32 ...
## $ nclicks : int 2013 1438 10164 4916 10261 7490 2530 1378 9243 6818 ...
## $ crate : num 83.2 105.2 258 117.2 321 ...
## $ wisk : int 1 1 1 0 0 0 1 0 0 1 ...
## $ direction: int 0 0 0 1 1 1 1 1 1 1 ...
## $ cs0 : int 1 1 2 1 4 3 0 1 3 2 ...
We have 51 observations for 12 variables. We were told that cs
is the dependent variable while all the others are potentially explanatory (i.e. independent) variables. Note that the gID
and cs0
are not explanatory variables!
We look at the response variable
table(whales$cs)
##
## 1 2 3 4 5 6
## 5 21 19 4 1 1
We have group size, which is an integer, so we might try a Poisson GLM. If we model the group size alone, as a function of all other variables, surprisingly, no variable is significant!
whaleglm=glm(cs~.,data=whales[,-c(1,ncol(whales))],family=poisson)
summary(whaleglm)
##
## Call:
## glm(formula = cs ~ ., family = poisson, data = whales[, -c(1,
## ncol(whales))])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.97690 -0.33728 -0.01651 0.31285 0.80632
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.776e-01 7.020e-01 1.108 0.268
## conf -9.075e-03 2.419e-01 -0.038 0.970
## maxcount -1.014e-05 1.119e-04 -0.091 0.928
## meancount -1.339e-04 7.025e-04 -0.191 0.849
## nhyd -5.827e-02 8.075e-02 -0.722 0.471
## cdur 7.174e-03 1.606e-02 0.447 0.655
## nclicks 1.045e-05 4.332e-05 0.241 0.809
## crate 2.653e-03 3.275e-03 0.810 0.418
## wisk -1.605e-01 2.844e-01 -0.564 0.572
## direction 2.416e-02 2.791e-01 0.087 0.931
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18.260 on 50 degrees of freedom
## Residual deviance: 11.122 on 41 degrees of freedom
## AIC: 173.06
##
## Number of Fisher Scoring iterations: 4
Another option might be to consider cs-1 (given by definition of a group, there are no groups of size 0, but the Poisson assumes 0’s are possible).
whaleglm=glm((cs-1)~.,data=whales[,-c(1,ncol(whales))],family=poisson)
summary(whaleglm)
##
## Call:
## glm(formula = (cs - 1) ~ ., family = poisson, data = whales[,
## -c(1, ncol(whales))])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.62280 -0.41929 -0.05318 0.42518 1.00981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.303e-01 9.217e-01 0.141 0.888
## conf -1.007e-02 3.161e-01 -0.032 0.975
## maxcount -7.431e-06 1.433e-04 -0.052 0.959
## meancount -2.679e-04 9.414e-04 -0.285 0.776
## nhyd -1.060e-01 1.135e-01 -0.934 0.350
## cdur 1.414e-02 2.106e-02 0.672 0.502
## nclicks 9.153e-06 5.526e-05 0.166 0.868
## crate 4.843e-03 4.369e-03 1.108 0.268
## wisk -2.596e-01 3.733e-01 -0.695 0.487
## direction 3.786e-02 3.702e-01 0.102 0.919
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 34.198 on 50 degrees of freedom
## Residual deviance: 22.732 on 41 degrees of freedom
## AIC: 153.11
##
## Number of Fisher Scoring iterations: 5
Still, nothing seems important!
However, when we look specifically at crate
, it does seem important - its importance had been lost in the mix!
whaleglm=glm((cs-1)~crate,data=whales,family=poisson)
summary(whaleglm)
##
## Call:
## glm(formula = (cs - 1) ~ crate, family = poisson, data = whales)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8304 -0.4809 -0.1375 0.4521 1.2735
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0294291 0.2406810 -0.122 0.903
## crate 0.0021919 0.0009101 2.408 0.016 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 34.198 on 50 degrees of freedom
## Residual deviance: 28.806 on 49 degrees of freedom
## AIC: 143.18
##
## Number of Fisher Scoring iterations: 5
Could we get this using bestglm
?
bestGLMw=bestglm(Xy=whales[,-c(1,2)],family = poisson,IC="AIC",RequireFullEnumerationQ=TRUE)
## Morgan-Tatar search since family is non-gaussian.
summary(bestGLMw)
## Fitting algorithm: AIC-glm
## Best Model:
## df deviance
## Null Model 49 26.21518
## Full Model 50 34.19845
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit AIC-glm
## X = 7.9833, df = 1, p-value = 0.004721
bestGLMw
## AIC
## BICq equivalent for q in (0.116528565422219, 0.783729954539495)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1932322279 0.2660331813 -0.7263463 0.467626477
## meancount 0.0006715969 0.0002338329 2.8721230 0.004077242
We finish on a different model (the search failed to find the best model!)
whaleglm2=glm(cs0~maxcount+nclicks+wisk+crate,data=whales,family=poisson)
summary(whaleglm2)
##
## Call:
## glm(formula = cs0 ~ maxcount + nclicks + wisk + crate, family = poisson,
## data = whales)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.69286 -0.46080 -0.06297 0.39043 1.24434
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.101e-01 3.193e-01 0.345 0.7302
## maxcount 1.695e-05 9.976e-05 0.170 0.8651
## nclicks 1.380e-05 3.375e-05 0.409 0.6827
## wisk -4.542e-01 2.421e-01 -1.876 0.0607 .
## crate 1.817e-03 1.906e-03 0.953 0.3407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 34.198 on 50 degrees of freedom
## Residual deviance: 24.404 on 46 degrees of freedom
## AIC: 144.78
##
## Number of Fisher Scoring iterations: 5
AIC(whaleglm,whaleglm2)
## df AIC
## whaleglm 2 143.1803
## whaleglm2 5 144.7784
This is completely out of what you would be expected to do in Ecologia Numérica, but… we could actually use a special GLM, a truncated Poisson, to model directly the cs
response variable
library(VGAM)
mp7 <- vglm(cs ~ crate, family = pospoisson, data = whales)
summary(mp7)
##
## Call:
## vglm(formula = cs ~ crate, family = pospoisson, data = whales)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## loglink(lambda) -1.168 -0.4051 -0.1247 0.4365 1.312
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4465136 0.2186045 2.043 0.0411 *
## crate 0.0017909 0.0008241 2.173 0.0298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Name of linear predictor: loglink(lambda)
##
## Log-likelihood: -73.2749 on 49 degrees of freedom
##
## Number of Fisher scoring iterations: 4
##
## No Hauck-Donner effect found in any of the estimates
Usando os dados dataXY.txt, ajuste um modelo linear, um GLM e um GAM e escolha o melhor modelo para representar a relação que explica y em função de x.
Read in the data
dataXY <- read.csv("data4GAM.txt", sep="")
Look at the data
with(dataXY,plot(x,y))
Now, we fit each of the models (I tried the Gamma, you can try other families for the response)
library(mgcv)
lm1=lm(y~x,data=dataXY)
glm1=glm(y~x,data=dataXY,family=Gamma(link="log"))
gam1=gam(y~s(x),data=dataXY,family=Gamma(link="log"))
we can see what these look like overlaid on the data. To get the predictions from the GLM and the GAM we need to predict on the response scale
xs=seq(0,40,by=0.5)
predglm1=predict(glm1,newdata=data.frame(x=xs),type="response")
predgam1=predict(gam1,newdata=data.frame(x=xs),type="response")
with(dataXY,plot(x,y))
abline(lm1,lty=2,col="red")
lines(xs,sort(predglm1),lty=2,col="green")
lines(xs,predgam1,lty=2,col="blue")
As we can see, the only model that manages to capture the non-linear relation between y
and x
is the GAM. We could look at fitted model
plot(gam1)
and we can also look at the residuals (here, on the response scale)
plot(resid(gam1,type="response"))
and the residuals look indeed adequate. The GAM would therefore be a reasonable model to predict y
from x
.