--- title: "Looking at the Owls data with a GEE" author: "Tiago A. Marques" date: "11/29/2019" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction Here we look at the "Owls" dataset used in Zuur et al. to illustrate the use of GLMs and GEEs in a data set where there is a grouping in the data, which requires that within group correlation is dealt with. # Read data in Reading the data in, obtained from the book webpage. ```{r} library(geepack) owls <- read.delim("C:/Users/tam2/Dropbox/Trabalho/DBA/20192020/ME2019/aulas/A19 FT7b/Owls.txt") str(owls) ``` # EDA A quick exploratory analysis reveals that we have two potential response variables, both related to the activity level of the chicks. These are: * SiblingNegotiation - the number of sounds produced by the chicks in the nest during a 30 second period * `NegPerChick` - the number of negotiation sounds emiter per chick per nest, i.e. `SiblingNegotiation` divided by the number of chicks in the nest, `BroodSize`. We can look at both of these, as both might be considered suitable responses, the later requiring the use of `BroodSize` as an offset. Note that if one uses an offset in the model, then the corresponding variable needs to be transformed according to the link funtion later used. This is required irrespectively of one considering a GAM, GLM or GEE. ```{r} par(mfrow=c(1,2),mar=c(4,4,2.5,0.5)) with(owls,hist(NegPerChick,cex.main=0.7,main="Sounds per owl per 30 secs",)) with(owls,hist(SiblingNegotiation,cex.main=0.7,main="Sounds per 30 secs")) owls$LBroodSize=log(owls$BroodSize) ``` We can also look at brood sizes ```{r} par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) with(owls,hist(BroodSize)) barplot(table((with(owls, tapply(X=BroodSize,INDEX=Nest,mean))))) ``` We can see that we have `r length(unique(owls$Nest))` nests, and the number of observations per nest varied between `r min(table((owls$Nest)))` and `r max(table((owls$Nest)))`. We actually have repeated measurements in nests over two nights, one where the animals were starved, the other where the animals were fed. This is described by `FoodTreatment`. In that respect, we have a relatively balanced number of records per night. ```{r} with(owls,barplot(table(FoodTreatment))) ``` Since the chicks were tested in two different nights, one when they were food deprived and one when they were not food deprived, there are actually twice as many groups of observations than the number of nests, and building a covariate that represents this will be useful for later ```{r} #doing Zuurs suggestion in pag. 317 with horrible code #but with 1 line of code :) owls$NestNight=as.factor(paste0(owls$Nest,".",substr(owls$FoodTreatment,1,3))) ``` Finally, it was also recorded the sex of the parent that arrived to the next ```{r} with(owls,barplot(table(SexParent))) ``` and there again there's no large differences, although it was more often the male than the female. A final note to the variable `ArrivalTime` which describes the time of the start of each 30 second period ```{r} with(owls,hist(ArrivalTime,main="Time")) ``` Note it is not that clear what are the units, but we assume here that the larger the number the later in the night. # Modeling the data ## GEE's This is a GEE that retains the factors considered relevant from previous analysis. ```{r} fmodelo <- formula(SiblingNegotiation∼offset(LBroodSize)+FoodTreatment+ArrivalTime) mod.by.nest <- geeglm(formula=fmodelo,data=owls,family = poisson,id = Nest, corstr = "ar1") summary(mod.by.nest) ``` Note the numbers of groups and maximum number of observations per group add up to the data in `Nest` ```{r} max(table(owls$Nest)) length(unique(owls$Nest)) ``` However, when we consider the relevant variable for the grouping, which is the one we built ourselves, `NestNight`, we don't get that ```{r} mod.by.nestnight <- geeglm(formula=fmodelo,data=owls,family = poisson,id=NestNight,corstr="ar1") summary(mod.by.nestnight) ``` The numbers of groups and maximum observations per group do not add up..., as we can see below... why? What am I missing? ```{r} max(table(owls$NestNight)) length(unique(owls$NestNight)) ``` Given this, we were expecting that the output referred that Number of clusters: `r length(unique(owls$NestNight))` Maximum cluster size: `r max(table(owls$NestNight))`. Where do the 277 and 18 come from? After an e-mail exchange with Alain Zuur (who's made the same error in his book!) that tried to be helpful but did not allow me to move forward, I contacted the package maintainer, Søren Højsgaard, who provided and explanation to the issue and the correct way to deal with it. Actually, the answer is in the package vignette, almost front and center. As an aside, in my opinion, the way the function `geeglm` computes the different groups is not tyhe best, but hey, that's life and in fact other software (including Distance for Windows, which I happen to be a co-author on...) does this too, so I really can't complain. ```{r,eval=FALSE} #if you want to see the vignette, just run this code - page 1-3 solves the issue vignette("geepack-manual", package="geepack") ``` The problem is that the variable defining the `id` is not ordered, and that causes issues, as each time there is a line with a different value from the previous one it is considered a new group. Hence, the first thing to do is to order the variable we use to define the groups. Søren Hojsgaard actually provided the 1 line of code that allows to do this! Check the issue ```{r} library(tidyverse) owls[, c("Nest", "NestNight")] %>% head(20) ``` solve it ```{r} owls <- owls %>% group_by(NestNight) %>% arrange(NestNight) #owls$NestNight ``` Just a quick visual check that the data is now sorted correctly (first by `NestNight`, then by `ArrivalTime` withn `NestNight`) ```{r} plot(owls$ArrivalTime,col=owls$NestNight) ``` Now we can run the same model, and get the expected results ```{r} mod.by.nestnight <- geeglm(formula=fmodelo,data=owls,family = poisson,id=NestNight,corstr="ar1") summary(mod.by.nestnight) ``` Note that we now have the correct values for the number of groups and maximum observations per group. Just to check the maximum observation by group now adds up ```{r} max(table(owls$NestNight)) ``` Note also this means that the analysis in Zuur et al. is wrong, and in particular the correlation results are completely incorrect. Note that the misspecification of the groups did not have a strong influence on the fixed components, but it changed the correlation considerably. It also lowered the "Estimated Scale Parameter" from 6.64 to 6.21. We conclude that satiated animals beg less (as expected), and also as the night progresses, animals tend to beg less (perhaps because they have already been fed). We also conclude that there is considerable correlation in the response over time, within night, as expected. Just for comparison, we can compare to what might have been concluded with a GLM that ignores that correlation, i.e. that treats each observation as independent ```{r} Aglm <- glm(formula=fmodelo,data=owls,family = poisson) summary(Aglm) ``` Note that while the conclusions are about the same regarding the fixed component, the standard errors of the parameters would be much smaller in the GLM case. It is easy to imagine that we could have found some parameters to be considered important when ignoring the correlation structure while using a GLM, that would then otherwise be found not relevant when the correct variances (as obtained by the GEE) were considered. Hence, properly accounting for the correlation structure in the data should always be given proper consideration. # Aknowledgements Alain Zuur provided some feedback about the issue where the incorrect number of groups was being considered in the analysis. Søren Højsgaard provided very useful advice about how data is dealt with within his `geepack`, which allowed to understand and solve said issue.