Introduction

Here we look at the “Owls” dataset used in Zuur et al. to illustrate the use of GLMs and GEEs in a data set where there is a grouping in the data, which requires that within group correlation is dealt with.

Read data in

Reading the data in, obtained from the book webpage.

library(geepack)
owls <- read.delim("C:/Users/tam2/Dropbox/Trabalho/DBA/20192020/ME2019/aulas/A19 FT7b/Owls.txt")
str(owls)
## 'data.frame':    599 obs. of  7 variables:
##  $ Nest              : Factor w/ 27 levels "AutavauxTV","Bochet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ FoodTreatment     : Factor w/ 2 levels "Deprived","Satiated": 1 2 1 1 1 1 1 2 1 2 ...
##  $ SexParent         : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 1 2 1 ...
##  $ ArrivalTime       : num  22.2 22.4 22.5 22.6 22.6 ...
##  $ SiblingNegotiation: int  4 0 2 2 2 2 18 4 18 0 ...
##  $ BroodSize         : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ NegPerChick       : num  0.8 0 0.4 0.4 0.4 0.4 3.6 0.8 3.6 0 ...

EDA

A quick exploratory analysis reveals that we have two potential response variables, both related to the activity level of the chicks. These are:

par(mfrow=c(1,2),mar=c(4,4,2.5,0.5))
with(owls,hist(NegPerChick,cex.main=0.7,main="Sounds per owl per 30 secs",))
with(owls,hist(SiblingNegotiation,cex.main=0.7,main="Sounds per 30 secs"))

owls$LBroodSize=log(owls$BroodSize)

We can also look at brood sizes

par(mfrow=c(1,1),mar=c(4,4,0.5,0.5))
with(owls,hist(BroodSize))

barplot(table((with(owls,
tapply(X=BroodSize,INDEX=Nest,mean)))))

We can see that we have 27 nests, and the number of observations per nest varied between 4 and 52.

We actually have repeated measurements in nests over two nights, one where the animals were starved, the other where the animals were fed. This is described by FoodTreatment.

In that respect, we have a relatively balanced number of records per night.

with(owls,barplot(table(FoodTreatment)))

Since the chicks were tested in two different nights, one when they were food deprived and one when they were not food deprived, there are actually twice as many groups of observations than the number of nests, and building a covariate that represents this will be useful for later

#doing Zuurs suggestion in pag. 317 with horrible code
#but with 1 line of code :)
owls$NestNight=as.factor(paste0(owls$Nest,".",substr(owls$FoodTreatment,1,3)))

Finally, it was also recorded the sex of the parent that arrived to the next

with(owls,barplot(table(SexParent)))

and there again there’s no large differences, although it was more often the male than the female.

A final note to the variable ArrivalTime which describes the time of the start of each 30 second period

with(owls,hist(ArrivalTime,main="Time"))

Note it is not that clear what are the units, but we assume here that the larger the number the later in the night.

Modeling the data

GEE’s

This is a GEE that retains the factors considered relevant from previous analysis.

fmodelo <- formula(SiblingNegotiation∼offset(LBroodSize)+FoodTreatment+ArrivalTime)
mod.by.nest <- geeglm(formula=fmodelo,data=owls,family = poisson,id = Nest, corstr = "ar1")
summary(mod.by.nest)
## 
## Call:
## geeglm(formula = fmodelo, family = poisson, data = owls, id = Nest, 
##     corstr = "ar1")
## 
##  Coefficients:
##                       Estimate  Std.err  Wald Pr(>|W|)    
## (Intercept)            3.70322  0.66935 30.61 3.16e-08 ***
## FoodTreatmentSatiated -0.56417  0.12254 21.20 4.15e-06 ***
## ArrivalTime           -0.12418  0.02691 21.30 3.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)    6.242   0.387
## 
## Correlation: Structure = ar1  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha   0.3854  0.0582
## Number of clusters:   27   Maximum cluster size: 52

Note the numbers of groups and maximum number of observations per group add up to the data in Nest

max(table(owls$Nest))
## [1] 52
length(unique(owls$Nest))
## [1] 27

However, when we consider the relevant variable for the grouping, which is the one we built ourselves, NestNight, we don’t get that

mod.by.nestnight <- geeglm(formula=fmodelo,data=owls,family = poisson,id=NestNight,corstr="ar1")
summary(mod.by.nestnight)
## 
## Call:
## geeglm(formula = fmodelo, family = poisson, data = owls, id = NestNight, 
##     corstr = "ar1")
## 
##  Coefficients:
##                       Estimate Std.err Wald Pr(>|W|)    
## (Intercept)              3.593   0.668 28.9  7.6e-08 ***
## FoodTreatmentSatiated   -0.578   0.115 25.4  4.6e-07 ***
## ArrivalTime             -0.122   0.027 20.3  6.6e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     6.64   0.524
## 
## Correlation: Structure = ar1  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.517  0.0676
## Number of clusters:   277   Maximum cluster size: 18

The numbers of groups and maximum observations per group do not add up…, as we can see below… why? What am I missing?

max(table(owls$NestNight))
## [1] 28
length(unique(owls$NestNight))
## [1] 54

Given this, we were expecting that the output referred that Number of clusters: 54 Maximum cluster size: 28. Where do the 277 and 18 come from?

After an e-mail exchange with Alain Zuur (who’s made the same error in his book!) that tried to be helpful but did not allow me to move forward, I contacted the package maintainer, Søren Højsgaard, who provided and explanation to the issue and the correct way to deal with it. Actually, the answer is in the package vignette, almost front and center. As an aside, in my opinion, the way the function geeglm computes the different groups is not tyhe best, but hey, that’s life and in fact other software (including Distance for Windows, which I happen to be a co-author on…) does this too, so I really can’t complain.

#if you want to see the vignette, just run this code - page 1-3 solves the issue
vignette("geepack-manual", package="geepack")

The problem is that the variable defining the id is not ordered, and that causes issues, as each time there is a line with a different value from the previous one it is considered a new group. Hence, the first thing to do is to order the variable we use to define the groups. Søren Hojsgaard actually provided the 1 line of code that allows to do this!

Check the issue

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
owls[, c("Nest", "NestNight")]  %>% head(20)
##          Nest      NestNight
## 1  AutavauxTV AutavauxTV.Dep
## 2  AutavauxTV AutavauxTV.Sat
## 3  AutavauxTV AutavauxTV.Dep
## 4  AutavauxTV AutavauxTV.Dep
## 5  AutavauxTV AutavauxTV.Dep
## 6  AutavauxTV AutavauxTV.Dep
## 7  AutavauxTV AutavauxTV.Dep
## 8  AutavauxTV AutavauxTV.Sat
## 9  AutavauxTV AutavauxTV.Dep
## 10 AutavauxTV AutavauxTV.Sat
## 11 AutavauxTV AutavauxTV.Sat
## 12 AutavauxTV AutavauxTV.Dep
## 13 AutavauxTV AutavauxTV.Sat
## 14 AutavauxTV AutavauxTV.Dep
## 15 AutavauxTV AutavauxTV.Dep
## 16 AutavauxTV AutavauxTV.Dep
## 17 AutavauxTV AutavauxTV.Dep
## 18 AutavauxTV AutavauxTV.Sat
## 19 AutavauxTV AutavauxTV.Dep
## 20 AutavauxTV AutavauxTV.Dep

solve it

owls <- owls  %>% group_by(NestNight)  %>% arrange(NestNight) 
#owls$NestNight

Just a quick visual check that the data is now sorted correctly (first by NestNight, then by ArrivalTime withn NestNight)

plot(owls$ArrivalTime,col=owls$NestNight)

Now we can run the same model, and get the expected results

mod.by.nestnight <- geeglm(formula=fmodelo,data=owls,family = poisson,id=NestNight,corstr="ar1")
summary(mod.by.nestnight)
## 
## Call:
## geeglm(formula = fmodelo, family = poisson, data = owls, id = NestNight, 
##     corstr = "ar1")
## 
##  Coefficients:
##                       Estimate Std.err Wald Pr(>|W|)    
## (Intercept)             3.7119  0.6867 29.2  6.5e-08 ***
## FoodTreatmentSatiated  -0.5409  0.1375 15.5  8.3e-05 ***
## ArrivalTime            -0.1249  0.0272 21.2  4.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Estimated Scale Parameters:
##             Estimate Std.err
## (Intercept)     6.21   0.535
## 
## Correlation: Structure = ar1  Link = identity 
## 
## Estimated Correlation Parameters:
##       Estimate Std.err
## alpha    0.382  0.0541
## Number of clusters:   54   Maximum cluster size: 28

Note that we now have the correct values for the number of groups and maximum observations per group. Just to check the maximum observation by group now adds up

max(table(owls$NestNight))
## [1] 28

Note also this means that the analysis in Zuur et al. is wrong, and in particular the correlation results are completely incorrect. Note that the misspecification of the groups did not have a strong influence on the fixed components, but it changed the correlation considerably. It also lowered the “Estimated Scale Parameter” from 6.64 to 6.21.

We conclude that satiated animals beg less (as expected), and also as the night progresses, animals tend to beg less (perhaps because they have already been fed). We also conclude that there is considerable correlation in the response over time, within night, as expected.

Just for comparison, we can compare to what might have been concluded with a GLM that ignores that correlation, i.e. that treats each observation as independent

Aglm <- glm(formula=fmodelo,data=owls,family = poisson)
summary(Aglm)
## 
## Call:
## glm(formula = fmodelo, family = poisson, data = owls)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -4.51   -2.64   -0.81    1.28    7.92  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            3.81333    0.21585    17.7   <2e-16 ***
## FoodTreatmentSatiated -0.53230    0.03305   -16.1   <2e-16 ***
## ArrivalTime           -0.12924    0.00882   -14.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4128.3  on 598  degrees of freedom
## Residual deviance: 3652.6  on 596  degrees of freedom
## AIC: 5339
## 
## Number of Fisher Scoring iterations: 6

Note that while the conclusions are about the same regarding the fixed component, the standard errors of the parameters would be much smaller in the GLM case. It is easy to imagine that we could have found some parameters to be considered important when ignoring the correlation structure while using a GLM, that would then otherwise be found not relevant when the correct variances (as obtained by the GEE) were considered. Hence, properly accounting for the correlation structure in the data should always be given proper consideration.

Aknowledgements

Alain Zuur provided some feedback about the issue where the incorrect number of groups was being considered in the analysis. Søren Højsgaard provided very useful advice about how data is dealt with within his geepack, which allowed to understand and solve said issue.