--- title: "About Regression Models" author: "Tiago A. Marques" date: "\today" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction This document is an attempt to provide a coherent framework for some of the examples about regression models that were presented in the "Modelação Ecológica" course on the 1st, 8th and 9th of October 2019. It was created for those students, but if you are reading it and your are not one of those students that's fine too. We started looking into the simplest of models, a standard linear regression with a Gaussian error structure $$y_i=a+b_x+e_i$$ where the assumption is that the $e_i$ are Gaussian independent random deviates with a constant variance $sigma^2$. # Simulating regression data This was part of what we did in class 6, on the 01 10 2019. ## Simulate model It was suggested that you would simulate data that considered that the TRUE relation between the weight and length of a lizard was given by $$ y=12+1.2*length $$ You were also told that the usual length of a lizard was between 5 and 20 cm. We will have 97 lizards Then you were told to create the lengths: ```{r} set.seed(121) n=97 #lengths xs=runif(n,5,20) hist(xs,main="Lengths (cm)") ``` and then to create weights of lizards ```{r} a=12 b=1.2 ys=a+b*xs ``` If we plot the data, all points are in a single line. Why, because there is no randomness. ```{r} plot(xs,ys) ``` This means that if you try to run a model, it gives you a warning that the model might be unreliable ```{r} summary(lm(ys~xs)) ``` So... , we add some variance, and plot the data: ```{r} ys=a+b*xs+rnorm(n,0,4) plot(xs,ys) ``` Now, let's consider there's more and less variance. We also add to each plot the real line (that with the true parameter values) and the one with the estimated parameter values. ```{r} par(mfrow=c(2,3)) ys=a+b*xs+rnorm(n,0,1) plot(xs,ys) mod1=lm(ys~xs) abline(mod1,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,2) plot(xs,ys) mod2=lm(ys~xs) abline(mod2,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,4) plot(xs,ys) mod4=lm(ys~xs) abline(mod4,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,10) plot(xs,ys) mod10=lm(ys~xs) abline(mod10,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,20) plot(xs,ys) mod20=lm(ys~xs) abline(mod20,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,100) plot(xs,ys) mod100=lm(ys~xs) abline(mod100,col="red") abline(a,b,col="green") ``` Not surprisingly, as the variance increases, we get data that more and more looks like it is not coming from a real linear process. You can also look at the model summaries, and there you can see that, in fact, the models become essentially useless as the variance increases! You can see that both from the correlation, but also by the predictions generated from the model (comparing to the truth), and also the significance of the coefficients associated with the regression parameters. Make no mistake, the reality is always the same, in terms of the fixed part of the model, it's just the variance that increases. Also, don't get confused, the different green lines might look different, but they are always exactly the same line! You can check that by forcing the y axis to span the same limits. ```{r} par(mfrow=c(2,3)) ys=a+b*xs+rnorm(n,0,1) plot(xs,ys,ylim=c(-400,400)) mod1=lm(ys~xs) abline(mod1,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,2) plot(xs,ys,ylim=c(-400,400)) mod2=lm(ys~xs) abline(mod2,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,4) plot(xs,ys,ylim=c(-400,400)) mod4=lm(ys~xs) abline(mod4,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,10) plot(xs,ys,ylim=c(-400,400)) mod10=lm(ys~xs) abline(mod10,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,20) plot(xs,ys,ylim=c(-400,400)) mod20=lm(ys~xs) abline(mod20,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,100) plot(xs,ys,ylim=c(-400,400)) mod100=lm(ys~xs) abline(mod100,col="red") abline(a,b,col="green") ``` but since then you loose all the ability to look at the actual data in some of the plots, that is not really that useful! Below I look at the summary of each model. Look at correlations, at the estimated values for the parameters, their corresponding variances and the $R^2$. ```{r} summary(mod1) summary(mod2) summary(mod4) summary(mod10) summary(mod20) summary(mod100) ``` As an example, we can plot the $R^2$ as a function of the variance ```{r} plot(c(1,2,4,10,20,100),c(summary(mod1)$r.squared,summary(mod2)$r.squared,summary(mod4)$r.squared,summary(mod10)$r.squared,summary(mod20)$r.squared,summary(mod100)$r.squared)) ``` That is quite interesting actually... There seems to be a nonlinear relationship, but we only have a sample size of six (different standard deviations, i.e., variances, as variance is standard deviation squared), so hard to tell... Let's show off in R... ```{r} sds=seq(0.5,100,by=0.5) nsds=length(sds) #an object to hold the correlations Rsqs=numeric(nsds) for (i in 1:nsds){ #create data ys=a+b*xs+rnorm(n,0,sds[i]) #estimate model modi=lm(ys~xs) #get R-squared Rsqs[i]=summary(modi)$r.squared } #and at the end... plot results plot(sds,Rsqs) ``` How cool is that!! Actually, this means we can model the $R^2$ as a function of the original variance! But we would not be able to model it using a linear model... You are not supposed to know about this yet, but I'll continue to show off. Let's use a GAM ```{r} library(mgcv) gam1=gam(Rsqs~s(sds),link=log) #make predictions to plot the estimated GAM model predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response") plot(sds,Rsqs) lines(sds,predRsqs,col="red") ``` Aha... remember what we talked in class today? It seems like we have over-fitted. Then, I constrain the GAM. ```{r} library(mgcv) gam1=gam(Rsqs~s(sds,k=3),link=log) #make predictions to plot the estimated GAM model predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response") plot(sds,Rsqs) lines(sds,predRsqs,col="red") ``` That was too much... ```{r} library(mgcv) gam1=gam(Rsqs~s(sds,k=6),link=log) #make predictions to plot the estimated GAM model predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response") plot(sds,Rsqs) lines(sds,predRsqs,col="red") ``` That is already over-fitting... conclusion, the GAM is not the right tool here :) What is...? Well, stay tuned and one day you'll learn! ## Drawing data This was part of what we did in class 7, on the 08 October 2019. Here the idea was to explore different regression models. To illustrate the models the students were invited to go to drawdata.xyz and to create a dataset with regression data such that there was an increasing trend, a decreasing trend, no trend, and data that was not really a line. Then the data set should be downloaded, imported into R and regression models fit to it. Here we have such a dataset, named "data.csv", and we read it in a and look at it ```{r} #TAM's path #C:/Users/tam2/Dropbox/Trabalho/DBA/20192020/ME2019/aulas/OnRegressionModels/ #To work on any machine provided the file data.csv exists in said folder data <- read.csv("data.csv") summary(data) ``` We can see we have a dataset with 3 variables, an x, the independent variable, and y, the response, and the z, which represents the specific data. In my case, a was the increasing slope, b the decreasing slope, c a slope of approximately 0 and d was not a line. The data looks like this ```{r} with(data,plot(y~x,col=as.numeric(z))) ``` Not surprisingly, it looks quite messy, but if we separate the data by the 4 different subsets things become clearer. ```{r} par(mfrow=c(2,2),mar=c(4,4,0.5,0.5)) dataA=data[data$z=="a",] with(dataA,plot(y~x)) dataB=data[data$z=="b",] with(dataB,plot(y~x,col=2)) dataC=data[data$z=="c",] with(dataC,plot(y~x,col=3)) dataD=data[data$z=="d",] with(dataD,plot(y~x,col=4)) ``` and of course now we can fit a line to each dataset ```{r} par(mfrow=c(2,2),mar=c(4,4,0.5,0.5)) with(dataA,plot(y~x)) lm1=with(dataA,lm(y~x)) abline(lm1,col=1,lty=2) with(dataB,plot(y~x,col=2)) lm2=with(dataB,lm(y~x)) abline(lm2,col=2,lty=2) with(dataC,plot(y~x,col=3)) lm3=with(dataC,lm(y~x)) abline(lm3,col=3,lty=2) with(dataD,plot(y~x,col=4)) lm4=with(dataD,lm(y~x)) abline(lm4,col=4,lty=2) ``` and we can look at the output of the models. For the increasing slope ```{r} summary(lm1) ``` we have an intercept estimate of `r round(summary(lm1)$coefficients[1,1],2)` (with corresponding standard error of `r round(summary(lm1)$coefficients[1,2],2)` and a slope of `r round(summary(lm1)$coefficients[2,1],2)` (with corresponding standard error of `r round(summary(lm1)$coefficients[2,2],2)`. The regression $R^2$ is `r round(summary(lm1)$r.squared,3)`. Note these objects can be easily manipulated to obtain parameters and statistics of interest. That was how I wrote them dynamically in the text above ```{r} #intercept round(summary(lm1)$coefficients[1,1],2) #intercept se round(summary(lm1)$coefficients[1,2],2) #slope round(summary(lm1)$coefficients[2,1],2) #slope se round(summary(lm1)$coefficients[2,2],2) #R-squared round(summary(lm1)$r.squared,3) ``` but there's much more info in a regression object. Fee free to explore it ```{r} names(summary(lm1)) ``` For the decreasing slope we got ```{r} summary(lm2) ``` for what was supposed to be a 0 slope (proving I am terrible at drawing!) ```{r} summary(lm3) ``` and what was not a line at all ```{r} summary(lm4) ``` notice in particular that while this was not a line, the computer is agnostic to it. You are telling it to fit a line to the data, it, will. And interestingly, the regression is highly significant. This should warn you about the dangers of fitting wrong models to data, you can end up with the wrong conclusions. The students should by now understand every aspect of the outputs above. If you do not, come and talk to me. # ANOVA, ANCOVA and the likes This was part of what we did in class 8, on the 9th October 2019. Here the idea was to explore different regression models and to see how they relate to statistical procedures one might not associate with a regression, when in fact, they are special cases of a regression. ## The t-test While we did not do the t-test in class, but this is useful because it allows you to see how a simple t-test is just a linear model too, and acts as a building block for the next examples. The t-test allows us to test the null hypothesis that two samples have the same mean. Create some data ```{r} #Making up a t-test #making sure everyone gets the same results set.seed(980) ``` Then we define the sample size and the number of treatments ```{r} #define sample size n=100 #define treatments tr=c("a","b") #how many treatments - 2 for a t test ntr=length(tr) #balanced design n.by.tr=n/ntr ``` Now, we can simulate some data. First, the treatments ```{r} type=as.factor(rep(tr,each=n.by.tr)) cores=rep(1:ntr,each=n.by.tr) ``` Then we define the means by treatment - note that they are different, so the null hypothesis in the t-test, that the mean of a is equal to the mean of b, is known to be false in this case. ```{r} #define 4 means ms=c(3,4) ``` Then, the key part, the response variable, with a different mean by treatment. Note the use of the `ifelse` function, which evaluates its first argument and then assigns the value of its second argument if the first is true or the value of the second if its first argument is false. An example ```{r} ifelse(3>4,55,77) ifelse(3<4,55,77) ``` So now, generate the response data ```{r} ys=ifelse(type=="a",ms[1],ms[2])+rnorm(n,0,1.5) ``` Look at the data ```{r} plot(ys~type) ``` Now, we can run the usual t-test ```{r} t.test(ys~type) ``` and now we can do it the linear regression way ```{r} lm0=lm(ys~type) summary(lm0) ``` and as you can see, we get the same result for the test statistic. It is the same thing! And we can naturally get the estimated means per group. The mean for a is just the intercept of the model. To get the mean of the group b we add the mean of group b to the intercept, as ```{r} #mean of ys under treatment a summary(lm0)$coefficients[1] #mean of ys under treatment b summary(lm0)$coefficients[1]+lm0$coefficients[2] ``` This is required because in a linear model, all the other parameters associated with levels of a factor will be compared to a reference value, that of the intercept, which happens to be the mean under treatment a. Below you will see more examples of this. Note we were able to detect the null was false, but this was because we had a decent sample size compared to the variance of the measurements and the magnitude of the true effect (the difference of the means). If we keep the sample size constant but we increase the noise or decrease the magnitude of the difference, we might not get the same result, and make a type II error! ```{r} #define 2 means ms=c(3,4) #increase the variance of the process ys=ifelse(type=="a",ms[1],ms[2])+rnorm(n,0,5) ``` Look at the data, we can see much more variation ```{r} plot(ys~type) ``` Now, we can run the usual t-test ```{r} t.test(ys~type) ``` and now we can do it the linear regression way ```{r} lm0=lm(ys~type) summary(lm0) ``` and as you can see, we get the same result for the test statistic, but now with a non significant test. The same would have happened if we decreased the true difference, while keeping the original magnitude of the error ```{r} #define 2 means ms=c(3,3.1) #increase the variance of the process ys=ifelse(type=="a",ms[1],ms[2])+rnorm(n,0,1.5) ``` Look at the data, we can see again lower variation, but the difference across treatments is very small (so, hard to detect!) ```{r} plot(ys~type) ``` Now, we can run the usual t-test ```{r} t.test(ys~type) ``` and now we can do it the linear regression way ```{r} lm0=lm(ys~type) summary(lm0) ``` ## ANOVA We move on with perhaps the most famous example of a statistical test/procedure, the ANOVA. An ANOVA is nothing but a linear model, where we have a continuous response variable, which we want to explain as a function of a factor (with several levels, or treatments). We simulate a data set, beginning by making sure everyone gets the same results by using `set.seed` ```{r} #Making up an ANOVA #An ANOVA #making sure everyone gets the same results set.seed(12345) ``` Then we define the sample size and the number of treatments ```{r} #define sample size n=2000 #define treatments tr=c("a","b","c","d") #how many treatments ntr=length(tr) #balanced design n.by.tr=n/ntr ``` now, we can simulate some data. First, the treatments, but we also generate a independent variable that is not really used for now (`xs`). ```{r} #generate data xs=runif(n,10,20) type=as.factor(rep(tr,each=n.by.tr)) #if I wanted to recode the levels such that c was the baseline #type=factor(type,levels = c("c","a","b","d")) #get colors for plotting cores=rep(1:ntr,each=n.by.tr) ``` Then we define the means by treatment - note that they are different, so the null hypothesis in an ANOVA, that all the means are the same, is false. ```{r} #define 4 means ms=c(3,5,6,2) ``` Then, the key part, the response variable, with a different mean by treatment. Note the use of the `ifelse` function, which evaluates its first argument and then assigns the value of its second argument if the first is true or the value of the second if its first argument is false. An example ```{r} ifelse(3>4,55,77) ifelse(3<4,55,77) ``` Note these can be used nested, leading to possible multiple outcomes, and I use that below to define 4 different means depending on the treatment of the observation ```{r} ifelse(3<4,55,ifelse(3>2,55,68)) ifelse(3>4,55,ifelse(3>2,666,68)) ifelse(3>4,55,ifelse(3<2,666,68)) ``` So now, generate the data ```{r} #ys, not a function of the xs!!! ys=ifelse(type=="a",ms[1],ifelse(type=="b",ms[2],ifelse(type=="c",ms[3],ms[4])))+rnorm(n,0,3) ``` We can actually look at the simulated data ```{r} par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(ys~type,col=1:4) #abline(h=ms,col=1:4) ``` finally, we can implement the linear model and look at its summary ```{r} lm.anova=lm(ys~type) summary(lm.anova) ``` note that, again, we can manipulate any sub-components of the created objects ```{r} #see the parameters lm.anova$coefficients #see the third parameter lm.anova$coefficients[3] ``` Not surprisingly, because the means were different and we had a large sample size, everything is highly significant. Note that the ANOVA test is actually presented in the regression output, and that is the corresponding F-test ```{r} summary(lm.anova)$fstatistic ``` and we can use the F distribution to calculate the corresponding P-value (note that is already in the output above) ```{r} ftest=summary(lm.anova)$fstatistic[1] df1=summary(lm.anova)$fstatistic[2] df2=summary(lm.anova)$fstatistic[3] pt(ftest,df1,df2) ``` OK, this is actually the exact value, while above the value was reported as just a small value (< 2.2 $\times$ 10$^{-16}$), but it is the same value, believe me! Finally, to show (by example) this is just what the ANOVA does, we have the NAOVA itself ```{r} summary(aov(lm.anova)) ``` where everything is the same (test statistic, degrees of freedom and p-values). Conclusion: an ANOVA is just a special case of a linear model, one where we have a continuous response variable and a factor explanatory covariate. In fact, a two way ANOVA is just the extension where we have a continuous response variable and 2 factor explanatory covariates, and, you guessed it, a three way ANOVA means we have a continuous response variable and a 3 factor explanatory covariates. Just to finish up this example, we could now plot the true means per treatment, the estimated means per treatment ```{r} par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(as.numeric(type),ys,col=as.numeric(type),xlab="Treatment",xaxt="n") axis(1,at=1:4,letters[1:4]) #plot the estimated line for type a abline(h=lm.anova$coefficients[1],lwd=3,col=1) #plot the mean line for type a abline(h=mean(ys[type=="a"]),lwd=1,col=1,lty=2) #plot the real mean for type a abline(h=ms[1],lwd=2,col=1,lty=3) #and now for the other types abline(h=lm.anova$coefficients[1]+lm.anova$coefficients[2],lwd=3,col=2) abline(h=mean(ys[type=="b"]),lwd=1,col=2,lty=2) #plot the real mean for type b abline(h=ms[2],lwd=2,col=2,lty=3) abline(h=lm.anova$coefficients[1]+lm.anova$coefficients[3],lwd=3,col=3) abline(h=mean(ys[type=="c"]),lwd=1,col=3,lty=2) #plot the real mean for type c abline(h=ms[3],lwd=2,col=3,lty=3) abline(h=lm.anova$coefficients[1]+lm.anova$coefficients[4],lwd=3,col=4) abline(h=mean(ys[type=="d"]),lwd=1,col=4,lty=2) #plot the real mean for type a abline(h=ms[4],lwd=2,col=4,lty=3) legend("topright",c("Estimated","Mean of data","True"),lwd=c(4,1,2),lty=c(1,3,2),inset=0.03) ``` It's not easy to see because these overlap (large sample size, high precision) but the estimated means are really close to the real means. It's a bit easier to see if we separate in 4 plots and zoom in on the mean of each treatment, but still the blue lines are all on top of each other, since the mean value was estimated real close to truth (truth=2, estimated = `r mean(ys[type=="d"])`). ```{r} #see this in 4 plots, less blur par(mfrow=c(2,2),mar=c(4,4,0.5,0.5)) plot(as.numeric(type),ys,col=as.numeric(type),xlab="Treatment",xaxt="n",ylim=mean(ys[type=="a"])+c(-0.5,0.5)) axis(1,at=1:4,letters[1:4]) #plot the estimated line for type a abline(h=lm.anova$coefficients[1],lwd=3,col=1) #plot the mean line for type a abline(h=mean(ys[type=="a"]),lwd=1,col=1,lty=2) #plot the real mean for type a abline(h=ms[1],lwd=2,col=1,lty=3) #and now for the other types plot(as.numeric(type),ys,col=as.numeric(type),xlab="Treatment",xaxt="n",ylim=mean(ys[type=="b"])+c(-0.5,0.5)) axis(1,at=1:4,letters[1:4]) abline(h=lm.anova$coefficients[1]+lm.anova$coefficients[2],lwd=3,col=2) abline(h=mean(ys[type=="b"]),lwd=1,col=2,lty=2) #plot the real mean for type b abline(h=ms[2],lwd=2,col=2,lty=3) plot(as.numeric(type),ys,col=as.numeric(type),xlab="Treatment",xaxt="n",ylim=mean(ys[type=="c"])+c(-0.5,0.5)) axis(1,at=1:4,letters[1:4]) abline(h=lm.anova$coefficients[1]+lm.anova$coefficients[3],lwd=3,col=3) abline(h=mean(ys[type=="c"]),lwd=1,col=3,lty=2) #plot the real mean for type c abline(h=ms[3],lwd=2,col=3,lty=3) plot(as.numeric(type),ys,col=as.numeric(type),xlab="Treatment",xaxt="n",ylim=mean(ys[type=="d"])+c(-0.5,0.5)) axis(1,at=1:4,letters[1:4]) abline(h=lm.anova$coefficients[1]+lm.anova$coefficients[4],lwd=3,col=4) abline(h=mean(ys[type=="d"]),lwd=1,col=4,lty=2) #plot the real mean for type a abline(h=ms[4],lwd=2,col=4,lty=3) #legend("bottomright",c("Estimated","Mean of data","True"),lwd=c(4,1,2),lty=c(1,3,2),inset=0.05) ``` Now we can check how we can obtain the estimated means from the actual parameters of the regression model (yes, that is what the regression does, it calculates the expected mean of the response, conditional on the treatment). This is the estimated mean per treatment, using function `tapply` (very useful function to get any statistics over a variable, inside groups defined by a second variable, here the treatment) ```{r} tapply(X=ys,INDEX=type,FUN=mean) ``` and checking these are obtained from the regression coefficients. An important note. When you fit models with factors (like here), the intercept term will correspond to the mean of the reference level of the factor(s). Hence, to get the other means, you always have to sum the parameter of the corresponding level to the intercept. So we do it below ```{r} #check ANOVA is just computing the mean in each group lm.anova$coefficients[1] lm.anova$coefficients[1]+lm.anova$coefficients[2] lm.anova$coefficients[1]+lm.anova$coefficients[3] lm.anova$coefficients[1]+lm.anova$coefficients[4] ``` and we can see these are exactly the same values. ## ANCOVA We move on to the ANCOVA, which is like an ANOVA to which we add a continuous explanatory covariate. This is an extremely common situation in biology/ecology data. Consider, as an example, you are trying to explain how the weight of a fish depends on its length, but you want to see if that relationship changes per year or site. Let's simulate some relevant data and fit the models ## Common slope, different intercepts per treatment We begin with a situation where there are different intercepts per group, but a common slope across all groups ```{r} #all slopes the same, diferent intercepts - no interactions set.seed(1234) xs=runif(20000,10,20) type=rep(c("a","b","c","d"),each=5000) cores=rep(1:4,each=5000) ys=3+4*xs+ ifelse(type=="a",5,ifelse(type=="b",8,ifelse(type=="c",10,12)))+rnorm(200,0,4) ``` We plot the data, all together, per group, and at the end adding the generating line to the plot. It's not easy to make sense of it! ```{r} par(mfrow=c(2,3),mar=c(4,4,0.5,0.5)) #all the data - uma salganhada! plot(xs,ys,col=cores,cex=0.2) #plot the data #par(mfrow=c(2,2),mar=c(4,4,0.5,0.5)) plot(xs[type=="a"],ys[type=="a"],col=cores[type=="a"]) abline(3+5,4,lwd=3,col=1) plot(xs[type=="b"],ys[type=="b"],col=cores[type=="b"]) abline(3+8,4,lwd=3,col=2) plot(xs[type=="c"],ys[type=="c"],col=cores[type=="c"]) abline(3+10,4,lwd=3,col=3) plot(xs[type=="d"],ys[type=="d"],col=cores[type=="d"]) abline(3+12,4,lwd=3,col=4) #the data with each line added to it #par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(xs,ys,col=cores,cex=0.2) abline(3+5,4,lwd=3,col=1) abline(3+8,4,lwd=3,col=2) abline(3+10,4,lwd=3,col=3) abline(3+12,4,lwd=3,col=4) ``` Now we run the linear model ```{r} #fit the model lm.ancova1=summary(lm(ys~xs+type)) lm.ancova1 ``` We can check the model intercept coefficients ```{r} #estimated values of each intercept lm.ancova1$coefficients[1] lm.ancova1$coefficients[1]+lm.ancova1$coefficients[3] lm.ancova1$coefficients[1]+lm.ancova1$coefficients[4] lm.ancova1$coefficients[1]+lm.ancova1$coefficients[5] ``` and the common slope ```{r} lm.ancova1$coefficients[2] ``` Check how these values are similar (they are estimates) to those we simulated above, slope was 4, and the intercepts were respectively 3+5,3+8,3+10 and 3+12. We can plot the estimated regression lines ```{r} par(mfrow=c(1,1),mar=c(4,4,2.5,0.5)) plot(xs,ys,col=cores,pch=".",main="Estimated regression lines") abline(lm.ancova1$coefficients[1],lm.ancova1$coefficients[2],col=1,lwd=2) abline(lm.ancova1$coefficients[1]+lm.ancova1$coefficients[3],lm.ancova1$coefficients[2],col=2,lwd=2) abline(lm.ancova1$coefficients[1]+lm.ancova1$coefficients[4],lm.ancova1$coefficients[2],col=3,lwd=2) abline(lm.ancova1$coefficients[1]+lm.ancova1$coefficients[5],lm.ancova1$coefficients[2],col=4,lwd=2) legend("topleft",legend = tr,lwd=2,col=1:4,inset=0.05) ``` ## Different intercepts and slopes per treatment Now we extend the previous case to where the slope of the relationship is also different per treatment. Simulate treatments, same as before, but at least gives us the option to change later separately if we want. ```{r} #---------------------------------------------------------------- #all slopes different set.seed(1234) xs=runif(200,10,20) type=rep(c("a","b","c","d"),each=50) cores=rep(1:4,each=50) ``` Now we simulate the response ```{r} ys=3+ ifelse(type=="a",5,ifelse(type=="b",8,ifelse(type=="c",10,12)))+ 4*xs+ifelse(type=="a",0.2,ifelse(type=="b",0.5,ifelse(type=="c",1,2)))*xs+ rnorm(200,0,4) ``` note this is the same as what we have below, but below it might be simpler to understand that these do correspond to different intercepts and slopes per treatment ```{r} #same as intercept=3+ifelse(type=="a",5,ifelse(type=="b",8,ifelse(type=="c",10,12))) slope=xs*(4+ifelse(type=="a",0.2,ifelse(type=="b",0.5,ifelse(type=="c",1,2)))) ys=slope+intercept+rnorm(200,0,4) ``` We can look at the data ```{r} par(mfrow=c(1,2),mar=c(4,4,0.5,0.5)) plot(xs,ys,col=cores) abline(3+5,4+0.2,lwd=3,col=1) abline(3+8,4+0.5,lwd=3,col=2) abline(3+10,4+1,lwd=3,col=3) abline(3+12,4+2,lwd=3,col=4) ``` it's actually not that easy to confirm the slopes and intercepts are different, as the intercept is not shown in the above plot. We can zoom out the plot and force that ```{r} plot(xs,ys,col=cores,xlim=c(0,20),ylim=c(0,150)) abline(3+5,4+0.2,lwd=3,col=1) abline(3+8,4+0.5,lwd=3,col=2) abline(3+10,4+1,lwd=3,col=3) abline(3+12,4+2,lwd=3,col=4) abline(h=c(3+5,3+8,3+10,3+12),v=0,col=c(1,2,3,4,1),lty=2) ``` Now, we implement the AANCOVA linear model ```{r} lm.ancova2=lm(ys~xs+type+xs*type) sum.lm.ancova2=summary(lm.ancova2) ``` and we look at the output of the model ```{r} sum.lm.ancova2 ``` check how this is an output similar to the ANOVA (implemented via `aov`, the R function that produces ANOVA tables from expressions akin to linear models) ```{r} summary(aov(ys~xs+type+xs*type)) ``` Note that the overall F statistic from the regression model has an F-statistic of `r round(sum.lm.ancova2$fstatistic[1],1)`, with `r round(sum.lm.ancova2$fstatistic[2],1)` and `r round(sum.lm.ancova2$fstatistic[3],1)` degrees of freedom. That corresponds to the composite test with the null hypothesis "are all parameters equal to 0", which in the ANOVA table, is separated in 3 testes, one for each parameter, with 1, 3 and 3 degrees of freedom each. The residual degrees of freedom are naturally the same in all these tests. The most interesting aspect it that, naturally, we can check the values of the estimated coefficients, and in particular how to estimate the corresponding regression lines per group ```{r} #type a lm.ancova2$coefficients[1] lm.ancova2$coefficients[2] #type b lm.ancova2$coefficients[1]+lm.ancova2$coefficients[3] lm.ancova2$coefficients[2]+lm.ancova2$coefficients[6] #type c lm.ancova2$coefficients[1]+lm.ancova2$coefficients[4] lm.ancova2$coefficients[2]+lm.ancova2$coefficients[7] #type b lm.ancova2$coefficients[1]+lm.ancova2$coefficients[5] lm.ancova2$coefficients[2]+lm.ancova2$coefficients[8] ``` we can now add these to the earlier plots, to see how well we have estimated the different lines per treatment ```{r} #real lines par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(xs,ys,col=cores) abline(3+5,4+0.2,lwd=3,col=1) abline(3+8,4+0.5,lwd=3,col=2) abline(3+10,4+1,lwd=3,col=3) abline(3+12,4+2,lwd=3,col=4) #estimated lines #type a abline(lm.ancova2$coefficients[1],lm.ancova2$coefficients[2],lty=2,col=1) #type b abline(lm.ancova2$coefficients[1]+lm.ancova2$coefficients[3], lm.ancova2$coefficients[2]+lm.ancova2$coefficients[6],lty=2,col=1) #type c abline(lm.ancova2$coefficients[1]+lm.ancova2$coefficients[4], lm.ancova2$coefficients[2]+lm.ancova2$coefficients[7],lty=2,col=1) #type b abline(lm.ancova2$coefficients[1]+lm.ancova2$coefficients[5], lm.ancova2$coefficients[2]+lm.ancova2$coefficients[8],lty=2,col=1) legend("topleft",legend = tr,lwd=2,col=1:4,inset=0.05) legend("bottomright",legend =paste("Estimated",tr),lwd=1,lty=2,col=1:4,inset=0.05) ``` # Conclusion The material above allows you to fully understand the outputs of simple regression models, to see how some statistical models that you know from other names are just a linear model. It also helps you understand how the parameter values represent just features of the data and its generating process, and how we can recover estimates of the original relationships between the variables from said set of parameters. I recommend you explore the code and output above, and that in particular you experiment with changing means (parameter values for the real models), variances (the precision of how you would measure variables) and sample sizes (which gives you an indication of how much information you have to estimate the underlying reality). Understanding the outputs under these new scenarios is fundamental for progressing towards more complex regression models, like GLMs or GAMs, of which the above cases are just particular cases.