--- title: "Simulating data from a regression model and all that" author: "Tiago A. Marques" date: "10/2/2019" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Simulate model It was suggested that you would simulate data that considered that the TRUE relation between the weight and length of a lizzard was given by $$ y=12+1.2*length $$ You were also told that the usual length of a lizard was between 5 and 20 cm. We will have 97 lizards Then you were told to create the lengths: ```{r} set.seed(121) n=97 #lengths xs=runif(n,5,20) hist(xs,main="Lenths (cm)") ``` and then to create weights of lizards ```{r} a=12 b=1.2 ys=a+b*xs ``` If we plot the data, all points are in a single line. Why, because there is no randomness. ```{r} plot(xs,ys) ``` This means that if you try to run a model, it gives you a warning that the model might be unreliable ```{r} summary(lm(ys~xs)) ``` So... , we add some variance, and plot the data: ```{r} ys=a+b*xs+rnorm(n,0,4) plot(xs,ys) ``` Now, let's consider there's more and less variance. We also add to each plot the real line (that with the true parameter values) and the one with the estimated parameter values. ```{r} par(mfrow=c(2,3)) ys=a+b*xs+rnorm(n,0,1) plot(xs,ys) mod1=lm(ys~xs) abline(mod1,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,2) plot(xs,ys) mod2=lm(ys~xs) abline(mod2,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,4) plot(xs,ys) mod4=lm(ys~xs) abline(mod4,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,10) plot(xs,ys) mod10=lm(ys~xs) abline(mod10,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,20) plot(xs,ys) mod20=lm(ys~xs) abline(mod20,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,100) plot(xs,ys) mod100=lm(ys~xs) abline(mod100,col="red") abline(a,b,col="green") ``` Not surprisingly, as the variance increases, we get data that more and more looks like it is not coming from a real linear process. You can also look at the model summaries, and there you can see that, in fact, the models become essentially useless as the variance increases! You can see that both from the correlation, but also by the predictions generated from the model (comparing to the truth), and also the significance of the coefficients associated with the regression parameters. Make no mistake, the reality is always the same, in terms of the fixed part of the model, it's just the variance that increases. Also, don't get confused, the different green lines might look different, but they are always exactly the same line! You can check that by forcing the y axis to span the same limits. ```{r} par(mfrow=c(2,3)) ys=a+b*xs+rnorm(n,0,1) plot(xs,ys,ylim=c(-400,400)) mod1=lm(ys~xs) abline(mod1,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,2) plot(xs,ys,ylim=c(-400,400)) mod2=lm(ys~xs) abline(mod2,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,4) plot(xs,ys,ylim=c(-400,400)) mod4=lm(ys~xs) abline(mod4,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,10) plot(xs,ys,ylim=c(-400,400)) mod10=lm(ys~xs) abline(mod10,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,20) plot(xs,ys,ylim=c(-400,400)) mod20=lm(ys~xs) abline(mod20,col="red") abline(a,b,col="green") ys=a+b*xs+rnorm(n,0,100) plot(xs,ys,ylim=c(-400,400)) mod100=lm(ys~xs) abline(mod100,col="red") abline(a,b,col="green") ``` but since then you loose all the ability to look at the actual data in some of the plots, that is not really that useful! Below I look at the summary of each model. Look at correlations, at the estimated values for the parameters, their corresponding variances and the $R^2$. ```{r} summary(mod1) summary(mod2) summary(mod4) summary(mod10) summary(mod20) summary(mod100) ``` As an example, we can plot the $R^2$ as a function of the variance ```{r} plot(c(1,2,4,10,20,100),c(summary(mod1)$r.squared,summary(mod2)$r.squared,summary(mod4)$r.squared,summary(mod10)$r.squared,summary(mod20)$r.squared,summary(mod100)$r.squared)) ``` That is quite interesting actually... There seems to be a nonlinear relationship, but we only have a sample size of six (different standard deviatios, i.e., variances, as variance is standard deviation squared), so hard to tell... Let's show off in R... ```{r} sds=seq(0.5,100,by=0.5) nsds=length(sds) #an object to hold the correlations Rsqs=numeric(nsds) for (i in 1:nsds){ #create data ys=a+b*xs+rnorm(n,0,sds[i]) #estimate model modi=lm(ys~xs) #get R-squared Rsqs[i]=summary(modi)$r.squared } #and at the end... plot results plot(sds,Rsqs) ``` How cool is that!! Actually, this means we can model the $R^2$ as a function of the original variance! But we would not be able to model it using a linear model... You are not supposed to know about this yet, but I'll continue to show off. Let's use a GAM ```{r} library(mgcv) gam1=gam(Rsqs~s(sds),link=log) #make predictions to plot the estimated GAM model predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response") plot(sds,Rsqs) lines(sds,predRsqs,col="red") ``` Aha... remember what we talked in class today? It seems like we have overfitted. Then, I constrain the GAM. ```{r} library(mgcv) gam1=gam(Rsqs~s(sds,k=3),link=log) #make predictions to plot the estimated GAM model predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response") plot(sds,Rsqs) lines(sds,predRsqs,col="red") ``` That was too much... ```{r} library(mgcv) gam1=gam(Rsqs~s(sds,k=6),link=log) #make predictions to plot the estimated GAM model predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response") plot(sds,Rsqs) lines(sds,predRsqs,col="red") ``` That is already overfitting... conclusion, the GAM is not the right tool here :) What is...? Well, stay tuned and one day you'll learn!