Simulate model

It was suggested that you would simulate data that considered that the TRUE relation between the weight and length of a lizzard was given by

\[ y=12+1.2*length \]

You were also told that the usual length of a lizard was between 5 and 20 cm.

We will have 97 lizards

Then you were told to create the lengths:

set.seed(121)
n=97
#lengths
xs=runif(n,5,20)
hist(xs,main="Lenths (cm)")

and then to create weights of lizards

a=12
b=1.2
ys=a+b*xs

If we plot the data, all points are in a single line. Why, because there is no randomness.

plot(xs,ys)

This means that if you try to run a model, it gives you a warning that the model might be unreliable

summary(lm(ys~xs))
## Warning in summary.lm(lm(ys ~ xs)): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.595e-15 -2.460e-15 -1.878e-15 -1.422e-15  1.873e-13 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.200e+01  6.050e-15 1.983e+15   <2e-16 ***
## xs          1.200e+00  4.611e-16 2.603e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.934e-14 on 95 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.773e+30 on 1 and 95 DF,  p-value: < 2.2e-16

So… , we add some variance, and plot the data:

ys=a+b*xs+rnorm(n,0,4)
plot(xs,ys)

Now, let’s consider there’s more and less variance. We also add to each plot the real line (that with the true parameter values) and the one with the estimated parameter values.

par(mfrow=c(2,3))
ys=a+b*xs+rnorm(n,0,1)
plot(xs,ys)
mod1=lm(ys~xs)
abline(mod1,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,2)
plot(xs,ys)
mod2=lm(ys~xs)
abline(mod2,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,4)
plot(xs,ys)
mod4=lm(ys~xs)
abline(mod4,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,10)
plot(xs,ys)
mod10=lm(ys~xs)
abline(mod10,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,20)
plot(xs,ys)
mod20=lm(ys~xs)
abline(mod20,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,100)
plot(xs,ys)
mod100=lm(ys~xs)
abline(mod100,col="red")
abline(a,b,col="green")

Not surprisingly, as the variance increases, we get data that more and more looks like it is not coming from a real linear process.

You can also look at the model summaries, and there you can see that, in fact, the models become essentially useless as the variance increases! You can see that both from the correlation, but also by the predictions generated from the model (comparing to the truth), and also the significance of the coefficients associated with the regression parameters.

Make no mistake, the reality is always the same, in terms of the fixed part of the model, it’s just the variance that increases.

Also, don’t get confused, the different green lines might look different, but they are always exactly the same line! You can check that by forcing the y axis to span the same limits.

par(mfrow=c(2,3))
ys=a+b*xs+rnorm(n,0,1)
plot(xs,ys,ylim=c(-400,400))
mod1=lm(ys~xs)
abline(mod1,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,2)
plot(xs,ys,ylim=c(-400,400))
mod2=lm(ys~xs)
abline(mod2,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,4)
plot(xs,ys,ylim=c(-400,400))
mod4=lm(ys~xs)
abline(mod4,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,10)
plot(xs,ys,ylim=c(-400,400))
mod10=lm(ys~xs)
abline(mod10,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,20)
plot(xs,ys,ylim=c(-400,400))
mod20=lm(ys~xs)
abline(mod20,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,100)
plot(xs,ys,ylim=c(-400,400))
mod100=lm(ys~xs)
abline(mod100,col="red")
abline(a,b,col="green")

but since then you loose all the ability to look at the actual data in some of the plots, that is not really that useful!

Below I look at the summary of each model. Look at correlations, at the estimated values for the parameters, their corresponding variances and the \(R^2\).

summary(mod1)
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2297 -0.5755 -0.0093  0.5758  3.1122 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.13296    0.31728   38.24   <2e-16 ***
## xs           1.18951    0.02418   49.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 95 degrees of freedom
## Multiple R-squared:  0.9622, Adjusted R-squared:  0.9618 
## F-statistic:  2420 on 1 and 95 DF,  p-value: < 2.2e-16
summary(mod2)
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0496 -1.0818  0.0395  1.3671  4.7968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.78179    0.57907   22.07   <2e-16 ***
## xs           1.14051    0.04413   25.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.851 on 95 degrees of freedom
## Multiple R-squared:  0.8755, Adjusted R-squared:  0.8742 
## F-statistic: 667.9 on 1 and 95 DF,  p-value: < 2.2e-16
summary(mod4)
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3229 -2.7291 -0.4323  2.3578 12.5807 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.47511    1.14862   10.86   <2e-16 ***
## xs           1.12565    0.08754   12.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.671 on 95 degrees of freedom
## Multiple R-squared:  0.6351, Adjusted R-squared:  0.6313 
## F-statistic: 165.4 on 1 and 95 DF,  p-value: < 2.2e-16
summary(mod10)
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.332  -7.700   1.157   6.395  25.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.4371     3.1514   2.360   0.0203 *  
## xs            1.4793     0.2402   6.159 1.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 95 degrees of freedom
## Multiple R-squared:  0.2854, Adjusted R-squared:  0.2779 
## F-statistic: 37.94 on 1 and 95 DF,  p-value: 1.746e-08
summary(mod20)
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.36 -14.88   0.09  12.18  56.18 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  11.6794     6.6835   1.747   0.0838 .
## xs            1.2592     0.5094   2.472   0.0152 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.36 on 95 degrees of freedom
## Multiple R-squared:  0.06044,    Adjusted R-squared:  0.05055 
## F-statistic: 6.111 on 1 and 95 DF,  p-value: 0.01521
summary(mod100)
## 
## Call:
## lm(formula = ys ~ xs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -212.693  -56.234    2.761   69.244  184.494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -4.037     29.856  -0.135    0.893
## xs             1.545      2.275   0.679    0.499
## 
## Residual standard error: 95.42 on 95 degrees of freedom
## Multiple R-squared:  0.004829,   Adjusted R-squared:  -0.005646 
## F-statistic: 0.461 on 1 and 95 DF,  p-value: 0.4988

As an example, we can plot the \(R^2\) as a function of the variance

plot(c(1,2,4,10,20,100),c(summary(mod1)$r.squared,summary(mod2)$r.squared,summary(mod4)$r.squared,summary(mod10)$r.squared,summary(mod20)$r.squared,summary(mod100)$r.squared))

That is quite interesting actually… There seems to be a nonlinear relationship, but we only have a sample size of six (different standard deviatios, i.e., variances, as variance is standard deviation squared), so hard to tell…

Let’s show off in R…

sds=seq(0.5,100,by=0.5)
nsds=length(sds)
#an object to hold the correlations
Rsqs=numeric(nsds)
for (i in 1:nsds){
  #create data
  ys=a+b*xs+rnorm(n,0,sds[i])
  #estimate model
  modi=lm(ys~xs)
  #get R-squared
  Rsqs[i]=summary(modi)$r.squared
}
#and at the end... plot results
plot(sds,Rsqs)

How cool is that!! Actually, this means we can model the \(R^2\) as a function of the original variance! But we would not be able to model it using a linear model…

You are not supposed to know about this yet, but I’ll continue to show off. Let’s use a GAM

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
gam1=gam(Rsqs~s(sds),link=log)
#make predictions to plot the estimated GAM model
predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response")
plot(sds,Rsqs)
lines(sds,predRsqs,col="red")

Aha… remember what we talked in class today? It seems like we have overfitted. Then, I constrain the GAM.

library(mgcv)
gam1=gam(Rsqs~s(sds,k=3),link=log)
#make predictions to plot the estimated GAM model
predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response")
plot(sds,Rsqs)
lines(sds,predRsqs,col="red")

That was too much…

library(mgcv)
gam1=gam(Rsqs~s(sds,k=6),link=log)
#make predictions to plot the estimated GAM model
predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response")
plot(sds,Rsqs)
lines(sds,predRsqs,col="red")

That is already overfitting… conclusion, the GAM is not the right tool here :)

What is…? Well, stay tuned and one day you’ll learn!