It was suggested that you would simulate data that considered that the TRUE relation between the weight and length of a lizzard was given by
\[ y=12+1.2*length \]
You were also told that the usual length of a lizard was between 5 and 20 cm.
We will have 97 lizards
Then you were told to create the lengths:
set.seed(121)
n=97
#lengths
xs=runif(n,5,20)
hist(xs,main="Lenths (cm)")
and then to create weights of lizards
a=12
b=1.2
ys=a+b*xs
If we plot the data, all points are in a single line. Why, because there is no randomness.
plot(xs,ys)
This means that if you try to run a model, it gives you a warning that the model might be unreliable
summary(lm(ys~xs))
## Warning in summary.lm(lm(ys ~ xs)): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.595e-15 -2.460e-15 -1.878e-15 -1.422e-15 1.873e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.200e+01 6.050e-15 1.983e+15 <2e-16 ***
## xs 1.200e+00 4.611e-16 2.603e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.934e-14 on 95 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.773e+30 on 1 and 95 DF, p-value: < 2.2e-16
So… , we add some variance, and plot the data:
ys=a+b*xs+rnorm(n,0,4)
plot(xs,ys)
Now, let’s consider there’s more and less variance. We also add to each plot the real line (that with the true parameter values) and the one with the estimated parameter values.
par(mfrow=c(2,3))
ys=a+b*xs+rnorm(n,0,1)
plot(xs,ys)
mod1=lm(ys~xs)
abline(mod1,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,2)
plot(xs,ys)
mod2=lm(ys~xs)
abline(mod2,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,4)
plot(xs,ys)
mod4=lm(ys~xs)
abline(mod4,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,10)
plot(xs,ys)
mod10=lm(ys~xs)
abline(mod10,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,20)
plot(xs,ys)
mod20=lm(ys~xs)
abline(mod20,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,100)
plot(xs,ys)
mod100=lm(ys~xs)
abline(mod100,col="red")
abline(a,b,col="green")
Not surprisingly, as the variance increases, we get data that more and more looks like it is not coming from a real linear process.
You can also look at the model summaries, and there you can see that, in fact, the models become essentially useless as the variance increases! You can see that both from the correlation, but also by the predictions generated from the model (comparing to the truth), and also the significance of the coefficients associated with the regression parameters.
Make no mistake, the reality is always the same, in terms of the fixed part of the model, it’s just the variance that increases.
Also, don’t get confused, the different green lines might look different, but they are always exactly the same line! You can check that by forcing the y axis to span the same limits.
par(mfrow=c(2,3))
ys=a+b*xs+rnorm(n,0,1)
plot(xs,ys,ylim=c(-400,400))
mod1=lm(ys~xs)
abline(mod1,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,2)
plot(xs,ys,ylim=c(-400,400))
mod2=lm(ys~xs)
abline(mod2,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,4)
plot(xs,ys,ylim=c(-400,400))
mod4=lm(ys~xs)
abline(mod4,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,10)
plot(xs,ys,ylim=c(-400,400))
mod10=lm(ys~xs)
abline(mod10,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,20)
plot(xs,ys,ylim=c(-400,400))
mod20=lm(ys~xs)
abline(mod20,col="red")
abline(a,b,col="green")
ys=a+b*xs+rnorm(n,0,100)
plot(xs,ys,ylim=c(-400,400))
mod100=lm(ys~xs)
abline(mod100,col="red")
abline(a,b,col="green")
but since then you loose all the ability to look at the actual data in some of the plots, that is not really that useful!
Below I look at the summary of each model. Look at correlations, at the estimated values for the parameters, their corresponding variances and the \(R^2\).
summary(mod1)
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2297 -0.5755 -0.0093 0.5758 3.1122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.13296 0.31728 38.24 <2e-16 ***
## xs 1.18951 0.02418 49.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 95 degrees of freedom
## Multiple R-squared: 0.9622, Adjusted R-squared: 0.9618
## F-statistic: 2420 on 1 and 95 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0496 -1.0818 0.0395 1.3671 4.7968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.78179 0.57907 22.07 <2e-16 ***
## xs 1.14051 0.04413 25.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.851 on 95 degrees of freedom
## Multiple R-squared: 0.8755, Adjusted R-squared: 0.8742
## F-statistic: 667.9 on 1 and 95 DF, p-value: < 2.2e-16
summary(mod4)
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3229 -2.7291 -0.4323 2.3578 12.5807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.47511 1.14862 10.86 <2e-16 ***
## xs 1.12565 0.08754 12.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.671 on 95 degrees of freedom
## Multiple R-squared: 0.6351, Adjusted R-squared: 0.6313
## F-statistic: 165.4 on 1 and 95 DF, p-value: < 2.2e-16
summary(mod10)
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.332 -7.700 1.157 6.395 25.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4371 3.1514 2.360 0.0203 *
## xs 1.4793 0.2402 6.159 1.75e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.07 on 95 degrees of freedom
## Multiple R-squared: 0.2854, Adjusted R-squared: 0.2779
## F-statistic: 37.94 on 1 and 95 DF, p-value: 1.746e-08
summary(mod20)
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.36 -14.88 0.09 12.18 56.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6794 6.6835 1.747 0.0838 .
## xs 1.2592 0.5094 2.472 0.0152 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.36 on 95 degrees of freedom
## Multiple R-squared: 0.06044, Adjusted R-squared: 0.05055
## F-statistic: 6.111 on 1 and 95 DF, p-value: 0.01521
summary(mod100)
##
## Call:
## lm(formula = ys ~ xs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.693 -56.234 2.761 69.244 184.494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.037 29.856 -0.135 0.893
## xs 1.545 2.275 0.679 0.499
##
## Residual standard error: 95.42 on 95 degrees of freedom
## Multiple R-squared: 0.004829, Adjusted R-squared: -0.005646
## F-statistic: 0.461 on 1 and 95 DF, p-value: 0.4988
As an example, we can plot the \(R^2\) as a function of the variance
plot(c(1,2,4,10,20,100),c(summary(mod1)$r.squared,summary(mod2)$r.squared,summary(mod4)$r.squared,summary(mod10)$r.squared,summary(mod20)$r.squared,summary(mod100)$r.squared))
That is quite interesting actually… There seems to be a nonlinear relationship, but we only have a sample size of six (different standard deviatios, i.e., variances, as variance is standard deviation squared), so hard to tell…
Let’s show off in R…
sds=seq(0.5,100,by=0.5)
nsds=length(sds)
#an object to hold the correlations
Rsqs=numeric(nsds)
for (i in 1:nsds){
#create data
ys=a+b*xs+rnorm(n,0,sds[i])
#estimate model
modi=lm(ys~xs)
#get R-squared
Rsqs[i]=summary(modi)$r.squared
}
#and at the end... plot results
plot(sds,Rsqs)
How cool is that!! Actually, this means we can model the \(R^2\) as a function of the original variance! But we would not be able to model it using a linear model…
You are not supposed to know about this yet, but I’ll continue to show off. Let’s use a GAM
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
gam1=gam(Rsqs~s(sds),link=log)
#make predictions to plot the estimated GAM model
predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response")
plot(sds,Rsqs)
lines(sds,predRsqs,col="red")
Aha… remember what we talked in class today? It seems like we have overfitted. Then, I constrain the GAM.
library(mgcv)
gam1=gam(Rsqs~s(sds,k=3),link=log)
#make predictions to plot the estimated GAM model
predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response")
plot(sds,Rsqs)
lines(sds,predRsqs,col="red")
That was too much…
library(mgcv)
gam1=gam(Rsqs~s(sds,k=6),link=log)
#make predictions to plot the estimated GAM model
predRsqs=predict.gam(gam1,newdata = list(sds=sds),type="response")
plot(sds,Rsqs)
lines(sds,predRsqs,col="red")
That is already overfitting… conclusion, the GAM is not the right tool here :)
What is…? Well, stay tuned and one day you’ll learn!