We load up some useful libraries first
library(vegan)
#for whatever might be required!
library(MASS)
#Mixture and Flexible Discriminant Analysis
library(mda)
#for the confusion matrix
library (class)
Recolheram-se dados de abundância de várias espécies de anfíbios em vários distritos de Portugal (DataTP11anfibios.csv). Efectue um escalamento multidimensional com base em distâncias euclideanas e retire as principais conclusões.
We read in the data (note this data was also used in FT10!)
anfi <- read.csv2("DataTP11anfibios.csv")
and check it looks OK
str(anfi)
## 'data.frame': 15 obs. of 7 variables:
## $ Distritos : Factor w/ 15 levels "AVEI","BEJA",..: 7 2 8 13 11 10 5 14 9 1 ...
## $ Rana : num 4.44 2.18 0.47 1.22 4.66 2.68 1.45 0 1 2.35 ...
## $ Triturus : num 0 0 0 2.45 0 0 1.88 0 0 0 ...
## $ Salamandra : num 35.96 41.1 35.38 20.92 5.12 ...
## $ Hyla : num 3.2 0 0.71 0 1.82 0 0 1.5 1.71 6.47 ...
## $ Bufo : num 4.8 16.4 14.8 44.1 23.6 ...
## $ Pleurodeles: num 16 0.48 9.3 11.46 20.76 ...
We have 15 observations for 7 variables, which correspond respectively to locations and taxa (genera of amphibians). The first column does not represent a genera, but it contains the labels of the locations
anfi[,1]
## [1] EVOR BEJA FARO SETU LISB LEIR CBRA VISE GUAR AVEI COIM PORT BRAG BRCA
## [15] VREA
## 15 Levels: AVEI BEJA BRAG BRCA CBRA COIM EVOR FARO GUAR LEIR LISB ... VREA
The names of the genera are
names(anfi)[2:7]
## [1] "Rana" "Triturus" "Salamandra" "Hyla" "Bufo"
## [6] "Pleurodeles"
We can also see and if some locations have much more amphibians than others
nbyloc=data.frame(loc=anfi[,1],n=rowSums(anfi[,2:7]))
nbyloc
## loc n
## 1 EVOR 64.40
## 2 BEJA 60.12
## 3 FARO 60.68
## 4 SETU 80.13
## 5 LISB 56.00
## 6 LEIR 64.44
## 7 CBRA 55.34
## 8 VISE 59.52
## 9 GUAR 44.71
## 10 AVEI 61.30
## 11 COIM 67.15
## 12 PORT 72.92
## 13 BRAG 63.86
## 14 BRCA 58.43
## 15 VREA 0.00
with Vila Real representing a really weird location, with no data (might that be a mistake?).
We can also evaluate what are the most abundant taxa,
colSums(anfi[,2:7])
## Rana Triturus Salamandra Hyla Bufo Pleurodeles
## 21.99 5.44 146.68 36.63 471.83 186.43
We can calculate a distance matrix between locations (Euclidean distances)
row.names(anfi)=anfi[,1]
deucanfi<-dist(anfi[,-1])
dmananfi2<-dist(anfi[,-1],method="manhattan")
print(deucanfi,digits=2)
## EVOR BEJA FARO SETU LISB LEIR CBRA VISE GUAR AVEI COIM PORT BRAG BRCA
## BEJA 20.4
## FARO 12.9 10.8
## SETU 42.6 36.1 32.8
## LISB 36.5 42.1 33.8 27.8
## LEIR 44.5 54.9 45.1 39.8 18.1
## CBRA 40.5 47.4 38.5 31.5 7.5 12.3
## VISE 41.8 50.2 40.7 34.0 12.0 7.7 6.0
## GUAR 42.0 43.8 37.0 26.3 9.8 25.2 13.4 18.2
## AVEI 51.5 48.7 44.1 21.2 22.6 35.8 25.8 29.1 16.0
## COIM 67.0 59.9 57.6 28.1 40.9 52.1 43.3 46.0 34.1 19.0
## PORT 64.6 58.9 55.9 26.3 37.5 47.9 39.6 42.0 31.3 15.8 5.3
## BRAG 68.2 60.2 58.4 29.0 42.7 54.3 45.1 48.1 35.6 21.1 3.9 8.9
## BRCA 61.5 54.7 52.3 26.1 35.9 49.0 39.1 42.4 28.4 13.7 8.3 9.2 9.6
## VREA 40.0 44.3 39.5 50.2 32.3 44.2 35.5 39.7 30.7 43.6 59.8 58.4 60.3 52.2
and finally implement the multidimensional scaling (Note: this analysis, classical metric multidimensional scaling (MDS), is sometimes also referred to as principal coordinates analysis")
mdsanfi<- cmdscale(deucanfi, eig = TRUE)
#mdsanfi2<- cmdscale(dmananfi2, eig = TRUE)
#test how close these two analyisismatch up, i.e.
#considering the euclidean and manhattan distances, say
mdsanfi
## $points
## [,1] [,2]
## EVOR -35.9372068 -9.447106
## BEJA -25.9322705 -26.132662
## FARO -25.9645802 -15.938199
## SETU 5.6126292 -11.697835
## LISB -6.1061455 10.813623
## LEIR -8.9848858 25.213559
## CBRA -5.5366438 16.597692
## VISE -6.2936806 19.570766
## GUAR 0.4735304 8.222119
## AVEI 13.9550085 1.242643
## COIM 30.6908939 -6.783789
## PORT 28.3591494 -3.467991
## BRAG 31.5838299 -8.565286
## BRCA 24.5776694 -6.527845
## VREA -20.4972974 6.900310
##
## $eig
## [1] 6.820783e+03 2.864971e+03 1.171632e+03 6.150470e+01 1.776065e+01
## [6] 4.014496e+00 9.548624e-13 1.591834e-13 1.492092e-13 1.479654e-13
## [11] 5.156001e-14 -1.019713e-14 -4.809168e-14 -1.853059e-13 -6.608920e-13
##
## $x
## NULL
##
## $ac
## [1] 0
##
## $GOF
## [1] 0.8852984 0.8852984
and look at the results
somavalprop<-sum(mdsanfi$eig)
mdsanfi$eig[2]
## [1] 2864.971
percent2d<-(sum(mdsanfi$eig[1],mdsanfi$eig[2]))/somavalprop*100
percent2d
## [1] 88.52984
x <- mdsanfi$points[,1]
y <- mdsanfi$points[,2]
plot(x,y, xlab="Coordinate 1",ylab="Coordinate 2",type="n")
text(x,y, labels= anfi[,1],cex=0.6)
#adding a polygon to a plot is easy
#to find the required coordinates consider function
#locator()
#use locator(k) to find coordinates of k points
#us mouse and click over desired points
polygon(x=c(20,34,34,20),y=c(-10,-10,0,0),lty=2)
polygon(x=c(-12.4916579,-0.2896308,6.0951509,3.1155861,-10.9309335,-15.3293387),y=c(26.992671,26.992671,17.763229,2.523919,5.314215,16.690038),lty=2)
polygon(x=c(-32.92296,-23.41673,-20.15340,-27.53137,-37.60513,-38.74020),y=c(-0.4810154,-9.9250950,-27.0961487,-27.7400632,-24.0912143,-8.2079896),lty=2)
As in the previous exercises with this data set, there is a good latitudinal separation between the districts on the basis of the amphibian data. We can see 3 or 4 groups of locations.
We could see if the analysis seems to be efficient at describing the information contained in the original distances
par(mfrow=c(1,2),mar=c(4,4,0.5,0.5))
plot(1:15,mdsanfi$eig/sum(mdsanfi$eig),xlab="Dimension",ylab="Variance explained")
plot(1:15,cumsum(mdsanfi$eig/sum(mdsanfi$eig)),xlab="Dimension",ylab="Comulative variance explained")
Efectue um escalamento multidimensional não-métrico aos dados da alínea anterior e comente os resultados obtidos.
A Non-Metric Multidimensional Scaling “…chooses a k-dimensional (default k = 2) configuration to minimize the stress, the square root of the ratio of the sum of squared differences between the input distances and those of the configuration to the sum of configuration distances squared.” (stolen from ?isoMDS
)".
library(MASS)
nmmdsanfi<- isoMDS(deucanfi)
## initial value 11.014400
## iter 5 value 7.298933
## iter 10 value 6.345802
## final value 6.261120
## converged
nmmdsanfi
## $points
## [,1] [,2]
## EVOR -34.11519011 -12.716941
## BEJA -25.49278236 -25.989141
## FARO -24.15893226 -16.309516
## SETU 5.81708937 -11.759853
## LISB -5.10029304 8.858192
## LEIR -4.75141021 25.687765
## CBRA -3.48416531 14.291690
## VISE -5.22080235 19.179415
## GUAR 0.05379224 5.977481
## AVEI 13.13845563 1.171623
## COIM 30.54658981 -6.067311
## PORT 27.70084748 -2.225513
## BRAG 31.34666571 -7.962404
## BRCA 25.28447070 -5.484271
## VREA -31.56433532 13.348786
##
## $stress
## [1] 6.26112
x2<-nmmdsanfi$points[,1]
y2<-nmmdsanfi$points[,2]
plot(x2,y2,xlab="Coordinate 1",ylab="Coordinate 2",type="n")
text(x2,y2, labels= anfi[,1],cex=0.7)
There does not seem to be any major differences compared to what was obtained with the original analysis.
We could compare both analysis outputs directly by overlaying them on a single plot
plot(x2,y2,xlab="Coordinate 1",ylab="Coordinate 2",type="n")
text(x,y, labels= anfi[,1],cex=0.7,col=2)
text(x2,y2, labels= anfi[,1],cex=0.7,col=3)
Interestingly, every single district has about the same position on the biplot, except for Vila Real
, which was the district which we suspect had a record that was deleted (as it presented no amphibia at all, when all others had a reasonably large number of amphibia). The non-metric analysis is less sensible to such data, and hence places the location closer to the center of the plot.
We could evaluate how reasonable was the representation in the 2 axis
nmmdsanfiB<- metaMDS(deucanfi)
## Run 0 stress 0.06206213
## Run 1 stress 0.1011647
## Run 2 stress 0.06206213
## ... Procrustes: rmse 2.630554e-06 max resid 4.482184e-06
## ... Similar to previous best
## Run 3 stress 0.1011647
## Run 4 stress 0.06206213
## ... New best solution
## ... Procrustes: rmse 2.033655e-05 max resid 6.395341e-05
## ... Similar to previous best
## Run 5 stress 0.1011647
## Run 6 stress 0.1188388
## Run 7 stress 0.06206214
## ... Procrustes: rmse 4.61858e-05 max resid 0.0001325306
## ... Similar to previous best
## Run 8 stress 0.06206227
## ... Procrustes: rmse 0.0001470754 max resid 0.000449502
## ... Similar to previous best
## Run 9 stress 0.1011647
## Run 10 stress 0.1011647
## Run 11 stress 0.06206213
## ... Procrustes: rmse 1.621825e-05 max resid 3.687544e-05
## ... Similar to previous best
## Run 12 stress 0.1011647
## Run 13 stress 0.1011647
## Run 14 stress 0.06206214
## ... Procrustes: rmse 3.65225e-05 max resid 8.587905e-05
## ... Similar to previous best
## Run 15 stress 0.06206214
## ... Procrustes: rmse 2.77913e-05 max resid 6.040747e-05
## ... Similar to previous best
## Run 16 stress 0.1011647
## Run 17 stress 0.06206214
## ... Procrustes: rmse 5.009538e-05 max resid 0.0001271436
## ... Similar to previous best
## Run 18 stress 0.06206213
## ... New best solution
## ... Procrustes: rmse 7.932605e-06 max resid 2.236813e-05
## ... Similar to previous best
## Run 19 stress 0.06206218
## ... Procrustes: rmse 2.578792e-05 max resid 4.027577e-05
## ... Similar to previous best
## Run 20 stress 0.06206225
## ... Procrustes: rmse 8.420252e-05 max resid 0.000248178
## ... Similar to previous best
## *** Solution reached
par(mfrow=c(1,2),mar=c(4,4,2.5,0.5))
stressplot(nmmdsanfiB,main="Sheppard plot")
gof=goodness(nmmdsanfiB)
plot(nmmdsanfiB,type="t",main="goodness of fit")
## species scores not available
points(nmmdsanfiB,display="sites",cex=gof*200)
Recolheu dados relativos a densidades de várias espécies de peixes em duas áreas distintas (1 e 2) (DataTP11peixesgrupos.csv). Realize uma análise discriminante e comente os resultados obtidos.
Read the data in
pg <- read.csv("DataTP11peixesgrupos.csv", sep=";")
and we take a peak at the data
str(pg)
## 'data.frame': 20 obs. of 11 variables:
## $ Group: int 1 1 1 1 1 1 1 1 1 2 ...
## $ Re_hi: int 0 0 0 0 1 0 0 0 0 3 ...
## $ An_de: int 0 0 3 0 0 0 0 0 2 0 ...
## $ An_mi: int 0 0 0 1 0 0 0 0 1 0 ...
## $ Hi_pl: int 5 6 7 8 9 8 7 8 9 12 ...
## $ An_lu: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Me_ae: int 23 24 25 24 23 24 25 26 27 0 ...
## $ Ra_ra: int 0 0 1 0 0 0 0 2 0 0 ...
## $ Mi_po: int 3 2 0 2 0 6 3 2 0 4 ...
## $ Ma_vi: int 1 0 0 3 0 0 3 16 0 0 ...
## $ Ga_mo: int 12 14 15 21 23 25 26 24 25 1 ...
We have 20 observations for 11 variables, which correspond respectively to locations and species of fish. The first column does not represent a genera, but it contains the habitat each location corresponds to.
pg[,1]
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
The names of the genera are
names(pg)[-1]
## [1] "Re_hi" "An_de" "An_mi" "Hi_pl" "An_lu" "Me_ae" "Ra_ra" "Mi_po"
## [9] "Ma_vi" "Ga_mo"
We can also see and if some locations have much more fish than others
nbyloc=data.frame(loc=pg[,1],n=rowSums(pg[,-1]))
nbyloc
## loc n
## 1 1 44
## 2 1 46
## 3 1 51
## 4 1 59
## 5 1 56
## 6 1 63
## 7 1 64
## 8 1 78
## 9 1 64
## 10 2 20
## 11 2 63
## 12 2 65
## 13 2 62
## 14 2 136
## 15 2 122
## 16 2 79
## 17 2 50
## 18 2 50
## 19 2 41
## 20 2 38
We can also evaluate what are the most abundant taxa,
sort(colSums(pg[,-1]))
## An_lu An_mi An_de Mi_po Ra_ra Re_hi Ga_mo Me_ae Hi_pl Ma_vi
## 2 16 19 23 45 80 238 248 282 298
Of course, what will be interesting is to separate the analysis per habitat. As an example, we might see what are the largest differences per habitat in terms of fish abundances
sort(round(colMeans(pg[pg$Group==1,-1])-colMeans(pg[pg$Group==2,-1]),1))
## Ma_vi Hi_pl Re_hi Ra_ra An_mi An_de An_lu Mi_po Ga_mo Me_ae
## -22.4 -12.1 -7.1 -3.5 -1.1 -0.7 -0.2 1.5 15.7 22.1
to conclude that Ma_vi
is much more abundant in habitat 2 than 1, while the contrary happens for say Me_ae
. These will likely be determinant species to separate the two groups!
Finally, we implement the descriminant analysis
dapg <- fda(Group~.,data=pg)
dapg
## Call:
## fda(formula = Group ~ ., data = pg)
##
## Dimension: 1
##
## Percent Between-Group Variance Explained:
## v1
## 100
##
## Degrees of Freedom (per dimension): 11
##
## Training Misclassification Error: 0 ( N = 20 )
we can see the estimated coefficients
coef(dapg)
## [,1]
## Intercept 8.51363455
## Re_hi 0.24295777
## An_de 0.20372710
## An_mi -0.32843407
## Hi_pl 0.13629716
## An_lu 1.53500253
## Me_ae -0.91703765
## Ra_ra -0.35050178
## Mi_po -0.09375134
## Ma_vi 0.04371099
## Ga_mo 0.01049302
and the means per group on the discriminant space - the larger the differences between these means the easier it should be to discriminate the groups (do not confuse with means of variables per group, note we had already obtained these manually above!)
dapg$means
## v1
## 1 -12.89983
## 2 10.55440
## attr(,"scaled:scale")
## [1] 0.08570205
With a single discriminant function the plot of the result cannot be obtained (and it would not be that informative, but contrast with plot from Exercício 4 below)
#this code chunck has eval=FALSE
#this returns an error if you try to run it
plot(dapg)
Finally, we could predict what would be the group assignment for the data we have based on the estimated model
prev<-predict(dapg,pg)
prev
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## Levels: 1 2
For some reason the code for the plot of the object resulting from thelda call, as below, returns an error message regarding a “subscript out of bounds”. There is nothing incorrect about the code being used (analogous code isused below without any problem). I suspect this is a problem related with the acoordinateof one of the locations being infinity in a given discriminant function, but there is nothing obvious that might becausing it.
Any way, the best way to evaluate the performance of the analysis is to build a confusion matrix, that presents correct decisions and incorrect decisions. We want most of the decisions to be on the main diagonal (correct decisions!)
confusion(prev,pg$Group)
## true
## predicted 1 2
## 1 9 0
## 2 0 11
Note that the object resulting from the discriminant analysis itself already contains this information
dapg$confusion
## true
## predicted 1 2
## 1 9 0
## 2 0 11
Based on this data set, we would be able to perfectly separate each and every site according to two groups. Nonetheless, what should be noted is that we would like to be able to separate new samples from locations with unknown habitat to the right habitat, just based on their fish community.
As a note, we could have made also a logistic regression, a PCA, a cluster analysis over these data. You can do that to practice for the exam, comparing the results across analysis.
Efectue uma análise discriminante aos dados DataTP11solea.csv, referentes a variáveis morfométricas de 3 espécies de linguado: Solea solea, Solea senegalensis e Solea lascaris. Discuta os resultados obtidos.
We read in the data
lingua <- read.csv("DataTP11solea.csv", header=TRUE, sep=";")
and as usual conduct a brief EDA
str(lingua)
## 'data.frame': 121 obs. of 5 variables:
## $ especie: Factor w/ 3 levels "lascaris","senegalensis",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ m1 : num 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 ...
## $ m2 : num 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 ...
## $ m3 : num 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1 1.7 1.9 ...
## $ m4 : num 0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 ...
We have 121 observations for 5 variables, which correspond respectively to individuals of 3 species of linguado and their characteristics. The first column does not represent a characteristic, but it contains the species each individual belongs to. There are 3 species with a similar number of observations per species
table(lingua[,1])
##
## lascaris senegalensis solea
## 35 36 50
The variables considered are
names(lingua)[-1]
## [1] "m1" "m2" "m3" "m4"
It will be interesting is to separate the data per species, to see if any variables allow clear segregation across species. A set of boxplots is helpful for that
par(mfrow=c(2,2))
with(lingua,boxplot(m1~especie))
with(lingua,boxplot(m2~especie))
with(lingua,boxplot(m3~especie))
with(lingua,boxplot(m4~especie))
All variables show some apparent differences, so we should be able to separate the species efficiently based on these.
We now implement a discriminat analysis using again function fda
in package mda
.
linguada<-fda(especie~.,data=lingua)
linguada
## Call:
## fda(formula = especie ~ ., data = lingua)
##
## Dimension: 2
##
## Percent Between-Group Variance Explained:
## v1 v2
## 99.1 100.0
##
## Degrees of Freedom (per dimension): 5
##
## Training Misclassification Error: 0.02479 ( N = 121 )
The summary of the object is not very helpful
summary(linguada)
## Length Class Mode
## percent.explained 2 -none- numeric
## values 2 -none- numeric
## means 6 -none- numeric
## theta.mod 4 -none- numeric
## dimension 1 -none- numeric
## prior 3 table numeric
## fit 5 polyreg list
## call 3 -none- call
## terms 3 terms call
## confusion 9 table numeric
The default plot is actually very informative, and shows us that the analysis is very efficient at segregating the species
plot(linguada)
We can see the parameter values
coef.fda(linguada)
## [,1] [,2]
## Intercept -1.6277078 7.9937687
## m1 -0.9909587 -0.2178406
## m2 -1.5048351 -2.2048785
## m3 2.3325292 0.8028833
## m4 2.4613798 -2.5948115
coef(linguada)
## [,1] [,2]
## Intercept -1.6277078 7.9937687
## m1 -0.9909587 -0.2178406
## m2 -1.5048351 -2.2048785
## m3 2.3325292 0.8028833
## m4 2.4613798 -2.5948115
the percentage of variance explained by each linear discriminant function
linguada$percent.explained
## v1 v2
## 99.09563 100.00000
and the confusion matrix
linguada$confusion
## true
## predicted lascaris senegalensis solea
## lascaris 35 0 0
## senegalensis 0 35 2
## solea 0 1 48
which shows us that of 121 only 3 are not correctly classified.
Aplique técnicas de classificação (classificação hierárquica, com base na distância euclideana e no método da mínima variância) e de ordenação (análise de componentes principais) aos dados da alínea anterior (DataTP11solea2.csv). Compare os resultados e comente as principais diferenças obtidas.
OK, so here we can do anything we’d like really, just use the previous data for other type of analysis and compare results. Here are some possibilities
pcalingua<-princomp(lingua[,-1])
summary(pcalingua)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.9593246 0.51074904 0.27618854 0.146300909
## Proportion of Variance 0.9145805 0.06214759 0.01817274 0.005099213
## Cumulative Proportion 0.9145805 0.97672804 0.99490079 1.000000000
biplot(pcalingua)
## Hierarchical Cluster
lingueucD=dist(lingua[,-1])
hclingua<-hclust(lingueucD,method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(hclingua)
Despite not requested, given we have 3 species, a non-hierachical cluster with 3 clusters might also be interesting
nhclingua<-kmeans(lingueucD,3)
We can check how well it does using again a confusion matrix
species=as.character(unique(lingua$especie))
#awkward tweaking required to lign up predictions with species
#because kmeans only produces a clustering, it does not
#associate labels to clusters
preds=ifelse(nhclingua$cluster==1,species[3],ifelse(nhclingua$cluster==2,species[2],species[1]))
confusion(preds,lingua$especie)
## true
## predicted lascaris senegalensis solea
## lascaris 0 36 49
## senegalensis 22 0 1
## solea 13 0 0
and while we do well at correctly predicting solea and lascaris, many of the senegalensis are mistaken as solea.
Note that while this could seem to imply that the Discriminant Analysis (DA) is better than the Non-Hierarchical Cluster (NHC) analysis, we should note these are intrinsically different approaches. While the latter is a unsupervised learning technique (the labels are not provided for the observations), the former is a supervised learning technique, and therefore not surprisingly is better at discriminating the groups we know exist, given the differences present.
Therefore, in general, if you know the groups a priori, distinguishing these based on a multivariate analysis is more sensible using the DA than the NHC analysis.