Preliminary steps

We load up some useful libraries first

library(vegan)
#for whatever might be required!
library(MASS)
#Mixture and Flexible Discriminant Analysis 
library(mda)
#for the confusion matrix
library (class)

Exercício 1

Recolheram-se dados de abundância de várias espécies de anfíbios em vários distritos de Portugal (DataTP11anfibios.csv). Efectue um escalamento multidimensional com base em distâncias euclideanas e retire as principais conclusões.

We read in the data (note this data was also used in FT10!)

anfi <- read.csv2("DataTP11anfibios.csv")

and check it looks OK

str(anfi)
## 'data.frame':    15 obs. of  7 variables:
##  $ Distritos  : Factor w/ 15 levels "AVEI","BEJA",..: 7 2 8 13 11 10 5 14 9 1 ...
##  $ Rana       : num  4.44 2.18 0.47 1.22 4.66 2.68 1.45 0 1 2.35 ...
##  $ Triturus   : num  0 0 0 2.45 0 0 1.88 0 0 0 ...
##  $ Salamandra : num  35.96 41.1 35.38 20.92 5.12 ...
##  $ Hyla       : num  3.2 0 0.71 0 1.82 0 0 1.5 1.71 6.47 ...
##  $ Bufo       : num  4.8 16.4 14.8 44.1 23.6 ...
##  $ Pleurodeles: num  16 0.48 9.3 11.46 20.76 ...

We have 15 observations for 7 variables, which correspond respectively to locations and taxa (genera of amphibians). The first column does not represent a genera, but it contains the labels of the locations

anfi[,1]
##  [1] EVOR BEJA FARO SETU LISB LEIR CBRA VISE GUAR AVEI COIM PORT BRAG BRCA
## [15] VREA
## 15 Levels: AVEI BEJA BRAG BRCA CBRA COIM EVOR FARO GUAR LEIR LISB ... VREA

The names of the genera are

names(anfi)[2:7]
## [1] "Rana"        "Triturus"    "Salamandra"  "Hyla"        "Bufo"       
## [6] "Pleurodeles"

We can also see and if some locations have much more amphibians than others

nbyloc=data.frame(loc=anfi[,1],n=rowSums(anfi[,2:7]))
nbyloc
##     loc     n
## 1  EVOR 64.40
## 2  BEJA 60.12
## 3  FARO 60.68
## 4  SETU 80.13
## 5  LISB 56.00
## 6  LEIR 64.44
## 7  CBRA 55.34
## 8  VISE 59.52
## 9  GUAR 44.71
## 10 AVEI 61.30
## 11 COIM 67.15
## 12 PORT 72.92
## 13 BRAG 63.86
## 14 BRCA 58.43
## 15 VREA  0.00

with Vila Real representing a really weird location, with no data (might that be a mistake?).

We can also evaluate what are the most abundant taxa,

colSums(anfi[,2:7])
##        Rana    Triturus  Salamandra        Hyla        Bufo Pleurodeles 
##       21.99        5.44      146.68       36.63      471.83      186.43

We can calculate a distance matrix between locations (Euclidean distances)

row.names(anfi)=anfi[,1]
deucanfi<-dist(anfi[,-1])
dmananfi2<-dist(anfi[,-1],method="manhattan")
print(deucanfi,digits=2)
##      EVOR BEJA FARO SETU LISB LEIR CBRA VISE GUAR AVEI COIM PORT BRAG BRCA
## BEJA 20.4                                                                 
## FARO 12.9 10.8                                                            
## SETU 42.6 36.1 32.8                                                       
## LISB 36.5 42.1 33.8 27.8                                                  
## LEIR 44.5 54.9 45.1 39.8 18.1                                             
## CBRA 40.5 47.4 38.5 31.5  7.5 12.3                                        
## VISE 41.8 50.2 40.7 34.0 12.0  7.7  6.0                                   
## GUAR 42.0 43.8 37.0 26.3  9.8 25.2 13.4 18.2                              
## AVEI 51.5 48.7 44.1 21.2 22.6 35.8 25.8 29.1 16.0                         
## COIM 67.0 59.9 57.6 28.1 40.9 52.1 43.3 46.0 34.1 19.0                    
## PORT 64.6 58.9 55.9 26.3 37.5 47.9 39.6 42.0 31.3 15.8  5.3               
## BRAG 68.2 60.2 58.4 29.0 42.7 54.3 45.1 48.1 35.6 21.1  3.9  8.9          
## BRCA 61.5 54.7 52.3 26.1 35.9 49.0 39.1 42.4 28.4 13.7  8.3  9.2  9.6     
## VREA 40.0 44.3 39.5 50.2 32.3 44.2 35.5 39.7 30.7 43.6 59.8 58.4 60.3 52.2

and finally implement the multidimensional scaling (Note: this analysis, classical metric multidimensional scaling (MDS), is sometimes also referred to as principal coordinates analysis")

mdsanfi<- cmdscale(deucanfi, eig = TRUE)
#mdsanfi2<- cmdscale(dmananfi2, eig = TRUE)
#test how close these two analyisismatch up, i.e.
#considering the euclidean and manhattan distances, say
mdsanfi
## $points
##             [,1]       [,2]
## EVOR -35.9372068  -9.447106
## BEJA -25.9322705 -26.132662
## FARO -25.9645802 -15.938199
## SETU   5.6126292 -11.697835
## LISB  -6.1061455  10.813623
## LEIR  -8.9848858  25.213559
## CBRA  -5.5366438  16.597692
## VISE  -6.2936806  19.570766
## GUAR   0.4735304   8.222119
## AVEI  13.9550085   1.242643
## COIM  30.6908939  -6.783789
## PORT  28.3591494  -3.467991
## BRAG  31.5838299  -8.565286
## BRCA  24.5776694  -6.527845
## VREA -20.4972974   6.900310
## 
## $eig
##  [1]  6.820783e+03  2.864971e+03  1.171632e+03  6.150470e+01  1.776065e+01
##  [6]  4.014496e+00  9.548624e-13  1.591834e-13  1.492092e-13  1.479654e-13
## [11]  5.156001e-14 -1.019713e-14 -4.809168e-14 -1.853059e-13 -6.608920e-13
## 
## $x
## NULL
## 
## $ac
## [1] 0
## 
## $GOF
## [1] 0.8852984 0.8852984

and look at the results

somavalprop<-sum(mdsanfi$eig)
mdsanfi$eig[2]
## [1] 2864.971
percent2d<-(sum(mdsanfi$eig[1],mdsanfi$eig[2]))/somavalprop*100
percent2d
## [1] 88.52984
x <- mdsanfi$points[,1]
y <- mdsanfi$points[,2]
plot(x,y, xlab="Coordinate 1",ylab="Coordinate 2",type="n")
text(x,y, labels= anfi[,1],cex=0.6)
#adding a polygon to a plot is easy
#to find the required coordinates consider function
#locator()
#use locator(k) to find coordinates of k points
#us mouse and click over desired points
polygon(x=c(20,34,34,20),y=c(-10,-10,0,0),lty=2)
polygon(x=c(-12.4916579,-0.2896308,6.0951509,3.1155861,-10.9309335,-15.3293387),y=c(26.992671,26.992671,17.763229,2.523919,5.314215,16.690038),lty=2)
polygon(x=c(-32.92296,-23.41673,-20.15340,-27.53137,-37.60513,-38.74020),y=c(-0.4810154,-9.9250950,-27.0961487,-27.7400632,-24.0912143,-8.2079896),lty=2)

As in the previous exercises with this data set, there is a good latitudinal separation between the districts on the basis of the amphibian data. We can see 3 or 4 groups of locations.

We could see if the analysis seems to be efficient at describing the information contained in the original distances

par(mfrow=c(1,2),mar=c(4,4,0.5,0.5))
plot(1:15,mdsanfi$eig/sum(mdsanfi$eig),xlab="Dimension",ylab="Variance explained")
plot(1:15,cumsum(mdsanfi$eig/sum(mdsanfi$eig)),xlab="Dimension",ylab="Comulative variance explained")

Exercício 2

Efectue um escalamento multidimensional não-métrico aos dados da alínea anterior e comente os resultados obtidos.

A Non-Metric Multidimensional Scaling “…chooses a k-dimensional (default k = 2) configuration to minimize the stress, the square root of the ratio of the sum of squared differences between the input distances and those of the configuration to the sum of configuration distances squared.” (stolen from ?isoMDS)".

library(MASS)
nmmdsanfi<- isoMDS(deucanfi)
## initial  value 11.014400 
## iter   5 value 7.298933
## iter  10 value 6.345802
## final  value 6.261120 
## converged
nmmdsanfi
## $points
##              [,1]       [,2]
## EVOR -34.11519011 -12.716941
## BEJA -25.49278236 -25.989141
## FARO -24.15893226 -16.309516
## SETU   5.81708937 -11.759853
## LISB  -5.10029304   8.858192
## LEIR  -4.75141021  25.687765
## CBRA  -3.48416531  14.291690
## VISE  -5.22080235  19.179415
## GUAR   0.05379224   5.977481
## AVEI  13.13845563   1.171623
## COIM  30.54658981  -6.067311
## PORT  27.70084748  -2.225513
## BRAG  31.34666571  -7.962404
## BRCA  25.28447070  -5.484271
## VREA -31.56433532  13.348786
## 
## $stress
## [1] 6.26112
x2<-nmmdsanfi$points[,1]
y2<-nmmdsanfi$points[,2]
plot(x2,y2,xlab="Coordinate 1",ylab="Coordinate 2",type="n")
text(x2,y2, labels= anfi[,1],cex=0.7)

There does not seem to be any major differences compared to what was obtained with the original analysis.

We could compare both analysis outputs directly by overlaying them on a single plot

plot(x2,y2,xlab="Coordinate 1",ylab="Coordinate 2",type="n")
text(x,y, labels= anfi[,1],cex=0.7,col=2)
text(x2,y2, labels= anfi[,1],cex=0.7,col=3)

Interestingly, every single district has about the same position on the biplot, except for Vila Real, which was the district which we suspect had a record that was deleted (as it presented no amphibia at all, when all others had a reasonably large number of amphibia). The non-metric analysis is less sensible to such data, and hence places the location closer to the center of the plot.

We could evaluate how reasonable was the representation in the 2 axis

nmmdsanfiB<- metaMDS(deucanfi)
## Run 0 stress 0.06206213 
## Run 1 stress 0.1011647 
## Run 2 stress 0.06206213 
## ... Procrustes: rmse 2.630554e-06  max resid 4.482184e-06 
## ... Similar to previous best
## Run 3 stress 0.1011647 
## Run 4 stress 0.06206213 
## ... New best solution
## ... Procrustes: rmse 2.033655e-05  max resid 6.395341e-05 
## ... Similar to previous best
## Run 5 stress 0.1011647 
## Run 6 stress 0.1188388 
## Run 7 stress 0.06206214 
## ... Procrustes: rmse 4.61858e-05  max resid 0.0001325306 
## ... Similar to previous best
## Run 8 stress 0.06206227 
## ... Procrustes: rmse 0.0001470754  max resid 0.000449502 
## ... Similar to previous best
## Run 9 stress 0.1011647 
## Run 10 stress 0.1011647 
## Run 11 stress 0.06206213 
## ... Procrustes: rmse 1.621825e-05  max resid 3.687544e-05 
## ... Similar to previous best
## Run 12 stress 0.1011647 
## Run 13 stress 0.1011647 
## Run 14 stress 0.06206214 
## ... Procrustes: rmse 3.65225e-05  max resid 8.587905e-05 
## ... Similar to previous best
## Run 15 stress 0.06206214 
## ... Procrustes: rmse 2.77913e-05  max resid 6.040747e-05 
## ... Similar to previous best
## Run 16 stress 0.1011647 
## Run 17 stress 0.06206214 
## ... Procrustes: rmse 5.009538e-05  max resid 0.0001271436 
## ... Similar to previous best
## Run 18 stress 0.06206213 
## ... New best solution
## ... Procrustes: rmse 7.932605e-06  max resid 2.236813e-05 
## ... Similar to previous best
## Run 19 stress 0.06206218 
## ... Procrustes: rmse 2.578792e-05  max resid 4.027577e-05 
## ... Similar to previous best
## Run 20 stress 0.06206225 
## ... Procrustes: rmse 8.420252e-05  max resid 0.000248178 
## ... Similar to previous best
## *** Solution reached
par(mfrow=c(1,2),mar=c(4,4,2.5,0.5))
stressplot(nmmdsanfiB,main="Sheppard plot")
gof=goodness(nmmdsanfiB)
plot(nmmdsanfiB,type="t",main="goodness of fit")
## species scores not available
points(nmmdsanfiB,display="sites",cex=gof*200)

Exercício 3

Recolheu dados relativos a densidades de várias espécies de peixes em duas áreas distintas (1 e 2) (DataTP11peixesgrupos.csv). Realize uma análise discriminante e comente os resultados obtidos.

Read the data in

pg <- read.csv("DataTP11peixesgrupos.csv", sep=";")

and we take a peak at the data

str(pg)
## 'data.frame':    20 obs. of  11 variables:
##  $ Group: int  1 1 1 1 1 1 1 1 1 2 ...
##  $ Re_hi: int  0 0 0 0 1 0 0 0 0 3 ...
##  $ An_de: int  0 0 3 0 0 0 0 0 2 0 ...
##  $ An_mi: int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Hi_pl: int  5 6 7 8 9 8 7 8 9 12 ...
##  $ An_lu: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Me_ae: int  23 24 25 24 23 24 25 26 27 0 ...
##  $ Ra_ra: int  0 0 1 0 0 0 0 2 0 0 ...
##  $ Mi_po: int  3 2 0 2 0 6 3 2 0 4 ...
##  $ Ma_vi: int  1 0 0 3 0 0 3 16 0 0 ...
##  $ Ga_mo: int  12 14 15 21 23 25 26 24 25 1 ...

We have 20 observations for 11 variables, which correspond respectively to locations and species of fish. The first column does not represent a genera, but it contains the habitat each location corresponds to.

pg[,1]
##  [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2

The names of the genera are

names(pg)[-1]
##  [1] "Re_hi" "An_de" "An_mi" "Hi_pl" "An_lu" "Me_ae" "Ra_ra" "Mi_po"
##  [9] "Ma_vi" "Ga_mo"

We can also see and if some locations have much more fish than others

nbyloc=data.frame(loc=pg[,1],n=rowSums(pg[,-1]))
nbyloc
##    loc   n
## 1    1  44
## 2    1  46
## 3    1  51
## 4    1  59
## 5    1  56
## 6    1  63
## 7    1  64
## 8    1  78
## 9    1  64
## 10   2  20
## 11   2  63
## 12   2  65
## 13   2  62
## 14   2 136
## 15   2 122
## 16   2  79
## 17   2  50
## 18   2  50
## 19   2  41
## 20   2  38

We can also evaluate what are the most abundant taxa,

sort(colSums(pg[,-1]))
## An_lu An_mi An_de Mi_po Ra_ra Re_hi Ga_mo Me_ae Hi_pl Ma_vi 
##     2    16    19    23    45    80   238   248   282   298

Of course, what will be interesting is to separate the analysis per habitat. As an example, we might see what are the largest differences per habitat in terms of fish abundances

sort(round(colMeans(pg[pg$Group==1,-1])-colMeans(pg[pg$Group==2,-1]),1))
## Ma_vi Hi_pl Re_hi Ra_ra An_mi An_de An_lu Mi_po Ga_mo Me_ae 
## -22.4 -12.1  -7.1  -3.5  -1.1  -0.7  -0.2   1.5  15.7  22.1

to conclude that Ma_vi is much more abundant in habitat 2 than 1, while the contrary happens for say Me_ae. These will likely be determinant species to separate the two groups!

Finally, we implement the descriminant analysis

dapg <- fda(Group~.,data=pg)
dapg
## Call:
## fda(formula = Group ~ ., data = pg)
## 
## Dimension: 1 
## 
## Percent Between-Group Variance Explained:
##  v1 
## 100 
## 
## Degrees of Freedom (per dimension): 11 
## 
## Training Misclassification Error: 0 ( N = 20 )

we can see the estimated coefficients

coef(dapg)
##                  [,1]
## Intercept  8.51363455
## Re_hi      0.24295777
## An_de      0.20372710
## An_mi     -0.32843407
## Hi_pl      0.13629716
## An_lu      1.53500253
## Me_ae     -0.91703765
## Ra_ra     -0.35050178
## Mi_po     -0.09375134
## Ma_vi      0.04371099
## Ga_mo      0.01049302

and the means per group on the discriminant space - the larger the differences between these means the easier it should be to discriminate the groups (do not confuse with means of variables per group, note we had already obtained these manually above!)

dapg$means
##          v1
## 1 -12.89983
## 2  10.55440
## attr(,"scaled:scale")
## [1] 0.08570205

With a single discriminant function the plot of the result cannot be obtained (and it would not be that informative, but contrast with plot from Exercício 4 below)

#this code chunck has eval=FALSE
#this returns an error if you try to run it
plot(dapg)

Finally, we could predict what would be the group assignment for the data we have based on the estimated model

prev<-predict(dapg,pg)
prev
##  [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
## Levels: 1 2

For some reason the code for the plot of the object resulting from thelda call, as below, returns an error message regarding a “subscript out of bounds”. There is nothing incorrect about the code being used (analogous code isused below without any problem). I suspect this is a problem related with the acoordinateof one of the locations being infinity in a given discriminant function, but there is nothing obvious that might becausing it.

Any way, the best way to evaluate the performance of the analysis is to build a confusion matrix, that presents correct decisions and incorrect decisions. We want most of the decisions to be on the main diagonal (correct decisions!)

confusion(prev,pg$Group)
##          true
## predicted  1  2
##         1  9  0
##         2  0 11

Note that the object resulting from the discriminant analysis itself already contains this information

dapg$confusion
##          true
## predicted  1  2
##         1  9  0
##         2  0 11

Based on this data set, we would be able to perfectly separate each and every site according to two groups. Nonetheless, what should be noted is that we would like to be able to separate new samples from locations with unknown habitat to the right habitat, just based on their fish community.

As a note, we could have made also a logistic regression, a PCA, a cluster analysis over these data. You can do that to practice for the exam, comparing the results across analysis.

Exercício 4

Efectue uma análise discriminante aos dados DataTP11solea.csv, referentes a variáveis morfométricas de 3 espécies de linguado: Solea solea, Solea senegalensis e Solea lascaris. Discuta os resultados obtidos.

We read in the data

lingua <- read.csv("DataTP11solea.csv", header=TRUE, sep=";")

and as usual conduct a brief EDA

str(lingua)
## 'data.frame':    121 obs. of  5 variables:
##  $ especie: Factor w/ 3 levels "lascaris","senegalensis",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ m1     : num  5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 ...
##  $ m2     : num  4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 ...
##  $ m3     : num  1.5 1.3 1.4 1.7 1.5 1.7 1.5 1 1.7 1.9 ...
##  $ m4     : num  0.4 0.4 0.3 0.3 0.3 0.2 0.4 0.2 0.5 0.2 ...

We have 121 observations for 5 variables, which correspond respectively to individuals of 3 species of linguado and their characteristics. The first column does not represent a characteristic, but it contains the species each individual belongs to. There are 3 species with a similar number of observations per species

table(lingua[,1])
## 
##     lascaris senegalensis        solea 
##           35           36           50

The variables considered are

names(lingua)[-1]
## [1] "m1" "m2" "m3" "m4"

It will be interesting is to separate the data per species, to see if any variables allow clear segregation across species. A set of boxplots is helpful for that

par(mfrow=c(2,2))
with(lingua,boxplot(m1~especie))
with(lingua,boxplot(m2~especie))
with(lingua,boxplot(m3~especie))
with(lingua,boxplot(m4~especie))

All variables show some apparent differences, so we should be able to separate the species efficiently based on these.

We now implement a discriminat analysis using again function fda in package mda.

linguada<-fda(especie~.,data=lingua)
linguada
## Call:
## fda(formula = especie ~ ., data = lingua)
## 
## Dimension: 2 
## 
## Percent Between-Group Variance Explained:
##    v1    v2 
##  99.1 100.0 
## 
## Degrees of Freedom (per dimension): 5 
## 
## Training Misclassification Error: 0.02479 ( N = 121 )

The summary of the object is not very helpful

summary(linguada)
##                   Length Class   Mode   
## percent.explained 2      -none-  numeric
## values            2      -none-  numeric
## means             6      -none-  numeric
## theta.mod         4      -none-  numeric
## dimension         1      -none-  numeric
## prior             3      table   numeric
## fit               5      polyreg list   
## call              3      -none-  call   
## terms             3      terms   call   
## confusion         9      table   numeric

The default plot is actually very informative, and shows us that the analysis is very efficient at segregating the species

plot(linguada)

We can see the parameter values

coef.fda(linguada)
##                 [,1]       [,2]
## Intercept -1.6277078  7.9937687
## m1        -0.9909587 -0.2178406
## m2        -1.5048351 -2.2048785
## m3         2.3325292  0.8028833
## m4         2.4613798 -2.5948115
coef(linguada)
##                 [,1]       [,2]
## Intercept -1.6277078  7.9937687
## m1        -0.9909587 -0.2178406
## m2        -1.5048351 -2.2048785
## m3         2.3325292  0.8028833
## m4         2.4613798 -2.5948115

the percentage of variance explained by each linear discriminant function

linguada$percent.explained
##        v1        v2 
##  99.09563 100.00000

and the confusion matrix

linguada$confusion
##               true
## predicted      lascaris senegalensis solea
##   lascaris           35            0     0
##   senegalensis        0           35     2
##   solea               0            1    48

which shows us that of 121 only 3 are not correctly classified.

Exercício 5

Aplique técnicas de classificação (classificação hierárquica, com base na distância euclideana e no método da mínima variância) e de ordenação (análise de componentes principais) aos dados da alínea anterior (DataTP11solea2.csv). Compare os resultados e comente as principais diferenças obtidas.

OK, so here we can do anything we’d like really, just use the previous data for other type of analysis and compare results. Here are some possibilities

Principal Component Analysis

pcalingua<-princomp(lingua[,-1])
summary(pcalingua)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3      Comp.4
## Standard deviation     1.9593246 0.51074904 0.27618854 0.146300909
## Proportion of Variance 0.9145805 0.06214759 0.01817274 0.005099213
## Cumulative Proportion  0.9145805 0.97672804 0.99490079 1.000000000
biplot(pcalingua)

## Hierarchical Cluster

lingueucD=dist(lingua[,-1])
hclingua<-hclust(lingueucD,method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(hclingua)

Non-Hierarchical Cluster

Despite not requested, given we have 3 species, a non-hierachical cluster with 3 clusters might also be interesting

nhclingua<-kmeans(lingueucD,3)

We can check how well it does using again a confusion matrix

species=as.character(unique(lingua$especie))
#awkward tweaking required to lign up predictions with species
#because kmeans only produces a clustering, it does not
#associate labels to clusters
preds=ifelse(nhclingua$cluster==1,species[3],ifelse(nhclingua$cluster==2,species[2],species[1]))
confusion(preds,lingua$especie)
##               true
## predicted      lascaris senegalensis solea
##   lascaris            0           36    49
##   senegalensis       22            0     1
##   solea              13            0     0

and while we do well at correctly predicting solea and lascaris, many of the senegalensis are mistaken as solea.

Note that while this could seem to imply that the Discriminant Analysis (DA) is better than the Non-Hierarchical Cluster (NHC) analysis, we should note these are intrinsically different approaches. While the latter is a unsupervised learning technique (the labels are not provided for the observations), the former is a supervised learning technique, and therefore not surprisingly is better at discriminating the groups we know exist, given the differences present.

Therefore, in general, if you know the groups a priori, distinguishing these based on a multivariate analysis is more sensible using the DA than the NHC analysis.