Before we begin we load the packages cluster and vegan, required for implementing the required clustering methods.

library(cluster)
library(vegan)

Exercício 1

Recolheram-se dados de abundância de várias espécies (A a J) em 3 diferentes locais correspondentes a 3 tipos de habitat (H1, H2 e H3) (DataTP9habitats123.csv). Efectue uma análise classificativa hierárquica aos dados. Seleccione diferentes medidas de distância e métodos de aglomeração e avalie o impacto nos agrupamentos formados.

As usual, begin by reading the data in

#path="C:/Users/tam2/Dropbox/Trabalho/DBA/EcologiaNumerica/2018/Aulas/TPs/TP11/"
#file=paste0(path,"DataTP9habitats123.csv")
habs <- read.csv("DataTP9habitats123.csv", sep=";")

and as usual, check all is OK

str(habs)

## 'data.frame':    22 obs. of  11 variables:
##  $ Habitat: Factor w/ 22 levels "H1_1","H1_2",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ A      : num  0.05 0.02 0.01 0.01 0.02 0.08 0.03 0.02 0 0.01 ...
##  $ B      : num  0.22 0.18 0.18 0.15 0.12 0.03 0 0.02 0.03 0 ...
##  $ C      : num  0 0.04 0.05 0.01 0.04 0 0 0 0.01 0.01 ...
##  $ D      : num  0.2 0.13 0.11 0.1 0.02 0.02 0.02 0.02 0.01 0 ...
##  $ E      : num  0.03 0 0 0.01 0 0.02 0 0 0.01 0.01 ...
##  $ F      : num  0.06 0.15 0.13 0.17 0.37 0.21 0.19 0.19 0.2 0.23 ...
##  $ G      : num  0.16 0 0.09 0.1 0.12 0.12 0.34 0.2 0.26 0.11 ...
##  $ H      : num  0.03 0.05 0.08 0.04 0.08 0.16 0.11 0.31 0.25 0.21 ...
##  $ I      : num  0.06 0.04 0.06 0.05 0.02 0.05 0.1 0.03 0.08 0.1 ...
##  $ J      : num  0.16 0.36 0.32 0.33 0.16 0.32 0.17 0.2 0.14 0.27 ...

We have 22 observations, i.e number of locations, for 11 variables. The first variable is different from the others, as it is a factor representing the labels of the locations.

We can check how many sites were sampled in each habitat (note the use of function substr to cut a string at predefined places)

barplot(table(substr(habs$Habitat,2,2)),xlab="Habitat")

We use here the function daisy to get the association measure between locations, but recall the Aula Teórica, as there are many functions that can do the same thing.

We compute both the Euclidean and the Manhattan distances (note removing the label column to avoid errors).

dEUC<-dist(habs[,-1], method = "euclidean")
dMAN<-dist(habs[,-1], method = "manhattan")
##same same - from library cluster
#dEUC<-daisy(habs[,-1], metric = "euclidean")
#dMAN<-daisy(habs[,-1], metric = "manhattan")

More as curiosity than anything else, although this will be akin to what might be required to calculate a cophenetic correlation coeficient) we can obtain the correlation between the two distances. Naturally, these are stongly positively correlated

cor(dEUC,dMAN)

## [1] 0.9666873

In fact, if we plot these two distances distances across sites, the relationship is extremely obvious (note this is expected, as they are closely related - compare their mathematical expressions and imagine what these meen in practice to understand why)

plot(dEUC,dMAN)

Now we can perform the clustering, with the average linkage method

clustEUCave<-hclust(dEUC,method="average")
clustMANave<-hclust(dMAN,method="average")

We can see that the object resulting from a clustering algorithm is not that interesting per se.

clustEUCave

## 
## Call:
## hclust(d = dEUC, method = "average")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 22

nor is the summary

summary(clustEUCave)

##             Length Class  Mode     
## merge       42     -none- numeric  
## height      21     -none- numeric  
## order       22     -none- numeric  
## labels       0     -none- NULL     
## method       1     -none- character
## call         3     -none- call     
## dist.method  1     -none- character

so essentially we can plot the objects to see the resulting grouping.

par(mfrow=c(1,2))
plot(clustEUCave,labels=habs[,1],cex=0.6)
plot(clustMANave,labels=habs[,1],cex=0.6)

An easier to interpret results if we only use as label the corresponding habitat letter. To do so we begin by extracting the habitat label (1, 2 or 3) from the original data. We can use a function for sting manipulation to do so (substr)

habitat=substr(habs[,1],2,2)
habitat=as.numeric(as.character(habitat))
habitat

##  [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3

and then plot the same plots as above, with the newly created labels

par(mfrow=c(1,2))
plot(clustEUCave,labels=substr(habs[,1],2,2),cex=0.6)
plot(clustMANave,labels=substr(habs[,1],2,2),cex=0.6)

Next we consider also the single and complete methods of linkage

par(mfrow=c(1,2))
clustEUCsin<-hclust(dEUC,method="single")
plot(clustEUCsin,labels=habs[,1],cex=0.6)
clustMANsin<-hclust(dMAN,method="single")
plot(clustMANsin,labels=habs[,1],cex=0.6)

par(mfrow=c(1,2))
clustEUCcom<-hclust(dEUC,method="complete")
plot(clustEUCcom,labels=habs[,1],cex=0.6)
clustMANcom<-hclust(dMAN,method="complete")
plot(clustMANcom,labels=habs[,1],cex=0.6)

Where the locations from each habitat better segregated with any of the above options (i.e. considering the different distances and linkage methods)?

We can explore the cophenetic correlation between the object resulting from the grouping procedure and the original distance matrix.

dcop=cophenetic(clustEUCave)
cor(as.numeric(dEUC),as.numeric(dcop))

## [1] 0.7818695

We can see what is the cophenetic distance and relate it to the dendogram (note severe rounding to have it printing nicely).

round(dcop,1)

##      1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
## 2  0.3                                                                    
## 3  0.3 0.1                                                                
## 4  0.3 0.1 0.1                                                            
## 5  0.3 0.3 0.3 0.3                                                        
## 6  0.3 0.3 0.3 0.3 0.3                                                    
## 7  0.3 0.3 0.3 0.3 0.3 0.2                                                
## 8  0.3 0.3 0.3 0.3 0.3 0.2 0.2                                            
## 9  0.3 0.3 0.3 0.3 0.3 0.2 0.2 0.1                                        
## 10 0.3 0.3 0.3 0.3 0.3 0.1 0.2 0.2 0.2                                    
## 11 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5                                
## 12 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2                            
## 13 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.1                        
## 14 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.1 0.1                    
## 15 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2                
## 16 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.1 0.1 0.1 0.2            
## 17 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.3 0.3 0.3 0.3        
## 18 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2 0.3    
## 19 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.1
## 20 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.1 0.3
## 21 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.1 0.3
## 22 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.3
##     19  20  21
## 2             
## 3             
## 4             
## 5             
## 6             
## 7             
## 8             
## 9             
## 10            
## 11            
## 12            
## 13            
## 14            
## 15            
## 16            
## 17            
## 18            
## 19            
## 20 0.3        
## 21 0.3 0.1    
## 22 0.3 0.2 0.2

plot(clustEUCave)

From the R help on cophenetic, “the cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster. Note that this distance has many ties and restrictions. It can be argued that a dendrogram is an appropriate summary of some data if the correlation between the original distances and the cophenetic distances is high. Otherwise, it should simply be viewed as the description of the output of the clustering algorithm.”

Note also that we can use a function to draw say \(k\) groups on the dendogram

plot(clustEUCave,labels=habitat)
rect.hclust(clustEUCave,k=3)

Exercício 2

Efectue uma análise classificativa hierárquica aos dados DataTP9bentos.csv, relativos a densidades de espécies de invertebrados bentónicos em vários locais, e indique que medidas de distância e métodos de aglomeração lhe parecem mais adequados.

We begin by reading the data in

#path="C:/Users/tam2/Dropbox/Trabalho/DBA/EcologiaNumerica/2018/Aulas/TPs/TP11/"
#file=paste0(path,"DataTP9bentos.csv")
bent <- read.csv2("DataTP9bentos.csv")

and check all is OK

str(bent)

## 'data.frame':    15 obs. of  7 variables:
##  $ Local   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Eurydice: num  4.4 2.2 0.5 1.2 4.7 2.7 1.5 0 1 2.4 ...
##  $ Idotea  : num  0.1 0.2 0.3 0.1 0.3 0.5 0.6 0.2 0.3 0.5 ...
##  $ Gammarus: num  36 41 36 21 5.2 1.6 2.1 2.6 0.8 1.2 ...
##  $ Crangon : num  3.2 0 0.7 0 1.8 0 0 1.5 1.8 6.5 ...
##  $ Cancer  : num  4.8 16.4 14.8 44.1 23.6 21.9 23.7 24 27.4 42 ...
##  $ Carcinus: num  16 0.5 9.3 11.4 20.7 38.2 26.2 31.5 13.7 9.3 ...

We have 15 observations, i.e number of locations, for 7 variables. As in exercise 1, the first variable is different from the others, as it is a factor representing the labels of the locations.

We can perform a cluster analysis using the ward distance. We can plot the result vertically or horizontally. For the latter, we need to transform the object into the class dendrogram (see code below).

clustbent<-hclust(dist(bent,method="euclidean"),method="ward.D")
plot(clustbent)

plot(as.dendrogram(clustbent),horiz=TRUE)

The truth is that given the data at hand, we need a measure for continuous values, but besides that, there is little to go on to choose an appropriate method. Noneteless, we could go wild and just do many combinations of distances and methods of clustering

distancias=c("euclidean", "maximum", "manhattan", "canberra", "binary","minkowski")
nd=length(distancias)
aglomeracoes=c("ward.D", "ward.D2", "single", "complete","average","mcquitty","median","centroid")
na=length(aglomeracoes)
for(i in 1:nd){
for(j in 1:na){  
distancia=distancias[i]
aglomeracao=aglomeracoes[j] 
clustbent<-hclust(dist(bent,method=distancia),method=aglomeracao)
plot(clustbent,main=paste0("D=",distancia,", A=",aglomeracao))
}}

Exercício 3

Realize uma análise classificativa não hierárquica aos dados das duas alíneas anteriores, constituindo 2 a 3 grupos. Explore os resultados, compare as diferentes análises e indique qual ou quais lhe parecem mais adequadas.

Data Habitats

cnh2<-kmeans(habs[,-1],2)
cnh3<-kmeans(habs[,-1],3)

We can look at the resulting objects. First the group assignments

cnh2$cluster

##  [1] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1

cnh3$cluster

##  [1] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 2 1 1 2 2 2

and then the total sum of squares, which since the data is the same, is equal irrespectively of the number of groups selected

cnh2$totss

## [1] 1.722877

cnh3$totss

## [1] 1.722877

Then, the within group sum of squares (we want this small!)

cnh2$withinss

## [1] 0.4011667 0.4121300

cnh3$withinss

## [1] 0.1096125 0.0587250 0.4121300

and the between group sum of squares (we want this large!)

cnh2$betweenss

## [1] 0.9095806

cnh3$betweenss

## [1] 1.14241

and finally the number of locations per group

cnh2$size

## [1] 12 10

cnh3$size

## [1]  8  4 10

Note that we could see say how well the 3 original habitats had been segregated by the different groups. A simple way to compare the results of the 3 group hierarchical cluster is

plot(jitter(habitat),jitter(cnh3$cluster))

table(habitat,cnh3$cluster)

##        
## habitat 1 2 3
##       1 0 0 8
##       2 6 0 2
##       3 2 4 0

And we can see that the elements of habitat 1 were separated into 2 of the clusters (1 and 3) and most elements of habitat 2 and 3 were placed into group 2. Therefore, it seems like habitat 2 shares locations with comunities similar to 1 and 3, and that habitat 1 has some locations with a very particular community that gets separated from the remaining of habitat 1 and does not group with any of the others (Group 3).

What might be the best number of groups to partition the data? That depends on the clustering structure created and whether we can interpret it. Nonetheless, there are automated criteria that we can use to choose an optimal number of groups.

Therefore, we could actually check what might happen if we considered say from 2 to 8 groups, using the function cascadeKM. The plot of the resulting objects is very useful, and it provides an indication of what might be the best number of groups given a number of possible criteria. The most common are the ssi (“Simple Structure Index”) and the kalinski (the Calinski-Harabasz (1974) criterion). From the kmeans function help, “In a simulation study, Milligan and Cooper (1985) found that the Calinski-Harabasz criterion recovered the correct number of groups the most often. We recommend this criterion because, if the groups are of equal sizes, the maximum value of”calinski" usually indicates the correct number of groups. Another available index is the simple structure index “ssi”. Users should not take the indications of these indices literally when the groups are not equal in size and explore the groups corresponding to other values of K.".

km2to8=cascadeKM(habs[,-1],inf.gr = 2,sup.gr = 8,criterion = "ssi")
plot(km2to8,sortg=TRUE)

Note that here we used the sum of squares criterion to evaluate the best number of groups, but there are others one might choose, e.g. the Kalinski criterion. The best partitio would be 4 groups for the first criterion but only 2 for the second. Yes, it’s true, cluster analysis is somewaht arbitrary, and it is hard to know what is the best procedure. I would say that the best is the one that produces an output that you can use to make an ecological interpretation that is sensible and useful.

km2to8=cascadeKM(habs[,-1],inf.gr = 2,sup.gr = 8,criterion = "calinski")
plot(km2to8,sortg=TRUE)

Data Benthos

set.seed(123)
bentcnh2<-kmeans(bent[,-1],2)
bentcnh3<-kmeans(bent[,-1],3)

We can look at the resulting objects.

bentcnh2

## K-means clustering with 2 clusters of sizes 9, 6
## 
## Cluster means:
##   Eurydice    Idotea Gammarus  Crangon   Cancer Carcinus
## 1 2.133333 0.4111111 14.17778 1.133333 17.47778 17.43333
## 2 0.850000 0.3833333  3.70000 4.633333 52.55000  5.30000
## 
## Clustering vector:
##  [1] 1 1 1 2 1 1 1 1 1 2 2 2 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 4591.3000  833.9717
##  (between_SS / total_SS =  49.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

bentcnh3

## K-means clustering with 3 clusters of sizes 6, 3, 6
## 
## Cluster means:
##   Eurydice    Idotea  Gammarus  Crangon   Cancer Carcinus
## 1 0.850000 0.3833333  3.700000 4.633333 52.55000     5.30
## 2 2.366667 0.2000000 37.666667 1.300000 12.00000     8.60
## 3 2.016667 0.5166667  2.433333 1.050000 20.21667    21.85
## 
## Clustering vector:
##  [1] 2 2 2 1 3 3 3 3 3 1 1 1 1 1 3
## 
## Within cluster sum of squares by cluster:
## [1]  833.9717  229.8933 1391.9083
##  (between_SS / total_SS =  77.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

We can look at the objects subcomponents too. First the group assignments

bentcnh2$cluster

##  [1] 1 1 1 2 1 1 1 1 1 2 2 2 2 2 1

bentcnh3$cluster

##  [1] 2 2 2 1 3 3 3 3 3 1 1 1 1 1 3

Then, the within group sum of squares (we want this small!)

bentcnh2$withinss

## [1] 4591.3000  833.9717

bentcnh3$withinss

## [1]  833.9717  229.8933 1391.9083

and the between group sum of squares (we want this large!)

bentcnh2$betweenss

## [1] 5403.456

bentcnh3$betweenss

## [1] 8372.955

and finally the number of locations per group

bentcnh2$size

## [1] 9 6

bentcnh3$size

## [1] 6 3 6

Since here we were not given any indications of what clustering patterns to expect nor were we given detailsabout the data, we can’t make any strong claims about which number of groups might be more sensible to try.

We could see what seem to be the major differences between the groups created. It seems to be Cancer (more abundant on group 2) vs. Carcinus and Gammarus (more abundant on group 1).

medbygroup2<- aggregate(bent[,-1],by=list(bentcnh2$cluster), FUN=mean)
medbygroup2=medbygroup2[,-1]
plot(t(medbygroup2),xlab="Grupo 1",ylab="Grupo 2",type="n")
text(t(medbygroup2), colnames(medbygroup2), cex = 0.5)
abline(a=0,b=1)

plot(bent,col=bentcnh2$cluster)

Exercício 4

Efectue uma análise classificativa hierárquica aos dados DataTP9bentos.csv, utilizando a medida de associação de Jaccard e o método de aglomeração da mínima variância.

There’s nothing to think about here, we just do what we are told!

dJAC<-vegdist(bent[,-1],method="jaccard")
round(dJAC,2)

##       1    2    3    4    5    6    7    8    9   10   11   12   13   14
## 2  0.46                                                                 
## 3  0.31 0.26                                                            
## 4  0.63 0.60 0.51                                                       
## 5  0.64 0.73 0.65 0.55                                                  
## 6  0.76 0.80 0.74 0.66 0.36                                             
## 7  0.74 0.78 0.70 0.59 0.23 0.23                                        
## 8  0.75 0.80 0.71 0.62 0.28 0.21 0.15                                   
## 9  0.75 0.78 0.67 0.50 0.31 0.48 0.34 0.38                              
## 10 0.80 0.80 0.72 0.37 0.52 0.61 0.55 0.58 0.39                         
## 11 0.91 0.85 0.83 0.52 0.71 0.77 0.72 0.71 0.60 0.37                    
## 12 0.86 0.84 0.79 0.47 0.64 0.72 0.66 0.68 0.54 0.27 0.12               
## 13 0.93 0.84 0.85 0.54 0.72 0.78 0.73 0.73 0.62 0.43 0.09 0.20          
## 14 0.92 0.83 0.84 0.51 0.70 0.78 0.72 0.71 0.59 0.31 0.17 0.20 0.21     
## 15 0.89 0.91 0.92 0.94 0.87 0.91 0.90 0.92 0.90 0.90 0.96 0.94 0.94 0.96

clustEUCwar<-hclust(dJAC,method="ward.D")
plot(clustEUCwar)

plot(as.dendrogram(clustEUCwar),horiz=TRUE)

Exercício 5

Foram caracterizados os perfis lípidos de organismos pertencentes a 3 grupos distintos. Explore os dados DataTP9lipidos.csv recorrendo a várias metodologias de análise classificativa. Interprete e comente os resultados.

lipidos <- read.csv2("DataTP9lipidos.csv")

and as usual, check all is OK

str(lipidos)

## 'data.frame':    15 obs. of  9 variables:
##  $ lipids: Factor w/ 15 levels "B1","B2","B3",..: 1 2 3 4 11 12 13 14 15 5 ...
##  $ a     : num  13.85 11.83 6.46 12.01 6.76 ...
##  $ b     : num  0.203 0.148 0 0.148 0 ...
##  $ c     : num  1.19 1.236 0.768 1.13 0.59 ...
##  $ d     : num  0.446 0.46 0.248 0.411 0.186 ...
##  $ e     : num  0.847 1.056 0.514 0.796 0.45 ...
##  $ f     : num  6.39 7.21 6.62 7.42 21.64 ...
##  $ g     : num  1.174 0.952 0.331 0.894 0.465 ...
##  $ h     : num  0 0 0 0.175 0.217 ...

We have 15 observations, i.e number of locations, for 9 variables. The first variable is different from the others, as it is a factor representing the samples. There are 3 types of samples, H, E, e B.

type=substr(lipidos[,1],1,1)
type

##  [1] "B" "B" "B" "B" "H" "H" "H" "H" "H" "E" "E" "E" "E" "E" "E"

dais1<-daisy(lipidos[,-1], metric = "euclidean")


dais2<-daisy(lipidos[,-1], metric = "manhattan")

ca.lipidos<-hclust(dais2,method="average")

ca.lipidos<-hclust(dist(lipidos[,-1],method="manhattan"),method="average")
ca.lipidos

## 
## Call:
## hclust(d = dist(lipidos[, -1], method = "manhattan"), method = "average")
## 
## Cluster method   : average 
## Distance         : manhattan 
## Number of objects: 15

plot(ca.lipidos,labels=type)

klip3<-kmeans(lipidos[,-1],3)
klip3

## K-means clustering with 3 clusters of sizes 3, 4, 8
## 
## Cluster means:
##           a         b         c         d         e         f         g
## 1 12.564049 0.1662921 1.1852793 0.4392386 0.8996581  7.007679 1.0064306
## 2  6.899518 0.0000000 0.5126613 0.1292619 0.4929220 23.350286 0.5044076
## 3  7.905915 0.1102234 0.7270442 0.2635101 0.5338341  7.703745 0.6655320
##            h
## 1 0.05844995
## 2 0.22905134
## 3 0.28772963
## 
## Clustering vector:
##  [1] 1 1 3 1 2 2 2 2 3 3 3 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  3.212695 42.757374 20.417411
##  (between_SS / total_SS =  92.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

klip2<-kmeans (lipidos[,-1],2)
klip2

## K-means clustering with 2 clusters of sizes 4, 11
## 
## Cluster means:
##          a         b         c         d         e         f         g
## 1 6.899518 0.0000000 0.5126613 0.1292619 0.4929220 23.350286 0.5044076
## 2 9.176315 0.1255149 0.8520174 0.3114360 0.6336043  7.513909 0.7585044
##           h
## 1 0.2290513
## 2 0.2251988
## 
## Clustering vector:
##  [1] 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 42.75737 73.22139
##  (between_SS / total_SS =  86.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

dist.eu.lip<-daisy(lipidos[,-1], metric = "euclidean")
klip3d<-kmeans(dist.eu.lip,3)
klip3d

## K-means clustering with 3 clusters of sizes 3, 4, 8
## 
## Cluster means:
##           1          2         3          4         5         6         7
## 1  1.448368  0.8831022  6.209846  0.8602341 15.787807 12.853377 19.680480
## 2 18.368778 16.9193199 16.779101 16.7550789  3.113878  4.521659  3.070447
## 3  6.317331  4.3038122  2.201622  4.4060971 14.008818 11.451626 18.039067
##           8         9        10        11        12        13        14
## 1 21.069239  5.468862  4.474243  4.100940  5.019769  5.351457  4.476256
## 2  3.757358 15.304123 15.617365 17.501876 17.589953 13.236646 15.175391
## 3 19.391460  1.655913  1.428601  2.200137  2.157734  2.640055  1.483465
##          15
## 1  4.971267
## 2 14.577488
## 3  1.655162
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 
##  1  1  3  1  2  2  2  2  3  3  3  3  3  3  3 
## 
## Within cluster sum of squares by cluster:
## [1]  35.51305 573.18035 151.12105
##  (between_SS / total_SS =  91.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

klip3$cluster

##  [1] 1 1 3 1 2 2 2 2 3 3 3 3 3 3 3

klip3$withinss

## [1]  3.212695 42.757374 20.417411

klip3$betweenss

## [1] 801.1791

klip3$size

## [1] 3 4 8

plot(lipidos,col=klip3$cluster)

We can see that there is a B and an H that get into the “wrong” groups. Besides that, the groups separate neatly. That seems to happen for both the hierarchical and non-hierarchical analysis.

table(type,klip3$cluster)

##     
## type 1 2 3
##    B 3 0 1
##    E 0 0 6
##    H 0 4 1

Exercício 6

Pretende efectuar um estudo relativo ao crescimento de árvores em zonas de altitude. Pretende seleccionar dois tipos de montanhas (DataTP9montanhas.csv), para depois efectuar 5 replicados em cada um destes tipos. Que análise lhe parece adequada a esta situação? Efectue-a no R e retire as principais conclusões. Com base nos resultados, indique o que poderá enfraquecer o contexto experimental do estudo que se pretende efectuar?

mont=read.csv("DataTP9montanhas.csv", sep=";")
str(mont)

## 'data.frame':    15 obs. of  6 variables:
##  $ Serra        : Factor w/ 15 levels "Acor","Alvao",..: 5 8 13 1 15 11 12 3 2 7 ...
##  $ Altitude     : int  1993 1548 1486 1418 1416 1415 1382 1362 1283 1227 ...
##  $ Perc.Floresta: int  20 21 18 17 16 15 25 32 45 54 ...
##  $ Dispon.agua  : int  5 6 7 8 9 5 4 10 12 14 ...
##  $ Matos        : int  75 48 76 98 56 47 56 35 34 54 ...
##  $ Incendios    : int  3 4 6 3 6 5 4 12 15 14 ...

We have 15 observations, i.e number of locations, for 6 variables. The first variable is different from the others, as it is a factor representing the different montains.

mont[,1]

##  [1] Estrela    Geres      Montesinho Acor       Soajo      Marao     
##  [7] Montemuro  Amarela    Alvao      Gardunha   Lousa      Bornes    
## [13] Padrela    Falperra   Gralheira 
## 15 Levels: Acor Alvao Amarela Bornes Estrela Falperra Gardunha ... Soajo

We now fit a cluster analysis that produces two groups

mont.nhca<-kmeans(mont[,-1],2)
mont.nhca

## K-means clustering with 2 clusters of sizes 8, 7
## 
## Cluster means:
##   Altitude Perc.Floresta Dispon.agua    Matos Incendios
## 1 1502.500      20.50000     6.75000 61.37500   5.37500
## 2 1187.286      53.57143    14.14286 43.57143  16.14286
## 
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 302665.25  21694.57
##  (between_SS / total_SS =  53.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

There are several reasons for why selecting replicates from two groups of locations based on a cluster analysis might not be reasonable. Even if the groups might be distinct with respect to all 5 available variables, they might not be different in terms of other variables that might be important for the response we want to investigate. Further, with 15 mountains and only 5 replicates for each of the 2 groups to test, there will be at the very least 5 mountains which will not be sampled, irrespective of the groups found!

Exercício 7

Realize uma análise classificativa hierárquica aos dados da alínea anterior, variando medidas de distância e métodos de aglomeração, e refira se os resultados corroboram os obtidos no exercício anterior.

We consider the Manhatan distance and the ward.D linkage method, and the two groups that were formed above seem to show up again.

ca.mont<-hclust(dist(mont[,-1],method="manhattan"),method="ward.D")
plot(ca.mont,labels=mont[,1])

The same does not happen say for Euclidean Distance and complete linkage, where Estrela shows up separate from the other members of its (assuming the above was more sensible) group. Nonetheless, Estrela is the largest mountain in Portugal, so it might be indeed a group on its own.

ca.mont2<-hclust(dist(mont[,-1],method="euclidean"),method="complete")
plot(ca.mont2,labels=mont[,1])

Exercício 8

Efectue ambas as análises das alíneas 6 e 7, mas com base numa matriz de dados transformados através de redução e centragem das variáveis.

Transform the variables

mont.scale<-scale(mont[,-1])

Then implement the required analysis. First the hierarchical analysis

ca.mont.scale<-hclust(dist(mont.scale,method="manhattan"),method="ward.D")
plot(ca.mont.scale,labels=mont[,1])

interestingly, Estrela does not appear separate anymore.

We then implement then the non-hierarchical analysis

mont.scale.nhca<-kmeans(mont.scale,2)
mont.scale.nhca

## K-means clustering with 2 clusters of sizes 8, 7
## 
## Cluster means:
##     Altitude Perc.Floresta Dispon.agua      Matos  Incendios
## 1 -0.6582863     0.8048545   0.8164023 -0.5995296  0.8479258
## 2  0.7523272    -0.9198338  -0.9330312  0.6851767 -0.9690580
## 
## Clustering vector:
##  [1] 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1]  7.668136 13.885150
##  (between_SS / total_SS =  69.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Note that while the analysis seems a bit better, the classification (i.e. which mountain get’s assigned to which group) is exactly the same as before.

Aula 11 - Ficha de trabalho 9

Tiago A. Marques

November 26, 2018