---
title: "Aula 11 - Ficha de trabalho 9"
author: "Tiago A. Marques"
date: "November 26, 2018"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Before we begin we load the packages `cluster` and `vegan`, required for implementing the required clustering methods.

```{r,message=FALSE,warning=FALSE}
library(cluster)
library(vegan)
```


# Exercício 1

1. Recolheram-se dados de abundância de várias espécies (A a J) em 3 diferentes locais correspondentes a 3 tipos de habitat (H1, H2 e H3) (DataTP9habitats123.csv). Efectue uma análise classificativa hierárquica aos dados. Seleccione diferentes medidas de distância e métodos de aglomeração e avalie o impacto nos agrupamentos formados.

As usual, begin by reading the data in 

```{r,cache=TRUE}
#path="C:/Users/tam2/Dropbox/Trabalho/DBA/EcologiaNumerica/2018/Aulas/TPs/TP11/"
#file=paste0(path,"DataTP9habitats123.csv")
habs <- read.csv("DataTP9habitats123.csv", sep=";")
```

and as usual, check all is OK

```{r,cache=TRUE}
str(habs)
```

We have `r nrow(habs)` observations, i.e number of locations, for `r ncol(habs)` variables. The first variable is different from the others, as it is a factor representing the labels of the locations.

We can check how many sites were sampled in each habitat (note the use of function `substr` to cut a string at predefined places)

```{r}
barplot(table(substr(habs$Habitat,2,2)),xlab="Habitat")
```

We use here the function `daisy` to get the association measure between locations, but recall the Aula Teórica, as there are many functions that can do the same thing.

We compute both the Euclidean and the Manhattan distances (note removing the label column to avoid errors).

```{r}
dEUC<-dist(habs[,-1], method = "euclidean")
dMAN<-dist(habs[,-1], method = "manhattan")
##same same - from library cluster
#dEUC<-daisy(habs[,-1], metric = "euclidean")
#dMAN<-daisy(habs[,-1], metric = "manhattan")
```
More as curiosity than anything else, although this will be akin to what might be required to calculate a cophenetic correlation coeficient) we can obtain the correlation between the two distances. Naturally, these are stongly positively correlated

```{r}
cor(dEUC,dMAN)
```

In fact, if we plot these two distances distances across sites, the relationship is extremely obvious (note this is expected, as they are closely related - compare their mathematical expressions and imagine what these meen in practice to understand why)

```{r}
plot(dEUC,dMAN)
```


Now we can perform the clustering, with  the average linkage method

```{r}
clustEUCave<-hclust(dEUC,method="average")
clustMANave<-hclust(dMAN,method="average")
```

We can see that the object resulting from a clustering algorithm is not that interesting per se.

```{r}
clustEUCave
```

nor is the summary

```{r}
summary(clustEUCave)
```

so essentially we can plot the objects to see the resulting grouping.

```{r}
par(mfrow=c(1,2))
plot(clustEUCave,labels=habs[,1],cex=0.6)
plot(clustMANave,labels=habs[,1],cex=0.6)
```

An easier to interpret results if we only use as label the corresponding habitat letter. To do so we begin by extracting the habitat label (1, 2 or 3) from the original data. We can use a function for sting manipulation to do so (`substr`)
 
```{r}
habitat=substr(habs[,1],2,2)
habitat=as.numeric(as.character(habitat))
habitat
```

and then plot the same plots as above, with the newly created labels

```{r}
par(mfrow=c(1,2))
plot(clustEUCave,labels=substr(habs[,1],2,2),cex=0.6)
plot(clustMANave,labels=substr(habs[,1],2,2),cex=0.6)
```

Next we consider also the single and complete methods of linkage

```{r}
par(mfrow=c(1,2))
clustEUCsin<-hclust(dEUC,method="single")
plot(clustEUCsin,labels=habs[,1],cex=0.6)
clustMANsin<-hclust(dMAN,method="single")
plot(clustMANsin,labels=habs[,1],cex=0.6)
```

```{r}
par(mfrow=c(1,2))
clustEUCcom<-hclust(dEUC,method="complete")
plot(clustEUCcom,labels=habs[,1],cex=0.6)
clustMANcom<-hclust(dMAN,method="complete")
plot(clustMANcom,labels=habs[,1],cex=0.6)
```

Where the locations from each habitat better segregated with any of the above options (i.e. considering the different distances and linkage methods)?

We can explore the cophenetic correlation between the object resulting from the grouping procedure and the original distance matrix.

```{r}
dcop=cophenetic(clustEUCave)
cor(as.numeric(dEUC),as.numeric(dcop))
```

We can see what is the cophenetic distance and relate it to the dendogram (note severe rounding to have it printing nicely).

```{r}
round(dcop,1)
plot(clustEUCave)
```

From the R help on `cophenetic`, "the cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster. Note that this distance has many ties and restrictions. It can be argued that a dendrogram is an appropriate summary of some data if the correlation between the original distances and the cophenetic distances is high. Otherwise, it should simply be viewed as the description of the output of the clustering algorithm."

Note also that we can use a function to draw say $k$ groups on the dendogram

```{r}
plot(clustEUCave,labels=habitat)
rect.hclust(clustEUCave,k=3)
```


# Exercício 2

2. Efectue uma análise classificativa hierárquica aos dados DataTP9bentos.csv, relativos a densidades de espécies de invertebrados bentónicos em vários locais, e indique que medidas de distância e métodos de aglomeração lhe parecem mais adequados.

We begin by reading the data in 

```{r,cache=TRUE}
#path="C:/Users/tam2/Dropbox/Trabalho/DBA/EcologiaNumerica/2018/Aulas/TPs/TP11/"
#file=paste0(path,"DataTP9bentos.csv")
bent <- read.csv2("DataTP9bentos.csv")
```

and check all is OK

```{r}
str(bent)
```

We have `r nrow(bent)` observations, i.e number of locations, for `r ncol(bent)` variables. As in exercise 1, the first variable is different from the others, as it is a factor representing the labels of the locations.

We can perform a cluster analysis using the ward distance. We can plot the result vertically or horizontally. For the latter, we need to transform the object into the class dendrogram (see code below).

```{r}
clustbent<-hclust(dist(bent,method="euclidean"),method="ward.D")
plot(clustbent)
plot(as.dendrogram(clustbent),horiz=TRUE)
```

The truth is that given the data at hand, we need a measure for continuous values, but besides that, there is little to go on to choose an appropriate method. Noneteless, we could go wild and just do many combinations of distances and methods of clustering


```{r}
distancias=c("euclidean", "maximum", "manhattan", "canberra", "binary","minkowski")
nd=length(distancias)
aglomeracoes=c("ward.D", "ward.D2", "single", "complete","average","mcquitty","median","centroid")
na=length(aglomeracoes)
for(i in 1:nd){
for(j in 1:na){  
distancia=distancias[i]
aglomeracao=aglomeracoes[j] 
clustbent<-hclust(dist(bent,method=distancia),method=aglomeracao)
plot(clustbent,main=paste0("D=",distancia,", A=",aglomeracao))
}}
```


# Exercício 3

3. Realize uma análise classificativa não hierárquica aos dados das duas alíneas anteriores, constituindo 2 a 3 grupos. Explore os resultados, compare as diferentes análises e indique qual ou quais lhe parecem mais adequadas.

## Data Habitats

```{r}
cnh2<-kmeans(habs[,-1],2)
cnh3<-kmeans(habs[,-1],3)
```
 
We can look at the resulting objects. First the group assignments

```{r}
cnh2$cluster
cnh3$cluster
```

and then the total sum of squares, which since the data is the same, is equal irrespectively of the number of groups selected

```{r}
cnh2$totss
cnh3$totss

```

Then, the within group sum of squares (we want this small!)

```{r}
cnh2$withinss
cnh3$withinss
```

and the between group sum of squares (we want this large!)

```{r}
cnh2$betweenss
cnh3$betweenss
```

and finally the number of locations per group

```{r}
cnh2$size
cnh3$size
```

Note that we could see say how well the 3 original habitats had been segregated by the different groups. A simple way to compare the results of the 3 group hierarchical cluster is

```{r}
plot(jitter(habitat),jitter(cnh3$cluster))
table(habitat,cnh3$cluster)
```

And we can see that the elements of habitat 1 were separated into 2 of the clusters (1 and 3) and most elements of habitat 2 and 3 were placed into group 2. Therefore, it seems like habitat 2 shares locations with comunities similar to 1 and 3, and that habitat 1 has some locations with a  very particular community that gets separated from the remaining of habitat 1 and does not group with any of the others (Group 3).

What might be the best number of groups to partition the data? That depends on the clustering structure created and whether we can interpret it. Nonetheless, there are automated criteria that we can use to choose an optimal number of groups.

Therefore, we could actually check what might happen if we considered say from 2 to 8 groups, using the function `cascadeKM`. The plot of the resulting objects is very useful, and it provides an indication of what might be the best number of groups given a number of possible criteria. The most common are the `ssi` (“Simple Structure Index”) and the `kalinski` (the Calinski-Harabasz (1974) criterion). From the `kmeans` function help, "In a simulation study, Milligan and Cooper (1985) found that the Calinski-Harabasz criterion recovered the correct number of groups the most often. We recommend this criterion because, if the groups are of equal sizes, the maximum value of "calinski" usually indicates the correct number of groups. Another available index is the simple structure index "ssi". Users should not take the indications of these indices literally when the groups are not equal in size and explore the groups corresponding to other values of K.".

```{r,cache=TRUE}
km2to8=cascadeKM(habs[,-1],inf.gr = 2,sup.gr = 8,criterion = "ssi")
plot(km2to8,sortg=TRUE)
```

Note that here we used the sum of squares criterion to evaluate the best number of groups, but there are others one might choose, e.g. the Kalinski criterion. The best partitio would be 4 groups for the first criterion but only 2 for the second. Yes, it's true, cluster analysis is somewaht arbitrary, and it is hard to know what is the best procedure. I would say that the best is the one that produces an output that you can use to make an ecological interpretation that is sensible and useful.


```{r,cache=TRUE}
km2to8=cascadeKM(habs[,-1],inf.gr = 2,sup.gr = 8,criterion = "calinski")
plot(km2to8,sortg=TRUE)
```


## Data Benthos


```{r}
set.seed(123)
bentcnh2<-kmeans(bent[,-1],2)
bentcnh3<-kmeans(bent[,-1],3)
```
 
We can look at the resulting objects. 

```{r}
bentcnh2
bentcnh3
```


We can look at the objects subcomponents too. First the group assignments

```{r}
bentcnh2$cluster
bentcnh3$cluster
```

Then, the within group sum of squares (we want this small!)

```{r}
bentcnh2$withinss
bentcnh3$withinss
```

and the between group sum of squares (we want this large!)

```{r}
bentcnh2$betweenss
bentcnh3$betweenss
```

and finally the number of locations per group

```{r}
bentcnh2$size
bentcnh3$size
```

Since here we were not given any indications of what clustering patterns to expect nor were we given detailsabout the data, we can't make any strong claims about which number of groups might be more sensible to try.

We could see what seem to be the major differences between the groups created. It seems to be `Cancer` (more abundant on group 2) vs. `Carcinus` and `Gammarus` (more abundant on group 1).

```{r}
medbygroup2<- aggregate(bent[,-1],by=list(bentcnh2$cluster), FUN=mean)
medbygroup2=medbygroup2[,-1]
plot(t(medbygroup2),xlab="Grupo 1",ylab="Grupo 2",type="n")
text(t(medbygroup2), colnames(medbygroup2), cex = 0.5)
abline(a=0,b=1)
plot(bent,col=bentcnh2$cluster)
```

# Exercício 4

4. Efectue uma análise classificativa hierárquica aos dados DataTP9bentos.csv, utilizando a medida de associação de Jaccard e o método de aglomeração da mínima variância.
 
There's nothing to think about here, we just do what we are told! 
 
```{r}
dJAC<-vegdist(bent[,-1],method="jaccard")
round(dJAC,2)
clustEUCwar<-hclust(dJAC,method="ward.D")
plot(clustEUCwar)
plot(as.dendrogram(clustEUCwar),horiz=TRUE)
```
 
# Exercício 5

5. Foram caracterizados os perfis lípidos de organismos pertencentes a 3 grupos distintos. Explore os dados DataTP9lipidos.csv recorrendo a várias metodologias de análise classificativa. Interprete e comente os resultados.

```{r}
lipidos <- read.csv2("DataTP9lipidos.csv")
```

and as usual, check all is OK

```{r,cache=TRUE}
str(lipidos)
```

We have `r nrow(lipidos)` observations, i.e number of locations, for `r ncol(lipidos)` variables. The first variable is different from the others, as it is a factor representing the samples. There are 3 types of samples, H, E, e B.

```{r}
type=substr(lipidos[,1],1,1)
type
```


```{r}
dais1<-daisy(lipidos[,-1], metric = "euclidean")


dais2<-daisy(lipidos[,-1], metric = "manhattan")

ca.lipidos<-hclust(dais2,method="average")

ca.lipidos<-hclust(dist(lipidos[,-1],method="manhattan"),method="average")
ca.lipidos
plot(ca.lipidos,labels=type)



klip3<-kmeans(lipidos[,-1],3)
klip3

klip2<-kmeans (lipidos[,-1],2)
klip2

dist.eu.lip<-daisy(lipidos[,-1], metric = "euclidean")
klip3d<-kmeans(dist.eu.lip,3)
klip3d


klip3$cluster
klip3$withinss
klip3$betweenss
klip3$size

plot(lipidos,col=klip3$cluster)
```

We can see that there is a B and an H that get into the "wrong" groups. Besides that, the groups separate neatly. That seems to happen for both the hierarchical and non-hierarchical analysis.

```{r}
table(type,klip3$cluster)
```


# Exercício 6

6. Pretende efectuar um estudo relativo ao crescimento de árvores em zonas de altitude. Pretende seleccionar dois tipos de montanhas (DataTP9montanhas.csv), para depois efectuar 5 replicados em cada um destes tipos. Que análise lhe parece adequada a esta situação? Efectue-a no R e retire as principais conclusões. Com base nos resultados, indique o que poderá enfraquecer o contexto experimental do estudo que se pretende efectuar?

```{r}
mont=read.csv("DataTP9montanhas.csv", sep=";")
str(mont)
```

We have `r nrow(mont)` observations, i.e number of locations, for `r ncol(mont)` variables. The first variable is different from the others, as it is a factor representing the different montains.

```{r}
mont[,1]
```

We now fit a cluster analysis that produces two groups

```{r}
mont.nhca<-kmeans(mont[,-1],2)
mont.nhca
```

There are several reasons for why selecting replicates from two groups of locations based on a cluster analysis might not be reasonable. Even if the groups might be distinct with respect to all 5 available variables, they might not be different in terms of other variables that might be important for the response we want to investigate. Further, with 15 mountains and only 5 replicates for each of the 2 groups to test, there will be at the very least 5 mountains which will not be sampled, irrespective of the groups found!

# Exercício 7

7. Realize uma análise classificativa hierárquica aos dados da alínea anterior, variando medidas de distância e métodos de aglomeração, e refira se os resultados corroboram os obtidos no exercício anterior. 

We consider the Manhatan distance and the `ward.D` linkage method, and the two groups that were formed above seem to show up again.

```{r}
ca.mont<-hclust(dist(mont[,-1],method="manhattan"),method="ward.D")
plot(ca.mont,labels=mont[,1])
```

The same does not happen say for Euclidean Distance and complete linkage, where Estrela shows up separate from the other members of its (assuming the above was more sensible) group. Nonetheless, Estrela is the largest mountain in Portugal, so it might be indeed a group on its own.

```{r}
ca.mont2<-hclust(dist(mont[,-1],method="euclidean"),method="complete")
plot(ca.mont2,labels=mont[,1])
```

# Exercício 8


8. Efectue ambas as análises das alíneas 6 e 7, mas com base numa matriz de dados transformados através de redução e centragem das variáveis. 

Transform the variables

```{r}
mont.scale<-scale(mont[,-1])
```

Then implement the required analysis. First the hierarchical analysis

```{r}
ca.mont.scale<-hclust(dist(mont.scale,method="manhattan"),method="ward.D")
plot(ca.mont.scale,labels=mont[,1])
```

interestingly, `Estrela` does not appear separate anymore.

We then implement then the non-hierarchical analysis

```{r}
mont.scale.nhca<-kmeans(mont.scale,2)
mont.scale.nhca
```

Note that while the analysis seems a bit better, the classification (i.e. which mountain get's assigned to which group) is exactly the same as before.