--- title: "Aula 13 - Ficha de Trabalho 11" author: "Tiago A. Marques" date: "December 29, 2019" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Preliminary steps We load up some useful libraries first ```{r,message=FALSE,warning=FALSE} library(vegan) #for whatever might be required! library(MASS) #Mixture and Flexible Discriminant Analysis library(mda) #for the confusion matrix library (class) ``` # Exercício 1 Recolheram-se dados de abundância de várias espécies de anfíbios em vários distritos de Portugal (DataTP11anfibios.csv). Efectue um escalamento multidimensional com base em distâncias euclideanas e retire as principais conclusões. We read in the data (note this data was also used in FT10!) ```{r,cache=TRUE} anfi <- read.csv2("DataTP11anfibios.csv") ``` and check it looks OK ```{r,cache=TRUE} str(anfi) ``` We have `r nrow(anfi)` observations for `r ncol(anfi)` variables, which correspond respectively to locations and taxa (genera of amphibians). The first column does not represent a genera, but it contains the labels of the locations ```{r,cache=TRUE} anfi[,1] ``` The names of the genera are ```{r,cache=TRUE} names(anfi)[2:7] ``` We can also see and if some locations have much more amphibians than others ```{r,cache=TRUE} nbyloc=data.frame(loc=anfi[,1],n=rowSums(anfi[,2:7])) nbyloc ``` with Vila Real representing a really weird location, with no data (might that be a mistake?). We can also evaluate what are the most abundant taxa, ```{r,cache=TRUE} colSums(anfi[,2:7]) ``` We can calculate a distance matrix between locations (Euclidean distances) ```{r,cache=TRUE} row.names(anfi)=anfi[,1] deucanfi<-dist(anfi[,-1]) dmananfi2<-dist(anfi[,-1],method="manhattan") print(deucanfi,digits=2) ``` and finally implement the multidimensional scaling (Note: this analysis, classical metric multidimensional scaling (MDS), is sometimes also referred to as principal coordinates analysis") ```{r,cache=TRUE} mdsanfi<- cmdscale(deucanfi, eig = TRUE) #mdsanfi2<- cmdscale(dmananfi2, eig = TRUE) #test how close these two analyisismatch up, i.e. #considering the euclidean and manhattan distances, say mdsanfi ``` and look at the results ```{r,cache=TRUE} somavalprop<-sum(mdsanfi$eig) mdsanfi$eig[2] percent2d<-(sum(mdsanfi$eig[1],mdsanfi$eig[2]))/somavalprop*100 percent2d x <- mdsanfi$points[,1] y <- mdsanfi$points[,2] plot(x,y, xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(x,y, labels= anfi[,1],cex=0.6) #adding a polygon to a plot is easy #to find the required coordinates consider function #locator() #use locator(k) to find coordinates of k points #us mouse and click over desired points polygon(x=c(20,34,34,20),y=c(-10,-10,0,0),lty=2) polygon(x=c(-12.4916579,-0.2896308,6.0951509,3.1155861,-10.9309335,-15.3293387),y=c(26.992671,26.992671,17.763229,2.523919,5.314215,16.690038),lty=2) polygon(x=c(-32.92296,-23.41673,-20.15340,-27.53137,-37.60513,-38.74020),y=c(-0.4810154,-9.9250950,-27.0961487,-27.7400632,-24.0912143,-8.2079896),lty=2) ``` As in the previous exercises with this data set, there is a good latitudinal separation between the districts on the basis of the amphibian data. We can see 3 or 4 groups of locations. We could see if the analysis seems to be efficient at describing the information contained in the original distances ```{r,cache=TRUE} par(mfrow=c(1,2),mar=c(4,4,0.5,0.5)) plot(1:15,mdsanfi$eig/sum(mdsanfi$eig),xlab="Dimension",ylab="Variance explained") plot(1:15,cumsum(mdsanfi$eig/sum(mdsanfi$eig)),xlab="Dimension",ylab="Comulative variance explained") ``` # Exercício 2 Efectue um escalamento multidimensional não-métrico aos dados da alínea anterior e comente os resultados obtidos. A Non-Metric Multidimensional Scaling "...chooses a k-dimensional (default k = 2) configuration to minimize the stress, the square root of the ratio of the sum of squared differences between the input distances and those of the configuration to the sum of configuration distances squared." (stolen from `?isoMDS`)". ```{r,eval=TRUE,cache=TRUE} library(MASS) nmmdsanfi<- isoMDS(deucanfi) nmmdsanfi x2<-nmmdsanfi$points[,1] y2<-nmmdsanfi$points[,2] plot(x2,y2,xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(x2,y2, labels= anfi[,1],cex=0.7) ``` There does not seem to be any major differences compared to what was obtained with the original analysis. We could compare both analysis outputs directly by overlaying them on a single plot ```{r,cache=TRUE} plot(x2,y2,xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(x,y, labels= anfi[,1],cex=0.7,col=2) text(x2,y2, labels= anfi[,1],cex=0.7,col=3) ``` Interestingly, every single district has about the same position on the biplot, except for `Vila Real`, which was the district which we suspect had a record that was deleted (as it presented no amphibia at all, when all others had a reasonably large number of amphibia). The non-metric analysis is less sensible to such data, and hence places the location closer to the center of the plot. We could evaluate how reasonable was the representation in the 2 axis ```{r} nmmdsanfiB<- metaMDS(deucanfi) par(mfrow=c(1,2),mar=c(4,4,2.5,0.5)) stressplot(nmmdsanfiB,main="Sheppard plot") gof=goodness(nmmdsanfiB) plot(nmmdsanfiB,type="t",main="goodness of fit") points(nmmdsanfiB,display="sites",cex=gof*200) ``` # Exercício 3 Recolheu dados relativos a densidades de várias espécies de peixes em duas áreas distintas (1 e 2) (DataTP11peixesgrupos.csv). Realize uma análise discriminante e comente os resultados obtidos. Read the data in ```{r,cache=TRUE} pg <- read.csv("DataTP11peixesgrupos.csv", sep=";") ``` and we take a peak at the data ```{r,cache=TRUE} str(pg) ``` We have `r nrow(pg)` observations for `r ncol(pg)` variables, which correspond respectively to locations and species of fish. The first column does not represent a genera, but it contains the habitat each location corresponds to. ```{r,cache=TRUE} pg[,1] ``` The names of the genera are ```{r,cache=TRUE} names(pg)[-1] ``` We can also see and if some locations have much more fish than others ```{r,cache=TRUE} nbyloc=data.frame(loc=pg[,1],n=rowSums(pg[,-1])) nbyloc ``` We can also evaluate what are the most abundant taxa, ```{r,cache=TRUE} sort(colSums(pg[,-1])) ``` Of course, what will be interesting is to separate the analysis per habitat. As an example, we might see what are the largest differences per habitat in terms of fish abundances ```{r,cache=TRUE} sort(round(colMeans(pg[pg$Group==1,-1])-colMeans(pg[pg$Group==2,-1]),1)) ``` to conclude that `Ma_vi` is much more abundant in habitat 2 than 1, while the contrary happens for say `Me_ae`. These will likely be determinant species to separate the two groups! Finally, we implement the descriminant analysis ```{r,eval=TRUE,cache=TRUE} dapg <- fda(Group~.,data=pg) dapg ``` we can see the estimated coefficients ```{r,cache=TRUE} coef(dapg) ``` and the means per group on the discriminant space - the larger the differences between these means the easier it should be to discriminate the groups (do not confuse with means of variables per group, note we had already obtained these manually above!) ```{r,cache=TRUE} dapg$means ``` With a single discriminant function the plot of the result cannot be obtained (and it would not be that informative, but contrast with plot from Exercício 4 below) ```{r,eval=FALSE} #this code chunck has eval=FALSE #this returns an error if you try to run it plot(dapg) ``` Finally, we could predict what would be the group assignment for the data we have based on the estimated model ```{r,cache=TRUE} prev<-predict(dapg,pg) prev ``` For some reason the code for the plot of the object resulting from thelda call, as below, returns an error message regarding a "subscript out of bounds". There is nothing incorrect about the code being used (analogous code isused below without any problem). I suspect this is a problem related with the acoordinateof one of the locations being infinity in a given discriminant function, but there is nothing obvious that might becausing it. ```{r,eval=FALSE,echo=FALSE} plot(dapg) ``` Any way, the best way to evaluate the performance of the analysis is to build a confusion matrix, that presents correct decisions and incorrect decisions. We want most of the decisions to be on the main diagonal (correct decisions!) ```{r,cache=TRUE} confusion(prev,pg$Group) ``` Note that the object resulting from the discriminant analysis itself already contains this information ```{r} dapg$confusion ``` Based on this data set, we would be able to perfectly separate each and every site according to two groups. Nonetheless, what should be noted is that we would like to be able to separate new samples from locations with unknown habitat to the right habitat, just based on their fish community. As a note, we could have made also a logistic regression, a PCA, a cluster analysis over these data. You can do that to practice for the exam, comparing the results across analysis. # Exercício 4 Efectue uma análise discriminante aos dados DataTP11solea.csv, referentes a variáveis morfométricas de 3 espécies de linguado: *Solea solea*, *Solea senegalensis* e *Solea lascaris*. Discuta os resultados obtidos. We read in the data ```{r,cache=TRUE} lingua <- read.csv("DataTP11solea.csv", header=TRUE, sep=";") ``` and as usual conduct a brief EDA ```{r,cache=TRUE} str(lingua) ``` We have `r nrow(lingua)` observations for `r ncol(lingua)` variables, which correspond respectively to individuals of 3 species of linguado and their characteristics. The first column does not represent a characteristic, but it contains the species each individual belongs to. There are 3 species with a similar number of observations per species ```{r,cache=TRUE} table(lingua[,1]) ``` The variables considered are ```{r,cache=TRUE} names(lingua)[-1] ``` It will be interesting is to separate the data per species, to see if any variables allow clear segregation across species. A set of boxplots is helpful for that ```{r,cache=TRUE} par(mfrow=c(2,2)) with(lingua,boxplot(m1~especie)) with(lingua,boxplot(m2~especie)) with(lingua,boxplot(m3~especie)) with(lingua,boxplot(m4~especie)) ``` All variables show some apparent differences, so we should be able to separate the species efficiently based on these. We now implement a discriminat analysis using again function `fda` in package `mda`. ```{r,cache=TRUE} linguada<-fda(especie~.,data=lingua) linguada ``` The summary of the object is not very helpful ```{r} summary(linguada) ``` The default plot is actually very informative, and shows us that the analysis is very efficient at segregating the species ```{r,cache=TRUE} plot(linguada) ``` We can see the parameter values ```{r,cache=TRUE} coef.fda(linguada) coef(linguada) ``` the percentage of variance explained by each linear discriminant function ```{r,cache=TRUE} linguada$percent.explained ``` and the confusion matrix ```{r,cache=TRUE} linguada$confusion ``` which shows us that of `r nrow(lingua)` only `r nrow(lingua)-sum(diag(linguada$confusion))` are not correctly classified. # Exercício 5 Aplique técnicas de classificação (classificação hierárquica, com base na distância euclideana e no método da mínima variância) e de ordenação (análise de componentes principais) aos dados da alínea anterior (DataTP11solea2.csv). Compare os resultados e comente as principais diferenças obtidas. OK, so here we can do anything we'd like really, just use the previous data for other type of analysis and compare results. Here are some possibilities ## Principal Component Analysis ```{r,cache=TRUE} pcalingua<-princomp(lingua[,-1]) summary(pcalingua) biplot(pcalingua) ``` ## Hierarchical Cluster ```{r,cache=TRUE} lingueucD=dist(lingua[,-1]) hclingua<-hclust(lingueucD,method="ward") plot(hclingua) ``` ## Non-Hierarchical Cluster Despite not requested, given we have 3 species, a non-hierachical cluster with 3 clusters might also be interesting ```{r,cache=TRUE} nhclingua<-kmeans(lingueucD,3) ``` We can check how well it does using again a confusion matrix ```{r,cache=TRUE} species=as.character(unique(lingua$especie)) #awkward tweaking required to lign up predictions with species #because kmeans only produces a clustering, it does not #associate labels to clusters preds=ifelse(nhclingua$cluster==1,species[3],ifelse(nhclingua$cluster==2,species[2],species[1])) confusion(preds,lingua$especie) ``` and while we do well at correctly predicting *solea* and *lascaris*, many of the *senegalensis* are mistaken as *solea*. Note that while this could seem to imply that the Discriminant Analysis (DA) is better than the Non-Hierarchical Cluster (NHC) analysis, we should note these are intrinsically different approaches. While the latter is a unsupervised learning technique (the labels are not provided for the observations), the former is a supervised learning technique, and therefore not surprisingly is better at discriminating the groups we know exist, given the differences present. Therefore, in general, if you know the groups a priori, distinguishing these based on a multivariate analysis is more sensible using the DA than the NHC analysis.