--- title: "Aula 12 - Ficha de Trabalho 10" author: "Tiago A. Marques" date: "December 3, 2018" output: html_document: default pdf_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Preliminary steps We load up library `vegan` and we source (i.e. read in) a file that has some useful functions for exploring PCA outputs (this file is in the folder for Aula T23). ```{r,message=FALSE,warning=FALSE} library(vegan) source("brocardfunctions.R") ``` # Exercício 1 Recolheram-se dados de abundância de várias espécies de anfíbios em vários distritos de Portugal (DataTP10anfibios.csv). Efectue uma análise de componentes principais aos dados e retire as principais conclusões. As usual, we begin by reading the data ```{r,cache=TRUE} anfibios<-read.csv2("DataTP10anfibios.csv") ``` and we check it all looks OK ```{r,cache=TRUE} str(anfibios) ``` We have `r nrow(anfibios)` observations for `r ncol(anfibios)` variables, which correspond respectively to locations and taxa (genera of amphibians). The first column does not represent a genera, but it contains the labels of the locations ```{r,cache=TRUE} anfibios[,1] ``` The names of the genera are ```{r,cache=TRUE} names(anfibios)[2:7] ``` We can also see and if some locations have much more amphibians than others ```{r,cache=TRUE} nbyloc=data.frame(loc=anfibios[,1],n=rowSums(anfibios[,2:7])) nbyloc ``` with Vila Real representing a really weird location, with no data (might that be a mistake?). We can also evaluate what are the most abundant taxa, ```{r,cache=TRUE} colSums(anfibios[,2:7]) ``` By now, on lecture TP13, we should be fast at doing all the above. Given that a PCA is a dimension reduction technique, we can see if the genera are highly correlated or not. The analysis will be more efficient if there are high correlations to explore (e.g. two variables with perfect correlation can be summarized in a single component!) ```{r,cache=TRUE} panel.hist <- function(x, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nB <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...) } panel.corP <- function(x, y, digits = 2, prefix = "", cex.cor, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits = digits)[1] txt <- paste0(prefix, txt) if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } panel.corS <- function(x, y, digits = 2, prefix = "", cex.cor, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y,method="spearman")) txt <- format(c(r, 0.123456789), digits = digits)[1] txt <- paste0(prefix, txt) if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } pairs(anfibios[,2:7], lower.panel = panel.smooth, upper.panel = panel.corP, gap=0, row1attop=FALSE,diag.panel = panel.hist) ``` As it turns out, there is actually no correlation between any two variables larger than `r round(max(cor(anfibios[,2:7])[cor(anfibios[,2:7])!=1]),2)`, and this is hinting to the fact that the PCA might not be very efficient to reduce the dimensionality of this data set. For the PCA plots it might be useful to define `row.names` in the object `anfibios`, as these will be used by default in the plotting of biplots, hence making the plots easier to interpret. ```{r,cache=TRUE} row.names(anfibios)=anfibios[,1] ``` Now, we implement the PCA itself. There are several possible functions to do, and we begin by using `princomp`. Note that we must remove the first column, as otherwise the labels would be used as a (factor) covariate in the PCA, which would be unreasonable. ```{r,cache=TRUE} acp1<-princomp(anfibios[,-1]) acp1 ``` and we visualize the summary of the analysis ```{r,cache=TRUE} summary(acp1) ``` As we can see, we were able to extract 6 principal components, which is the number of genera available in the dataset. This is always the case, at most we extract as many PCs as original variables. We can also explore some of the components of the PCA output, namely the loadings (i.e. the correlations between the PC's and the original variables) ```{r,cache=TRUE} acp1$loadings ``` Note there is an argument to the function that prints the loadings that can be used to force all loadings to appear, since by default only those above 0.1 are displayed ```{r,cache=TRUE} print(acp1$loadings,digits = 2, cutoff = 0.05) ``` the scores (the coordinates of the locations in the principal components space) - note one could use these to manually plot the results of a PCA ```{r,cache=TRUE} acp1$scores ``` and the standard deviation associated with each PC ```{r,cache=TRUE} acp1$sdev ``` We can plot the objects produced by a PCA in two different ways. The default `plot` function presents a display of the amount of variation explained by each principal component. remember that ideally, the first few components will explain most of the variability! ```{r,cache=TRUE} par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(acp1) ``` A second plot, representing sites and species in a biplot, is obtained via function `biplot` ```{r} biplot(acp1,cex=0.7) ``` Finally, we can use one of the bespoke functions in file `brocardfunctions.R` to plot the variances associated with each PC and evaluate two criteria to choose the number of PCs to interpret ```{r,cache=TRUE} evplot(acp1$sdev^2) ``` Both criteria seem to agree that the more sensible would be to interpret just the first two PCs. Note that we could have implemented the same analysis using function `rda` in library `vegan`, which produces slightly different output, but the same results ```{r,cache=TRUE} acpbyrda1<-rda(anfibios[,-1]) acpbyrda1 ``` The summary of this object gives us direct access to something that was hidden above, namely the coordinates of the covariates in the PC space ```{r,cache=TRUE} summary(acpbyrda1) ``` We can use a function in `brocardfunctions.R` to make a clean plot of this object ```{r,cache=TRUE} cleanplot.pca(acpbyrda1) ``` and this representation is slightly more useful as it shows us that the variables most important (those which the arrows go outside the "control" circle) are `Pleurodeles`, `Salamandra` and `Bufo` (question: what do these 3 taxa share? They were the most abundant! This is not good.). This last genera name is not printed in the plot, for reasons which are unclear (a possible bug in the function). Nonetheless, the `rda` function creates an object that can help us, because it provides the coordinates of the variables in the PC space. We could therefore plot these in PC space if we so wanted and check that the variable that was not printed above was `Bufo` ```{r,cache=TRUE} plot(summary(acpbyrda1)$species[,1:2],type="n",xlim=c(-5,7)) text(summary(acpbyrda1)$species[,1:2],names(anfibios[,-1])) ``` Note that the results of the `princomp` and of `rda` might look different, for example, the sites scores. The sites scores in the first component are ```{r,cache=TRUE} #princomp acp1$scores[,1] #rda summary(acpbyrda1)$sites[,1] ``` but what we should keep in mind is that signs in scores are not possible to interpret, and in fact magnitudes are not easily interpreted either. It is only relative to each other that they can be interpreted. And relative to each other, the correlations between the sites scores across methods is perfect - as we can see below all 1 or -1 - so the results just look different at first, but are exactly the same! ```{r,cache=TRUE} cor(acp1$scores[,1],summary(acpbyrda1)$sites[,1]) cor(acp1$scores[,2],summary(acpbyrda1)$sites[,2]) cor(acp1$scores[,3],summary(acpbyrda1)$sites[,3]) cor(acp1$scores[,4],summary(acpbyrda1)$sites[,4]) cor(acp1$scores[,5],summary(acpbyrda1)$sites[,5]) cor(acp1$scores[,6],summary(acpbyrda1)$sites[,6]) ``` Finally, we should note that it might be more sensible, given the data were abundances, to scale the data prior to the analysis. If we do that, say with the `rda` function, this is what we get ```{r,cache=TRUE} acpbyrda2<-rda(scale(anfibios[,-1])) acpbyrda2 ``` As noted above, the summary of this object gives us direct access to the coordinates of the covariates in the PC space ```{r,cache=TRUE} summary(acpbyrda2) ``` Make a clean plot of the output ```{r,cache=TRUE} cleanplot.pca(acpbyrda2) ``` The results are now considerably different, and this means that prior results were (inadvertently) dominated by the most abundant species. Typically, that is not desired, and therefore this might be a better analysis to interpret. In this case we might need more than 2 PCs to interpret. Can you explain why this is a better analysis and yet it does not manage to represent as well the original information in just 2 PC's? ```{r,cache=TRUE} evplot(acpbyrda2$CA$eig) ``` # Exercício 2 Efectue uma análise de componentes principais aos dados das características de diferentes serras portuguesas (DataTP10serras.csv), com base na matriz de variância/covariância e na matriz de correlação. Comente os principais resultados obtidos em cada uma das análises e refira em que circunstâncias cada uma destas abordagens seria mais adequada. We read in the data ```{r,cache=TRUE} serras<-read.csv("DataTP10serras.csv", sep=";") ``` as before, check it is all OK ```{r,cache=TRUE} str(serras) ``` We have `r nrow(serras)` observations for `r ncol(serras)` variables, which correspond respectively to serras and their characteristics. The first column does not represent a a characteristic, but it contains the labels of the serras. As above, we begin with exploring the data ```{r,cache=TRUE} serras[,1] ``` The names of the serras are ```{r,cache=TRUE} names(serras)[2:6] ``` What are the correlations between descriptors? ```{r,cache=TRUE} pairs(serras[,2:6], lower.panel = panel.smooth, upper.panel = panel.corP,gap=0, row1attop=FALSE,diag.panel = panel.hist) ``` Some of the correlations are quite high, as high as `r round(max(cor(serras[,2:6])[cor(serras[,2:6])!=1]),2)`, so the PCA technique might be efficient. As before, we add the right labels to the `row.names` ```{r,cache=TRUE} row.names(serras)=serras[,1] ``` Now, we implement the PCA itself, first on the covariance matrix ```{r,cache=TRUE} acp.serras<-princomp(serras[,-1]) #this is the same as #acp.serras<-princomp(serras[,-1],cor=FALSE) acp.serras ``` We can look at the summary of the resulting object ```{r,cache=TRUE} summary(acp.serras) ``` We can also look at some of the components of the object created ```{r,cache=TRUE} acp.serras$loadings ``` Apparently, the main driver of the variability would be `Altitude`. But in fact, the truth is that the altitude is measured with a higher order of magnitude (1000's) than the other variables (10's or 100's). This suggests that the analysis based on the correlation matrix is much more sensible than one based on the covariances. We can look at the scores of the analysis ```{r,cache=TRUE} acp.serras$scores ``` and the plot it ```{r,cache=TRUE} par(mfrow=c(1,2),mar=c(4,4,0.5,0.5)) plot(acp.serras) biplot(acp.serras,cex=0.7) ``` No, we implement the PCA on the correlation matrix, which will decrease the influence of altitude in the analysis ```{r,cache=TRUE} acp.serras2<-princomp(serras[,-1],cor=TRUE) ``` which we note in passing is equivalent with an analysis on the covariance matrix of the scaled data (scaled data have, by definition, variances of 1, and hence the covariances and correlations are the same) ```{r,eval=FALSE} #which is the same as acp.serras2<-princomp(scale(serras[,-1])) ``` Then we can look at the object resulting ```{r,cache=TRUE} acp.serras2 summary(acp.serras2) acp.serras2$loadings ``` the summary of the object ```{r,cache=TRUE} summary(acp.serras2) ``` and some of its components, as before, the loadings ```{r,cache=TRUE} acp.serras2$loadings ``` and the scores ```{r,cache=TRUE} acp.serras2$scores ``` Now, all variables have a reasonable influence on the first component. We can plot the resulting object ```{r,cache=TRUE} par(mfrow=c(1,2),mar=c(4,4,0.5,0.5)) plot(acp.serras2) biplot(acp.serras2,cex=0.7) ``` In this case, given all the variables are measured in different units, it would only make sense to interpret the analysis over the correlation matrix, since the analysis based on the covariances suffers a strong impact from the measuring units of each variable (in other words, the analysis would return different results if say altitude was measured in km and in meters!). Leveraging on the nice clean biplot function (`cleanplot.pca`), for the scaled analysis we get the following biplots ```{r,cache=TRUE} rda.serras2s<-rda(scale(serras[,-1])) cleanplot.pca(rda.serras2s) ``` showing that `matos` seems to be the main driver of the variability in Serras. Notice that the conclusions would be extremely different if interpreting the analysis based on the covariance and the correlation matrix. Just to finish, note how the different principal components are indeed independent (i.e. uncorrelated, and while not shown here for both, this obviously happens for both covariance and correlation based analysis) ```{r,cache=TRUE} round(cor(acp.serras$scores),4) ``` This happens by construction, i.e. a PCA is the mathematical procedure that obtains the linear combinations of the original covariates which simultaneous respect two conditions: * the variance is largest for the 1st component, then the 2nd,then the 3rd, and so on * and all the components are orthogonal (i.e. all independent of each other). # Exercício 3 Recolheu dados relativos a densidades de várias espécies de anfíbios e répteis em três tipos florestais (pinhal, eucaliptal e sobro) (DataTP10herpetofauna.csv). Realize uma análise de correspondências e comente os resultados obtidos. Efectue uma análise de componentes principais aos mesmos dados compare os resultados com os obtidos na análise de correspondências. We read the data in ```{r,cache=TRUE} herpeto <- read.csv2("DataTP10herpetofauna.csv") ``` and we look at the data ```{r,cache=TRUE} str(herpeto) ``` We can check upfront how many observations (replicates) exist in each habitat by using function `substr` (subset a string, i.e. a word), then `table` to count how many different habitats, then `barplot` to make a suitable plot of the result ```{r} barplot(table(substr(herpeto$Habitat,1,4))) ``` ## PCA Given the way the material was presented in the lectures, we first implement the PCA. The labels of the locations are ```{r,cache=TRUE} herpeto[,1] ``` and this seems to imply there are 3 types of habitats, Eucaliptal, Pinhal e Sobro. We can create a separate variable to hold the type of habitat - this could be useful for plotting say points with different colors per habitat ```{r,cache=TRUE} herpeto.hab=substr(herpeto[,1],1,4) herpeto.hab ``` The names of the genera are ```{r,cache=TRUE} names(herpeto)[-1] ``` We can also see and if some locations have much more animals than others ```{r,cache=TRUE} nbyloc=data.frame(loc=herpeto[,1],n=rowSums(herpeto[,-1])) nbyloc ``` and in fact these numbers are extremely suspicious to me, as they are way too close to 1 to being true... Not sure what happened here, but I suspect some scaling must have happened at some level. We can also evaluate what are the most abundant taxa, and show them in increasing order ```{r,cache=TRUE} sort(colSums(herpeto[,-1])) ``` Given that a PCA is a dimension reduction technique, we can see if the genera are highly correlated or not. The analysis will be more efficient if there are high correlations to explore (e.g. two variables with perfect correlation can be summarized in a single component!) ```{r,cache=TRUE} pairs(herpeto[,-1], lower.panel = panel.smooth, upper.panel = panel.corP, gap=0, row1attop=FALSE,diag.panel = panel.hist) ``` As it turns out, there are a couple of large correlations, but many low correlations too, so it's unclear how the PCA might perform. For the PCA plots it might be useful to define `row.names` in the object `herpeto`, as these will be used by default in the plotting of biplots, hence making the plots easier to interpret. ```{r,cache=TRUE} row.names(herpeto)=herpeto[,1] ``` Now, we implement the PCA itself ```{r,cache=TRUE} acph1<-princomp(herpeto[,-1]) acph1 ``` and we visualize the summary ```{r,cache=TRUE} sumacp1=summary(acph1) sumacp1 ``` We can also explore some of the components of the PCA output, namely the loadings (i.e. the correlations between the PC's and the original variables) ```{r,cache=TRUE} print(acph1$loadings,digits = 2, cutoff = 0.2) ``` the scores (the coordinates of the locations in the principal components space) - note one could use these to manually plot the results of a PCA ```{r,cache=TRUE} acph1$scores ``` We can plot the objects produced by the PCA ```{r,cache=TRUE} par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(acph1) biplot(acph1,cex=0.7) ``` as for previous exercises, and not surprisingly given the unscaled analysis (i.e. using the covariance matrix), `Lacerta` and `Coluber`, the most abundant species, are the ones driving most of the variability on the first PC. Finally, we can use one of the bespoke functions in file `brocardfunctions.R` to plot the variances associated with each PC and two criteria to choose the number of PCs to interpret ```{r,cache=TRUE} evplot(acph1$sdev^2) ``` Both criteria seem to agree that the more sensible would be to interpret just the first two PCs. Note that we could have implemented the same analysis using function `rda` in library `vegan`, which produces slightly different output, but the same results ```{r,cache=TRUE} acphbyrda1<-rda(herpeto[,-1]) acphbyrda1 ``` The summary of this object gives us direct access to something that was hidden above, namely the coordinates of the covariates in the PC space ```{r,cache=TRUE} summary(acphbyrda1) ``` We can make a clean plot of this object ```{r,cache=TRUE} cleanplot.pca(acphbyrda1) ``` The corresponding analysis based on the correlation matrix is perhaps more sensible ```{r,cache=TRUE} sacphbyrda1<-rda(scale(herpeto[,-1])) sacphbyrda1 ``` The summary of this object gives us direct access to something that was hidden above, namely the coordinates of the covariates in the PC space ```{r,cache=TRUE} summary(sacphbyrda1) ``` We can make a clean plot of this object ```{r,cache=TRUE} cleanplot.pca(sacphbyrda1) ``` and we can see clearly that the separation on the first PC is between sites from Eucaliptal versus Pinhal versus Sobro, without any clear separation regarding habitat on the second component. Making a biplot by hand has the advantage of letting us use bespoke colors as well as fine tune the limits of the plot. Check where are the species and locations scores ```{r,cache=TRUE} names(summary(sacphbyrda1)) ``` and then use those to make the biplot ```{r,cache=TRUE} #a couple of useful objects to make the code more readable myacp=summary(sacphbyrda1) plot(myacp$sites[,1:2],type="n",ylim=c(-2,2),xlim=c(-2,2)) text(myacp$sites[,1:2],labels=substr(herpeto[,1],1,4),col=as.numeric(as.factor(substr(herpeto[,1],1,4)))) #species arrows(x0=0,y0=0,x1=myacp$species[,1],y1=myacp$species[,2],col=2) text(myacp$species[,1:2],labels=names(herpeto[-1]),cex=0.5) ``` ## CA We now implement the CA ```{r,cache=TRUE} ca.herp<-cca(herpeto[,-1]) summary(ca.herp) plot(ca.herp) ``` we could have used package `ca`, but we note these analysis are exactly the same, and as before for different PCA implementations, only the structure of the output objects is slightly different ```{r,cache=TRUE} library(ca) caca.herp<-ca(herpeto[-1]) summary(caca.herp) plot(caca.herp) ``` We note that the separation per habitat is perhaps not as clean as for the PCA. Again, we can make our own biplot ```{r,cache=TRUE} #a couple of useful objects to make the code more readable names(summary(ca.herp)) mycas=summary(ca.herp) #the biplot plot(mycas$sites[,1:2],type="n",ylim=c(-2,2),xlim=c(-2,2)) text(mycas$sites[,1:2],labels=substr(herpeto[,1],1,4),col=as.numeric(as.factor(substr(herpeto[,1],1,4)))) #species arrows(x0=0,y0=0,x1=mycas$species[,1],y1=mycas$species[,2],col=2) text(mycas$species[,1:2],labels=names(herpeto[-1]),cex=0.5) ``` ## Comparison CA vs. PCA The results from both analysis are not that different in this case. While this is not a standard procedure (i.e. I have never seen anyone else doing it explicitly), we could compare the relations between the scores in the PCA and the CA, for the first and second axis. ```{r,cache=TRUE} par(mfrow=c(1,2),mar=c(4,4,0.5,0.5)) plot(myacp$sites[,1],mycas$sites[,1],xlab="PCA score 1st CP",ylab="CA score 1st CP") abline(0,1) text(-1,1,paste0("R= ",round(cor(myacp$sites[,1],mycas$sites[,1]),2))) plot(myacp$sites[,2],mycas$sites[,2],xlab="PCA score 2nd CP",ylab="CA score 2nd CP") text(-1,1,paste0("R= ",round(cor(myacp$sites[,2],mycas$sites[,2]),2))) abline(0,1) ``` With a correlation of over 99% in the first axis and over 80% in the second, there is nothing structurally different in both analysis. In general, a CA might be more sensible to use directly over count data. Actually, in this case, similar plots to the pair above comparing the PCA with and without scaling (i.e., over the correlation and over the covariance matrices) and between the covariance-based PCA and the CA would also be informative. ```{r,cache=TRUE} par(mfrow=c(2,2),mar=c(4,4,0.5,0.5)) plot(myacp$sites[,1],sumacp1$scores[,1],xlab="PCA score 1st CP",ylab="PCA unscaled score 1st CP") abline(0,1) text(-1,1,paste0("R= ",round(cor(myacp$sites[,1],sumacp1$scores[,1]),2))) plot(myacp$sites[,2],sumacp1$scores[,2],xlab="PCA score 2nd CP",ylab="PCA unscaled score 2nd CP") text(-1,1,paste0("R= ",round(cor(myacp$sites[,2],sumacp1$scores[,2]),2))) abline(0,1) # PCA unscaled vs CA plot(mycas$sites[,1],sumacp1$scores[,1],ylab="PCA unscaled score 1st CP",xlab="CA score 1st CP") abline(0,1) text(-1,1,paste0("R= ",round(cor(sumacp1$scores[,1],mycas$sites[,1]),2))) plot(mycas$sites[,2],sumacp1$scores[,2],ylab="PCA unscaled score 2nd CP",xlab="CA score 2nd CP") text(-1,1,paste0("R= ",round(cor(sumacp1$scores[,2],mycas$sites[,2]),2))) abline(0,1) ``` These plots show us what we already knew. The results of the unscaled PCA would not be sensible, being very different from those of the CA and the scaled PCA. This was as expected, as both the CA and the unscaled PCA are not dominated by the raw abundances, unlike the unscaled PCA (covariance matrix based). # Exercício 4 Efectue uma análise canónica de correspondências aos dados da alínea anterior, incluindo também as características dos locais (DataTP10herpamb.csv). Discuta os resultados. Em que situações esta análise é preferível em relação às anteriores técnicas? First we read the environmental variables ```{r,cache=TRUE} herpamb <- read.csv("DataTP10herpamb.csv", sep=";") ``` and take a quick peak at it to make sure all is good ```{r,cache=TRUE} str(herpamb) ``` Let's look at the different variables as a function of habitat ```{r} hab=substr(herpamb$Habitat,1,4) par(mfrow=c(3,2)) boxplot(herpamb$Perc.floresta~hab) boxplot(herpamb$Agua~hab) boxplot(herpamb$Altura~hab) boxplot(herpamb$Idade~hab) boxplot(herpamb$Activ.agricola~hab) ``` Then, we implement the CCA using function `cca` from package `vegan` ```{r,cache=TRUE} cca.herp<-cca(herpeto[,-1]~.,data=herpamb[,-1]) ``` We can look at the summary of the results ```{r,cache=TRUE} summary(cca.herp) ``` and finally we plot the triplot that contains the information on the sites and the species information, as the previous analysis did, but now it also contains directly the information about the environmental variables. The ecological interpretation would now have to be done considering the 3 sets of variables, as the environmental variables were used to constrain the axis. ```{r,cache=TRUE} plot(cca.herp) ``` # Exercício 5 Obteve dados relativos à abundância de diversas espécies de aves em três tipos de habitat. Efectue várias análises de ordenação (análise de componentes principais, análise de correspondências e análise canónica de correspondências). A análise canónica de correspondências permite integrar informação de uma segunda matriz de dados. Indique em que situações esta análise poderá ser mais informativa e como poderia tentar obter informação análoga nas outras duas análises efectuadas. We read the data in, both the species matrix and the environmental variables matrix ```{r,cache=TRUE} aves <- read.csv("DataTP10aves.csv", sep=";") avesamb <- read.csv("DataTP10avesamb.csv", sep=";") row.names(aves)=row.names(avesamb)=aves[,1] ``` We look at both of these, first the species data ```{r,cache=TRUE} str(aves) ``` We have `r nrow(aves)` observations for `r ncol(aves)` variables, which correspond respectively to locations and taxa (genera of amphibians). The first column does not represent a genera, but it contains the labels of the locations, which are separated by habitats, of which there are the following 3 ```{r,cache=TRUE} unique(substr(aves[,1],1,4)) ``` which we assume correspond to "Bosque", "Pradaria", and "Ripícola". We can see if there are locations with much more abundance than others ```{r,cache=TRUE} sort(rowSums(aves[,-1])) ``` (as before, some standardization must have been made before analysis) or if there are species which dominate the communities. ```{r,cache=TRUE} sort(colSums(aves[,-1])) ``` In fact, *Anas* and *Carduelis* seem to dominate the community in terms of abundance. Now we look at environmental variables ```{r,cache=TRUE} str(avesamb) ``` We have `r nrow(avesamb)` observations for `r ncol(avesamb)` variables, which correspond respectively to locations and environmental variables (genera of amphibians). The first column does not represent a variable, but it contains the labels of the locations. A number of different analysis could be attempted on this data, as shown below. We use the PCA unscaled, but this is wrong given the raw abundances (for the environmental variables analysis might be equivalent scaled or unscaled, as they seem to be measured in similar scales). Your task is to repeat the analysis with the correlation matrix and compare the results. ## Environmental variables We could see how the sites would group as a function of the environmental variables, both considering a PCA and a CA ### PCA ```{r,cache=TRUE} pca.avesamb<-princomp(avesamb[,-1]) summary(pca.avesamb) biplot(pca.avesamb,cex=0.7,xlabs=substr(avesamb[,1],1,4)) ``` We can see a very clear separation of the 3 habitat types based on the sites environmental variables. ### CA ```{r,cache=TRUE} ca.avesamb<-ca(avesamb[,-1]) ca.avesamb plot(ca.avesamb,labels=2) ``` Again, the separation across habitats is very clear. ## Species data ### PCA ```{r,cache=TRUE} pca.aves<-princomp(aves[,-1]) summary(pca.aves) biplot(pca.aves,cex=0.7,xlabs=substr(avesamb[,1],1,4)) ``` ### CA ```{r,cache=TRUE} ca.aves<-ca(aves[,-1]) ca.aves par(mfrow=c(1,2),mar=c(4,4,0.5,0.5)) plot(ca.aves,labels=2) ``` Note that in a CA we can also overlay on the ordination biplot, if available, a set of environmental covariates. We do this using function `envfit` (to do so we need to fit the CA with function `cca` in package `vegan`) ```{r} ca.aves2<-cca(aves[,-1]) ca.aves2env<-envfit(ca.aves2,avesamb[,-1]) par(mfrow=c(1,1),mar=c(4,4,0.5,0.5)) plot(ca.aves2) plot(ca.aves2env) ``` ### CCA ```{r,cache=TRUE} cca.aves<-cca(aves[,-1]~., data=avesamb[,-1]) summary(cca.aves) plot(cca.aves) ``` We conclude that with the direct gradient analysis (i.e. the CCA) the same separation across habitats is present, and we can relate the environmental variables to the abundances. Naturally, the percentage of vegetation is larger on Bosques than on the Pradaria or Ripicola areas, and there is more water near rivers (Ripicola). It is however not easy to see in the CCA output what are the species contributing more to this segregation. That is simpler to do in the CA/PCA plots, where it seems clear that *Turdus* and *Parus* are characteristic of Bosques, as is *Sylvia* (segregating sites on the second component), and *Athene* of Ripicola. Prados are harder to characterize in terms of species in the CA, but the PCA shows that *Anas* seems frequent. # Final thoughts All of these multivariate techniques reduce the information in a large matrix (or sets of matrices) to explore the relationships between the sites and the species, or in the case of the CCA, to see how the species distribute themselves over the sites using linear combinations of the environmental variables to explain that variation. This process of dimension reduction inevitably leads to distortions. Try to find examples of points that were well represented in the resulting reduced space. A good starting point might be e.g. site Eucaliptal 7 in exercise 3. Look at its position in the biplot and try to imagine what might be the values of the different genera in it. What do you expect to be the most abundant genera say? Then go to the original data frame and check if your predictions were correct. Do the same with other sites to check the distortion that some sites suffered in the reduced space.