Methods and Plots

We implemented pagoo to provide basic but fundamental statistical analyses and visualizations out-of-the-box. Some of them are just wrappers of other packages functions (see vegan and micropan R packages; cite them also if you use pagoo), and graphics are made by ggplot2 and plotly packages. All these are very useful for initial data exploration, but also provide a lot of flexibility to interact with other R packages in order to make more complex plots and more robust analyses. Also an interactive Shiny App can be easily deployed with a single function call, useful specially for those users that not very used to work with R code and prefer a UI, but also powerful for experienced users since it gives a general view of the data.

All this methods are embebbed to the pagoo object, so is very easy to start working with pangenome data.

library(pagoo) # Load package
toy_rds <- system.file('extdata', 'campylobacter.RDS', package = 'pagoo')
p <- load_pangenomeRDS(toy_rds)

As you may have already noted, methods and visualization are different than data fields (see previous tutorials) in the way of calling them. As any R function, you must use parentheses to run them.

Statistical methods

In this tutorial some, but not all, methods will be covered. Once you get the idea, just browse the documentation to see other available methods and what arguments they take.

Gene Abundance Distance

Pairwise distances between gene abundance can be retrieved by using $dist() method. By default, Bray-Curtis distance is computed. This is just a wrapper of vegdist() from vegan package. p$dist()
##             16244_6_6 16244_6_18 17059_2_16 17059_2_23 17059_2_27 17150_1_73
## 16244_6_18 0.06594656
## 17059_2_16 0.12122816 0.12500000
## 17059_2_23 0.09622745 0.09632517 0.07632399
## 17059_2_27 0.09245937 0.11310008 0.10311629 0.08230990
## 17150_1_73 0.08203991 0.09034444 0.13624408 0.12275937 0.12999735
## 17059_2_42 0.08920705 0.09927089 0.14532148 0.12682137 0.13706919 0.09518600

Other distances available include, for instance, Jaccard distance (see documentation). As this method require presence/absence data, and not abundance, you should use it with binary = TRUE argument set.

p$dist(method = "jaccard", binary = TRUE) Principal Component Analysis A method to compute prcomp (stats package) over the panmatrix is provided. You can use generic functions to further analyze the results. pca <- p$pan_pca()
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6       PC7
## Standard deviation     8.1626 6.1103 5.5458 5.2924 4.6981 4.29398 6.693e-15
## Proportion of Variance 0.3278 0.1837 0.1513 0.1378 0.1086 0.09072 0.000e+00
## Cumulative Proportion  0.3278 0.5115 0.6629 0.8007 0.9093 1.00000 1.000e+00

Visualization methods

pagoo uses ggplot2 package to produce customizable visualizations. While is not necessary to load ggplot2 to plot pangenome data, is better since you can fully customize them. Also we will be using patchwork to arrange plots in some cases to show, side by side, the default plot and a customized one.

library(ggplot2)
library(patchwork) # To arrange plots

Summary Pie Chart

The most basic plot is a pie chart to show core, shell, and cloud genome proportions.

# Basic

Gene Presence/Absence BinMap

A binary map show gene presence/absence patterns per genome. Each column is a cluster of orthologous genes, and each row an organism. The columns (clusters) are sorted by column sums, so core genes appear to the left, cloud genes to the right, and shell genes in the middle. This plot is useful, for example, to identify strains with abnormal accessory gene patterns which can represent true biological signatures like the presence of extrachromosomal elements or artifacts product of contamination during sequencing.

p$gg_binmap() Rarefaction Curves Pangenome curves typically illustrate the number of gene clusters that are subsequently discovered as more genomes are added to the dataset. If the pangenome is open, more and more accessory genes will be discovered as new genomes are added to the analysis and the size of the core genome will tend to decrease. pagoo applies the the Power-law distribution to fit the pangenome size and the Exponential decay function to fit the core genome size. Rarefaction curves are computed by first performing permutations with $rarefact(), and then fitting a Power Law to pangenome counts, and an Exponential Decay Law to coregenome counts. The latter two operations are done by using $pg_power_law_fit() and $cg_exp_decay_fit() methods, respectively (see documentation for more details). $gg_curves() combines this methods to facilitate plotting. p$gg_curves()

You can add points and facet data by category, and add any other customization:

p$gg_curves() + ggtitle("Pangenome and Coregenome curves") + geom_point() + facet_wrap(~Category, scales = 'free_y') + theme_bw(base_size = 15) + scale_color_brewer(palette = "Accent") PCA Biplot A biplot for visualizing the first 2 principal components in a PCA is always useful at early stages of analysis to explore possible association of genomes based on gene content. This method uses $pan_pca() method to perform the PCA, and allows you to use organism metadata to color the points.

p$organisms ## DataFrame with 7 rows and 8 columns ## org id strain year country host ## <factor> <character> <character> <integer> <character> <character> ## 1 16244_6_6 FR15 2008/170h 2008 France Human ## 2 16244_6_18 FR27 2012/185h 2012 France Human ## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine ## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine ## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine ## 6 17150_1_73 CA1 001A-0374 2005 Canada Human ## 7 17059_2_42 TW6 1830 2008 Taiwan Human ## source accession ## <character> <character> ## 1 Feces ERS672247 ## 2 Blood ERS672259 ## 3 Prepuce ERS739235 ## 4 Fetus ERS739242 ## 5 VM ERS739246 ## 6 Blood ERS686652 ## 7 Blood ERS739261 Since we have metadata associated to each organism, we will use it to color the organisms according to the country variable. p$gg_pca(colour = 'host', size = 4) +
theme_bw(base_size = 15) +
scale_color_brewer(palette = "Set2")

Shiny App

Last but not least, you can deploy a local shiny app by just one line of code.

p\$runShinyApp()

You can explore an online example of pagoo shiny app for a complete set of 69 Campylobacter fetus genomes at the following url: https://microgenlab.shinyapps.io/pagoo_campylobacter/ . This method is currently not intended for very large and complex datasets, as it may render slow. To work with big pangenomes we recommend the use of the R command line, but for tens or few hundred of genomes it works fairly good if you have patience.