Wellcome & Disclaimer

This site contains the materials for the Coding tools for Biochemistry & Molecular Biology (Herramientas de Programación para Bioquímica y Biología Molecular) course of fall 2022 in the Bachelor’s Degree in Biochemistry @UAM. This materials are the basis for GitHub-pages-based website that can be accessed here. Detailed academic information about the course contents, dates and assessment only can be found at the UAM Moodle site.

All this material is open access and it is shared under CC BY-NC license.

Plot your data in R - Episode II

As we discussed in Lesson 11, exploratory data visualization is one of the greatest advantage of R. One can quickly go from idea to data to graph with a unique balance of flexibility and ease.

There are many graphing options available in R. You already know that the graphing capabilities that come with a basic installation of R are already quite useful. There are also a number of packages for creating advances graphs like grid, plotly, or lattice. In this course we chose to use ggplot2 because it is widely used and also the basis of many derivative packages for specific advanced plots. The main advantage of ggplot is that it breaks plots into components in a way that allows beginners to create relatively complex and aesthetically pleasing plots using an intuitive and relatively easy-to-remember syntax.

One reason why ggplot2 is generally more intuitive for beginners is because it uses a “graphing grammar” (see Wilkinson et al.2000), the gg of ggplot2. This is analogous to the way that learning a language grammar can help you construct hundreds of different sentences out of a small number of verbs, nouns, and adjectives, rather than memorizing each specific sentence. Similarly, by learning a small amount of the basic components of ggplot2 and the elements in its grammar, you will be able to create hundreds of different plots to show and render the data exactly in the way you think it’s the best way.

Example 1: Advanced plotting with ggplot in one line

# load the data
library(data.table)
vaccines <- fread("data/vaccines_EU_22oct2022.csv")
# install & load ggplot2
if (!require(ggplot2)) install.packages("ggplot2")
## Loading required package: ggplot2
library(ggplot2)
# plot
ggplot(vaccines, aes(x = YearWeekISO, y = log10(FirstDose), col = TargetGroup),
    stat = mean) + geom_point() + facet_wrap(~ReportingCountry)

Although it should have taken a while, in the above code you generated a complex graphic in only one line. This examples contains four basic elements:

  1. Data (in this case vaccines)

  2. Aesthetic: aes()

  3. Layer: geom()

  4. Facet: facet_wrap()

However, these are not all the possible elements in a ggplot coding. As with any language, the grammar of graphics can be flexible and we may omit some elements o add more elements of the same type, just like we can add diverse kinds of complements (place, time…) to a sentence.

From a general perspective (see ref. 3), plots are composed of the data, the information you want to visualize, and a mapping, the description of how the data’s variables are mapped to aesthetic attributes. There are five mapping components (again from ref. 3):

  • A layer is a collection of geometric elements and statistical transformations. Geometric elements, geoms for short, represent what you actually see in the plot: points, lines, polygons, maps, etc. Statistical transformations, stats for short, summarize the data: for example, binning and counting observations to create a histogram, or fitting a linear model.

  • Scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes, which make it possible to read the original data values from the plot (an inverse mapping).

  • A coord, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.

  • A facet specifies how (usually a factor type vector) to break up and display subsets of data as small multiples. This is also known as conditioning or latticing.

  • A theme controls the finer points of display, like the font size and background colour. While the defaults in ggplot2 have been chosen with care, you may need to consult other references to create an attractive plot.

Let’s see all those element through more examples.

Example 2: From R base plot to ggplot

The use of ggplot is usually associate to great, gorgeous charts and plots. However, once you learn how to use it and how adapt and readapt code to your data, you will probably use ggplot for every graph.

In the following code, we are going to use ggplot to solve the exercise 1 from Lesson 11.

# create the same dataset
set.seed(2013)
(GeneA <- rnorm(50))
##  [1] -0.09202453  0.78901912 -0.66744232  1.36061149  1.50768816 -2.60754997
##  [7]  0.68727212  0.31557476  2.02027688 -1.42361769  0.12517209 -1.44320134
## [13] -0.60880250 -1.00991165  0.87674367 -0.12930641 -2.50635437 -1.40955606
## [19] -1.94699972 -1.26446327 -0.43884950 -0.25605301 -0.26469887 -0.10566397
## [25] -0.82539261  0.49729533 -0.67084289 -0.01708127  1.15749514  0.80898869
## [31]  0.23217880 -1.31806388 -0.41527722  0.49687828 -0.08924889 -0.53341712
## [37] -0.25110368 -1.92372936  0.79160297 -1.24688229 -1.75004602 -0.07749612
## [43]  1.03417653 -0.25473888  0.30563785  2.23979118 -0.09413188  0.56692140
## [49] -0.43870799  1.04595609
(GeneB <- c(rep(-1, 30), rep(2, 20)) + rnorm(50))
##  [1] -0.70688901 -1.11270134  0.04748857 -0.29431894 -0.46466621 -2.49286126
##  [7] -1.82383399 -1.83096793 -2.57442803  0.67223483  0.92282863 -0.99496690
## [13] -2.53926614 -1.57654868 -1.32482996 -1.61520806 -1.04619868 -2.20795455
## [19] -0.99922529 -2.17361066 -1.03934427 -1.93229960 -1.90894245 -2.30262035
## [25] -0.34950473 -0.98683383 -0.60318761  0.15508957 -1.93541211 -0.37318571
## [31]  2.16081491  1.41453392  1.89025500  2.23165814  2.48152152  0.89572125
## [37]  0.87921210  0.54220152  1.93609920  2.47722667  1.84504925  1.25617301
## [43]  2.07154338  0.65515636  3.07255634  3.68094033  3.61132686  0.34453072
## [49]  2.75151317  2.29373155
(tumor <- factor(c(rep("Colon", 30), rep("Lung", 20))))
##  [1] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
## [13] Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon Colon
## [25] Colon Colon Colon Colon Colon Colon Lung  Lung  Lung  Lung  Lung  Lung 
## [37] Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung  Lung 
## [49] Lung  Lung 
## Levels: Colon Lung
# for ggplot it is more convenient to work with dataframes
genes <- data.frame(tumor, GeneA, GeneB)

# basic plots

# geneA
ggplot(genes, aes(x = tumor, y = GeneA, color = tumor)) + geom_boxplot()

# geneB with custom colors
ggplot(genes, aes(x = tumor, y = GeneB, fill = tumor)) + scale_fill_manual(values = c("#999999",
    "#E69F00")) + geom_boxplot()

# together we need to adapt dataset with stack()
genes2 <- cbind(stack(genes[, 2:3]), tumor)
names(genes2) <- c("Expression", "Gene", "Tumor")
# generate the plot
p <- ggplot(genes2, aes(x = Gene, y = Expression, fill = Gene)) +
    scale_fill_manual(values = c("#999999", "#E69F00")) + geom_boxplot()
p  #see the plot

p + facet_grid(. ~ Tumor)  #new version

For these simple plots, the degree of difficulty and the time consumption of making them with base R plot functions like boxplot() or stripchart() or with ggplot() is very similar, but this is only the very tip of the ggplot iceberg.

Plots customization

Customization of your plot is very easy thanks to the themes, and other options, as in the examples below. Check for built-in ggplot themes: https://ggplot2.tidyverse.org/reference/ggtheme.html. Also, you can find some packages with custom themes and you can create your own theme (check this article by Emanuela Furfaro).

(q <- p + facet_grid(. ~ Tumor) + theme_light())

(q2 <- p + facet_grid(. ~ Tumor) + theme_dark())

(q3 <- p + facet_grid(. ~ Tumor) + theme_linedraw())

q + stat_boxplot(geom = "errorbar", width = 0.5, alpha = 0.5)

q + theme(legend.position = "top")

q + theme(legend.position = "bottom")

q + theme(legend.position = "none")

Now, let’s try some more cool example plots with these same data.

r <- ggplot(genes2, aes(x = Gene, y = Expression, fill = Tumor)) +
    scale_fill_manual(values = c("#999999", "#E69F00")) + geom_boxplot()

r + geom_dotplot(binaxis = "y", stackdir = "center", position = position_dodge(0.75))
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

r + geom_point(pch = 21, position = position_jitterdodge())

# more examples
a <- ggplot(genes2, aes(x = Expression))
a + geom_density(aes(fill = Gene), alpha = 0.4) + scale_fill_manual(values = c("#868686FF",
    "#EFC000FF")) + theme_classic() + facet_grid(. ~ Tumor)

library(RColorBrewer)
a <- ggplot(genes2, aes(x = Gene, y = Expression, fill = Tumor)) +
    scale_fill_brewer(palette = "Pastel1") + geom_violin(alpha = 0.4,
    position = "dodge")
a

Color palettes

When you have large datasets or several factors in your data, selecting the colors is not trivial. In R, there are defined color combinations or palettes that you can select in your plot. Moreover, there are also several packages that contain custom color palettes suitable for base plots and/or ggplots, like viridis or RColorBrewer. You may also find interesting the package ggsci, which contains palettes with colors used in scientific journals, data visualization libraries, science fiction movies…

To obtain the desired plot, you sometimes need to rotate the axis x labels, which must be done within a theme() argument. You can customize the label rotation, justification, font…

coli_genomes <- read.csv2(file = "data/coli_genomes_renamed.csv",
    strip.white = TRUE, stringsAsFactors = TRUE)
p <- ggplot(coli_genomes, aes(x = Strain, y = contigs1kb, fill = Source)) +
    geom_bar(stat = "identity", position = "dodge", alpha = 0.6)
p

p2 <- p + scale_fill_brewer(palette = "Dark2") + theme_linedraw()
p2

p2b <- p + scale_fill_brewer(palette = "Pastel2") + theme_linedraw()
p2b

p3 <- p2 + theme(axis.text.x = element_text(angle = 45, hjust = 1,
    face = "bold"))
p3

In the example above, we generated a plot with some custom palettes. The palettes are available in ggplot, but in order to explore and edit them, you need to install the package.

if (!require(RColorBrewer)) install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all()

display.brewer.all(colorblindFriendly = TRUE)  #only colorblind-friendly!

# display only the a number of colors from specific palette
display.brewer.pal(n = 3, name = "Dark2")

Saving your plot

Finally, saving plots is also very easy with ggplot2, check the function ggsave().

# save
ggsave(filename = "plot_p3.svg", plot = p3, width = 10, height = 6)

Of course, you can also do it the same way than with Base plots:

svg("plot_p3b.svg")
print(p3)
dev.off()
## quartz_off_screen 
##                 2

Example 3: Plotting multiple variables

One of the nice ways to summarize data is plotting more than one variable in the same plot. However, you should consider that different data may need different scale to be render in the same plot. Thus, you should scale up/down the data of one variable to the values of the second variable. Then, for the secondary axis, you just need to apply the same scaling factor in the opposite direction.

In the following example, we plot the number of Contigs and the Assembly length from our coli_genomes.csv.

# barplot
plots2 <- ggplot(data = coli_genomes) + geom_bar(aes(x = Strain,
    y = Contigs, color = Source, fill = Source), stat = "identity")
# add the points and adjust the scale in the right axis
plots2b <- plots2 + geom_point(aes(x = Strain, y = Assembly_length/15000,
    fill = Source, alpha = 0.8), col = "black", shape = 21, size = 3)
plots2b

plots2c <- plots2b + scale_y_continuous(name = "Number of Contigs",
    limits = c(0, 400), expand = c(0, 0), sec.axis = sec_axis(~15000 *
        ., name = "Assembly length"))
plots2c

# add colors and more customization
plots2custom <- plots2b + scale_fill_brewer(palette = "Set1") +
    scale_color_brewer(palette = "Set1") + theme_bw() + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, face = "bold")) + guides(alpha = "none", color = "none")
plots2custom

We did the plot in different stages, to check each of the variables and the scale of the secondary axis before apply the customization.

As you noticed, axis customization may entail adjust the limits withlimits(), the axis expansion below and above those limits with expand(), and other aspects as the axis ticks with breaks(). We also can add/remove a legend with the argument guides().

ggplot and beyond

As mentioned above, ggplot is already an standard and the base of many derivative packages. We are going to see a couple of examples.

Interactive plots

Another interesting application of ggplot is its use for the generation of interactive plots to be published on websites. One that you might find of interest is the package heatmaply, that generates interactive heatmaps. Further, I find awesome the use of the package plotly() for very quick upgrade of any plot as interactive.

See the examples:

# install.packages('ggplotly')
# install.packages('heatmaply')

library(heatmaply)
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridis
## Loading required package: viridisLite
## 
## ======================
## Welcome to heatmaply version 1.4.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================
# aggregate the data with xtabs
matrix <- xtabs(~coli_genomes[, 4] + coli_genomes[, 5])
# xtabs objects must be converted into dataframes, but
# heatmaply requires a matrix...
heatmaply(as.data.frame.matrix(matrix))
library(plotly)
ggplotly(plots2custom)

References

  1. R in action. Robert I. Kabacoff. March 2022 ISBN 9781617296055

  2. R Graphics Cookbook: https://r-graphics.org/ (I recommend the Appendix A: Understanding ggplot).

  3. GGplot cheatsheet: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf

  4. ggplot2: elegant graphics for data analysis: https://ggplot2-book.org/index.html

  5. ggplot in “Introducción a la ciencia de datos” (spanish!): http://rafalab.dfci.harvard.edu/dslibro/ggplot2.html

  6. Color palettes in R: https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/

Exercises

1. The table coli_curve.csv contains the growth curves of three E. coli strains. Plot the curves as scatterplot and lines containing the 90% confidence interval for the three strains.

Tips.

  • Check the ggplot geom geom_smooth() for the confidence interval.

  • Also, it would be better if you color the data by strains, but keep the information of each sample measure using, for instance, the point shape.

2. The table microbe_download.csv contains the data of worldwide human deceases associated with antimicrobial resistance, as indicated in the column Counterfactual (data from https://vizhub.healthdata.org/microbe/). Explore the data and use it to reproduce the following barplots.

Tips.

  • You will need to check geom_errorbar() for the first plot.

  • For the second plot, you will need to order the data by the total number of Deaths.

  • Also, adjust the plot margins in order to show all the x-axis labels.

Session Info

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggseqlogo_0.1      heatmaply_1.4.0    viridis_0.6.2      viridisLite_0.4.1 
##  [5] plotly_4.10.1      RColorBrewer_1.1-3 ggplot2_3.4.0      data.table_1.14.6 
##  [9] formatR_1.12       knitr_1.41        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.9        svglite_2.1.0     tidyr_1.2.1       assertthat_0.2.1 
##  [5] digest_0.6.30     foreach_1.5.2     utf8_1.2.2        plyr_1.8.8       
##  [9] R6_2.5.1          evaluate_0.18     httr_1.4.4        highr_0.9        
## [13] pillar_1.8.1      rlang_1.0.6       lazyeval_0.2.2    rstudioapi_0.14  
## [17] jquerylib_0.1.4   rmarkdown_2.18    textshaping_0.3.6 labeling_0.4.2   
## [21] webshot_0.5.4     stringr_1.4.1     htmlwidgets_1.5.4 munsell_0.5.0    
## [25] compiler_4.2.2    xfun_0.35         pkgconfig_2.0.3   systemfonts_1.0.4
## [29] htmltools_0.5.3   tidyselect_1.2.0  tibble_3.1.8      gridExtra_2.3    
## [33] seriation_1.4.0   codetools_0.2-18  dendextend_1.16.0 fansi_1.0.3      
## [37] dplyr_1.0.10      withr_2.5.0       grid_4.2.2        registry_0.5-1   
## [41] jsonlite_1.8.3    gtable_0.3.1      lifecycle_1.0.3   DBI_1.1.3        
## [45] magrittr_2.0.3    scales_1.2.1      cli_3.4.1         stringi_1.7.8    
## [49] cachem_1.0.6      reshape2_1.4.4    farver_2.1.1      ca_0.71.1        
## [53] bslib_0.4.1       ragg_1.2.4        generics_0.1.3    vctrs_0.5.1      
## [57] iterators_1.0.14  tools_4.2.2       glue_1.6.2        purrr_0.3.5      
## [61] crosstalk_1.2.0   fastmap_1.1.0     yaml_2.3.6        colorspace_2.0-3 
## [65] TSP_1.2-1         sass_0.4.4