Wellcome & Disclaimer

This site contains the materials for the Coding tools for Biochemistry & Molecular Biology (Herramientas de Programación para Bioquímica y Biología Molecular) course of fall 2022 in the Bachelor’s Degree in Biochemistry @UAM. This materials are the basis for GitHub-pages-based website that can be accessed here. Detailed academic information about the course contents, dates and assessment only can be found at the UAM Moodle site.

All this material is open access and it is shared under CC BY-NC license.

Plot your data in R

Quick Plotting in R

R has a number of built-in tools for diverse graph types such as histograms, scatter plots, stricharts, bar charts, boxplots, and more. Indeed, there are many functions in R to produce plots ranging from the very basic to the highly complex. We will show a few examples and use the plots as an excuse to add some new tricks for data analysis.

In Lesson 13 we will use the ggplot2 package for efficient generation of complex and cool plots. However, you will see that it’s sometimes useful to use the plotting functions in base R. These are installed by default with R and do not require any additional packages to be installed. They’re quick to type, straightforward to use in simple cases, and run very quickly. Then, if you want to do anything beyond very simple plots, though, it’s generally better to switch to ggplot2. In fact, once you already know how to use R’s base graphics, having these examples side by side will help you transition to using ggplot2 for when you want to make more sophisticated graphics.

The function plot() can be used to generate different types of plots, as detailed in the table and examples below.

Use of plot() function. Source: https://r-coder.com/plot-r/

For the first examples, we are going to review also some R functions to make up data following certain distributions, using functions like rnorm() or dnorm() for generation of normal data or normal density curves. The same can be obtained for Poisson distribution with dpois() and rpois().

#generate some sample data with normal distribution
x <- rnorm(500)
y <- x + rnorm(500)
plot(x)

hist(x)

plot(x, y)

plot(dnorm(0:100,mean=50,sd=5))

#customized Example
plot(x, y, pch = 21,
     bg = "red",   # Fill color
     col = "blue", # Border color
     cex = 3,      # Symbol size
     lwd = 3)      # Border width

Example Plot 1: coli_genomes_renamed.csv

Now let’s plot our data.

coli_genomes <- read.csv2(file = "data/coli_genomes_renamed.csv",
    strip.white = TRUE, stringsAsFactors = TRUE)
attach(coli_genomes)
# one variable
plot(Year)

# a factor
plot(Source)

# two variables
plot(Contigs, kmer)

# factor + numeric variable and saving the plot
png(file = "plot1.png")  #give it a file name
plot(Source, Contigs)  #construct the plot
dev.off()  #save
## quartz_off_screen 
##                 2

As shown in the example, in order to save a plot, we must follow three steps:

  1. Open the file indicating the format: png(), jpg(), svg(), postscript(), bmp(), win.metafile(), or pdf().

  2. Plot the data.

  3. Close the file: dev.off().

Alternatively, you can save the plot using the Plot menu or the Plots panel: Export –> Save as Image or Save as PDF.

As you may know, the R function cor() calculate the correlation coefficient between two or more vectors and cor.test() allow us to quickly perform a correlation test between two variables to know if the correlation is statistically significant. However, a quick plot can be also very useful.

In our table, we have some features of a list of E. coli isolates and the basic stats of the genome sequencing. Regarding this data, do you think that the number of contigs > 1kb in the genome assemblies (contigs1kb) correlates with the total number of contigs (Contigs), the average contig length (average_contig) or the number of contigs that contain 50% of the genome (N50). Let’s check the data using simple plots with the plot() function.

# correlation plots
par(mfrow = c(3, 1))
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, xlab = "Contigs",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$Contigs), col = "red")
plot(coli_genomes$contigs1kb ~ coli_genomes$N50, xlab = "N50",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$N50), col = "red")
plot(coli_genomes$contigs1kb ~ coli_genomes$average_contig, xlab = "Average contig",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$average_contig),
    col = "red")

# now we will check with a cor.test and add some text to
# the plot
(test1 <- cor.test(coli_genomes$contigs1kb, coli_genomes$Contigs))
## 
##  Pearson's product-moment correlation
## 
## data:  coli_genomes$contigs1kb and coli_genomes$Contigs
## t = 7.651, df = 20, p-value = 2.306e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6945248 0.9420476
## sample estimates:
##      cor 
## 0.863334
(test2 <- cor.test(coli_genomes$contigs1kb, coli_genomes$N50))
## 
##  Pearson's product-moment correlation
## 
## data:  coli_genomes$contigs1kb and coli_genomes$N50
## t = -4.8281, df = 20, p-value = 0.0001021
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8823317 -0.4517569
## sample estimates:
##        cor 
## -0.7336341
(test3 <- cor.test(coli_genomes$contigs1kb, coli_genomes$average_contig))
## 
##  Pearson's product-moment correlation
## 
## data:  coli_genomes$contigs1kb and coli_genomes$average_contig
## t = -7.8453, df = 20, p-value = 1.574e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9444427 -0.7055990
## sample estimates:
##        cor 
## -0.8687624
str(test1)
## List of 9
##  $ statistic  : Named num 7.65
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named int 20
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 2.31e-07
##  $ estimate   : Named num 0.863
##   ..- attr(*, "names")= chr "cor"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "correlation"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Pearson's product-moment correlation"
##  $ data.name  : chr "coli_genomes$contigs1kb and coli_genomes$Contigs"
##  $ conf.int   : num [1:2] 0.695 0.942
##   ..- attr(*, "conf.level")= num 0.95
##  - attr(*, "class")= chr "htest"
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, xlab = "Contigs",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$Contigs), col = "red")
text(200, 200, paste("Pearson r2=", round(test1$estimate, 2)))
plot(coli_genomes$contigs1kb ~ coli_genomes$N50, xlab = "N50",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$N50), col = "red")
text(250000, 200, paste("Pearson r2=", round(test2$estimate,
    2)))
plot(coli_genomes$contigs1kb ~ coli_genomes$average_contig, xlab = "Average contig",
    ylab = "Contigs > 1kb")
abline(lm(coli_genomes$contigs1kb ~ coli_genomes$average_contig),
    col = "red")
text(40000, 200, paste("Pearson r2=", round(test3$estimate, 2)))

# oh wait! why only one-vs-one?
plot(coli_genomes[, c("VF", "Plasmids", "kmer", "Contigs", "N50",
    "longest.contig..bp.", "Assembly_length", "contigs1kb", "average_contig")],
    main = "Multiple Correlation plot")

Selecting the right plot

Data visualizations are a vital component of a data analysis, as they have the capability of summarizing large amounts of data efficiently in a graphical format and bring out new insights that can be difficult to understand from raw data. There are many chart types available, each with its own strengths and use cases. One of the trickiest parts of the analysis process is choosing the right way to represent your data using one of these visualizations. A wrong plot can lead to confusion or data misinterpretation (see some examples of bad graphs here).

From the perspective of classical data analysis, the first consideration to select the right plot is the nature of your variables. Some plots are recommended for numerical data (quantitative variables) and other for categorical data (qualitative variables). Then, you should considering the role for your data visualization, i.e, question you want to address or the message you want to transmit. That will depend on how many variables, as well as their distribution, grouping or correlation.

You can see a description of some of the more common plots, with examples here or here. Rather than going through all of them, we are going to use some examples to learn some basic plotting and see the effect of plot selection in order to understand your data and use them to answer questions.

Example Plot 2: Zebrafish_data.csv

We are loading again the table described in Lesson 12 and we will try out some plots.

# read data
ZFdata <- read.csv("data/Zebrafish_data.csv")

# some plots
barplot(ZFdata)
## Error in barplot.default(ZFdata): 'height' must be a vector or a matrix
barplot(ZFdata$pLoC)

par(mfrow = c(1, 3))  #arrange the three plots in a row 
barplot(ZFdata$pLoC, col = "red", main = "pLoc")
barplot(ZFdata$EFNA3, col = "green", main = "EFNA3")
barplot(ZFdata$NC1s, col = "blue", main = "NC1s")

par(mfrow = c(2, 3))  #arrange the three plots in a row 
stripchart(ZFdata$pLoC, main = "pLoc")
stripchart(ZFdata$pLoC, method = "stack", col = "red", main = "pLoc")
stripchart(ZFdata$pLoC, method = "overplot", col = "red", main = "pLoc")
stripchart(ZFdata$pLoC, method = "jitter", col = "red", main = "pLoc")
stripchart(ZFdata$EFNA3, method = "stack", col = "green", main = "EFNA3")
stripchart(ZFdata$NC1s, method = "stack", col = "blue", main = "N1cs")

par(mfrow = c(1, 3))  #arrange the three plots in a row 
hist(ZFdata$pLoC, col = "red", main = "pLoc")
hist(ZFdata$EFNA3, col = "green", main = "EFNA3")
hist(ZFdata$NC1s, col = "blue", main = "N1cs")

par(mfrow = c(1, 3))  #arrange the three plots in a row 
boxplot(ZFdata$pLoC, col = "red", main = "pLoc")
boxplot(ZFdata$EFNA3, col = "green", main = "EFNA3")
boxplot(ZFdata$NC1s, col = "blue", main = "N1cs")

Now try to answer some questions about your data and obtained plots:

  1. Which construct has the strongest impact on the disemination of metastatic cells?

  2. Which plot represent best the data?

  3. Does a barplot represent the same information than a boxplot?

Now, we are going to represent the same data again.

oldpar <- par()  #save original settings (optional)
par(mfrow = c(1, 3), cex.lab = 1, cex = 1, lwd = 2)  #new settings

hist(ZFdata$pLoC, col = rgb(1, 0, 0, 0.5), main = "pLoc", ylim = c(0,
    20), xlim = c(0, 120), xlab = "Number of Metastatic cells")
hist(ZFdata$EFNA3, col = rgb(0, 1, 0, 0.5), main = "EFNA3", ylim = c(0,
    20), xlim = c(0, 120), xlab = "Number of Metastatic cells")
hist(ZFdata$NC1s, col = rgb(0, 0, 1, 0.5), main = "NC1s", ylim = c(0,
    20), xlim = c(0, 120), xlab = "Number of Metastatic cells")

stripchart(ZFdata$pLoC, method = "jitter", pch = 19, col = rgb(1,
    0, 0, 0.5), vertical = TRUE, main = "pLoc", ylim = c(0, 120),
    xlab = "Number of Metastatic cells")
stripchart(ZFdata$EFNA3, method = "jitter", pch = 19, col = rgb(0,
    1, 0, 0.5), vertical = TRUE, main = "EFNA3", ylim = c(0,
    120), xlab = "Number of Metastatic cells")
stripchart(ZFdata$EFNA3, method = "jitter", pch = 19, col = rgb(0,
    1, 0, 0.5), vertical = TRUE, main = "EFNA3", ylim = c(0,
    120), xlab = "Number of Metastatic cells")

boxplot(ZFdata$pLoC, col = rgb(1, 0, 0, 0.5), ylim = c(0, 120),
    xlab = "Number of Metastatic cells", main = "pLoc")
boxplot(ZFdata$EFNA3, col = rgb(0, 1, 0, 0.5), ylim = c(0, 120),
    xlab = "Number of Metastatic cells", main = "EFNA3")
boxplot(ZFdata$NC1s, col = rgb(0, 0, 1, 0.5), ylim = c(0, 120),
    xlab = "Number of Metastatic cells", main = "N1cs")

par(oldpar)  #restore (optional)
## Warning in par(oldpar): el parámetro del gráfico "cin" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "cra" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "csi" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "cxy" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "din" no puede ser especificado
## Warning in par(oldpar): el parámetro del gráfico "page" no puede ser
## especificado
# alternatively you can reset par with: par(no.readonly =
# TRUE)

In these plots above, you can see how plots can be deeply customized, and you can select colors (by name, RGB, hex or HSV code), size and shape of the points (cex and pch), plot and axis titles (main, xlab and ylab) and some tricks to avoid points overlapping (method = "jitter" or method="stacked"). Also, we have used the function par() to arrange the plots integrated as a single plot. This is a very useful function that can be used to set or query many graphical parameters (see below). The arguments mfrow and mfcol are very useful to place graphs indicating the row and cols (or cols and row in mfcol). However, any other graph customization can be included here to avoid reapeat it during the plot.

Do you think, you can answer better the questions now?

Beyond some basic examples of plotting in R, the take-home message of this example is that the type of plot and the plot parameters (in this case the scale) can be essential for correct interpretation of the data and if they are not properly adjusted the plot can be strongly misleading.

Further, to compare three conditions, we needed to make three independent plots. However, in the table, there are actually six conditions, and it is easy to imagine a table with more conditions. How could you plot that? Here is where the generation of a datamatrix table, as shown in Lesson 12 is very important.

ZF_stacked <- stack(ZFdata)

boxplot(ZF_stacked$values ~ ZF_stacked$ind, col = c(rgb(1, 0,
    0, 0.5), rgb(0, 1, 0, 0.5), rgb(0, 0, 1, 0.5), rgb(1, 0.5,
    0, 0.5), rgb(0.5, 0.5, 0.5, 0.5), rgb(0, 1, 1, 0.5)))
colorines = c(rgb(1, 0, 0, 0.5), rgb(0, 1, 0, 0.5), rgb(0, 0,
    1, 0.5), rgb(1, 0.5, 0, 0.5), rgb(0.5, 0.5, 0.5, 0.5), rgb(0,
    1, 1, 0.5))  #define the colors for the whole analysis
boxplot(ZF_stacked$values ~ ZF_stacked$ind, col = colorines)

stripchart(ZF_stacked$values ~ ZF_stacked$ind, vertical = TRUE,
    method = "jitter", col = colorines, pch = 19, cex = 1, ylab = "Number of cells",
    xlab = "Plasmid")

# what about barplot?
means <- aggregate(values ~ ind, data = ZF_stacked, mean)  #you can use by() or tapply()
barplot(means$values, ylim = c(0, 120), col = colorines, ylab = "Number of cells",
    xlab = "Plasmid")

The stripchart and the boxplot strongly suggest that NC2s is probably the strongest transcript, but that is not shown in the barplot. These plots clearly show that barplots are intended for single values (categorical data) and can mislead your conclusions.

Plot customization

There are many options to customize your plots, including font type and size, point shape, line type… You can see more info on the References section below. Let’s see some examples using code I borrow from https://r-coder.com/plot-r/

# point shape with 'pch'
r <- c(sapply(seq(5, 25, 5), function(i) rep(i, 5)))
t <- rep(seq(25, 5, -5), 5)

plot(r, t, pch = 1:25, cex = 3, yaxt = "n", xaxt = "n", ann = FALSE,
    xlim = c(3, 27), lwd = 1:3)
text(r - 1.5, t, 1:25)

# line type with 'lty'
M <- matrix(1:36, ncol = 6)
matplot(M, type = c("l"), lty = 1:6, col = "black", lwd = 3)

# Just to indicate the line types in the plot
j <- 0
invisible(sapply(seq(4, 40, by = 6), function(i) {
    j <<- j + 1
    text(2, i, paste("lty =", j))
}))

# plot box
par(mfrow = c(2, 3))

# plots
plot(x, y, bty = "o", main = "Default")
plot(x, y, bty = "7", main = "bty = '7'")
plot(x, y, bty = "L", main = "bty = 'L'")
plot(x, y, bty = "U", main = "bty = 'U'")
plot(x, y, bty = "C", main = "bty = 'C'")
plot(x, y, bty = "n", main = "bty = 'n'")

par(mfrow = c(1, 1))

Example Plot 3: more customization

Now let’s think again in our E. coli genomes. How would you add more layers of information to the plot, like labels of specific points.

# point labels

# Create the plot
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, pch = 19,
    col = rgb(1, 0, 0, 0.5), xlab = "Contigs", ylab = "Contigs > 1kb")

# Plot the labels
text(coli_genomes$contigs1kb ~ coli_genomes$Contigs, labels = coli_genomes$Strain,
    cex = 0.6, pos = 4, col = "red")

# only some points
selected <- c(7, 17, 18)
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, pch = 19,
    col = rgb(1, 0, 0, 0.5), xlab = "Contigs", ylab = "Contigs > 1kb")

# Index the elements with the vector
text(coli_genomes$contigs1kb[selected] ~ coli_genomes$Contigs[selected],
    labels = coli_genomes$Strain[selected], cex = 0.6, pos = 4,
    col = "red")

# another different example
attach(USJudgeRatings)
plot(FAMI, INTG, main = "Familiarity with law vs Judicial integrity",
    xlab = "Familiarity", ylab = "Integrity", pch = 18, col = "blue")
text(FAMI, INTG, labels = row.names(USJudgeRatings), cex = 0.6,
    pos = 4, col = "red")

detach(USJudgeRatings)

You can look for more custom options because there are a lot. I also suggest looking at the identify() function, which allows the quick interactive identification and labeling of selected points.

How would you plot several variables by groups? For this, we can use a loop or an apply family trick (see Lesson 12).

# grouping
par(mfrow = c(1, 3))
# how can you generate several plots for different variable
# by factors??

# option 1: looping
system.time(for (i in 9:11) {
    boxplot(by(coli_genomes[, i], coli_genomes$Source, summary),
        main = names(coli_genomes[i]))
})

##    user  system elapsed 
##   0.023   0.004   0.035
# option2: lapply
ploteando <- function(x) {
    boxplot(by(x, coli_genomes$Source, summary))
}
system.time(lapply(coli_genomes[, 9:11], FUN = ploteando))

##    user  system elapsed 
##   0.035   0.008   0.050

In this example lapply() returns a list of plots but (in this case with only 3 plots) you can notice that it is not faster than a loop. Finally, loops are not so slow on R, at least small loops.

Exercise

Repeat the scatterplot above of contigs1kb vs. Contigs, but coloring the points by Source as in the plot below. You just need to assign the colors in the plot to a new vector that encodes the levels of the factor as colors.

Finally, use the function legend() to add a legend.

Here is the code for the plot, with some more options:

coli_genomes <- read.csv2(file = "data/coli_genomes_renamed.csv",
    strip.white = TRUE, stringsAsFactors = TRUE)
colorines <- sapply(coli_genomes$Source, FUN = function(x) switch(x,
    Avian = rgb(1, 0, 0, 0.5), Human = rgb(0, 1, 0, 0.5), Porcine = rgb(0,
        0, 1, 0.5)))
names(colorines) <- coli_genomes$Source

# option2: swicth in character vector will keep the
# original as names
colorines2 <- sapply(as.character(coli_genomes$Source), FUN = function(x) switch(x,
    Avian = rgb(1, 0, 0, 0.5), Human = rgb(0, 1, 0, 0.5), Porcine = rgb(0,
        0, 1, 0.5)))
all.equal(colorines, colorines2)  #is the same?

# plot
plot(coli_genomes$contigs1kb ~ coli_genomes$Contigs, pch = 19,
    col = colorines2, xlab = "Contigs", ylab = "Contigs > 1kb")

# legend
legend(100, 200, fill = unique(colorines2), legend = unique(names(colorines2)),
    ncol = 2)
legend("top", fill = unique(colorines2), legend = unique(names(colorines2)),
    ncol = 2)
legend("bottom", fill = unique(colorines2), legend = unique(names(colorines2)),
    ncol = 2)
legend("topright", fill = unique(colorines2), legend = unique(names(colorines2)),
    ncol = 2)

Exercises

1. Let’s suppose we have the expression data for two genes (GeneA and GeneB) in 50 patients, 30 with colon cancer and 20 with lung cancer.

For the exercise, we are using random data, generated as follows:

(GeneA <- rnorm(50))
(GeneB <- c(rep(-1, 30), rep(2, 20)) + rnorm(50))
(tumor <- factor(c(rep("Colon", 30), rep("Lung", 20))))

a. Before plotting the data, how would you test if there is difference in the expression level by the type of tumor?

b. Plot boxplots and density curves of the expression of both genes by type of tumor. Do the plots agree with your answer to the question a?

Tips. Check out the functions density() and lines() in order to plot the density curves and polygon() if you want to have filled the area below the curves. Also, note that when plotting multiple lines, you must set the axis limits with the xlim and ylim arguments when plotting the first curve.

2. Replica plot.

The file aapc.csv is a plain text file that contains information amino acid composition in all the sequences deposited in the UniProtKB/Swiss-Prot protein data base. Use that file to reproduce the following barplot.

3. A classic in biochemistry lab: Protein quantification (with special thanks to Carmen Mayoral). You have probably done that with Excel or GraphPad, but after this exercise I’m sure you’ll consider a definitive upgrade in your protocols.

In our lab, we often purify recombinant DNA polymerases. We have recently purify a WT and mutant proteins, and we want to quantify the amount of pure proteins using a SDS-PAGE. To do so, we loaded decreasing volumes of the protein preparation (serial dilutions, but expresed as volume of the original sample in µL) along with known amounts (ng) of bovine seroalbumin (BSA) as a pattern, as in the image below. Bands were densitometer using ImageJ opensource software (arbitrary units). The data are in three tables containing the BSA pattern (triplicates in bsa_pattern.csv), the results for the wt protein (curve_wt.csv) and for the mutant (curve_mut.csv).

  1. Plot the correlation between BSA amounts and the bands density. Color the points by the replicate and include the linear model line, as well as the Pearson R2.

  2. Use the linear model to obtain the concentration (µM) of wt and mutant protein preparations. Consider the same molecular weight of 99 kDa for both proteins.

Session Info

sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] formatR_1.12 knitr_1.41  
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.30   R6_2.5.1        jsonlite_1.8.3  magrittr_2.0.3 
##  [5] evaluate_0.18   highr_0.9       stringi_1.7.8   cachem_1.0.6   
##  [9] rlang_1.0.6     cli_3.4.1       rstudioapi_0.14 jquerylib_0.1.4
## [13] bslib_0.4.1     rmarkdown_2.18  tools_4.2.2     stringr_1.4.1  
## [17] xfun_0.35       yaml_2.3.6      fastmap_1.1.0   compiler_4.2.2 
## [21] htmltools_0.5.3 sass_0.4.4