Comparison of MDA methods

Author

Modesto

Published

March 17, 2023


1 Disclaimer

This data analysis is preliminar and will be soon published in a manuscript authored by Carlos D. Ordóñez, Conceçao Egas and Modesto Redrejo-Rodríguez. The GitHub repo contains the original result files from all the analyses. Raw sequencing data has been deposited in the e-cienciaDatos repository with DOI: 10.21950/HCNDGF

Data is shared under creative common license (CC BY-NC 3.0 ES). Contact modesto.redrejo@uam.es for further info.

2 Input sample

The input DNA to be amplified is a mini mock metagenome made up of equal mass of purified genomes from 4 different bacteria, spanning Gram+ and Gram- species, and genomes with high and low GC, as detailed in the Table 11. Note that the used reference genomes were selected later on, using top hits of BlastN searches performed with a subset longest contigs from each assembly as query.

Show the code
# Define color palette for the whole report to facilitate future modifications
colorines <- c(`NA` = "grey", NA2 = "grey", `RP-MDA` = "palegreen4", `RP-MDA2` = "palegreen4",
    `PrimPol-MDA` = "orangered4", piPolB = "gold", `piPolB+D` = "darkgoldenrod1", piMDA = "deepskyblue3",
    piMDA2 = "deepskyblue3", `piMDA+D` = "blue4", `piMDA+D2` = "blue4", `B. subtilis` = "orange2",
    `E. coli` = "gold1", `K. rhizophila` = "skyblue1", `P. aeruginosa` = "slateblue2")

# load data
samples <- read.table("amplification_yield/input_sample.csv", header = TRUE, sep = ";")
names(samples) = c("Strain", "Model strain", "Reference genome", "Length (MB)", "GC content")
kbl(samples, align = "c", caption = "Table 1. Composition of the amplified mock-metagenome") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, italic = T)
Table 1. Composition of the amplified mock-metagenome
Strain Model strain Reference genome Length (MB) GC content
E.coli E. coli K12 MG1655 NC_000913.3 4.64 50.79
B. subtilis 110NA B. subtilis 168 NC_000964.3 4.21 43.51
P. aeruginosa P. aeruginosa PAO1 NC_002516.2 6.50 66.37
Micrococcus luteus K. rhizophila ATCC 9341 NC_010617.1 2.70 70.56
Show the code
# %>% save_kable('figures/samples.pdf')

3 DNA amplification assays (MDA)

DNA amplification was carried out by six different protocols. (1) Random-primer hexamers-based MDA (RP-MDA) was carried out with REPLIg (Qiagen) and (2) TruePrime (4basebio) was used for a control primer-less MDA (PrimPol-MDA). In both cases, amplification protocols followed the manufacturer’s instructions, including the routine alkaline denaturation step. Recombinant piPolB (Redrejo-Rodríguez et al. 2017) was used for a single-enzyme, primer-free DNA amplification in a single step MDA protocol (3) or an alternative protocol (4) with a previous alkaline denaturation step (piPolB+D). Additionally, a new method based on the combination of piPolB and Φ29 DNA polymerases was developed. Two alternative protocols were used, (5) a single-step primer-independent MDA (piMDA) and a modification (6) with a previous alkaline denaturation of input DNA (piMDA+D).

Amplification yield was assessed by absolute measure of DNA amplification product was performed by fluorescence quantitation of dilutions from each sample (triplicates) with the AccuBlue High Sensitivity dsDNA Quantitation Kit (Biotum, #31006) using a FLUOstar Omega (BMG Labtech) fluorimeter. AccuBlue High Sensitivity dsDNA Standards (Biotum, #31006C) were used to do the range standard curve in each experiment.

3.1 Methods comparison

Show the code
# read GraphPad Prism file from Carlos experiment
df <- read_pzfx("amplification_yield/Exp34_Datos.pzfx", table = 4)
free <- reshape2::melt(df[, -2], id.vars = "ROWTITLE")
free$variable <- as.character(free$variable)
# recode variable factor for consistent nomenclature
for (i in 1:nrow(free)) {
    free$variable[i] <- substr(free$variable[i], 1, (nchar(free$variable[i]) - 2))
    free$variable[i] <- switch(free$variable[i], `REPLI-g` = "RP-MDA", TruePrime = "PrimPol-MDA",
        `pi-mgDNA one-step` = "piMDA", `pi-mgDNA alk. denat.` = "piMDA+D")
}
# change decimal separator
free$value <- as.numeric(gsub(",", ".", free$value))
# reorder
free$variable <- factor(free$variable, levels = c("RP-MDA", "PrimPol-MDA", "piMDA", "piMDA+D"))

We plot first the DNA input vs. the product yield for the four methods.

Show the code
p <- ggplot(free, aes(x = as.factor(ROWTITLE), y = value, color = variable, fill = variable,
    alpha = 0.12)) + geom_boxplot() + scale_y_continuous(limits = c(0, 30), expand = c(0, 0)) +
    guides(alpha = "none", color = "none") + geom_point(pch = 21, position = position_jitterdodge(),
    alpha = 0.5) + facet_grid(~variable) + theme_bw() + theme(text = element_text(size = 15)) +
    xlab("Input DNA (ng)") + ylab("DNA product (µg)") + labs(fill = "MDA protocol") + scale_fill_manual(values = colorines[c(3,
    5, 8, 10)]) + scale_color_manual(values = colorines)
ggplotly(p)

Figure 1. DNA amplification yield comparison. All reactions were carried out in 50 µL and incubated at 30ºC for 3 h, except those of RP-MDA protocol that were incubated for 16 h. Shown are results from 4 independent experiments.

Show the code
# ggsave('figures/yieldplot.svg',p,width=8,height=5)

In order to analyze statistically the differences among samples, we will first test normality and homogeneous variance to decide if we should use parametric tests. As shown below, the p-values <0.05 indicate that data are not suitable for parametric tests, leading us to the use of non-parametric pairwise Wilcox comparisons.

Show the code
# statistics comparison of yield/input
res_aov <- aov(value/ROWTITLE ~ variable, data = free)
shapiro.test(res_aov$residuals)  #normality test

    Shapiro-Wilk normality test

data:  res_aov$residuals
W = 0.90279, p-value = 1.6e-05
Show the code
leveneTest(value/ROWTITLE ~ variable, data = free)  #NO homogeneus variance
Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  3  5.6983 0.001423 **
      76                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
kruskal.test((free$value/free$ROWTITLE), free$variable, paired = FALSE, p.adjust.method = "BH")  #non-parametric multiple groups comparison

    Kruskal-Wallis rank sum test

data:  (free$value/free$ROWTITLE) and free$variable
Kruskal-Wallis chi-squared = 15.869, df = 3, p-value = 0.001207
Show the code
pairwise.wilcox.test((free$value/free$ROWTITLE), free$variable, paired = FALSE, p.adjust.method = "BH")  #non-parametric pairwise

    Pairwise comparisons using Wilcoxon rank sum exact test 

data:  (free$value/free$ROWTITLE) and free$variable 

            RP-MDA  PrimPol-MDA piMDA  
PrimPol-MDA 0.03348 -           -      
piMDA       0.44496 0.00031     -      
piMDA+D     0.18105 0.44496     0.01843

P value adjustment method: BH 

Statistical analysis indicate that yield differences among MDA methods are statistically significant, particularly the piMDA+D protocol respect to RP-MDA and piPolB. Let’s see about time-course experiments:

Show the code
df <- read_pzfx("amplification_yield/Exp34_Datos.pzfx", table = 1)
free <- melt(df[, -2], id.vars = "ROWTITLE")
free$variable <- as.character(free$variable)
# recode variable factor
for (i in 1:nrow(free)) {
    free$variable[i] <- substr(free$variable[i], 1, (nchar(free$variable[i]) - 2))
    free$variable[i] <- switch(free$variable[i], `piDNA Denat.` = "piMDA+D", `piDNA Not denat.` = "piMDA",
        `Repli-g` = "RP-MDA", TruePrime = "PrimPol-MDA")
}

# add results from exp42
df42 <- read_pzfx("amplification_yield/Exp42_Datos.pzfx", table = 3)
df42 <- df42[-c(1, 4, 7), ]  #remove -DNA controls from plot
df42 <- df42[, -c(3)]  #remove NAs

# rename
for (i in 1:nrow(df42)) {
    df42$ROWTITLE[i] <- switch(df42$ROWTITLE[i], `piPolB D` = "piPolB+D", `piPolB ND` = "piPolB",
        `pi-gAmp D` = "piMDA+D", `pi-gAmp ND` = "piMDA", TruePrime = "PrimPol-MDA")
}
free42 <- melt(df42, id.vars = "ROWTITLE")
free42 <- free42[, c(2, 1, 3)]
free42 <- free42[-31, ]
free42$variable <- gsub(" h_1", "", free42$variable)  #change format
free42$variable <- gsub(" h_2", "", free42$variable)  #change format
free42$variable <- gsub(" h_3", "", free42$variable)  #change format
free42$variable <- gsub(" h_4", "", free42$variable)  #change format
free42$variable <- as.numeric(free42$variable)
names(free42) <- names(free)

free$value <- as.numeric(gsub(",", ".", free$value))  #change decimal separator
free$variable <- factor(free$variable, levels = c("RP-MDA", "PrimPol-MDA", "piPolB", "piPolB+D",
    "piMDA", "piMDA+D"))
free42$value <- as.numeric(gsub(",", ".", free42$value))  #change decimal separator
free42$variable <- factor(free42$variable, levels = c("RP-MDA", "PrimPol-MDA", "piPolB", "piPolB+D",
    "piMDA", "piMDA+D"))

free <- rbind(free, free42)
# plot

q <- ggplot(free, aes(x = as.factor(ROWTITLE), y = value, colour = variable, fill = variable,
    alpha = 0.12)) + geom_boxplot() + scale_y_continuous(limits = c(-0.8, 40), expand = c(0,
    0)) + guides(alpha = "none", color = "none") + geom_point(pch = 21, position = position_jitterdodge(),
    alpha = 0.5) + facet_grid(~variable, scales = "free") + theme_bw() + force_panelsizes(cols = c(1.5,
    3, 1.5, 1.5, 3, 3), TRUE) + theme(text = element_text(size = 15)) + theme(strip.text.x = element_text(size = 11)) +
    xlab("Reaction time (h)") + ylab("DNA product (µg)") + labs(fill = "MDA protocol") + scale_fill_manual(values = colorines[c(3,
    5, 6, 7, 8, 10)]) + scale_color_manual(values = colorines)
ggplotly(q)
Warning: Removed 27 rows containing non-finite values (`stat_boxplot()`).

Figure 2. DNA amplification time-course. All the amplification assays contained 10 ng of input DNA sample. Shown is the total DNA product in 50 µL reactions. Amplification yield from 2-4 experiments are shown.

Show the code
# ggsave('figures/timecourse.svg',q,width=9,height=5)
Show the code
# statistics comparison of yield/hour
res_aov <- aov(value/ROWTITLE ~ variable, data = free)
shapiro.test(res_aov$residuals)  #normality test - NOPE

    Shapiro-Wilk normality test

data:  res_aov$residuals
W = 0.82386, p-value = 5.609e-09
Show the code
leveneTest(value/ROWTITLE ~ variable, data = free)  #NO homogeneus variance
Levene's Test for Homogeneity of Variance (center = median)
      Df F value    Pr(>F)    
group  5  5.1655 0.0003512 ***
      84                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
kruskal.test((free$value/free$ROWTITLE), free$variable, paired = FALSE, p.adjust.method = "BH")  #non-parametric multiple groups comparison

    Kruskal-Wallis rank sum test

data:  (free$value/free$ROWTITLE) and free$variable
Kruskal-Wallis chi-squared = 35.508, df = 5, p-value = 1.191e-06
Show the code
pp <- pairwise.wilcox.test((free$value/free$ROWTITLE), free$variable, paired = FALSE, p.adjust.method = "BH",
    exact = FALSE)  #non-parametric pairwise

as.data.frame(format(pp$p.value, scientific = FALSE, digits = 1)) %>%
    replace(., . < 0, "") %>%
    mutate_all(~cell_spec(.x, color = ifelse(.x < 0.01, "firebrick", ifelse(.x < 0.05, "steelblue",
        "black")))) %>%
    kable(escape = F, align = "c", caption = "Table 2. P-value of pairwise comparison of time course assays") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, bold = T)
Table 2. P-value of pairwise comparison of time course assays
RP-MDA PrimPol-MDA piPolB piPolB+D piMDA
PrimPol-MDA 0.0003
piPolB 0.1850 0.0004
piPolB+D 1.0000 0.3112 0.1483
piMDA 0.1370 0.1483 0.0156 0.6127
piMDA+D 0.0003 0.0730 0.0004 0.0411 0.0196

These results show that the piMDA method is comparable with other MDA alternatives in terms of sensitivity and DNA production. Moreover, its yield can be increased with the addition of a DNA denaturation step, obtaining statistically significant more amount of DNA per time unit than all the other protocols, except PrimPol-MDA.

It should be also considered that these results apply specifically to this particular sample, and the reduced yield of RP-MDA respect to the other methods is likely by the presence of high-GC content sequences, as deeply discussed in the alongside manuscript..

4 Illumina raw data processing

Samples from independent amplification experiments, as well as the control non-amplified metagenome (NA), were sequenced by Illumina NextSeq 500 V2.5 High Output, 2x75 bp paired ends sequencing. To reduce the sequencing-derived bias, the sequencing libraries were prepared using TruSeq DNA PCR-Free Low Throughput Library Prep kit (Illumina), at Conceição Egas’ lab (GENOINSEQ - BIOCANT, Portugal). As shown below, NA, RP-MDA, piMDA and piMDA+D were sequenced from duplicated independent experiments.

4.1 QC stats and trimming

The raw files were first analyzed for basic statistics with Seqkit v.2.3 (Table 3).

Show the code
samples <- read.table("qc/raw_stats.csv", header = TRUE, sep = ";", na.strings = "")
for (i in 1:nrow(samples)) {
    samples$file[i] <- gsub("freeDNA_data/", "", samples$file[i])
}
kbl(samples, align = "c", caption = "Table 3. Stats of raw Illumina data") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, bold = T)
Table 3. Stats of raw Illumina data
Sample file format type num_seqs sum_len min_len avg_len
piMDA+D2 A2_R1.fastq.gz FASTQ DNA 81,578,075 6,172,839,185 35 75.7
piMDA+D2 A2_R2.fastq.gz FASTQ DNA 81,578,075 6,172,767,989 35 75.7
piMDA2 B2_R1.fastq.gz FASTQ DNA 75,890,757 5,742,701,890 35 75.7
piMDA2 B2_R2.fastq.gz FASTQ DNA 75,890,757 5,742,589,467 35 75.7
Primpol-MDA C3_R1.fastq.gz FASTQ DNA 57,026,214 4,309,432,584 35 75.6
Primpol-MDA C3_R2.fastq.gz FASTQ DNA 57,026,214 4,309,441,640 35 75.6
NA2 Ctrl2_R1.fastq.gz FASTQ DNA 76,474,759 5,784,350,652 35 75.6
NA2 Ctrl2_R2.fastq.gz FASTQ DNA 76,474,759 5,784,333,531 35 75.6
NA2 Ctrl_R1.fastq.gz FASTQ DNA 71,838,581 5,432,947,576 35 75.6
NA2 Ctrl_R2.fastq.gz FASTQ DNA 71,838,581 5,432,944,267 35 75.6
RP-MDA D3_R1.fastq.gz FASTQ DNA 67,415,579 5,093,810,770 35 75.6
RP-MDA D3_R2.fastq.gz FASTQ DNA 67,415,579 5,093,837,342 35 75.6
piPolB+D F2_R1.fastq.gz FASTQ DNA 69,333,871 5,181,529,472 35 74.7
piPolB+D F2_R2.fastq.gz FASTQ DNA 69,333,871 5,188,090,985 35 74.8
piPolB N1_R1.fastq.gz FASTQ DNA 66,003,225 4,946,827,786 35 74.9
piPolB N1_R2.fastq.gz FASTQ DNA 66,003,225 4,953,211,772 35 75.0
piMDA N2_R1.fastq.gz FASTQ DNA 69,872,142 5,287,166,316 35 75.7
piMDA N2_R2.fastq.gz FASTQ DNA 69,872,142 5,287,208,173 35 75.7
RP-MDA N4_R1.fastq.gz FASTQ DNA 66,722,442 5,040,940,307 35 75.6
RP-MDA N4_R2.fastq.gz FASTQ DNA 66,722,442 5,041,277,271 35 75.6
piMDA+D N6_R1.fastq.gz FASTQ DNA 64,762,150 4,901,577,518 35 75.7
piMDA+D N6_R2.fastq.gz FASTQ DNA 64,762,150 4,901,629,402 35 75.7

Then, a detailed quality check (QC) was performed with FastQC v0.11.9 for each sample, and reports aggregated with MultiQC 1.11. Reads were subsequently trimmed with Trimmomatic 0.39, using a customized script, based on I-Shell-Do-This Github Repo (parameters: LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:35). A post-trimming QC report was obtained as above. Due to size limitations, Trimmomatic output files are not available in this repository, they’ll be deposited along with the raw reads. Individual QC reports as well as MultiQC-merged reports before and after Trimmomatic are available in the qc folder of the GitHub repository. We include in Figures 4-6 some of the QC results. First, we will have a look to the duplication level of the reads, which is a good indicative of overamplification of some sequences, a common bias source in exponential DNA amplification.

Show the code
# Duplicated reads
duplication <- read.table("qc/fastqc_sequence_duplication_levels_plot.tsv", header = TRUE,
    sep = "\t", na.strings = "")
# reorder
duplication <- duplication[c(1:11, 20, 21, 22, 23, 12:19)]
dupli_table <- melt(duplication)
Using Sequence.Duplication.Level as id variables
Show the code
dupli_table[, 2] <- sub("RepliG", "RP-MDA", dupli_table[, 2])
dupli_table[, 2] <- sub("TruePrime", "PrimPol-MDA", dupli_table[, 2])
dupli_table[, 2] <- sub("piDNA", "piMDA", dupli_table[, 2])
dupli_table[, 2] <- sub("piMDA.D", "piMDA+D", dupli_table[, 2])
dupli_table[, 2] <- sub("piPolB.D", "piPolB+D", dupli_table[, 2])
dupli_table$Sequence.Duplication.Level <- factor(dupli_table$Sequence.Duplication.Level, levels = c("1",
    "2", "3", "4", "5", "6", "7", "8", "9", ">10", ">50", ">100", ">500", ">1k", ">5k", ">10k+"))

# plot

p <- ggplot(dupli_table, aes(x = Sequence.Duplication.Level, y = value, colour = variable,
    fill = variable, label = variable, alpha = 0.8)) + geom_point(pch = 21, size = 3, position = position_dodge(width = 0.5),
    alpha = 0.8) + geom_text_repel(max.overlaps = 15) + theme(text = element_text(size = 13)) +
    theme_bw() + ggtitle("Duplication level in Raw Reads") + xlab("Duplication level") + ylab("Reads (%)") +
    scale_fill_manual(values = rep(unname(colorines[c(1, 2, 8, 10, 11, 9, 6, 7, 5, 3, 4)]),
        each = 2)) + scale_color_manual(values = rep(unname(colorines[c(1, 2, 8, 10, 11, 9,
    6, 7, 5, 3, 4)]), each = 2)) + guides(alpha = FALSE, color = FALSE, fill = guide_legend(ncol = 2,
    title = "MDA Method"))
# ggsave('figures/duplication.svg',p,width=9,heigh=5)

# Duplication level post-trimming
post_duplication <- read.table("qc/post_fastqc_sequence_duplication_levels_final.tsv", header = TRUE,
    sep = "\t", na.strings = "")
post_dupli_table <- melt(post_duplication)
Using Sequence.Duplication.Level as id variables
Show the code
post_dupli_table[, 2] <- sub("piPolB.D", "piPolB+D", post_dupli_table[, 2])
post_dupli_table[, 2] <- sub("piMDA.D", "piMDA+D", post_dupli_table[, 2])
post_dupli_table$Sequence.Duplication.Level <- factor(post_dupli_table$Sequence.Duplication.Level,
    levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", ">10", ">50", ">100", ">500", ">1k",
        ">5k", ">10k+"))
q <- ggplot(post_dupli_table, aes(x = Sequence.Duplication.Level, y = value, colour = variable,
    fill = variable, label = variable, alpha = 0.8)) + geom_point(pch = 21, size = 3, position = position_dodge(width = 0.5),
    alpha = 0.8) + scale_fill_manual(values = rep(unname(colorines[c(1, 2, 8, 9, 10, 11, 6,
    7, 5, 3, 4)]), each = 4)) + scale_color_manual(values = rep(unname(colorines[c(1, 2, 8,
    9, 10, 11, 6, 7, 5, 3, 4)]), each = 4)) + geom_text_repel(max.overlaps = 20, fontface = "bold") +
    theme_bw() + theme(text = element_text(size = 13, face = "bold")) + ggtitle("Duplication level post-trimming") +
    xlab("Duplication level") + ylab("Reads (%)") + guides(alpha = FALSE, color = FALSE, fill = guide_legend(ncol = 2,
    title = "MDA Method"))

# ggsave('figures/duplication_post.svg',q,width=12,heigh=5)

p/q

Figure 3. Reads duplication level before (top) and after (bottom) trimming

Random-primer based MDA (RP-MDA) gave rise to a higher level of duplicated reads than PrimPol-MDA and both piMDA protocols. That suggests a greater level of overamplification of the same sequences. The piPolB solo MDA gave rise to a great level of highly duplicated reads, many of them removed after trimming. However, these overrepresented reads might be of interest as they could represent regions of favored initiation (i.e. DNA priming) events.

Show the code
trimmo <- read.csv2("trimmo/trimmo_stats_percentage.csv", header = TRUE, na.strings = "")
trimmo$sample <- factor(trimmo$sample, levels = c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA",
    "piPolB", "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2"))
g_t <- ggplot(trimmo) + geom_point(aes(x = sample, y = as.numeric(Both), color = sample, fill = sample),
    size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) + scale_shape_manual(values = c(22,
    23, 24, 25)) + geom_point(aes(x = sample, y = as.numeric(Dropped), fill = sample), shape = 21,
    color = "black", size = 2, position = position_dodge(width = 0.5)) + scale_y_continuous(name = "Both reads surviving (%)",
    limits = c(0, 100), sec.axis = sec_axis(~., name = "Dropped reads (%)")) + theme_bw() +
    scale_fill_manual(values = colorines[1:11]) + scale_color_manual(values = colorines[1:11]) +
    theme(text = element_text(size = 15)) + ggtitle("Reads pairing stats") + xlab("Sample") +
    scale_shape_manual(values = c(21, 22, 23, 24)) + guides(shape = FALSE, fill = FALSE, color = FALSE) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) +
    theme(plot.title = element_text(hjust = 0.5))
Scale for shape is already present.
Adding another scale for shape, which will replace the existing scale.
Show the code
g_t

Figure 4. Paired reads after trimming

Show the code
# table
tablita_trimmo <- trimmo[order(trimmo$sample), -2]
kbl(tablita_trimmo, align = "c", row.names = F, caption = "Table 4. Trimmomatic stats (%)") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, italic = T)  #  %>% save_kable('figures/trimmo_table.pdf')
Table 4. Trimmomatic stats (%)
sample Both R1_only R2_only Dropped
NA 91.41 5.32 1.87 1.40
NA2 91.33 5.20 1.53 1.94
RP-MDA 91.68 5.52 1.34 1.46
RP-MDA2 92.05 4.82 1.39 1.74
PrimPol-MDA 91.13 5.66 1.32 1.89
piPolB 66.79 15.89 9.06 8.26
piPolB+D 46.13 26.41 11.12 16.33
piMDA 90.03 6.07 2.33 1.57
piMDA2 91.03 5.28 1.84 1.88
piMDA+D 90.30 5.80 2.45 1.45
piMDA+D2 91.56 5.04 1.64 1.75

Quality filtering of the samples show that piPolB and piPolB+D samples contain lower quality reads, with less paired reads survival and more dropped reads. Also (Table 4), as expected only-forward survival reads were more (15,89% and 26,41%, respectively) than only-reverse reads (9,06% and 11,12%, respectively). This results indicate a possible library prep or sequencing problem that hinders the analysis of those samples, likely due to the hyperbranched structure of the piPolB-synthesized DNA.

4.2 Reads per GC content

DNA amplification as well as library generation and the sequencing itself may give rise to uneven or poor sequence coverage of GC-poor or rich sequences (see for instance Marine et al. 2014, Jamal et al. 2021, Benjamini et al. 2022). Thus, the ideal MDA method should reflex the real GC pattern of the sample or at least do not neglect DNA sequences with extreme GC composition. GC-bias will be analyzed also in terms of impact on the assembly, but this preliminar analysis is also very informative from a general point of view.

Show the code
# GC content

gc_reads <- read.table("qc/fastqc_per_sequence_gc_content_plot.tsv", header = TRUE, sep = "\t",
    na.strings = "")
gc_table <- melt(gc_reads, id.vars = "X..GC")
gc_table[, 2] <- sub("RepliG", "RP-MDA", gc_table[, 2])
gc_table[, 2] <- sub("Ctrl", "NA", gc_table[, 2])
gc_table[, 2] <- sub("TruePrime", "PrimPol-MDA", gc_table[, 2])
gc_table[, 2] <- sub("piDNA", "piMDA", gc_table[, 2])
gc_table[, 2] <- sub("piMDA.D", "piMDA+D", gc_table[, 2])
gc_table[, 2] <- sub("piPolB.D", "piPolB+D", gc_table[, 2])

post_gc_reads <- read.table("qc/post_fastqc_per_sequence_gc_content_plot.tsv", header = TRUE,
    sep = "\t", na.strings = "")
post_gc_table <- melt(post_gc_reads, id.vars = "X..GC")
post_gc_table[, 2] <- sub("RepliG", "RP-MDA", post_gc_table[, 2])
post_gc_table[, 2] <- sub("Ctrl", "NA", post_gc_table[, 2])
post_gc_table[, 2] <- sub("TruePrime", "PrimPol-MDA", post_gc_table[, 2])

p <- ggplot(gc_table, aes(x = X..GC, y = value, colour = variable, fill = variable, label = variable)) +
    scale_color_manual(values = rep(unname(colorines[c(1, 2, 8, 9, 10, 11, 6, 7, 5, 3, 4)]),
        each = 2)) + geom_path(linejoin = "round", size = 1) + theme(text = element_text(size = 15)) +
    theme_bw() + ggtitle("GC content in Raw Reads") + xlab("GC") + ylab("Reads (%)") + labs(color = "Method")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Show the code
p <- p + guides(color = guide_legend(ncol = 2))

# ggsave('figures/gc_profile.svg', p, width=9,heigh=5)

# And the post-trimming plot add the average GC%
post_gc_table$p <- post_gc_table$X..GC * post_gc_table$value
gc_sample <- by(post_gc_table$p, INDICES = post_gc_table$variable, FUN = mean)
leyenda <- c()
for (i in 1:length(levels(as.factor(post_gc_table$variable)))) {
    leyenda[i] <- paste0(levels(as.factor(post_gc_table$variable))[i], " (", round(gc_sample[i],
        1), "%)")
}
q <- ggplot(post_gc_table, aes(x = X..GC, y = value, colour = variable, fill = variable, label = variable)) +
    scale_color_manual(values = rep(unname(colorines[c(1, 2, 8, 9, 10, 11, 6, 7, 5, 3, 4)]),
        each = 4), labels = leyenda) + geom_path(linejoin = "round", size = 1) + theme_bw() +
    theme(text = element_text(size = 15)) + ggtitle("GC content per sample (post-trimming)") +
    xlab("GC") + ylab("Reads (%)") + labs(color = "Sample (Average GC %)") + guides(color = guide_legend(ncol = 2)) +
    theme(text = element_text(size = 17), legend.text = element_text(size = 9, face = "bold"),
        legend.key.height = unit(0.4, "cm"))


# ggsave('figures/gc_profile_post.svg', q, width=12,heigh=6)
p/q

Figure 5. Reads’ GC content plot. GC content is indicated in the plot leyend for the post-trimming reads

As shown in Figure 6, the control non-amplified (NA) samples show a bimodal curve respect to the GC content, in agreement with the selection of the species in the mock-metagenome (see Disclaimer and Table 1). None of the MDA protocols achieve a similar curve. The samples from RP-MDA and Primpol-MDA show a similar curve, centered in ~44% and ~48% GC content, respectively, and very few reads (<1%) in regions of GC content above 60%.

The piPolB protocols show a slim peak of about 10% of reads around 40-45% GC that increases to ~47% in the piPolB+D protocol, although it also shows a “shoulder” at 38-40%. Given the limited processivity of piPolB, this pattern suggests that low GC content DNA sequences facilitate DNA priming by piPolB.

On the contrary, the piMDA and piMDA+D methods gave rise to a peak around 68-70% GC, although the raise of the curve is very slow with a good number of reads in the range of the first peak of the NA samples, reaching a 1% at GC content of 40%. Thus, piMDA protocols provide most of their reads of higher GC-content sequences, but that bias is less severe than in the case of the other MDA methods, which centered on 40-50% GC content and seem to neglect high GC sequences.

4.3 Metagenomes profiling

Metagenomes profiling is usually the first approach to measure alpha diversity. In our case, profiling prior to assembly can be very useful to detect contamination. In this case, the trimmed reads were profiled using Metaphlan v.4.0.0 installed via Conda. Note that we used the --unclassified_estimation to include the estimation of genomic sequences not identified in the database.

Show the code
# read tsv, select and transform table
meta <- read.table(file = "metaphlan_results/merge_profiled_4.tsv", sep = "\t", header = TRUE)

short_meta <- meta[c(1, 21, 22, 23, 20), ]
short_meta[, 1] <- c("Unclassified", "Kokuria rhizophila", "Pseudomonas aeruginosa", "Escherichia coli",
    "Bacillus sp")
short_meta <- short_meta[, c(1, 6, 5, 11, 7, 4, 9, 8, 10, 3, 12, 2)]

colnames(short_meta) <- c("Genome", "NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB",
    "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2")


meta2 <- melt(short_meta, id.vars = "Genome", na.rm = F)
meta2[, 4] <- grepl(2, meta2$variable)
meta2$V4[meta2$V4 == "TRUE"] <- "2"
meta2$V4[meta2$V4 == "FALSE"] <- "1"

# reorder variables
meta2$variable <- factor(meta2$variable, levels = c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA",
    "piPolB", "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2"))
meta2$Species <- factor(meta2$Genome, levels = c("Unclassified", "Kokuria rhizophila", "Pseudomonas aeruginosa",
    "Escherichia coli", "Bacillus sp"))
# plot

profiling <- ggplot(meta2, aes(y = value, x = Species, group = variable, fill = variable)) +
    geom_bar(stat = "identity", position = "dodge", alpha = 0.8) + ylab("Reads (%)") + geom_text(aes(label = round(value,
    digits = 2), group = variable), position = position_dodge(width = 0.9), angle = 90, hjust = 0,
    vjust = 0.5) + theme_bw() + theme(axis.text.x = element_text(face = "bold.italic", size = 11,
    angle = 45, hjust = 1)) + theme(text = element_text(size = 19)) + labs(color = "Sample") +
    theme(axis.text.y = element_text(size = 17)) + scale_fill_manual(values = colorines[1:11],
    name = "") + scale_y_continuous(limits = c(0, 100), expand = c(0.08, -0.5))


# ggsave('figures/profiling_v4.svg', profiling, width=9,heigh=5)

# same plot in a different display
profilin2 <- ggplot(meta2, aes(y = value, x = variable, group = Species, fill = Species, label = paste(round(value,
    1), "%"))) + geom_bar(stat = "identity", position = "stack", alpha = 0.8) + ylab("Reads (%)") +
    xlab("Sample") + theme_bw() + theme(axis.text.x = element_text(face = "bold", size = 13,
    angle = 45, vjust = 1, hjust = 1)) + theme(text = element_text(size = 15)) + theme(axis.text.y = element_text(size = 15)) +
    scale_fill_manual(values = c(Unclassified = "grey80", `Kokuria rhizophila` = "skyblue1",
        `Pseudomonas aeruginosa` = "slateblue2", `Escherichia coli` = "gold1", `Bacillus sp` = "orange2"),
        name = "Species") + scale_y_continuous(limits = c(0, 100), expand = c(0.08, -0.5))
profiling/profilin2

Figure 6. Metagenomic samples reads profiling of amplified DNA by different MDA protocols. Note that both plots show the same data, with different representation

As shown in Figure 7, the amount of unknown (or junk) DNA is very high in piPolB samples, particularly without denaturation step, but not in piMDA. However, in any of the samples we detected sequences from other organisms than the expected bacteria, downplaying the presence of contaminant DNA. This suggest that, rather than contaminant DNA, piPolB might be performing ab initio DNA synthesis.

Show the code
res_aov <- aov(value ~ variable, data = meta2)
shapiro.test(res_aov$residuals)  #normality test

    Shapiro-Wilk normality test

data:  res_aov$residuals
W = 0.80281, p-value = 3.944e-07
Show the code
leveneTest(value ~ variable, data = meta2)  #homogeneus variance
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group 10  0.1652 0.9978
      44               
Show the code
kruskal.test(value ~ variable, data = meta2)

    Kruskal-Wallis rank sum test

data:  value by variable
Kruskal-Wallis chi-squared = 3.7369, df = 10, p-value = 0.9584

Although due to the limited number of samples, the differences are not statistically significant, we represent these results as an interactive heatmap.

Show the code
short <- as.matrix(short_meta[, -1])
row.names(short) <- short_meta[, 1]

heatmaply(t(short), k_row = 3, k_col = 3, fontsize_row = 14)

Figure 7. Clustering of samples by the metagenomic profiling results

Interestingly, in terms of relative representativity of reads from each reference genome, both NA samples cluster with the piMDA/piMDA+D amplification products. As mentioned above, we found an increased amount of unknown reads in the piPolB(+D) samples. As samples have been quality-trimmed, we only can suggest that those sequences arose from ab initio DNA synthesis.

On the other hand, RP-MDA and PrimPol-MDA contain an overrepresentation of B. subtilis reads, as compared with the NA reads. The piMDA amplified samples, particularly the piMDA+D reads form the first experiment, show a moderate overrepresentation of high GC sequences (K. rhizophyla), in agreement with the GC content profile of these samples (see Figure 6).

5 Assembly-independent coverage of reference genomes

5.1 Reference genomes breadth and coverage

Reads were mapped against the reference genomes with Bowtie2 v2.3.5.1 and further analyzed with Weesam v1.6 and Alfred v0.1.16. The Weesam reports can be accessed in a merged webpage. Alfred reports are accesible as a csv table. Mean values per sample are shown below in Table 5.

Show the code
# WeeSam html files were merged by copy/paste from the source file by hand and then
# exported into csv
weesam <- read.table("weeSam/weesam_merged.csv", header = TRUE, na.strings = "", sep = ";")
# change names
weesam <- data.frame(lapply(weesam, function(x) {
    gsub("RepliG", "RP-MDA", x)
}))
weesam <- data.frame(lapply(weesam, function(x) {
    gsub("TruePrime", "PrimPol-MDA", x)
}))
# fix variable types
weesam$Bam_File <- factor(weesam$Bam_File, levels = c("NA", "RP-MDA", "PrimPol-MDA", "piPolB",
    "piPolB+D", "piMDA", "piMDA+D"))
weesam <- weesam %>%
    mutate(across(c(Ref_Genome, Ref_Name), as.factor))
weesam <- weesam %>%
    mutate(across(c(Avg_Depth, Experiment, Mapped_Reads, Breadth, Std_Dev, Variation_Coefficient,
        covered), as.numeric))

# add Bowtie2 overall alignment rate values
Bam_File <- c("NA", "NA", "RP-MDA", "RP-MDA", "PrimPol-MDA", "piPolB", "piPolB+D", "piMDA",
    "piMDA", "piMDA+D", "piMDA+D")
alignment <- c(97.85, 97.99, 99.31, 99.24, 99.25, 7.67, 73.15, 94.22, 93.76, 97.68, 97.18)
bowtie <- data.frame(Bam_File, alignment)
weesam <- merge(weesam, bowtie, by = "Bam_File")

Among the different parameters that WeeSAM provide, we are going to represent the Alignment Rate, Breadth, Coverage and the Variation Coeficient (Figures 9 and 10). The overall alignment rate is clearly better using selected reference genomes than in the profiling with a reference database (Figures 7 & 8). However, piPolB and piPolB+D samples still show lower alignment rates, suggesting unfaithful DNA amplification (see below).

Breadth and Coverage are good general indicators of very poor amplification of some of the sequences, usually with extreme high or low GC content. On the other hand, the Variation Coeficient indicates variability in coverage. A coefficient of variation < 1 would suggest that the coverage has low-variance, which is good, while a coefficient > 1 would be considered high-variance.

Show the code
p1 <- ggplot(weesam, aes(x = Bam_File, y = log(Avg_Depth, 10), group = Ref_Name, color = Ref_Name,
    fill = Ref_Name, shape = Ref_Name)) + geom_pointrange(aes(ymin = log(Avg_Depth, 10), ymax = log(Avg_Depth +
    Std_Dev, 10)), size = 1, alpha = 0.7, position = position_dodge(width = 0.5)) + theme_bw() +
    scale_fill_manual(values = colorines[12:15]) + scale_color_manual(values = colorines[12:15]) +
    theme(text = element_text(size = 15)) + ggtitle("Coverage Depth") + scale_shape_manual(values = c(21,
    22, 23, 24)) + xlab("Sample") + ylab("Depth (Log10)") + labs(color = "Reference genome") +
    guides(shape = FALSE, fill = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5))
p2 <- ggplot(weesam) + geom_point(aes(x = Bam_File, y = as.numeric(covered), group = Ref_Name,
    color = Ref_Name, fill = Ref_Name, shape = Ref_Name), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    theme_bw() + scale_fill_manual(values = colorines[12:15]) + scale_color_manual(values = colorines[12:15]) +
    scale_shape_manual(values = c(21, 22, 23, 24)) + theme(text = element_text(size = 15)) +
    ggtitle("Coverage Breadth") + xlab("Sample") + labs(color = "Reference genome") + guides(shape = FALSE,
    fill = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45, hjust = 1,
    size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(name = "Coverage (%)", breaks = seq(80, 100, 5), limit = c(80, 100))

p3 <- ggplot(weesam, aes(x = Bam_File, y = Variation_Coefficient, group = Ref_Name, color = Ref_Name,
    fill = Ref_Name, shape = Ref_Name)) + geom_point(size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    theme_bw() + scale_fill_manual(values = colorines[12:15]) + scale_color_manual(values = colorines[12:15]) +
    scale_shape_manual(values = c(21, 22, 23, 24)) + theme(text = element_text(size = 15)) +
    ggtitle("Variation Coefficient") + xlab("Sample") + ylab("Coefficient") + labs(color = "Reference genome") +
    guides(shape = FALSE, fill = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5))
p4 <- ggplot(weesam) + geom_point(aes(x = Bam_File, y = as.numeric(alignment), color = Bam_File),
    alpha = 0.4, size = 4, shape = 19) + theme_bw() + scale_color_manual(values = colorines[1:11]) +
    theme(text = element_text(size = 15)) + ggtitle("Overall alignment rate") + xlab("Sample") +
    labs(color = "Method") + guides(shape = FALSE, fill = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(name = "Alignment rate (%)", breaks = seq(0, 100, 20), limit = c(0,
        100))

grid <- grid.arrange(p4, p2, p1, p3, nrow = 2)

Figure 8. Reference genomes coverage depth by diverse MDA methods

Show the code
grid
TableGrob (2 x 2) "arrange": 4 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (1-1,2-2) arrange gtable[layout]
3 3 (2-2,1-1) arrange gtable[layout]
4 4 (2-2,2-2) arrange gtable[layout]
Show the code
# version to export ggsave('figures/depth_plot.svg',grid,width=12,height=12)

The samples of RP-MDA and, to a lesser extent the PrimPol-MDA, show a clear decrease in the sequences of high GC in the profiling (Figures 7 & 8). In agreement with that, they show lower coverage and breadth and higher variation coefficient than all the other samples for those species and higher depth for B. subtilis and E. coli. Interestingly, the piMDA and piMDA+D reads contain a somewhat higher coverage for high GC genomes but lower for B. subtilis and E. coli. In the case of the samples from the piMDA protocol, this lower coverage depth of moderate GC genomes is consequence of a lower alignment rate and it is also in agreement with a higher variation coefficient in the case of B. subtilis, although the coverage depth for this genome is the same for all the samples.

In conclusion, the RP-MDA and, to a lesser extent, the PrimPol-MDA amplification methods show a strong bias against very high GC-content sequences and, on the contrary samples from piMDA methods show an overrepresentation of high GC sequences which, in the case of the piMDA protocol (not in the piMDA+D), may give rise to an overall uneven depth.

Another version in a single plot with two axis in the Figure 10 that allow us to see that the very high coverage depth in RP and PrimPol-MDA for intermediate-low GC content genomes correlates with a very high variation coefficient in high GC content genomes, which is not the case for piMDA and specially piMDA+D reads.

Show the code
pp <- ggplot(weesam, aes(x = Bam_File, y = log(Avg_Depth, 10), group = Ref_Name, color = Ref_Name,
    fill = Ref_Name, shape = Ref_Name)) + geom_pointrange(aes(ymin = log(Avg_Depth, 10), ymax = log(Avg_Depth +
    Std_Dev, 10)), size = 2, alpha = 0.7, position = position_dodge(width = 0.5)) + geom_point(aes(x = Bam_File,
    y = Variation_Coefficient/10, fill = Ref_Name), shape = 21, color = "black", size = 2,
    position = position_dodge(width = 0.5)) + scale_y_continuous(name = "Average Depth (Log10)",
    breaks = seq(0, 5, 1), limits = c(0, 4), sec.axis = sec_axis(~10 * ., name = "Variation Coefficient",
        breaks = seq(0, 125, 25))) + theme_bw() + scale_fill_manual(values = colorines[12:15]) +
    scale_color_manual(values = colorines[12:15]) + theme(text = element_text(size = 15)) +
    ggtitle("Coverage Depth") + scale_shape_manual(values = c(21, 22, 23, 24)) + xlab("Sample") +
    labs(color = "Reference genome") + guides(shape = FALSE, fill = FALSE, scale = "none") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) +
    theme(plot.title = element_text(hjust = 0.5)) + geom_hline(yintercept = 1, linetype = "dashed",
    color = "black")
ggplotly(pp)

Figure 9. Reference genomes coverage depth (left axis) and variation coefficient (small points, right axis) by MDA methods

Show the code
# ggsave('figures/depth.svg',pp,width=8,height=5)

Alfred stats from Bowtie alignment files allowed us to analyze the mismatch rate in the raw reads mapping (Figure 11). Individual merged tables were re-merged using the script merge_alfred_samples.R..

Show the code
alfred <- fread("alfred/merged/merged_alfred.csv", header = TRUE, na.strings = "")
# order samples
alfred$sample <- factor(alfred$sample, levels = c("NA", "RP-MDA", "PrimPol-MDA", "piPolB",
    "piPolB+D", "piMDA", "piMDA+D"))

# plot
g1 <- ggplot(alfred, aes(x = sample, y = as.numeric(MismatchRate), group = ref, color = ref,
    fill = ref, shape = ref)) + geom_point(size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    theme_bw() + scale_fill_manual(values = colorines[12:15]) + scale_color_manual(values = colorines[12:15]) +
    scale_y_continuous(breaks = seq(0, 0.05, 0.01), minor_breaks = seq(0, 0.05, by = 0.002)) +
    theme(text = element_text(size = 15)) + ggtitle("Mismatch rate") + xlab("Sample") + ylab("Mismatch Rate") +
    labs(color = "Reference genome") + scale_shape_manual(values = c(21, 22, 23, 24)) + guides(shape = FALSE,
    fill = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45, hjust = 1,
    size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5)) +
    theme(panel.grid.minor.y = element_line(linetype = 2))
g2 <- ggplot(alfred) + geom_point(aes(x = sample, y = as.numeric(InsertionRate), color = ref,
    fill = ref), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) + scale_shape_manual(values = c(22,
    23, 24, 25)) + geom_point(aes(x = sample, y = as.numeric(HomopolymerContextIns)/500, fill = ref),
    shape = 21, color = "black", size = 2, position = position_dodge(width = 0.5)) + scale_y_continuous(name = "Insertions",
    breaks = seq(0, 0.002, 5e-04), sec.axis = sec_axis(~500 * ., name = "Homopolymeric context insertions",
        breaks = seq(0, 1, 0.2))) + theme_bw() + scale_fill_manual(values = colorines[12:15]) +
    scale_color_manual(values = colorines[12:15]) + theme(text = element_text(size = 15)) +
    ggtitle("Insertions rate") + xlab("Sample") + ylab("Insertions Rate") + scale_shape_manual(values = c(21,
    22, 23, 24)) + guides(shape = FALSE, fill = FALSE, color = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5))
Scale for shape is already present.
Adding another scale for shape, which will replace the existing scale.
Show the code
g3 <- ggplot(alfred) + geom_point(aes(x = sample, y = as.numeric(DeletionRate), color = ref,
    fill = ref), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) + scale_shape_manual(values = c(22,
    23, 24, 25)) + geom_point(aes(x = sample, y = as.numeric(HomopolymerContextDel)/500, fill = ref),
    shape = 21, color = "black", size = 2, position = position_dodge(width = 0.5)) + scale_y_continuous(name = "Deletions",
    breaks = seq(0, 0.001, 2e-04), sec.axis = sec_axis(~500 * ., name = "Homopolymeric context deletions",
        breaks = seq(0, 0.5, 0.2))) + theme_bw() + scale_fill_manual(values = colorines[12:15]) +
    scale_color_manual(values = colorines[12:15]) + theme(text = element_text(size = 15)) +
    ggtitle("Deletion rate") + xlab("Sample") + ylab("Insertions Rate") + scale_shape_manual(values = c(21,
    22, 23, 24)) + guides(shape = FALSE, fill = FALSE, color = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5))
Scale for shape is already present.
Adding another scale for shape, which will replace the existing scale.
Show the code
(grid3 <- g1/(g2 + g3))

Figure 10. Reads Mismatch rate

Show the code
# ggsave('figures/mismtaches.svg',grid3,width=10,height=11)
Show the code
# Table
mismas <- alfred[, c(18, 37)]
colnames(mismas) <- c("MismatchRate", "Bam_File")
weesam <- merge(weesam, mismas, by = "Bam_File")
weesam$MismatchRate <- weesam$MismatchRate * 100
tablita <- aggregate(cbind(weesam$Mapped_Reads, weesam$Breadth, weesam$covered, weesam$Avg_Depth,
    weesam$Variation_Coefficient, weesam$MismatchRate), by = list(Sample = weesam$Bam_File,
    Experiment = weesam$Experiment), FUN = mean)
colnames(tablita) <- c("Sample", "Experiment", "Mapped reads", "Breadth", "Reference covered (%)",
    "Average Depth", "Variation coefficient", "Mismatch Rate (%)")
tablita <- tablita[order(tablita$Sample), ]

tablita[, 5:7] <- round(tablita[, 5:7], digits = 2)
tablita[, 8] <- round(tablita[, 8], digits = 4)
tablita <- tablita %>%
    mutate(`Mapped reads` = format(`Mapped reads`, scientific = FALSE, big.mark = ","))
tablita <- tablita %>%
    mutate(Breadth = format(Breadth, scientific = FALSE, big.mark = ","))

kbl(tablita, align = "c", row.names = F, caption = "Table 5. Mean mapping and coverage parameters sample.") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, italic = T)  #  %>% save_kable('figures/aligned2.pdf')
Table 5. Mean mapping and coverage parameters sample.
Sample Experiment Mapped reads Breadth Reference covered (%) Average Depth Variation coefficient Mismatch Rate (%)
NA 1 33,406,640 4,438,281 98.54 571.22 0.21 0.1077
NA 2 35,680,689 4,440,442 98.58 608.48 0.21 0.1077
RP-MDA 1 31,360,430 4,270,082 93.67 546.25 4.93 1.2791
RP-MDA 2 32,030,479 4,434,691 98.38 549.34 3.12 1.2791
PrimPol-MDA 2 26,951,539 4,440,242 98.58 457.24 0.85 0.3000
piPolB 2 2,009,164 4,381,966 97.51 16.36 31.77 0.2673
piPolB+D 2 16,484,064 4,439,396 98.56 91.06 17.93 0.3443
piMDA 1 30,765,774 4,438,131 98.54 525.24 0.45 0.1137
piMDA 2 33,761,506 4,439,853 98.57 586.57 0.61 0.1137
piMDA+D 1 29,696,732 4,437,996 98.54 572.54 0.38 0.1071
piMDA+D 2 37,756,486 4,440,189 98.58 689.94 0.36 0.1071

Similarly to the previous results, Figures 11 and 12 show that the RP-MDA and, to a lesser extent the PrimPol-MDA protocols gave rise to a higher mismatch rate in the P. aeruginosa and K. rhizophila genomes. This is mostly due to misinsertions and deletions. Thus, in our metagenome sample, the RP-MDA protocol generate an amplification product with uneven coverage breadth and higher mismatch rate for high GC-sequences.

The samples generated by the piPolB and piPolB+D protocols also contain a higher mismatch rate, but in this case, the mistmaches are mostly in the B. subtilis sequences and contain a high rate of insertions.

Interestingly, the piMDA and piMDA+D protocols gave rise to similar mismatch rates than NA samples, only with a minor increase in the insertions in moderate GC genomes, mostly in homopolymeric sequence contexts (as expected).

5.2 Analysis of unclassfied reads

The piPolB and specially piPolB+D samples showed a higher rate of unknown sequences in the profiling (Figure 7) and lower rate of reads mapping against the reference genomes (Figure 9A). After the trimming step, those unmapped reads may correspond either to contaminant amplified DNA or ab initio DNA synthesis (see Zyrina et al. 2013){target=“_blank”}. Since the profiling with Metaphlan only considers mapped reads, as well as our Bowtie2 mapping against the reference, we decided to map the trimmed reads without the –-no-unal option against the Metaphlan database (vJan21), and then extract the unaligned reads an convert to fasta using samtools f -4 for further analysis. We used the following bash script:

for sample in A2 B2 F2 C3 D3 Ctrl Ctrl2 N1 N2 N4 N6;

do bowtie2 --very-sensitive -p 32 -x /home/modesto/anaconda3/envs/mpa/lib/python3.7/site-packages/metaphlan/metaphlan_databases/mpa_vJan21_CHOCOPhlAnSGB_202103 -1 ../trimmo/"$sample"_R1_pair.fastq.gz -2 ../trimmo/"$sample"_R2_pair.fastq.gz -U ../trimmo/"$sample"_R1_unpair.fastq.gz,../trimmo/"$sample"_R2_unpair.fastq.gz | samtools view -bS - | samtools sort -@ 32 -o "$sample"_metaphlan_sorted.bam;

samtools view -u -f 4 "$sample"_metaphlan_sorted.bam > "$sample"_metaphlan_unmapped.sam;

samtools fasta "$sample"_metaphlan_unmapped.sam > "$sample"_metaphlan_unmapped.fasta; rm "$sample"_metaphlan_unmapped.sam;

rm "$sample"_metaphlan_sorted.bam;

done

We then analyzed the presence of low complexity sequences using BBduk (from BBtools) to detect sequences with entropy <0.5. The proportion of filtered low entropy reads was vary low (<0.25%) for all the samples, except for the piPolB-MDA sample (12.58%), supporting the presence of ab initio DNA synthesis (see Zyrina et al., 2014). Moreover, the presence of ab initio DNA synthesis capacity in piPolB could be confirmed in long incubation MDA assays (≥ 9h), as shown in the figure below.

Ab initio DNA synthesis by piPolB

External Figure 2. Ab initio DNA synthesis by piPolB. Time-course experiments were carried out in the standard conditions in the presence of 0, 2 and 20 ng of metagenomic DNA.

5.3 GC content bias

GC content bias of sequenced reads was analyzed in detail with Picard v.2.25.0 (CollectGcBiasMetrics). This allow us to see, for the whole metagenome and for each reference genome, the frequency of sequences at each GC content window (100 bp). All data is available in the folder gc in the GitHub repo.

Note that, contrary to the previous result in Figure 4, the coverage is normalized by the frequency of each GC window in the reference metagenome. According to the Picard definition:

NORMALIZED_COVERAGE is a relative measure of sequence coverage by the reads at a particular GC content. For each run, the corresponding reference sequence is divided into bins or windows based on the percentage of G + C content ranging from 0 - 100%. The percentages of G + C are determined from a defined length of sequence; in this case we used the default value (100 bases).

Therefore, if the reference metagenome have higher number of sequences at a particular GC content, the normalized coverage should be also higher at that GC content value.

5.3.1 GC bias per MDA sample

Data from Picard output was merged to generate a combined plot with homogeneous scale. Original individual data tables are available in the gc folder of the GitHub repository.

Show the code
# load all the files as a list of dataframes
gc_picard = c("ctrl", "ctrl2", "N4", "D3", "C3", "N1", "F2", "N2", "B2", "N6", "A2")
gc <- lapply(gc_picard, function(x) read.csv2(paste("gc/", x, "_gc_bias_metagenome.txt", sep = ""),
    skip = 6, header = TRUE, sep = "\t", colClasses = c("NULL", "NULL", "numeric", "numeric",
        "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric")))


cor_matrix <- data.frame(11, 3)
names(gc) <- c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB", "piPolB+D", "piMDA",
    "piMDA2", "piMDA+D", "piMDA+D2")
# create all the plots in a list

gc_cov <- gc[[1]][, 1:2]

for (i in 1:length(gc)) {
    gc_cov <- cbind(gc_cov, gc[[i]]$NORMALIZED_COVERAGE)
}
colnames(gc_cov) <- c("GC", "W", "NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB",
    "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2")
gc_cov <- as.data.frame(gc_cov)
gc_cov_melted <- data.table::melt(gc_cov, id.vars = c("GC", "W"), na.rm = FALSE)
colnames(gc_cov_melted) <- c("GC", "W", "Sample", "Nor_cov")
# mean GC content of mapped reads
gc_mapped <- list()
gc_mapped <- lapply(1:length(gc), FUN = function(i) {
    mean(gc[[i]]$NORMALIZED_COVERAGE[gc[[i]]$NORMALIZED_COVERAGE > 0] * gc[[i]]$GC[gc[[i]]$NORMALIZED_COVERAGE >
        0])
})
# paste in legend
leyenda <- c()
for (i in 1:length(levels(gc_cov_melted$Sample))) {
    leyenda[i] <- paste0(levels(gc_cov_melted$Sample)[i], " (", round(gc_mapped[[i]], 1), "%)")
}

p <- ggplot(gc_cov_melted, aes(x = GC)) + geom_bar(aes(y = W/1e+06), stat = "identity", fill = "steelblue",
    alpha = 0.5) + geom_line(aes(y = Nor_cov, color = Sample), size = 1.5) + theme_bw(base_size = 14) +
    scale_y_continuous("Normalized coverage", sec.axis = sec_axis(~. * 1e+06, name = "Frequency at GC")) +
    theme(axis.text.y.right = element_text(colour = "steelblue"), axis.title.y.right = element_text(colour = "steelblue"),
        axis.line.y.right = element_line(color = rgb(0.275, 0.51, 0.706, 0.55), linewidth = 2)) +
    scale_color_manual(values = colorines[1:11], labels = leyenda) + geom_hline(yintercept = 1,
    linetype = "dashed", color = "black")
p

Figure 11. Comparative Normalized coverage by GC distribution of sequence and reads. Sequence coverage (red) is normalized by the GC range frequency in 100 bp windows (blue). The average GC content of each sample mapped reads is also indicated in the leyend

Show the code
# ggsave('figures/normalized_gc_all_2.svg', p, width=8,heigh=5)

And the interactive plot for detail analysis:

Show the code
ggplotly(p)

Notwithstanding the difference in the method, this results are overall in agreement with the Figure 2. The control samples show a mild decrease in the coverage at GC values over 50%, with values always near to 1 (right Y axis). However, the normalized coverage of RP-MDA and PrimPol-MDA samples span a wide range at different GC content, with a very high normalized coverage of low GC sequences (>2.5, with peaks up to 4-5) sharply decreasing the coverage for GC content values over 50-60%, down to near negligible coverage. Strikingly, the piPolB and piPolB+D reads show a lower coverage, with some scattered peaks within the range of 30-50%. This may be due to the preference for that GC content in the piPolB initiation events, which would be in agreement with a higher rate of over-represented sequences.

However, the piMDA samples show less variability in the coverage, with a low coverage (around 0.3-0.5) at GC<50 and a strong increase in the coverage at values over 50-60%, up to 1-2. The piMDA and piMDA+D protocols show a very similar profile, although the piMDA+D reaches higher coverage levels for extremely high GC content values( >75%)

5.3.1.1 Statistic comparisons of GC bias

Normality test prompted us to non-parametrics group difference tests. Here we are removing outliers from each genome, using the same criteria: Picard variable WINDOWS that represent the number of reference metagenome sequences of a 100 bp window length in that GC value. We use a threshold of WINDOWS >5. Then, we use the function cor.mtest() to obtain the R2 and the p-value for the correlations within a confidence interval of 95%, thus avoiding the noise from outliers.

Show the code
# gc_cov_melted2 <- gc_cov_melted[gc_cov_melted$GC>19 & gc_cov_melted$GC<81,]
gc_cov_melted2 <- gc_cov_melted[gc_cov_melted$W > 5, ]
# test for normality and homogeneous variance intra groups
res_aov <- aov(Nor_cov ~ Sample, data = gc_cov_melted2)
shapiro.test(res_aov$residuals)  #NO normality 

    Shapiro-Wilk normality test

data:  res_aov$residuals
W = 0.90583, p-value < 2.2e-16
Show the code
leveneTest(Nor_cov ~ Sample, data = gc_cov_melted2)  #NO homogeneus variance
Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group  10  34.747 < 2.2e-16 ***
      869                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
# non parametric, pairwise (paired) tests
kk <- pairwise.wilcox.test(x = gc_cov_melted2$Nor_cov, g = gc_cov_melted2$Sample, paired = TRUE)

as.data.frame(format(kk$p.value, scientific = FALSE, digits = 1)) %>%
    replace(., . < 0, "") %>%
    mutate_all(~cell_spec(.x, color = ifelse(.x < 0.01, "firebrick", ifelse(.x < 0.05, "steelblue",
        "black")))) %>%
    kable(escape = F, align = "c", caption = "Table 6. P-value of pairwise comparison of GC bias in the reference metagenome") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, bold = T)
Table 6. P-value of pairwise comparison of GC bias in the reference metagenome
NA NA2 RP-MDA RP-MDA2 PrimPol-MDA piPolB piPolB+D piMDA piMDA2 piMDA+D
NA2 1.000000
RP-MDA 0.028756 0.030977
RP-MDA2 0.027509 0.030977 1.000000
PrimPol-MDA 0.027509 0.030977 1.000000 0.757072
piPolB 0.000217 0.000514 0.195928 0.164302 0.070335
piPolB+D 0.001396 0.002213 0.039915 0.050675 0.047076 1.000000
piMDA 0.013655 0.005165 0.024453 0.023303 0.021405 0.028571 0.170378
piMDA2 0.030977 0.017934 0.027509 0.027509 0.027509 0.027509 0.123799 0.123327
piMDA+D 1.000000 1.000000 0.047076 0.039915 0.054922 0.247848 1.000000 1.000000 1.000000
piMDA+D2 1.000000 0.375833 0.030977 0.030977 0.030977 0.007189 0.032833 0.000002 0.000040 1.000000

Only piMDA+D samples are not significantly different than controls samples.

5.3.1.2 Correlations: GC content vs. Normalized Coverage

Now we are going to analyze the correlation between the GC-bias in each sample (Normalized coverage by Picard) and the actual GC-bias in the reference metagenome (WINDOWS variable in the Picard output). Again, we use a threshold of WINDOWS >5. Then, we use the function cor.mtest() to obtain the R2 and the p-value for the correlations within a confidence interval of 95%, thus avoiding the noise from outliers.

Show the code
# multiple correlation

cor_matrix2 <- gc[[1]]$GC
for (i in 1:length(gc)) {
    gc[[i]] <- gc[[i]][gc[[i]]$WINDOWS > 5, ]  #remove outliers
    cor_matrix2 <- cbindX(as.data.frame(cor_matrix2), as.data.frame(gc[[i]]$NORMALIZED_COVERAGE))
}
colnames(cor_matrix2) <- c("GC", "NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB",
    "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2")
testRes <- cor.mtest(cor_matrix2, conf.level = 0.99)

corrplot(cor(cor_matrix2, use = "complete.obs"), tl.col = "black", p.mat = testRes$p, number.cex = 1,
    addCoef.col = "grey60", type = "upper", col = viridis(100))

Figure 12. Correlations between GC content and normalized coverage with all data

In agreement with Figure 12, the GC content bias shows an moderate negative correlation with the NA samples. However, there is a clear difference between the two experiments. Thus, the coverage from one of the samples (from the first batch sequenced) show a strong negative correlation with the GC content (first row, R2=-0.93), whereas in the other sample the bias was less clear (R2=-0.53). This difference could be an artifact from the libraries preparation, as has been already pointed out (Benjamini et al. 2022).

Regarding the amplified samples, which is stronger with the RP and PrimPol-MDA reads (R2=-0.91 to -0.95), whereas piMDA samples overall show positive correlation (R2≃0.9). In line with this, there is a significant and positive correlation among NA, RP and PrimPol samples and among piMDA samples.

5.3.2 GC bias detailed per reference genome

Show the code
# load all the files as a list of dataframes
gc_picard = c("ctrl_gc_bias_coli", "ctrl_gc_bias_subtilis", "ctrl_gc_bias_PAE", "ctrl_gc_bias_kocuria",
    "ctrl2_gc_bias_coli", "ctrl2_gc_bias_subtilis", "ctrl2_gc_bias_PAE", "ctrl2_gc_bias_kocuria",
    "N4_gc_bias_coli", "N4_gc_bias_subtilis", "N4_gc_bias_PAE", "N4_gc_bias_kocuria", "D3_gc_bias_coli",
    "D3_gc_bias_subtilis", "D3_gc_bias_PAE", "D3_gc_bias_kocuria", "C3_gc_bias_coli", "C3_gc_bias_subtilis",
    "C3_gc_bias_PAE", "C3_gc_bias_kocuria", "N1_gc_bias_coli", "N1_gc_bias_subtilis", "N1_gc_bias_PAE",
    "N1_gc_bias_kocuria", "F2_gc_bias_coli", "F2_gc_bias_subtilis", "F2_gc_bias_PAE", "F2_gc_bias_kocuria",
    "N2_gc_bias_coli", "N2_gc_bias_subtilis", "N2_gc_bias_PAE", "N2_gc_bias_kocuria", "B2_gc_bias_coli",
    "B2_gc_bias_subtilis", "B2_gc_bias_PAE", "B2_gc_bias_kocuria", "N6_gc_bias_coli", "N6_gc_bias_subtilis",
    "N6_gc_bias_PAE", "N6_gc_bias_kocuria", "A2_gc_bias_coli", "A2_gc_bias_subtilis", "A2_gc_bias_PAE",
    "A2_gc_bias_kocuria")
gc <- lapply(gc_picard, function(x) read.csv2(paste("gc/", x, ".txt", sep = ""), skip = 6,
    header = TRUE, sep = "\t", colClasses = c("NULL", "NULL", "numeric", "numeric", "numeric",
        "numeric", "numeric", "numeric", "numeric", "numeric", "numeric")))

# plot all the samples using a loop

cor_matrix <- data.frame(44, 3)

samples <- c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB", "piPolB+D", "piMDA",
    "piMDA2", "piMDA+D", "piMDA+D2")
templates <- c("E. coli", "B. subtilis 110NA", "P. aeruginosa PAER4", "K. rhizophila")
genomas <- merge(templates, samples, all = TRUE)
# subset color palette for genomes and replicate for each sample (subtitles of each plot)
col_genomes <- rep(colorines[12:15], times = 11)
# create the plot list
plot_list2 <- list()
plot_list2 <- lapply(1:length(gc), FUN = function(i) {
    plot_list2[[i]] <- ggplot(data = gc[[i]], aes(x = gc[[i]]$GC, y = gc[[i]]$WINDOWS)) + geom_bar(stat = "identity",
        fill = "steelblue") + geom_line(aes(y = gc[[i]]$MEAN_BASE_QUALITY * 15000), color = "grey") +
        geom_point(aes(y = gc[[i]]$NORMALIZED_COVERAGE * 2e+05), color = "firebrick") + scale_y_continuous("Frequency at GC",
        sec.axis = sec_axis(~./2e+05, name = "Normalized coverage")) + scale_color_manual(name = "",
        values = colors) + xlab("GC content") + ggtitle(paste(genomas[i, 2]), subtitle = paste(genomas[i,
        1])) + theme_bw(base_size = 14) + theme(plot.title = element_text(color = colorines[ceiling(i/4)],
        hjust = 0.5, size = 15, face = "bold"), plot.subtitle = element_text(color = col_genomes[i],
        hjust = 0.5, face = "italic"))
})
do.call("grid.arrange", c(plot_list2, ncol = 2))

Figure 13. GC distribution of sequence and reads for each reference genome. Sequence coverage (red) is normalized by the GC range frequency in 100 bp windows (blue). The average base quality is also shown (grey)

5.3.2.1 Correlations: GC content vs. Normalized Coverage per reference genome

Here we are removing outliers from each genome, using the same criteria: Picard variable WINDOWS that represent the number of reference sequences of a 100 bp window length in that GC value. We use a threshold of WINDOWS >5. Then, we use the function cor.mtest() to obtain the R2 and the p-value for the correlations within a confidence interval of 95%, thus avoiding the noise from outliers.

Show the code
cor_matrix <- data.frame(44, 3)

for (i in 1:length(gc)) {
    gc[[i]] <- gc[[i]][gc[[i]]$WINDOWS > 5, ]  #remove <5 values 
    cor_matrix[i, 1] <- gc_picard[i]
    tmp <- cor.test(gc[[i]]$GC, gc[[i]]$NORMALIZED_COVERAGE)
    cor_matrix[i, 2] <- tmp$estimate
    cor_matrix[i, 3] <- tmp$p.value
}

corr_frame <- data.frame(matrix(ncol = 11, nrow = 4))

# split the data to genome vs sample GC vs Normalized Coverage
for (i in 1:ncol(corr_frame)) {
    if (i == 1) {
        corr_frame[1:4, i] <- cor_matrix[1:4, 2]
    } else {
        corr_frame[1:4, i] <- cor_matrix[((4 * i) - 3):(4 * i), 2]
    }
}
corr_frame <- sapply(corr_frame, as.numeric)
colnames(corr_frame) <- c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB", "piPolB+D",
    "piMDA", "piMDA2", "piMDA+D", "piMDA+D2")
row.names(corr_frame) <- c("E. coli", "B. subtilis 110NA", "P. aeruginosa PAER4", "K. rhizophila")
# plot

# include the p-values
corr_frame2 <- data.frame(matrix(ncol = 11, nrow = 4))

for (i in 1:ncol(corr_frame2)) {
    if (i == 1) {
        corr_frame2[1:4, i] <- cor_matrix[1:4, 3]
    } else {
        corr_frame2[1:4, i] <- cor_matrix[((4 * i) - 3):(4 * i), 3]
    }
}
corr_frame2 <- sapply(corr_frame2, as.numeric)
colnames(corr_frame2) <- colnames(corr_frame)
row.names(corr_frame2) <- row.names(corr_frame)

corrplot(as.matrix(corr_frame), tl.col = "black", p.mat = corr_frame2, number.cex = 1.6, addCoef.col = "grey60",
    col = viridis(100))

Figure 14. Correlations among normalized coverage by GC content per reference genome

6 Assemblies

6.1 Assembly basic stats

Assemblies were obtained with MetaSpades. The piPolB sample could not be assembled due to a high proportion of low complexity reads, particularly in the R2 file (see https://www.seqme.eu/en/magazine/sequencing-low-diversity-libraries-on-illumina). We tried to bypass this problem using different trimming alternatives, like PRINSEQ++, FastP or Atropos. Only after using FastP for trimming out reads with complexity <70 (option -Y 70), we could assembly the piPolB sample (N1 raw reads), but the assembly was poor and non comparable with the other samples. Therefore, the piPolB sample was not included in the following analysis.

The assemblies were evaluated with MetaQuast v. 5.1.0rc1. A comparative analysis of all the samples was performed without the reads to save computational resources (check full report here). A full analysis including the reads were performed independently for each sample but due to size limitations it will be available along with the raw reads. All data is available in the repo folder quast_results and individual tables from detailed reports were merged using the script merge_quast_reports.R.

Show the code
quast <- fread("quast_results/merged_quast.csv", header = TRUE, na.strings = "", strip.white = TRUE)
tquast <- t(quast)
colnames(tquast) <- tquast[1, ]
tquast <- tquast[-1, ]
muestras <- rownames(tquast)
tquast <- cbind(muestras, tquast)
tquast <- as.data.frame(tquast)
# order samples
tquast[5, 1] <- "piPolB+D"
tquast$muestras <- factor(tquast$muestras, levels = c("NA", "RP-MDA", "PrimPol-MDA", "piPolB+D",
    "piMDA", "piMDA+D"))

# plots
contigs <- ggplot(tquast) + geom_point(aes(x = muestras, y = as.numeric(tquast[, 4]), color = muestras,
    fill = muestras, group = Experiment), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    geom_point(aes(x = muestras, y = as.numeric(tquast[, 21]) * 30, fill = muestras), shape = 21,
        color = "black", size = 2, position = position_dodge(width = 0.5)) + scale_y_continuous(name = "Contigs >1kb",
    breaks = seq(0, 2000, 250), sec.axis = sec_axis(~./30, name = "L50", breaks = seq(0, 50,
        5))) + theme_bw() + scale_color_manual(values = colorines) + scale_fill_manual(values = colorines) +
    theme(text = element_text(size = 15)) + ggtitle("Assembly contigs") + xlab("Sample") +
    scale_shape_manual(values = c(21, 22, 23, 24)) + guides(shape = FALSE, fill = FALSE, color = FALSE,
    scale = "none") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 16)) +
    theme(plot.title = element_text(hjust = 0.5))

mismatches <- ggplot(tquast) + geom_point(aes(x = muestras, y = as.numeric(tquast[, 47]), color = muestras,
    fill = muestras, group = Experiment), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    geom_point(aes(x = muestras, y = as.numeric(tquast[, 48]) * 16, fill = muestras), shape = 21,
        color = "black", size = 2, position = position_dodge(width = 0.5)) + scale_y_continuous(name = "Mismatches per 100 kbp",
    breaks = seq(0, 225, 25), limit = c(0, 230), sec.axis = sec_axis(~./16, name = "InDels per 100 kbp",
        breaks = seq(0, 20, 5))) + theme_bw() + scale_color_manual(values = colorines) + scale_fill_manual(values = colorines) +
    theme(text = element_text(size = 15)) + ggtitle("Assembly mismatches") + xlab("Sample") +
    scale_shape_manual(values = c(21, 22, 23, 24)) + guides(shape = FALSE, fill = FALSE, color = FALSE,
    scale = "none") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 16)) +
    theme(plot.title = element_text(hjust = 0.5))

# misassemblies report
misas <- read.table("quast_results/transposed_report_misassemblies.tsv", head = TRUE, comment.char = "@",
    sep = "\t", na.strings = "")
for (i in 1:nrow(misas)) {
    misas$Assembly[i] <- gsub("_2", "", misas$Assembly[i])
}
misas$Assembly[3:4] <- "RP-MDA"
misas$Assembly[5] <- "PrimPol-MDA"
misas$Assembly <- factor(misas$Assembly, levels = c("NA", "RP-MDA", "PrimPol-MDA", "piPolB+D",
    "piMDA", "piMDA+D"))

misas_plot <- ggplot(misas) + geom_point(aes(x = Assembly, y = as.numeric(misas[, 2]), color = Assembly,
    fill = Assembly), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) + geom_point(aes(x = Assembly,
    y = as.numeric(misas[, 17])/4, fill = Assembly), shape = 21, color = "black", size = 2,
    position = position_dodge(width = 0.8)) + scale_y_continuous(name = "Contig misassemblies",
    breaks = seq(0, 150, 20), limit = c(0, 100), sec.axis = sec_axis(~. * 4, name = "Local misassemblies",
        breaks = seq(0, 400, 50))) + theme_bw() + scale_color_manual(values = colorines) +
    scale_fill_manual(values = colorines) + theme(text = element_text(size = 15)) + ggtitle("Misassemblies") +
    xlab("Sample") + scale_shape_manual(values = c(21, 22, 23, 24)) + guides(shape = FALSE,
    fill = FALSE, color = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(plot.title = element_text(hjust = 0.5))



grid_contigs <- grid.arrange(contigs, mismatches, misas_plot, nrow = 1)

Figure 15. Assembly basic stats in samples amplified by all the different MDA methods tested

Show the code
grid_contigs
TableGrob (1 x 3) "arrange": 3 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (1-1,2-2) arrange gtable[layout]
3 3 (1-1,3-3) arrange gtable[layout]
Show the code
# ggsave('figures/contigs.svg',grid_contigs,width=12,height=5)

6.2 Stats per individual reference genome

Given the high background of mismatches in the P aeruginosa reference, we decided to normalize the number of mismatches to the first NA sample.

Show the code
quast_ref <- read.table("quast_results/num_contigs.tsv", header = TRUE, na.strings = "", sep = "\t",
    strip.white = TRUE)
names(quast_ref) <- c("Reference", "NA", "NA", "RP-MDA", "RP-MDA", "PrimPol-MDA", "piPolB+D",
    "piMDA", "piMDA", "piMDA+D", "piMDA+D")
quast_ref[, 1] <- c("B. subtilis", "E. coli", "K. rhizophila", "P. aeruginosa", "Not_aligned")

quast_ref_melted <- cbind(Category = quast_ref[[1]], stack(quast_ref[-1]))
# fix names
quast_ref_melted <- data.frame(lapply(quast_ref_melted, function(x) {
    gsub("A.1", "A", x)
}))
quast_ref_melted <- data.frame(lapply(quast_ref_melted, function(x) {
    gsub("D.1", "D", x)
}))
color_ref <- c(colorines[12:15], Not_aligned = "grey")
quast_ref_melted$ind <- factor(quast_ref_melted$ind, levels = c("NA", "RP-MDA", "PrimPol-MDA",
    "piPolB+D", "piMDA", "piMDA+D"))
quast_ref_melted$Category <- factor(quast_ref_melted$Category, levels = c("B. subtilis", "E. coli",
    "K. rhizophila", "P. aeruginosa", "Not_aligned"))
plot <- ggplot(quast_ref_melted) + geom_point(aes(x = ind, y = as.numeric(values), group = Category,
    color = Category, fill = Category, shape = Category), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    theme_bw() + scale_fill_manual(values = color_ref) + scale_color_manual(values = color_ref) +
    scale_shape_manual(values = c(21, 22, 23, 24, 20)) + theme(text = element_text(size = 15)) +
    ggtitle("Number of contigs per reference") + xlab("Sample") + labs(color = "Reference genome") +
    guides(shape = FALSE, color = FALSE, fill = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(name = "Contigs")

quast_ref_mism <- read.table("quast_results/num_mismatches_per_100_kbp.tsv", header = TRUE,
    na.strings = "", sep = "\t", strip.white = TRUE)
names(quast_ref_mism) <- c("Reference", "NA", "NA", "RP-MDA", "RP-MDA", "PrimPol-MDA", "piPolB+D",
    "piMDA", "piMDA", "piMDA+D", "piMDA+D")
quast_ref_mism[, 1] <- c("B. subtilis", "E. coli", "K. rhizophila", "P. aeruginosa")
# normalize data
quast_ref_mism[, 2:11] <- quast_ref_mism[, 2:11]/quast_ref_mism[, 2]
# melt
quast_ref_mism_melted <- cbind(Category = quast_ref_mism[[1]], stack(quast_ref_mism[-1]))
# fix names
quast_ref_mism_melted <- data.frame(lapply(quast_ref_mism_melted, function(x) {
    gsub("A.1", "A", x)
}))
quast_ref_mism_melted <- data.frame(lapply(quast_ref_mism_melted, function(x) {
    gsub("D.1", "D", x)
}))
quast_ref_mism_melted$ind <- factor(quast_ref_mism_melted$ind, levels = c("NA", "RP-MDA", "PrimPol-MDA",
    "piPolB+D", "piMDA", "piMDA+D"))
quast_ref_mism_melted$Category <- factor(quast_ref_mism_melted$Category, levels = c("B. subtilis",
    "E. coli", "K. rhizophila", "P. aeruginosa", "Not_aligned"))
plot2 <- ggplot(quast_ref_mism_melted) + geom_point(aes(x = ind, y = as.numeric(values), group = Category,
    color = Category, fill = Category, shape = Category), size = 4, alpha = 0.7, position = position_dodge(width = 0.5)) +
    theme_bw() + scale_fill_manual(values = colorines[12:15]) + scale_color_manual(values = colorines[12:15]) +
    scale_shape_manual(values = c(21, 22, 23, 24)) + theme(text = element_text(size = 14)) +
    ggtitle("Normalized mismatches per sample") + xlab("Sample") + labs(color = "Reference genome") +
    guides(shape = FALSE, fill = FALSE, color = FALSE, scale = "none") + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 16)) + theme(legend.text = element_text(face = "italic")) + theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(name = "Normalized mismatches per 100 kbp")
grid_contigs_refs <- grid.arrange(plot, plot2, nrow = 1)

Figure 16. Assembly basic stats per reference genome

Show the code
grid_contigs_refs
TableGrob (1 x 2) "arrange": 2 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (1-1,2-2) arrange gtable[layout]
Show the code
# ggsave('figures/quast_refs.svg',grid_contigs_refs,width=8,height=5)

6.3 Number of contigs by GC-content

Show the code
gc_contigs <- read.table("quast_results/gc_contigs_distributions.csv", head = FALSE, sep = ",",
    na.strings = "")
gc_contigs <- t(gc_contigs)
colnames(gc_contigs) <- gc_contigs[1, ]
gc_contigs <- gc_contigs[-1, ]
gc_contigs <- as.data.frame(gc_contigs, row.names = FALSE)
gc_melt <- melt(gc_contigs, id.vars = "GC", na.rm = F)

gc_melt$value <- as.numeric(gc_melt$value)
gc_melt$GC <- as.numeric(gc_melt$GC)
# full version
gc_melt$variable <- factor(gc_melt$variable, levels = c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA",
    "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2"))

contigs_gc2 <- ggplot(gc_melt, aes(y = value, x = GC, fill = variable, label = variable)) +
    scale_y_continuous(limits = c(0, 1500)) + geom_bar(stat = "identity", position = "dodge",
    alpha = 0.8) + scale_fill_manual(values = colorines[1:11]) + scale_color_manual(values = colorines[1:11]) +
    ylab("Number of contigs") + xlab("GC content") + theme_bw() + scale_y_break(c(75, 200),
    scales = 0.5, ticklabels = c(200, 600, 1000, 1400), space = 0) + scale_x_continuous(limit = c(30,
    80), expand = c(0, 0.1)) + theme(text = element_text(size = 18)) + labs(fill = "Sample")
contigs_gc2

Figure 17. Number of contigs by GC content in all samples

Show the code
# ggsave('figures/GC_contigs2.svg', contigs_gc2, width=10, height=6)

# stats test for normality and homogeneous variance intra groups
res_aov <- aov(value ~ variable, data = gc_melt)
shapiro.test(res_aov$residuals)  #NO normality 

    Shapiro-Wilk normality test

data:  res_aov$residuals
W = 0.32817, p-value < 2.2e-16
Show the code
leveneTest(value ~ variable, data = gc_melt)  #NO homogeneus variance
Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)  
group   9  2.1723 0.0254 *
      200                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
# non parametric, pairwise (paired) tests
pairwise.wilcox.test(x = gc_melt$value, g = gc_melt$variable, paired = TRUE, p.adjust.method = "BH")

    Pairwise comparisons using Wilcoxon signed rank test with continuity correction 

data:  gc_melt$value and gc_melt$variable 

            NA   NA2  RP-MDA RP-MDA2 PrimPol-MDA piPolB+D piMDA piMDA2 piMDA+D
NA2         0.34 -    -      -       -           -        -     -      -      
RP-MDA      1.00 0.17 -      -       -           -        -     -      -      
RP-MDA2     0.12 0.12 0.12   -       -           -        -     -      -      
PrimPol-MDA 0.13 0.14 0.14   0.34    -           -        -     -      -      
piPolB+D    0.34 0.34 0.30   0.12    0.17        -        -     -      -      
piMDA       0.34 0.14 0.34   0.12    0.20        0.30     -     -      -      
piMDA2      0.17 0.13 0.27   0.14    0.25        0.20     0.24  -      -      
piMDA+D     0.45 0.65 0.20   0.12    0.12        0.34     0.23  0.17   -      
piMDA+D2    1.00 0.14 0.83   0.12    0.14        0.34     0.23  0.17   0.47   

P value adjustment method: BH 
Show the code
# no significant differences

The coverage by the GC content can be also represented by Circos plots. The outer circle represents reference sequences with GC (%) heatmap, from 25% (white) to 79% (black). Color bars help to distinguish between different references. Assembly tracks are combined with mismatches visualization: green contigs indicate perfect assembly, whereas red contigs contains misassemblies, higher black columns indicate larger mismatch rate. Finally, the inner circle represents read coverage histogram.

External Figure 3. Circos plots of all samples by the four reference genomes.

The analysis of assemblies shows that for all the samples P. aeruginosa genome seems different to the reference genome. On the other hand, B. subtilis 110NA is a 168 derivative with a deletion spanning Spo0A gene (see Michel, J.F. y Cami, B., Ann. Inst. Pasteur 116(N1), 3-18 (1969)). On the contrary the assemblies of E. coli and K. rhizophila fit with the reference genomes in most samples.

6.4 Assembly annotation

Bakta v.1.5.1 with Bakta Database v.4.0. (Schwengers et al, 2021). Results in bakta_results folder in the repo.

Show the code
samples = c("ctrl", "ctrl2", "N4", "D3", "C3", "F2", "N2", "B2", "N6", "A2")
bakta <- lapply(samples, function(x) read.csv(file = paste0("bakta_stats/", x, ".txt"), sep = ":",
    header = F, skip = 1, col.names = c("V1", paste(x))))
bakta_merged <- Reduce(function(x, y) merge(x, y, by = "V1", all = TRUE), bakta)
bakta_final <- bakta_merged[c(3, 4, 15, 16, 20, 21, 24, 25, 26), ]
bakta_final = data.frame(bakta_final, row.names = NULL)
names(bakta_final) <- c("Parameter", "NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB+D",
    "piMDA", "piMDA2", "piMDA+D", "piMDA+D2")
kbl(bakta_final, align = "c", caption = "Table 7. Bakta annotation stats") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, position = "center") %>%
    column_spec(1, italic = T)
Table 7. Bakta annotation stats
Parameter NA NA2 RP-MDA RP-MDA2 PrimPol-MDA piPolB+D piMDA piMDA2 piMDA+D piMDA+D2
CDSs 16612 16609 14502 16713 16609 17326 16671 16697 16627 16627
coding density 89.3 89.3 79.9 87.3 89.3 74.2 89.2 89.1 89.2 89.3
ncRNA regions 161 161 149 159 160 160 160 162 160 160
ncRNAs 292 286 287 287 291 285 293 289 291 289
pseudogenes 45 44 125 76 43 120 44 46 45 45
rRNAs 12 14 12 12 13 12 12 12 12 13
sORFs 90 91 84 89 92 84 89 89 91 92
tmRNAs 5 5 3 5 5 5 5 5 5 5
tRNAs 254 248 210 244 246 237 248 251 242 239

The annotation of all the metagenomes was very similar, except for those from RP-MDA and piPolB+D amplified DNA, which contain a high number of pseudogenes, most likely as consequence of poor assembly

6.5 Variant calling

We performed short nucleotide variant (SNV) detection with Snippy 4.6.0, which has been benchmarked as the top variant caller for microbial genomes,showing both strong and uniform performance across different species (Bush et al. 2020).

Show the code
samples=c("Ctrl","Ctrl2","N4","D3","C3","N1","F2","N2","B2","N6","A2")
snps <- lapply(samples, function(x) read.csv(file=paste0("snippy_results/",x,"_snps.csv"),sep=",", header=T))
shared_samples_NA <- list()
samples_NA <- list()
NA_samples <- list()
snps_table <- data.frame(matrix(NA,    # Create empty data frame
                          nrow = 11,
                          ncol = 9))
snps_table[,1] <-  c("NA","NA2","RP-MDA","RP-MDA2","PrimPol-MDA","piPolB","piPolB+D","piMDA","piMDA2","piMDA+D","piMDA+D2")
`%notin%` <- Negate(`%in%`)
for (i in 1:length(samples)){
  shared_samples_NA[[i]] <- snps[[i]][snps[[i]]$POS %in% snps[[1]]$POS,]
  samples_NA[[i]] <- snps[[i]][snps[[i]]$POS %notin% snps[[1]]$POS,]
  NA_samples[[i]] <- snps[[1]][snps[[1]]$POS %notin% snps[[i]]$POS,]
  snps_table[i,2] <- nrow(snps[[i]])
  snps_table[i,3] <- c(nrow(shared_samples_NA[[i]]))
  snps_table[i,4] <- c(nrow(samples_NA[[i]]))
  snps_table[i,5] <- table(samples_NA[[i]]$TYPE=="snp")["TRUE"]
  snps_table[i,6] <- table(samples_NA[[i]]$TYPE=="complex")["TRUE"]
  snps_table[i,7] <- c(nrow(NA_samples[[i]]))
  snps_table[i,8] <- table(NA_samples[[i]]$TYPE=="snp")["TRUE"]
  snps_table[i,9] <- table(NA_samples[[i]]$TYPE=="complex")["TRUE"]
}
names(snps_table) <- c("Sample","Variants","Variants in NA","New Variants","New SNPS","New Complex","Missing Variants","Missing SNPS","Missing Complex")


kbl(snps_table, align="c",caption="Table 8. Variant detection") %>%
    kable_styling(bootstrap_options = "striped", full_width = F,
        position = "center") %>%
     column_spec(1, italic = T) 
Table 8. Variant detection
Sample Variants Variants in NA New Variants New SNPS New Complex Missing Variants Missing SNPS Missing Complex
NA 61 61 0 NA NA 0 NA NA
NA2 61 61 0 NA NA 0 NA NA
RP-MDA 26 19 7 6 1 42 32 10
RP-MDA2 36 31 5 4 1 30 21 9
PrimPol-MDA 46 40 6 5 1 21 14 7
piPolB 1 0 1 NA 1 61 45 16
piPolB+D 12 11 1 1 NA 50 36 14
piMDA 47 41 6 5 1 20 16 4
piMDA2 60 49 11 7 3 12 10 2
piMDA+D 40 36 4 4 NA 25 19 6
piMDA+D2 61 61 0 NA NA 0 NA NA

As shown in Table 8, variant calling capacity varies greatly samples. For example, RP-MDA and piPolB(+D) amplification result in poor variant detection of both single nucleotide polymorphisms (snps) and complex variants. It should be noted that with the exception of samples NA and NA2, which called the same SNVs, there are differences between the duplicates and the second batch of samples from RP-MDA. piMDA and especially piMDA+D allow better variant detection, with piMDA+D being identical to the control samples. Nevertheless, the PrimPol-MDA and piMDA(+D) methods show similar and good performance in variant detection, with up to 60-80% of variants overlapping with the non-amplified samples.

6.6 Assembly k-mer content

We used KAT v.2.4.1 to compare the Illumina reads with their corresponding Spades assembly.

Show the code
reads <- c("Ctrl", "Ctrl2", "N4", "D3", "C3", "F2", "N2", "B2", "N6", "A2")
samples <- c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA", "piPolB+D", "piMDA", "piMDA2",
    "piMDA+D", "piMDA+D2")
kat <- list()
p_kat <- list()

kat <- lapply(reads, function(x) read.csv(paste0("kat/sample_vs_assembly/", x, "_m16_pe_v_asm_test-main.mx"),
    sep = " ", header = FALSE, skip = 12))


p_kat <- lapply(1:length(reads), FUN = function(i) {
    p_kat[[i]] <- ggplot(data = kat[[i]]) + geom_point(aes_(x = 1:length(kat[[i]][, 5]), y = (kat[[i]][,
        5])), col = "cyan", alpha = 0.5) + geom_point(aes_(x = 1:length(kat[[i]][, 4]), y = (kat[[i]][,
        4])), col = "green", alpha = 0.5) + geom_point(aes_(x = 1:length(kat[[i]][, 3]), y = (kat[[i]][,
        3])), col = "orange", alpha = 0.5) + geom_point(aes_(x = 1:length(kat[[i]][, 2]), y = (kat[[i]][,
        2])), col = "red", alpha = 0.5) + geom_point(aes_(x = 1:length(kat[[i]][, 1]), y = (kat[[i]][,
        1])), alpha = 0.5) + ylim(0, 2e+05) + theme_bw() + theme(text = element_text(size = 13)) +
        xlab("k-mer frequency") + ylab("Number of distinct k-mers") + ggtitle(samples[i])
})
do.call("grid.arrange", c(p_kat, ncol = 5))

Figure 18. KAT k-mer plots comparing k-mer content of Illumina PCR-free reads with the corresponding assembly. The black dots of the graph represents the distribution of k-mers present in the reads but not in the assembly, and the red area represents the distribution of k-mers present in the reads and once in the assembly

The black line in piPolB+D has a larger extension along the x-axis, indicating a higher number of k-mers excluded from the assembly. The red area in all graphs represents a normal distribution of k-mers found in the reads and occurring once in the assembly. The presence of orange dots in some samples, namely RP-MDA, RP-MDA2, PrimPol-MDA and piPolB+D, reflects the presence of repetitive content throughout the assembly, with k-mers in the reads occurring >2 times in the assembly, in agreement with a higher level of biased overamplification (Figure 4).

6.7 Assembly binning

With MetacoAG 1.0. MetacoAG uses the information of assembly graphs as well as composition (k-mers), gene markers and coverage information. Obtained bins were then assigned to the reference genomes by the mash distances, selecting those with coverage >5% and p-value < 10-10. Results are available in the folder binning in the repo.

Show the code
# read data
dist <- read.csv(file = "binning/all_distances.csv", sep = ";", header = T)
# rename references and samples
dist$reference <- gsub("../refs/NC_000913.3.fasta", "E. coli", dist$reference)
dist$reference <- gsub("../refs/NC_000964.3.fasta", "B. subtilis", dist$reference)
dist$reference <- gsub("../refs/CP013113.1_PAER4.fasta", "P. aeruginosa", dist$reference)
dist$reference <- gsub("../refs/Kocuria_rhizophila_ATCC_9341.fasta", "K. rhizophila", dist$reference)


dist$sample <- substr(dist$bin, 1, 5)
dist$sample <- gsub("Ctrl/", "NA", dist$sample)
dist$sample <- gsub("Ctrl2", "NA2", dist$sample)
dist$sample <- gsub("A2/bi", "piMDA+D2", dist$sample)
dist$sample <- gsub("B2/bi", "piMDA2", dist$sample)
dist$sample <- gsub("F2/bi", "piPolB+D", dist$sample)
dist$sample <- gsub("C3/bi", "PrimPol-MDA", dist$sample)
dist$sample <- gsub("D3/bi", "RP-MDA2", dist$sample)
dist$sample <- gsub("N2/bi", "piMDA", dist$sample)
dist$sample <- gsub("N4/bi", "RP-MDA", dist$sample)
dist$sample <- gsub("N6/bi", "piMDA+D", dist$sample)

dist$bin <- gsub(".fasta", "", dist$bin)
dist$bin <- substr(dist$bin, nchar(dist$bin) - 5, nchar(dist$bin))
dist$bin <- gsub("/", "", dist$bin)
dist$length <- substr(dist$length, 1, nchar(dist$length) - 5)
dist$length <- as.integer(dist$length)/10

# select the bins
selection <- dist[dist$length > 5, ]
selection$sample <- factor(selection$sample, levels = c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA",
    "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2"))

g <- ggplot(selection, aes(fill = reference, y = length, x = sample, label = paste(length,
    "%"))) + geom_bar(position = "stack", stat = "identity", colour = "black") + scale_fill_manual(values = colorines[12:15]) +
    ylab("Cumulative bin length (%)") + theme_bw()
g + geom_text(size = 3, position = position_stack(vjust = 0.5)) + theme(axis.text.x = element_text(angle = 45,
    hjust = 1, size = 14), axis.text.y = element_text(size = 14))

Figure 19. Metagenome Binning

Try the same but with different MetaCoAG parameters: –mg_threshold 0.4 –bin_mg_threshold 0.2

Show the code
dist <- read.csv(file = "binning/all_distances_final.csv", sep = ";", header = T)

dist$reference <- gsub("../refs/NC_000913.3.fasta", "E. coli", dist$reference)
dist$reference <- gsub("../refs/NC_000964.3.fasta", "B. subtilis", dist$reference)
dist$reference <- gsub("../refs/CP013113.1_PAER4.fasta", "P. aeruginosa", dist$reference)
dist$reference <- gsub("../refs/Kocuria_rhizophila_ATCC_9341.fasta", "K. rhizophila", dist$reference)



dist$sample <- substr(dist$bin, 1, 5)
dist$sample <- gsub("Ctrlb", "NA", dist$sample)
dist$sample <- gsub("Ctrl2", "NA2", dist$sample)
dist$sample <- gsub("A2b/b", "piMDA+D2", dist$sample)
dist$sample <- gsub("B2b/b", "piMDA2", dist$sample)
dist$sample <- gsub("F2b/b", "piPolB+D", dist$sample)
dist$sample <- gsub("C3b/b", "PrimPol-MDA", dist$sample)
dist$sample <- gsub("D3b/b", "RP-MDA2", dist$sample)
dist$sample <- gsub("N2b/b", "piMDA", dist$sample)
dist$sample <- gsub("N4b/b", "RP-MDA", dist$sample)
dist$sample <- gsub("N6b/b", "piMDA+D", dist$sample)

dist$bin <- gsub(".fasta", "", dist$bin)
dist$bin <- substr(dist$bin, nchar(dist$bin) - 5, nchar(dist$bin))
dist$bin <- gsub("/", "", dist$bin)
dist$length <- substr(dist$length, 1, nchar(dist$length) - 5)
dist$length <- as.integer(dist$length)/10

# select the bins
dist <- dist[-184, ]
selection <- dist[dist$length > 5, ]
selection$sample <- factor(selection$sample, levels = c("NA", "NA2", "RP-MDA", "RP-MDA2", "PrimPol-MDA",
    "piPolB+D", "piMDA", "piMDA2", "piMDA+D", "piMDA+D2"))
# plot

g <- ggplot(selection, aes(fill = reference, y = length, x = sample, label = paste(length,
    "%"))) + geom_bar(position = "stack", stat = "identity", colour = "black", alpha = 0.8) +
    ggtitle("Assembly binning by reference") + labs(fill = "Reference genome") + ylab("Cumulative bin length (%)") +
    scale_fill_manual(values = colorines[12:15]) + geom_text(size = 3, position = position_stack(vjust = 0.5)) +
    theme_bw() + theme(axis.title = element_text(size = 15), axis.text.x = element_text(angle = 45,
    hjust = 1, size = 14), axis.text.y = element_text(size = 14))
g + theme(legend.text = element_text(face = "italic"))

Figure 20. Assembly bins extracted and color-coded by the reference genome

Show the code
# ggsave('figures/binning.svg', g, width=8, height=5)

Despite the low coverage, the piPolB+D assembly surprisingly good in the binning analysis and extraction.

Interestingly, despite containing more contigs in the assemblies (Figures 16 and 18), piMDA outperforms the piMDA+D protocol in the binning of B. subtilis sequences, suggesting that the alkaline denaturation step may somewhat reduce the overall quality of the MAGs from (pi)MDA samples.

7 Index of Figures

Show the code
names(figuritas) <- "List of Plots captions"
kbl(figuritas, align = "l", caption = "") %>%
    kable_styling(bootstrap_options = "striped", full_width = F, font_size = 12, position = "left")
List of Plots captions
Figure 1. DNA amplification yield comparison. All reactions were carried out in 50 µL and incubated at 30ºC for 3 h, except those of RP-MDA protocol that were incubated for 16 h. Shown are results from 4 independent experiments.
Figure 2. DNA amplification time-course. All the amplification assays contained 10 ng of input DNA sample. Shown is the total DNA product in 50 µL reactions. Amplification yield from 2-4 experiments are shown.
Figure 3. Reads duplication level before (top) and after (bottom) trimming
Figure 4. Paired reads after trimming
Figure 5. Reads' GC content plot. GC content is indicated in the plot leyend for the post-trimming reads
Figure 6. Metagenomic samples reads profiling of amplified DNA by different MDA protocols. Note that both plots show the same data, with different representation
Figure 7. Clustering of samples by the metagenomic profiling results
Figure 8. Reference genomes coverage depth by diverse MDA methods
Figure 9. Reference genomes coverage depth (left axis) and variation coefficient (small points, right axis) by MDA methods
Figure 10. Reads Mismatch rate
Figure 11. Comparative Normalized coverage by GC distribution of sequence and reads. Sequence coverage (red) is normalized by the GC range frequency in 100 bp windows (blue). The average GC content of each sample mapped reads is also indicated in the leyend
Figure 12. Correlations between GC content and normalized coverage with all data
Figure 13. GC distribution of sequence and reads for each reference genome. Sequence coverage (red) is normalized by the GC range frequency in 100 bp windows (blue). The average base quality is also shown (grey)
Figure 14. Correlations among normalized coverage by GC content per reference genome
Figure 15. Assembly basic stats in samples amplified by all the different MDA methods tested
Figure 16. Assembly basic stats per reference genome
Figure 17. Number of contigs by GC content in all samples
Figure 18. KAT k-mer plots comparing k-mer content of Illumina PCR-free reads with the corresponding assembly. The black dots of the graph represents the distribution of k-mers present in the reads but not in the assembly, and the red area represents the distribution of k-mers present in the reads and once in the assembly
Figure 19. Metagenome Binning
Figure 20. Assembly bins extracted and color-coded by the reference genome

8 Acknowledments and Funding

We are indebted to Professor Margarita Salas for her longstanding support and contributions to the discovery of piPolB and to the early development of this project. We also thank Dr. Miguel de Vega for valuable suggestions and critical reading of the manuscript. This work was supported by a grants PGC2018-093723-A-100 and PID2021-123403NB-I00 funded by MCIN/AEI/10.13039/501100011033/FEDER A way to make Europe to M.R.R. C. Egas’ laboratory was funded by the European Union´s Horizon 2020 Research and Innovation Program under Grant Agreement No. 685474, the METAFLUIDICS project. CDO was holder of a Fellowship from the Spanish Ministry of University FPU16/02665.

9 Session Info

Show the code
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.1.0        geomtextpath_0.1.1 patchwork_1.1.2    reshape2_1.4.4    
 [5] heatmaply_1.4.2    viridis_0.6.2      viridisLite_0.4.1  plotly_4.10.1     
 [9] corrplot_0.92      kableExtra_1.3.4   gridExtra_2.3      pzfx_0.3.0        
[13] ggrepel_0.9.2      gdata_2.18.0.1     stringr_1.5.0      car_3.1-1         
[17] carData_3.0-5      ggh4x_0.2.3        ggbreak_0.1.1      data.table_1.14.6 
[21] ggplot2_3.4.0      formatR_1.14       knitr_1.42        

loaded via a namespace (and not attached):
 [1] httr_1.4.4         tidyr_1.3.0        jsonlite_1.8.4     foreach_1.5.2     
 [5] gtools_3.9.4       assertthat_0.2.1   highr_0.10         yulab.utils_0.0.6 
 [9] yaml_2.3.7         pillar_1.8.1       glue_1.6.2         digest_0.6.31     
[13] RColorBrewer_1.1-3 rvest_1.0.3        colorspace_2.1-0   ggfun_0.0.9       
[17] plyr_1.8.8         htmltools_0.5.4    pkgconfig_2.0.3    purrr_1.0.1       
[21] scales_1.2.1       webshot_0.5.4      ggplotify_0.1.0    svglite_2.1.1     
[25] tibble_3.1.8       farver_2.1.1       generics_0.1.3     ellipsis_0.3.2    
[29] withr_2.5.0        lazyeval_0.2.2     cli_3.6.0          magrittr_2.0.3    
[33] evaluate_0.20      fansi_1.0.4        xml2_1.3.3         textshaping_0.3.6 
[37] tools_4.2.2        registry_0.5-1     lifecycle_1.0.3    aplot_0.1.9       
[41] munsell_0.5.0      compiler_4.2.2     ca_0.71.1          gridGraphics_0.5-1
[45] systemfonts_1.0.4  rlang_1.0.6        grid_4.2.2         iterators_1.0.14  
[49] rstudioapi_0.14    htmlwidgets_1.6.1  crosstalk_1.2.0    labeling_0.4.2    
[53] rmarkdown_2.20     gtable_0.3.1       codetools_0.2-18   abind_1.4-5       
[57] TSP_1.2-2          R6_2.5.1           seriation_1.4.1    fastmap_1.1.0     
[61] utf8_1.2.2         dendextend_1.16.0  stringi_1.7.12     Rcpp_1.0.10       
[65] vctrs_0.5.2        tidyselect_1.2.0   xfun_0.36         

Footnotes

  1. JM83 assembly was published in July 2022, but it shares an ANI of 99.98 with MG1655, so we decided not to change the reference genome.↩︎