Consistent with Figure 8, we can see that most genes are clustered at a Z-score <0, with a peak around -0.5 (grey line) and a shoulder centered at 1 (orange line). This indicates that most genes have a low number of insertions and that there is a gradient of genes up to a very high number of insertions.
In previous bacteria, the TnSeq data follow a bimodal distribution consisting of an exponential distribution of essential genes and a gamma family curve for the non-essential genes, with an inflection point in between (Goodall et al. 2018.; Moule et al. 2014; Valentino et al. 2014; Higgins et al. 2020; Ramsey et al. 2020; A. Ghomi et al. 2024), even in polyploid bacteria (Rubin et al. 2015). However, there is no inflection point in our data, so we confirm that there are no clear groups of essential/non-essential genes and only a gamma distribution can be fitted (p-value 2x10-16).
To make an approximate classification of the genes, we will perform an unsupervised cluster analysis. A similar approach was also used in a recent work (A. Ghomi et al. 2024), using DBSCAN algorithm. However, in that work they also have bimodal curve and more clear density separation of the score index. With our data, we obtained two large groups with wide range of overlapping index that was no very informative.
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
Thus, we decided to use a more classical approach and compare K-means and PAM clustering. First, we will test the number of clusters that allow us to better classify the genes.
The Kmeans clustering is slightly better as its clusters area are closer to the silhouette mean, thus we will keep this clusters and name them as “Highly permissive” and “Intermediate” genes. As shown in the following plot, both groups are well separated.
Show the code
#full table#subset the ~5% genes with less insertionsdatos.j =cbind(scores, cluster= kmclust2$cluster)ggplot(data=cbind(datos.j$cluster,stack(datos.j[,2:5]))) +xlab("Z-Score (log2)") +ylab("Gene count") +geom_density(aes(x=as.numeric(values),group=as.factor(`datos.j$cluster`), color=`datos.j$cluster`, fill=`datos.j$cluster`), alpha=0.7)+theme_classic() +facet_grid(~ind) +labs(fill="Cluster",color="Cluster")
The first group contains the genes with higher Z-score mean and the second large group contains the genes with less insertions. To highlight the top genes with less insertions, we empirically select the ~5% genes with less insertions as “Less permissive” genes, corresponding to 321 genes with a Z-score < -0.5.
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(...)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
The separation between the “Intermediate” and “Highly Permissive” cluster is not very clear, as expected. Thus, for some analysis, we will consider these groups as only one same cluster. The following table contain the genes Z-score and classification in three groups.
Although there is overall no differences between strains, we are going to check in detail if there is any gene with statistically significant more/less Tn insertions.
Show the code
#meethod from https://lashlock.github.io/compbio/R_presentation.html#we need to use raw counts as integers, no Z-scorescountData <-read.csv('tnseq_dd_counts_scores.csv', header =TRUE, sep =";")countData <- countData[,c(1,7,9,11,13)]countData[,2:5] <-lapply(countData[,2:5], as.integer)metaData <-data.frame(names(countData[,2:5]),c("HB27","HB27","Ppol","Ppol"),c("Exp1","Exp2","Exp1","Exp2"))names(metaData) <-c("Sample","Strain","Experiment")dds <-DESeqDataSetFromMatrix(countData=countData, colData=metaData, design=~Strain, tidy =TRUE)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
Show the code
dds <-DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
Warning in lfproc(x, y, weights = weights, cens = cens, base = base, geth =
geth, : Estimated rdf < 1.0; not estimating variance
final dispersion estimates
fitting model and testing
Show the code
res <-results(dds)res <- res[order(res$padj),]res$diffexpressed <-"NO"res$diffexpressed[res$log2FoldChange >2& res$padj <0.01] <-"UP"res$diffexpressed[res$log2FoldChange <-2& res$padj <0.01] <-"DOWN"library(ggrepel)# plot adding up all layers we have seen so farggplot(data=res, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed, label=row.names(res))) +geom_point() +theme_minimal() +geom_text_repel() +scale_color_manual(values=c("blue", "black", "red")) #+
Warning: Removed 671 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 671 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
Warning: ggrepel: 898 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
As we can see there are genes with significant differences between samples.
Genes functional classification
COG
Show the code
nog <-read_xlsx("00_raw/refs/MM_x4qgl_mt.emapper.annotations.xlsx",sheet=1,skip=2)nog$seed_ortholog <-gsub("262724.","",nog$seed_ortholog)names(nog)[2] <-"Genes"
After calculating the Z-scores, we are going to group the genes by their functions, using COG and KEGG functional categories. Functional annotation was obtained at EggNog mapper website Cantalapiedra et al. (2021). We could annotate 2006 (92.4%) of the HB27 proteins.
Show the code
#load COG listcog <-read.table("scripts/cog_list.csv",sep=";",header=TRUE)cog <- cog[-26,]#summarize and split categoriescategories <-c()for (i in1:nrow(cog)){ categories <-grep(cog$category[i],nog$COG_category) cog$sum[i] <-length(categories)}#plot freqsggplot(cog,aes(y=as.numeric(sum), x=definition,color=group,fill=group)) +geom_bar(stat="identity", position="stack", alpha=0.8) +ylab("Number of genes") +xlab("") +coord_flip() +scale_y_continuous(breaks =seq(0, 500, by =100))+theme_classic() +theme(text=element_text(size=15),legend.text=element_text(size=9))
After the incorporation of COG and KEGG annotations, the final TnSeq table is exported as tnSeq_dd_full.csv.
EggNog mapper could assign KEGG Ontology (KO) group to 1352 (62.28%) of the HB27 proteins. After revision of the KO annotation, we manually incorporated several missing tRNA genes to improve the statistics, using K14228 (tRNA-leu).
Show the code
#parse data and countko <-fread("scripts/ko.csv",sep=";",header=TRUE)trna <-read.csv2("scripts/tRNAs.csv",header =FALSE)final$KEGG_ko <-gsub("ko:","",final$KEGG_ko)#tRNAs annotation is not very good, so we corrected for statistics#all tRNAs not annotated were masked as K14228 (tRNA-leu)final$KEGG_ko[final$Genes %in% trna$V1] <-"K14228"duplicate_ko_rows <-function(data) { data %>%mutate(reaction_ko =strsplit(KEGG_ko, ",")) %>%# Split ko annotationsunnest(reaction_ko) %>%# Unnest each annotation into separate rowsselect(-KEGG_ko) # Keep original columns except split annotations}datos_unnested <-duplicate_ko_rows(final)stats <-summarise(group_by(datos_unnested,reaction_ko,cluster),count =n())
`summarise()` has grouped output by 'reaction_ko'. You can override using the
`.groups` argument.
A. Ghomi, Fatemeh, Jakob J. Jung, Gemma C. Langridge, Amy K. Cain, Christine J. Boinett, Moataz Abd El Ghany, Derek J. Pickard, et al. 2024. “High-Throughput Transposon Mutagenesis in the Family Enterobacteriaceae Reveals Core Essential Genes and Rapid Turnover of Essentiality.”mBio 0 (0): e01798–24. https://doi.org/10.1128/mbio.01798-24.
Cantalapiedra, Carlos P., Ana Hernández-Plaza, Ivica Letunic, Peer Bork, and Jaime Huerta-Cepas. 2021. “eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale.”Molecular Biology and Evolution 38 (12): 5825–29. https://doi.org/10.1093/molbev/msab293.
Goodall, Emily C. A., Ashley Robinson, Iain G. Johnston, Sara Jabbari, Keith A. Turner, Adam F. Cunningham, Peter A. Lund, Jeffrey A. Cole, and Ian R. Henderson. 2018. “The Essential Genome of Escherichia Coli k-12.”mBio 9 (1): 10.1128/mbio.02096–17. https://doi.org/10.1128/mbio.02096-17.
Higgins, Steven, Stefano Gualdi, Marta Pinto-Carbó, and Leo Eberl. 2020. “Copper Resistance Genes of Burkholderia Cenocepacia H111 Identified by Transposon Sequencing.”Environmental Microbiology Reports 12 (2): 241–49. https://doi.org/10.1111/1758-2229.12828.
Moule, Madeleine G., Claudia M. Hemsley, Qihui Seet, José Afonso Guerra-Assunção, Jiali Lim, Mitali Sarkar-Tyson, Taane G. Clark, et al. 2014. “Genome-wide saturation mutagenesis of Burkholderia pseudomallei K96243 predicts essential genes and novel targets for antimicrobial development.”mBio 5 (1): e00926–00913. https://doi.org/10.1128/mBio.00926-13.
Ramsey, Kathryn M., Hannah E. Ledvina, Tenayaann M. Tresko, Jamie M. Wandzilak, Catherine A. Tower, Thomas Tallo, Caroline E. Schramm, et al. 2020. “Tn-Seq reveals hidden complexity in the utilization of host-derived glutathione in Francisella tularensis.”PLoS pathogens 16 (6): e1008566. https://doi.org/10.1371/journal.ppat.1008566.
Rubin, Benjamin E., Kelly M. Wetmore, Morgan N. Price, Spencer Diamond, Ryan K. Shultzaberger, Laura C. Lowe, Genevieve Curtin, Adam P. Arkin, Adam Deutschbauer, and Susan S. Golden. 2015. “The Essential Gene Set of a Photosynthetic Organism.”Proceedings of the National Academy of Sciences of the United States of America 112 (48): E6634–43. https://doi.org/10.1073/pnas.1519220112.
Valentino, Michael D., Lucy Foulston, Ama Sadaka, Veronica N. Kos, Regis A. Villet, John Santa Maria, David W. Lazinski, et al. 2014. “Genes contributing to Staphylococcus aureus fitness in abscess- and infection-related ecologies.”mBio 5 (5): e01729–01714. https://doi.org/10.1128/mBio.01729-14.