Thermus TnSeq
  • Home
  • Contents
    • TnSeq Illumina Data Processing
    • TnSeq Genes classification
    • Thermus Pangenome

On this page

  • 1 Genomes dataset and pangenome construction
  • 2 Pangenomes stats
  • 3 Integration of TnSeq and pangenomes
    • Correlation between TnSeq and Pangenome
    • Enrichment analysis
  • 4 Compare Tn-seq with prokaryotic “essentialtome” and HB27 transcription
    • Essential genes in other species
    • Gene expression in HB27

Thermus Pangenome

Author
Affiliation

Modesto

Department of Biochemistry, UAM

Published

April 26, 2024

Modified

March 14, 2025

1 Genomes dataset and pangenome construction

We downloaded (May 27th, 2024) all the Phylum Deinococcota assemblies from Refseq database using the NCBI tool datasets (v.16.5.0) installed in Conda.

Show code
conda activate ncbi_datasets
conda update -c conda-forge ncbi-datasets-cli
cd thermaceae/thermus_refseq
datasets download genome taxon Thermaceae --filename thermaceae_dataset.zip --exclude-atypical --assembly-source 'refseq'

To obtain an index of the genome files, we converted the dataset_catalog.json to CSV (https://www.convertcsv.com/json-to-csv.htm) to obtain the file fasta.csv.

To work with homogeneous and updated genome annotations, we subsequently re-annotated all genome assemblies with Bakta (from Conda, v. 1.9.2) using the full database (DDBB v. 5.1) and the options --skip-crispr --force.

Show code
cd pangenomes/deinococcota/ncbi_dataset/data
conda activate bakta
conda update bakta
bakta_db download --output /Volumes/Trastero4/ddbb --type full 
#database is in external HD
while IFS=, read -r col1 col2 col3
    do
        bakta --db /Volumes/Trastero4/ddbb/bakta-db --verbose --output ../../../bakta_results/$col3 --prefix $col3 --locus-tag $col3 --threads 16 $col1
    done < <(tail -n +2 fasta.csv)

The pangenomes were constructed with PPanGGOLIN (v. 2.0.5) installed in Conda. You can see the whole documentation about PPanGGOLIN and the output files here. In order to construct different pangenomes at species, genus, family and order levels, we parse the taxonomy from NCBI datasets using dataformat tool and then use a short R script (parse_taxonomy.R) to generate the annotated genomes table lists.

After testing different alternative datasets and clustering combinations, we empirically set the MMSeqs clustering sequence identity and coverage parameters to 0.4 and 0.5, respectively.

Show code
#STEP 2
#Run ppanggolin and write extra output files
conda activate bioconda
cd pangenomes
ppanggolin workflow --anno deinococcota.gbff.list --basename deinococcota --identity 0.4 --coverage 0.5 -o deinococcota_i4c5 -c 16 -f
ppanggolin workflow --anno thermaceae.gbff.list --basename thermaceae --identity 0.4 --coverage 0.5 -o thermaceae_i4c5 -c 16 -f
ppanggolin workflow --anno thermus.gbff.list --basename thermus --identity 0.4 --coverage 0.5 -o thermus_i4c5 -c 16 -f
ppanggolin workflow --anno tt.gbff.list --basename tthermophilus --identity 0.4 --coverage 0.5 -o tt_i4c5 -c 16 -f

Additionally, before moving forward with the pangenome, as reference, we are going to incorporate the gene names from the Thermus thermophilus strain HB27, as annotated in the NCBI Refseq assembly (GCF_000008125.1).

2 Pangenomes stats

Now, we are going to have a look to the main pangenome stats.

Show code
pangenomes <- c("tt_i4c5","thermus_i4c5","thermaceae_i4c5","deinococcota_i4c5")
names(pangenomes) <- c("Thermus thermophilus","Thermus","Thermaceae","Deinococcota")
#stats from all metagenomes
cloud <- c()
shell <- c()
persistent <- c()
for (i in 1:length(pangenomes)){
  cloud[i] <- length(read_lines(paste0("pangenomes/",pangenomes[i],"/partitions/cloud.txt")))
  shell[i] <- length(read_lines(paste0("pangenomes/",pangenomes[i],"/partitions/shell.txt")))
  persistent[i] <- length(read_lines(paste0("pangenomes/",pangenomes[i],"/partitions/persistent.txt")))
}
stats <- data.frame(cloud,shell,persistent)
row.names(stats) <- names(pangenomes)
stats <- cbind(row.names(stats),stack(stats))
names(stats) <- c("level","numbers","partition")
#plot
ggplot(stats,aes(x=level, y=numbers,group=partition)) + 
         geom_bar(aes(fill=partition),stat = "identity",position="fill",color="grey40", alpha=0.8) + ylab("Gene families (%)")+ geom_text(aes(label=numbers),size=5, position = position_fill(vjust=0.5) , col = "black")+xlab("Pangenome level") +scale_y_continuous(labels = scales::percent) + theme_bw()+scale_fill_manual(values=c("#bbe59f","#abc837","#339933") )

Figure 23. Statistics of PPanGGOLIN pangenomes obtained with 40% identity and 50% coverage. The number of gene families in each partition is indicated

The number of clusters or gene families conserved (Shell+Persistent) increases as we go down in the taxonomical level. Note that the Shell partition shrinks at the species level, suggesting that the identity threshold is probably low at this level for a detailed pangenome analysis. However, as we will focus in the Persistent partition, the calculated pangenomes would be enough informative.

3 Integration of TnSeq and pangenomes

As mentioned above, in order to compute the pangenomes, we carried out a new annotation of the genes in all the assemblies. This is required to have homogeneous gene definitions, but that also gives rise to some differences between the classical NCBI Genbank annotation that we used for the pangenome and the new annotations. Thus, in order to integrate the HB27 genes from TnSeq (TT_CXXXX) and pangenome genes (GCF_000008125.1_XXXXX) in the same table, we used the package fuzzyjoin that allows a flexible merge of the data, considering the gene start and stop as an interval rather than fixed coordinates. However, this flexibility also has a price, as we have some pangenome genes that matched with more than one gene in the classical annotation. We tested different levels of flexibility in the coordinates to maximize the number of genes in the table and at the same time minimize the number of duplicates.

Show code
#cross GCA_000008125.1 in all pangenomes
matrix <- list()
HB27 <- data.frame()
for (i in 1:length(pangenomes)){
  matrix[[i]]<- read.table(paste0("pangenomes/",pangenomes[i],"/table/GCF_000008125.1.tsv"),header=TRUE,sep="\t")
}
HB27 <- merge(matrix[[1]][,c(1:4,8)],matrix[[2]][,c(1:4,8)],by=c("gene","contig","start","stop"),all=TRUE)
colnames(HB27)[5:6] <- pangenomes[1:2]
HB27 <- merge(HB27,matrix[[3]][,c(1:4,8)],by=c("gene","contig","start","stop"),all=TRUE)
HB27 <- merge(HB27,matrix[[4]][,c(1:4,8)],by=c("gene","contig","start","stop"),all=TRUE)
colnames(HB27)[7:8] <- pangenomes[3:4]

#cross gene names 
tnseq <- read.csv2("tnSeq_dd_full.csv")
HB27_genome <- read.table("00_raw/refs/GCA_000008125.1.gtf",sep="\t",header=FALSE)
HB27_genome$Chr <- "AE017221.1"
HB27_genome$Chr[8088:nrow(HB27_genome)] <- "AE017222.1"
HB27_genome$V9 <-  substr(HB27_genome$V9,9,16)
colnames(HB27_genome)[c(4:5,9)] <- c("start","stop","Genes")
#to reduce differnces in gene annotation, we use fuzzyjoin
#we use the highest max_dist value that don't give rise to duplicate gene names 
tmp <- fuzzyjoin::difference_inner_join(HB27[HB27$contig=="contig_1",],HB27_genome[HB27_genome$V3=="CDS" & HB27_genome$V1=="AE017221.1" ,c(4:5,9)], by =c("start", "stop"), max_dist = 210 )
tmp2 <- fuzzyjoin::difference_inner_join(HB27[HB27$contig=="contig_2",],HB27_genome[HB27_genome$V3=="CDS" & HB27_genome$V1=="AE017222.1" ,c(4:5,9)], by =c("start", "stop"), max_dist = 210 )
HB27_annotated <- rbind(tmp,tmp2)

#now we fetch the number of genomes with members in each family and paralogs
#load pangenome matrix
matrix <- list()
for (i in 1:length(pangenomes)){
  matrix[[i]]<- read.table(paste0("pangenomes/",pangenomes[i],"/matrix.csv"),sep=",",header=TRUE)
  matrix[[i]] <- matrix[[i]][,c(4,6,match("GCF_000008125.1",names(matrix[[i]])))]
  colnames(matrix[[i]])[3] <- "gene"

  #for (j in 1:nrow(HB27_annotated)){
  #  k <- grep(HB27_annotated$gene[j],matrix[[i]]$gene)
  #  HB27_annotated$isolates[j] <- matrix[[i]]$No..isolates[k] 
  #  HB27_annotated$seqs[j] <- matrix[[i]]$Avg.sequences.per.isolate[k] 
  #}
  HB27_annotated$isolates <- sapply(1:nrow(HB27_annotated), function(j) {
  k <- grep(HB27_annotated$gene[j], matrix[[i]]$gene)
  matrix[[i]]$No..isolates[k]
  })
  HB27_annotated$seqs <- sapply(1:nrow(HB27_annotated), function(j) {
  k <- grep(HB27_annotated$gene[j], matrix[[i]]$gene)
  matrix[[i]]$Avg.sequences.per.isolate[k] 
  })
  colnames(HB27_annotated)[c(length(HB27_annotated)-1,length(HB27_annotated))] <- c(paste0(pangenomes[i],"_isolates"),paste0(pangenomes[i],"_seqs.per.isolate"))
}

#final-final table
panTnseq <- merge(tnseq,HB27_annotated[,c(11,5:8,12:19)])
panTnseq_Full <-  merge(tnseq,HB27_annotated[,c(11,5:8,12:19)],by="Genes",all.x=TRUE)
#reorder columns and save
setcolorder(panTnseq_Full,"cluster",before ="Ppol_1")
#combine description fields
panTnseq_Full$Description.y[panTnseq_Full$Description.y=="-"] <- panTnseq_Full$Description.x[panTnseq_Full$Description.y=="-"]
setcolorder(panTnseq_Full,"Description.y",before ="Description.x")
panTnseq_Full <- panTnseq_Full[,-5]
names(panTnseq_Full)[4] <- "Description"
#save
#fwrite(panTnseq_Full,"tnseq_pangenome_25feb2025.csv",sep=";", row.names = FALSE)
#writexl::write_xlsx(panTnseq_Full,"tnseq_pangenome_25feb2025.xlsx")

We have been able to match in the pangenome 2190 genes out of 2275, with only 3 genes of the newly annotated HB27 (GCF_000008125.1_07995, GCF_000008125.1_08920, and GCF_000008125.1_08925) that matched with two genes in the classical annotation (TT_CXXXX).

The final table is saved as tnseq_pangenome_25feb2025.xlsx

Correlation between TnSeq and Pangenome

There’s no overall correlation between the TnSeq group classification of the HB27 genes, with the number of conserved genes in each pangenome and the pangenome cluster.

Show code
#recuento

high <- c()
intermediate <- c()
less <- c()
for (i in 1:4){
  high[i] <- nrow(subset(panTnseq,panTnseq[,17+i]=="persistent"  & panTnseq$cluster=="Highly Permissive" ))
   intermediate[i] <- nrow(subset(panTnseq,panTnseq[,17+i]=="persistent" & panTnseq$cluster=="Intermediate" ))
   less[i] <- nrow(subset(panTnseq,panTnseq[,17+i]=="persistent" & panTnseq$cluster=="Less Permissive" ))
}
stats <- data.frame(high,intermediate,less)
row.names(stats) <- names(pangenomes)
stats <- cbind(row.names(stats),stack(stats))
names(stats) <- c("level","ratio","cluster")
#plot
ggplot(stats,aes(x=level, y=ratio,group=cluster)) + 
    geom_bar(aes(fill=cluster),stat = "identity",position="fill",color="grey40", alpha=0.8) + ylab("Pangenome Persistent Genes  (%)")+ geom_text(aes(label=ratio),size=5, position = position_fill(vjust=0.5) , col = "black")+xlab("Pangenome level") +scale_y_continuous(labels = scales::percent) +theme_bw()+ theme(axis.text.x = element_text(angle = 45,vjust=1,hjust=1),text=element_text(face="bold"))+labs(fill="TnSeq gene group")+scale_fill_manual(labels=c("Highly permissive","Intermediate","Less Permissive"),values=c("#FF9966", "#619CFF" ,"#3366CC")) 

Figure 24. Persistent (core) genes in each pangenome.
Show code
ggsave("fig4b.pdf",width=5,height=5)

Now, let’s see genes distribution in the different groups using Venn diagrams.

Show code
       x <- list(
  Intermediate = panTnseq_Full$Genes[panTnseq_Full$cluster=="Intermediate"], 
  Highly_permissive = panTnseq_Full$Genes[panTnseq_Full$cluster=="Highly Permissive"], 
  Less_permissive = panTnseq_Full$Genes[panTnseq_Full$cluster=="Less Permissive"], 
  Persistent = panTnseq_Full$Genes[panTnseq_Full$tt_i4c5=="persistent"]
  )

library(ggvenn)
Loading required package: grid
Show code
tt <- ggvenn(
  x, 
  fill_color = c("#EFC000FF","#0073C2FF","cornflowerblue", "#CD534CFF"),
  stroke_size = 0.5, set_name_size = 4
  ) + ggtitle("Thermus thermophilus") + theme(plot.title = element_text(hjust = 0.5))+
   scale_x_continuous(expand = expansion(mult = c(0.15, 0.15)))

x[[4]] <- panTnseq_Full$Genes[panTnseq_Full$thermus_i4c5=="persistent"]

thermus <- ggvenn(
  x, 
  fill_color = c("#EFC000FF","#0073C2FF","cornflowerblue", "#CD534CFF"),
  stroke_size = 0.5, set_name_size = 4
  ) + ggtitle("Thermus") + theme(plot.title = element_text(hjust = 0.5))+
   scale_x_continuous(expand = expansion(mult = c(0.15, 0.15)))

x[[4]] <- panTnseq_Full$Genes[panTnseq_Full$thermaceae_i4c5=="persistent"]

thermaceae <- ggvenn(
  x, 
  fill_color = c("#EFC000FF","#0073C2FF","cornflowerblue", "#CD534CFF"),
  stroke_size = 0.5, set_name_size = 4
  ) + ggtitle("Thermaceae") + theme(plot.title = element_text(hjust = 0.5))+
   scale_x_continuous(expand = expansion(mult = c(0.15, 0.15)))

x[[4]] <- panTnseq_Full$Genes[panTnseq_Full$deinococcota_i4c5=="persistent"]

deinococcota <- ggvenn(
  x, 
  fill_color = c("#EFC000FF","#0073C2FF","cornflowerblue", "#CD534CFF"),
  stroke_size = 0.5, set_name_size = 4
  ) + ggtitle("Deinococcota") + theme(plot.title = element_text(hjust = 0.5))+
   scale_x_continuous(expand = expansion(mult = c(0.15, 0.15)))


ggarrange(tt, thermus, thermaceae, deinococcota, 
          labels = c("A", "B","C","D"), ncol=2,nrow = 2)

Figure 25. Venn diagram of TnSeq genes categories and Persistant genes in each pangenome).
Show code
#names(panTnseq_Full)

Upset Diagrams allow more detailed comparisons:

Show code
colorscale <- c("white","aquamarine","aquamarine1","aquamarine2","aquamarine3","aquamarine4")
colorines <- c("cloud"="#bbe59f","shell"="#abc837","persistent"="#339933","Highly permissive"="#FF9966","Intermediate"="#619CFF" ,"Less permissive"= "#3366CC")
panTnseq_Full[17:20] <- lapply(panTnseq_Full[17:20],factor, levels=c("cloud","shell","persistent"))
library(ggVennDiagram)

Attaching package: 'ggVennDiagram'
The following object is masked from 'package:tidyr':

    unite
Show code
#t thermophilus
Tnlist <- append(unstack(panTnseq_Full[,c(1,17)]),unstack(panTnseq_Full[,c(1,10)]))
v1 <- ggVennDiagram(Tnlist,set_color=colorines,label="count",label_alpha=0,show_intersect = FALSE) +
    scale_fill_gradientn(colours=colorscale)
u1 <- ggVennDiagram(Tnlist,force_upset = TRUE,order.set.by = "none", order.intersect.by = "none")
#thermus
Tnlist <- append(unstack(panTnseq_Full[,c(1,18)]),unstack(panTnseq_Full[,c(1,10)]))
v2 <- ggVennDiagram(Tnlist,set_color=colorines,label="count",label_alpha=0,show_intersect = FALSE)+
         scale_fill_gradientn(colours=colorscale)
u2 <- ggVennDiagram(Tnlist,force_upset = TRUE,order.set.by = "none", order.intersect.by = "none")
#thermaceae
Tnlist <- append(unstack(panTnseq_Full[,c(1,19)]),unstack(panTnseq_Full[,c(1,10)]))
v3 <- ggVennDiagram(Tnlist,set_color=colorines,label="count",label_alpha=0,show_intersect = FALSE)+
         scale_fill_gradientn(colours=colorscale)
u3 <- ggVennDiagram(Tnlist,force_upset = TRUE,order.set.by = "none", order.intersect.by = "none")
#deino
Tnlist <- append(unstack(panTnseq_Full[,c(1,20)]),unstack(panTnseq_Full[,c(1,10)]))
v4 <- ggVennDiagram(Tnlist,set_color=colorines,label="count",label_alpha=0,show_intersect = FALSE)+
         scale_fill_gradientn(colours=colorscale)
u4 <- ggVennDiagram(Tnlist,force_upset = TRUE,order.set.by = "none", order.intersect.by = "none")

#ggarrange(u1+labs(title=expression(italic("T. thermophilus"))), v2+labs(title=expression(italic("Thermus"))), v3+labs(title=expression(italic("Thermaceae"))), v4+labs(title=expression(italic("Deinococcaceae"))), ncol=2,nrow=2,common.legend=TRUE,legend = "bottom")
paste("T. thermophilus")
[1] "T. thermophilus"
Show code
u1

Show code
ggsave("Tth_upset.svg",width=4,height=4)
paste("Thermus")
[1] "Thermus"
Show code
u2

Show code
paste("Thermaceae")
[1] "Thermaceae"
Show code
u3

Show code
paste("Deinococcota")
[1] "Deinococcota"
Show code
u4

Show code
#svg("figure4c.svg")
par(mfrow = c(2, 2),mar=c(1,1,1,1))

#thermus termophilus
Tnmat <- rbind(as.data.frame(as.list(panTnseq_Full[,c(1,10)]),col.names=c("Genes","group")),
               as.data.frame(as.list(panTnseq_Full[,c(1,17)]),col.names=c("Genes","group")))
plot(venneuler(na.omit(Tnmat)),col=c("#FF9966","#619CFF" , "#3366CC","#bbe59f","#abc837","#339933"),alpha=0.7,main=expression(italic("T. thermophilus")))

#thermus
Tnmat <- rbind(as.data.frame(as.list(panTnseq_Full[,c(1,10)]),col.names=c("Genes","group")),
               as.data.frame(as.list(panTnseq_Full[,c(1,18)]),col.names=c("Genes","group")))
plot(venneuler(na.omit(Tnmat)),col=c("#FF9966","#619CFF" , "#3366CC","#bbe59f","#abc837","#339933"),alpha=0.7,main=expression(italic("Thermus")))

#termaceae
Tnmat <- rbind(as.data.frame(as.list(panTnseq_Full[,c(1,10)]),col.names=c("Genes","group")),
               as.data.frame(as.list(panTnseq_Full[,c(1,19)]),col.names=c("Genes","group")))
plot(venneuler(na.omit(Tnmat)),col=c("#FF9966","#619CFF" , "#3366CC","#bbe59f","#abc837","#339933"),alpha=0.7,main=expression(italic("Thermaceae")))

#deinococcota
Tnmat <- rbind(as.data.frame(as.list(panTnseq_Full[,c(1,10)]),col.names=c("Genes","group")),
               as.data.frame(as.list(panTnseq_Full[,c(1,20)]),col.names=c("Genes","group")))
plot(venneuler(na.omit(Tnmat)),col=c("#FF9966","#619CFF" , "#3366CC","#bbe59f","#abc837","#339933"),alpha=0.7,main=expression(italic("Deinococcota")))


dev.off()

Enrichment analysis

Show code
kk <- enrichKEGG(gene         = na.omit(panTnseq_Full[panTnseq_Full$cluster!="Highly Permissive"  & panTnseq_Full$thermaceae_i4c5=="persistent",1]),
                 organism     = 'tth',
                 pvalueCutoff = 0.05)
Reading KEGG annotation online: "https://rest.kegg.jp/link/tth/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/tth"...
Show code
p1 <- dotplot(kk, title="Thermaceae core genome")

kk_D <- enrichKEGG(gene         = na.omit(panTnseq_Full[panTnseq_Full$cluster!="Highly Permissive"  & panTnseq_Full$deinococcota_i4c5=="persistent",1]),
                 organism     = 'tth',
                 pvalueCutoff = 0.05)
p2 <- dotplot(kk_D, title="Deinococcota core genome")

ggarrange(p1, p2, 
          labels = c("A", "B"), ncol=1,nrow = 2, heights=c(1,1))

Figure 26. Significantly enriched KEGG Ontology groups made up of Intermediate/Less permissive genes within the Core genome (persistant genes) of Thermaceae (A) or Deinococcota (B).
Show code
uniprot <- bitr_kegg(panTnseq_Full$Genes, fromType="kegg", toType='uniprot', organism='tth')
panTnseq_Full <- panTnseq_Full %>%
  mutate(uni = replace(Genes, Genes %in% uniprot$kegg, uniprot$uniprot))

kk <- enrichKEGG(gene         = na.omit(panTnseq_Full[panTnseq_Full$cluster!="Highly Permissive"  & panTnseq_Full$thermus_i4c5=="persistent",30]),
           organism     = 'tth', keyType = 'uniprot',
           pvalueCutoff = 0.05)
if (is.null(kk)){
  p1 <- NULL
} else {
  if (any(kk@result$p.adjust<0.05)){
  p1 <- dotplot(kk, title="Thermus core genome")
  } else {
    p1 <- NULL
  }
}

kk_t <- enrichKEGG(gene         = na.omit(panTnseq_Full[panTnseq_Full$cluster!="Highly Permissive"  & panTnseq_Full$thermaceae_i4c5=="persistent",30]),
           organism     = 'tth', keyType = 'uniprot',
           pvalueCutoff = 0.05)
if (is.null(kk_t)){
  p2 <- NULL
} else {
  if (any(kk@result$p.adjust<0.05)){
  p2 <- dotplot(kk_t, title="Thermaceae core genome")
  } else {
    p2 <- NULL
  }
}

kk_D <- enrichKEGG(gene         = na.omit(panTnseq_Full[panTnseq_Full$cluster!="Highly Permissive"  & panTnseq_Full$deinococcota_i4c5=="persistent",30]),
           organism     = 'tth', keyType = 'uniprot',
           pvalueCutoff = 0.05)
if (is.null(kk_D)){
  p3 <- NULL
} else {
  if (any(kk@result$p.adjust<0.05)){
  p3 <- dotplot(kk_D, title="Deinococcota core genome")
  } else {
    p3 <- NULL
  }
}

ggarrange(p1, p2, p3,
          labels = c("A", "B", "C"), ncol=1,nrow = 3,heights=c(2,3,3))

4 Compare Tn-seq with prokaryotic “essentialtome” and HB27 transcription

Essential genes in other species

Since there is no clear correlation between gene conservation and the permissivity of Tn5 insertions, we decided to compare our results with previously published results. We used the data from Zhang et al. (2018) (SUPPLEMENTARY DATA 4), which includes the essential/non-essential genes of a wide range of prokaryotic organisms comprising extremophilic archaea (Sulfolobus islandicus and Methanococcus maripaludis) and various mesophilic bacteria (Bacillus subtilis, Bacteroides fragilis, Escherichia coli) as well as the minimal essential genome of Micoplasma JVCI Syn3.0.

Show code
#data from https://www.nature.com/articles/s41467-018-07379-4
essentiality <- read_xlsx("bacteria_essential_genes/41467_2018_7379_MOESM6_ESM.xlsx",sheet=1)
colnames(essentiality)[2] <- "eggNOG_OGs"
panTnseq_Full$eggNOG_OGs <- sub("\\@.*", "", panTnseq_Full$eggNOG_OGs)
world_essential <- merge(panTnseq_Full,essentiality)
#names(world_essential)
 prok <- subset(world_essential,world_essential$`Sulfolobus islandicus M.16.4`=="essential" & world_essential$`Methanococcus maripaludis S2`=="essential" & world_essential$`Bacillus subtilis`=="essential" & world_essential$`Bacteroides fragilis`=="essential" & world_essential$`Escherichia coli`=="essential" & world_essential$`Mycoplasma JCVI Syn3.0`=="essential")

The following table contains the essential genes in all prokaryotic genomes and their HB27 orthologs as annotated by eggNOG. As you can see, out of 53 genes annotated as essentia in all thouse prokaryotic organisms, 6 were Highly Permissive and 27 (50.9%) Intermediate in HB27 TnSeq, but they are again highly conserved as all of them are persistent genes in the T. thermophilus pangenome and near 92.5% are still members of the core genome of Deinococcota pangenome.

Show code
datatable(prok[,c(1,2,10,17:20)],rownames = FALSE,  escape = FALSE, filter="top", extensions = 'Responsive',options = list( pageLength = 25, autoWidth = TRUE ))
Show code
#table to save
library(kableExtra)

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
Show code
kbl(prok[,c(1,4,10,17:20)], align = "c", row.names = FALSE,col.names= c("eggNOG","Description","TnSeq","T. thermophilus","Thermus","Thermaceae","Deinococcota"), caption = "Table 2. Tth TnSeq and Pangenome analysis of selected prokaryotic essential genes from diverse previous works (Zhang et al. 2018). See Methods for details.", "html") %>%
    kable_styling(bootstrap_options = "condensed", full_width = F, position = "center") %>%
    column_spec(1, bold = T, italic=F) #%>% save_kable("table2_new.pdf",self_contained=TRUE)
Table 2. Tth TnSeq and Pangenome analysis of selected prokaryotic essential genes from diverse previous works (Zhang et al. 2018). See Methods for details.
eggNOG Description TnSeq T. thermophilus Thermus Thermaceae Deinococcota
COG0008 Glutamate--tRNA ligase Intermediate persistent persistent persistent persistent
COG0008 Glutamine--tRNA ligase Intermediate persistent persistent persistent persistent
COG0013 Alanine--tRNA ligase Intermediate persistent shell shell shell
COG0016 Phenylalanine--tRNA ligase alpha subunit Highly Permissive persistent persistent persistent persistent
COG0017 Aspartate--tRNA(Asp/Asn) ligase Intermediate persistent persistent persistent persistent
COG0017 Asparagine--tRNA ligase Less Permissive persistent persistent persistent persistent
COG0018 Arginine--tRNA ligase Intermediate persistent persistent persistent persistent
COG0024 Methionine aminopeptidase Less Permissive persistent persistent persistent shell
COG0048 30S ribosomal protein S12 Less Permissive persistent persistent persistent persistent
COG0049 30S ribosomal protein S7 Less Permissive persistent persistent persistent persistent
COG0051 30S ribosomal protein S10 Less Permissive persistent persistent persistent persistent
COG0052 30S ribosomal protein S2 Intermediate persistent persistent persistent persistent
COG0060 Isoleucine--tRNA ligase Intermediate persistent persistent persistent persistent
COG0081 50S ribosomal protein L1 Intermediate persistent persistent persistent persistent
COG0087 50S ribosomal protein L3 Intermediate persistent persistent persistent persistent
COG0088 50S ribosomal protein L4 Less Permissive persistent persistent persistent persistent
COG0089 50S ribosomal protein L23 Intermediate persistent persistent persistent persistent
COG0090 50S ribosomal protein L2 Less Permissive persistent persistent persistent persistent
COG0091 50S ribosomal protein L22 Less Permissive persistent persistent persistent persistent
COG0094 50S ribosomal protein L5 Less Permissive persistent persistent persistent persistent
COG0096 30S ribosomal protein S8 Less Permissive persistent persistent persistent persistent
COG0097 50S ribosomal protein L6 Intermediate persistent persistent persistent persistent
COG0098 30S ribosomal protein S5 Less Permissive persistent persistent persistent persistent
COG0099 30S ribosomal protein S13 Less Permissive persistent persistent persistent persistent
COG0100 30S ribosomal protein S11 Less Permissive persistent persistent persistent persistent
COG0102 50S ribosomal protein L13 Intermediate persistent persistent persistent persistent
COG0103 30S ribosomal protein S9 Less Permissive persistent persistent persistent persistent
COG0124 Histidine--tRNA ligase Less Permissive persistent shell persistent persistent
COG0162 Tyrosine--tRNA ligase Highly Permissive persistent persistent persistent persistent
COG0172 Serine--tRNA ligase Less Permissive persistent persistent persistent persistent
COG0180 Tryptophan--tRNA ligase Intermediate persistent persistent persistent persistent
COG0186 30S ribosomal protein S17 Less Permissive persistent persistent persistent persistent
COG0195 Transcription termination/antitermination protein NusA Intermediate persistent persistent persistent persistent
COG0197 50S ribosomal protein L16 Less Permissive persistent persistent persistent persistent
COG0198 50S ribosomal protein L24 Intermediate persistent persistent persistent persistent
COG0199 30S ribosomal protein S14 type Z Less Permissive persistent persistent persistent persistent
COG0201 Protein translocase subunit SecY Intermediate persistent persistent persistent persistent
COG0202 DNA-directed RNA polymerase subunit alpha Intermediate persistent persistent persistent persistent
COG0256 50S ribosomal protein L18 Intermediate persistent persistent persistent persistent
COG0361 Translation initiation factor IF-1 Intermediate persistent persistent persistent persistent
COG0442 Proline--tRNA ligase Highly Permissive persistent persistent persistent persistent
COG0462 Ribose-phosphate pyrophosphokinase Highly Permissive persistent persistent persistent persistent
COG0462 Ribose-phosphate pyrophosphokinase Intermediate persistent persistent persistent persistent
COG0470 DNA polymerase III subunit delta' Intermediate persistent persistent persistent shell
COG0470 DNA polymerase III Intermediate persistent shell shell shell
COG0480 Elongation factor G Highly Permissive persistent persistent persistent persistent
COG0480 Elongation factor G Intermediate persistent persistent persistent persistent
COG0495 Leucine--tRNA ligase Highly Permissive persistent persistent persistent persistent
COG0504 CTP synthase Intermediate persistent persistent persistent persistent
COG0522 30S ribosomal protein S4 Less Permissive persistent persistent persistent persistent
COG0525 Valine--tRNA ligase Intermediate persistent persistent persistent persistent
COG0532 Translation initiation factor IF-2 Intermediate persistent persistent persistent persistent
COG0592 Beta sliding clamp Intermediate persistent persistent persistent persistent

To quantify the correlation between Tn-seq, gene conservation in pangenomes and gene essentiality, we used the data from Rancati et al. (2018), which compares the gene essentiality in diverse bacteria from diverse works. Thus, we downloaded the data from the papers indicated below and parsed the essential genes. To unify nomenclature, we tried to obtain the EggNOG COG code for each gene and used it to merge genes with different names across the different species. All data was parsed with an external R script to generate a common table (all_essentials.csv).

Strain Ref. PMID Assembly FileName
E. coli K12 29463657 GCF_000800765.1 mbo001183726st1.xlsx
M. tuberculosis H37Rv 28096490 GCF_000195955.2 mbo002173137st3.xlsx
M. genitalum G37 16407165 GCA_000027325.1 pnas_0510013103_10013Table2.xls
P. aerugionsa PAO1 25848053 GCF_000006765.1 pnas.1422186112.sd01.xlsx
S. aureus HG003 25888466 GCF_000013425.1 12864_2015_1361_MOESM2_ESM.xlsx
S. pyogenes (5448 & NZ131) 25996237 GCA_000011765.2 srep09838-s1.pdf (Table S3!)
Synechococcus elongatus PCC 7942 26508635 GCF_000012525.1 pnas.1519220112.sd03.xlsx

First, we evaluate the correlation between the essential/indispensable genes with our HB27 TnSeq and pangenome data. As shown below, despite the fact that the bacterial data come from different sources, we see some negative correlation between the number of genomes with essential and dispensable genes. In addition, there is also a moderate positive correlation between gene essentiality and conservation in our pangenome. However, consistent with previous results, the TnSeq results do not correlate with other analyses, suggesting that T. thermophilus is reluctant to reveal its essential genes.

Gene expression in HB27

Finally, we compared Tn-seq, pangenome and transcriptome of HB27 with the abovementioned prokaryotic “essentialtome”. We used the data from Swarts et al. (2015) that contains triplicate data of gene expression in HB27. The expression level was also used for correlation analysis.

Show code
#data from https://www.nature.com/articles/s41467-018-07379-4

essential_mix <- read.csv2("bacteria_essential_genes/all_essentials.csv")
world_essential <- merge(world_essential,essential_mix,by="eggNOG_OGs")
fwrite(world_essential,"bacteria_essential_genes/all_essentials_panTnseq_25feb25.csv", sep=";")


#add rnaseq in HB27
rnaseq <- read_xlsx("bacteria_essential_genes/pone.0124880.s002.xlsx",sheet=1)
names(rnaseq)[2] <- "Genes"
panTnseq_Full2 <-  merge(world_essential,rnaseq[,c(2,7:9)],by="Genes",all.x=TRUE)
panTnseq_Full2[,c(11:16,21:28,50:54)] <- lapply(panTnseq_Full2[,c(11:16,21:28,50:54)], as.numeric)

M <-cor(panTnseq_Full2[,c(11:14,22,24,26,28,50:54)], method = "spearman", use = "pairwise.complete.obs")
testRes <-  cor.mtest(panTnseq_Full2[,c(11:14,22,24,26,28,50:54)], conf.level = 0.99)
corrplot(M, type="lower", p.mat = testRes$p, method = 'circle', insig='blank',
         tl.col="black",tl.srt = 45, addCoef.col ='black', number.cex = 0.5,tl.cex=0.5, col=brewer.pal(n=8, name="Spectral"))

Figure 27. Correlation between esentiality results in diverse bacteria, the HB27 TnSeq and Pangenome variables. Significant (p<0.05) spearman r² are shown.

The detailed table of bacteria essential genes and HB27 TnSeq+Pangenome data can be found in file all_essentials_panTnseq.csv.

Show code
ggplot(as.data.frame(table(world_essential$essential,world_essential$cluster)),aes(x=Var1, y=Freq,group=Var2)) + 
         geom_bar(aes(fill=Var2),stat = "identity",position="fill",color="grey40", alpha=0.8) +  geom_text(aes(label=Freq),size=4, position = position_fill(vjust=0.5) , col = "black")+ xlab("Number genomes with each essential gene") + ylab("TnSeq group (%)") +scale_y_continuous(labels = scales::percent) + theme_bw()+ labs(fill = "TnSeq group")

Overall, multiple correlation analysis shows a moderate correlation between expression levels and ortholog gene essentiality, but slight negative correlation with Tn-seq Z-score. Thus, highly transcribed genes in HB27 are abundant within the highly conserved (persistent) group and overall they would get somewhat less hits in our Tn-seq. Altogether, these results indicate that despite the different lifestyles and phylogenetic distances among the genomes examined in prior essentiality studies, the expected higher conservation of universally essential genes is applicable to Tth, even though, in this case, the Tn-seq approach may not readily disclose most of essential genes.

Show code
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] kableExtra_1.4.0       ggVennDiagram_1.5.2    ggvenn_0.1.10         
 [4] venneuler_1.1-4        rJava_1.0-11           readxl_1.4.3          
 [7] ggpubr_0.6.0           clusterProfiler_4.13.4 RColorBrewer_1.1-3    
[10] corrplot_0.95          ggstats_0.7.0          patchwork_1.3.0       
[13] lubridate_1.9.3        forcats_1.0.0          stringr_1.5.1         
[16] purrr_1.0.2            readr_2.1.5            tidyr_1.3.1           
[19] tibble_3.2.1           tidyverse_2.0.0        dplyr_1.1.4           
[22] DT_0.33                data.table_1.16.2      ggplot2_3.5.1         
[25] formatR_1.14           knitr_1.48            

loaded via a namespace (and not attached):
  [1] rstudioapi_0.17.1       jsonlite_1.8.9          magrittr_2.0.3         
  [4] ggtangle_0.0.3          farver_2.1.2            rmarkdown_2.28         
  [7] ragg_1.3.3              fs_1.6.5                zlibbioc_1.51.2        
 [10] vctrs_0.6.5             memoise_2.0.1           ggtree_3.13.2          
 [13] rstatix_0.7.2           htmltools_0.5.8.1       broom_1.0.7            
 [16] cellranger_1.1.0        Formula_1.2-5           gridGraphics_0.5-1     
 [19] sass_0.4.9              bslib_0.8.0             htmlwidgets_1.6.4      
 [22] plyr_1.8.9              cachem_1.1.0            igraph_2.1.1           
 [25] lifecycle_1.0.4         pkgconfig_2.0.3         fuzzyjoin_0.1.6        
 [28] Matrix_1.7-1            R6_2.5.1                fastmap_1.2.0          
 [31] gson_0.1.0              GenomeInfoDbData_1.2.13 digest_0.6.37          
 [34] aplot_0.2.3             enrichplot_1.25.5       colorspace_2.1-1       
 [37] AnnotationDbi_1.67.0    S4Vectors_0.43.2        crosstalk_1.2.1        
 [40] textshaping_0.4.0       RSQLite_2.3.7           labeling_0.4.3         
 [43] fansi_1.0.6             timechange_0.3.0        httr_1.4.7             
 [46] abind_1.4-8             compiler_4.4.1          bit64_4.5.2            
 [49] withr_3.0.2             backports_1.5.0         BiocParallel_1.39.0    
 [52] carData_3.0-5           DBI_1.2.3               highr_0.11             
 [55] R.utils_2.12.3          ggsignif_0.6.4          tools_4.4.1            
 [58] ape_5.8                 R.oo_1.26.0             glue_1.8.0             
 [61] nlme_3.1-166            GOSemSim_2.31.2         reshape2_1.4.4         
 [64] fgsea_1.31.6            generics_0.1.3          gtable_0.3.6           
 [67] tzdb_0.4.0              R.methodsS3_1.8.2       hms_1.1.3              
 [70] xml2_1.3.6              car_3.1-3               utf8_1.2.4             
 [73] XVector_0.45.0          BiocGenerics_0.51.3     ggrepel_0.9.6          
 [76] pillar_1.9.0            yulab.utils_0.1.7       vroom_1.6.5            
 [79] splines_4.4.1           treeio_1.29.2           lattice_0.22-6         
 [82] bit_4.5.0               tidyselect_1.2.1        GO.db_3.20.0           
 [85] Biostrings_2.73.2       IRanges_2.39.2          svglite_2.1.3          
 [88] stats4_4.4.1            xfun_0.48               Biobase_2.65.1         
 [91] stringi_1.8.4           UCSC.utils_1.1.0        lazyeval_0.2.2         
 [94] ggfun_0.1.7             yaml_2.3.10             evaluate_1.0.1         
 [97] codetools_0.2-20        qvalue_2.37.0           ggplotify_0.1.2        
[100] cli_3.6.3               systemfonts_1.1.0       jquerylib_0.1.4        
[103] munsell_0.5.1           Rcpp_1.0.13             GenomeInfoDb_1.41.2    
[106] png_0.1-8               parallel_4.4.1          blob_1.2.4             
[109] DOSE_3.99.1             viridisLite_0.4.2       tidytree_0.4.6         
[112] scales_1.3.0            crayon_1.5.3            rlang_1.1.4            
[115] cowplot_1.1.3           fastmatch_1.1-4         KEGGREST_1.45.1        
Back to top

References

Rancati, Giulia, Jason Moffat, Athanasios Typas, and Norman Pavelka. 2018. “Emerging and evolving concepts in gene essentiality.” Nature Reviews. Genetics 19 (1): 34–49. https://doi.org/10.1038/nrg.2017.74.
Swarts, Daan C., Jasper J. Koehorst, Edze R. Westra, Peter J. Schaap, and John van der Oost. 2015. “Effects of Argonaute on Gene Expression in Thermus thermophilus.” PloS One 10 (4): e0124880. https://doi.org/10.1371/journal.pone.0124880.
Zhang, Changyi, Alex P. R. Phillips, Rebecca L. Wipfler, Gary J. Olsen, and Rachel J. Whitaker. 2018. “The Essential Genome of the Crenarchaeal Model Sulfolobus Islandicus.” Nature Communications 9 (1): 4908. https://doi.org/10.1038/s41467-018-07379-4.

MRR, 2024