TnSeq Genes classification

Author

Affiliation

Modesto

Department of Biochemistry, UAM

Published

April 26, 2024

Modified

March 28, 2025

Classification of genes by Z-scores

First, let’s check the density plots, to see how the data are grouped.

Show code

#read data
scores <- read.csv2("scores80_11nov2024.csv")
scores <- cbind(rownames(scores),scores[,1:4])
samples <- c("Ppol_1"="TNB01", "Ppol_2"="TNB09","HB27_1"="TNB03", "HB27_2"="TNB07")
names(scores) <- c("Genes",names(samples))
scores[,2:5] <- lapply(scores[,2:5],as.numeric)
scores$mean <- apply(scores[,2:5], 1, mean,na.rm=TRUE)
#replace NAs for the mean
#for (i in 2:5){
#  for (j in 1:nrow(scores)){
#    scores[j,i][is.na(scores[j,i])] <- scores[j,6]
#  }
#}
my.cols <- brewer.pal(5, "Paired")
#density plot
den <- ggplot(data=stack(scores[,2:6]))+geom_density(aes(x=as.numeric(values),fill=ind,color=ind), alpha=0.4)+
    scale_color_manual(values=c(my.cols)) + scale_fill_manual(values=c(my.cols)) +
    geom_vline(aes(xintercept=-0.65), color="grey", linetype="dashed")+
    geom_vline(aes(xintercept=0.55), color="orange", linetype="dashed") + 
  xlab("Z-scores") + theme_bw()+ylab("Gene density")+ guides(alpha="none",color="none")+labs(fill="Sample")
#Hist
hist <- ggplot(data=stack(scores[,2:5]))+geom_histogram(aes(x=as.numeric(values),fill=ind,color=ind, alpha=0.2),binwidth=0.2)+
scale_color_manual(values=c(my.cols)) + scale_fill_manual(values=c(my.cols)) +
#  scale_color_brewer(palette="Paired") + scale_fill_brewer(palette="Paired")+   
  geom_vline(aes(xintercept=-0.65), color="grey", linetype="dashed")+
     geom_vline(aes(xintercept=0.55),
                color="orange", linetype="dashed") + xlab("Z-scores") + facet_grid(~ind) +
     theme_bw()+ylab("Gene count")+guides(alpha="none",color="none")+labs(fill="Sample")

ggarrange(den, hist, 
          labels = c("A", "B"), ncol=1,nrow = 2)

Figure 9. Distribution of Z-scores in density (A) and histogram (B) plots.

Show code

ggsave("fig2b_new.svg",hist,width=6,height=3.6)

Consistent with Figure 8, we can see that most genes are clustered at a Z-score <0, with a peak around -0.5 (grey line) and a shoulder centered at 1 (orange line). This indicates that most genes have a low number of insertions and that there is a gradient of genes up to a very high number of insertions.

In previous bacteria, the TnSeq data follow a bimodal distribution consisting of an exponential distribution of essential genes and a gamma family curve for the non-essential genes, with an inflection point in between (Goodall et al. 2018.; Moule et al. 2014; Valentino et al. 2014; Higgins et al. 2020; Ramsey et al. 2020; A. Ghomi et al. 2024), even in polyploid bacteria (Rubin et al. 2015). However, there is no inflection point in our data, so we confirm that there are no clear groups of essential/non-essential genes and only a gamma distribution can be fitted (p-value 2x10^-16).

To make an approximate classification of the genes, we will perform an unsupervised cluster analysis. A similar approach was also used in a recent work (A. Ghomi et al. 2024), using DBSCAN algorithm. However, in that work they also have bimodal curve and more clear density separation of the score index. With our data, we obtained two large groups with wide range of overlapping index that was no very informative.

Show code

cl<-dbscan(scores[,c(2:5)],eps=0.65,MinPts = 200)

Warning in dbscan(scores[, c(2:5)], eps = 0.65, MinPts = 200): converting
argument MinPts (fpc) to minPts (dbscan)!

Show code

hullplot(scores[,c(2:5)],cl$cluster)

Show code

tmp <- cbind(stack(scores[,2:5]),scores[,6],cl$cluster)
names(tmp) <- c("values","ind","mean","cluster")
ggplot(data=tmp) +
  geom_point(aes(x=reorder(1:nrow(tmp),tmp[,3],decreasing=TRUE),y=values,color=as.factor(cluster)), alpha=0.7)+
  theme_classic() +
  scale_x_discrete(expand = c(0.01, 0)) + 
  theme(axis.text.x = element_blank(),axis.text.y=element_text(size=12,face="bold")) +
  ylab("Z-Score (log2)") + xlab("Gene") + facet_grid(~tmp[,2]) +
  scale_color_brewer(palette="Paired") + theme(legend.position = c(0.9, 0.7)) +
  labs(color="")

Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

Thus, we decided to use a more classical approach and compare K-means and PAM clustering. First, we will test the number of clusters that allow us to better classify the genes.

Show code

#wss o silhoutte?
wss <- fviz_nbclust(scores[,2:5], FUN = hcut, method = "wss" , linecolor = "darkblue") + labs(title= "Elbow (WSS)") 
si <- fviz_nbclust(scores[,2:5], FUN = hcut, method = "silhouette", linecolor = "darkblue")+ labs(title= "SILHOUETTE") 

#kmeans, pam 
pam <- fviz_nbclust(x = scores[,2:5], FUNcluster = cluster::pam, method = "silhouette", k.max = 10) + labs(title= "PAM clustering") 
k <- fviz_nbclust(x = scores[,2:5], FUNcluster = kmeans, method = "silhouette", k.max = 10) + labs(title= "K-means clustering")



ggarrange(wss, si, pam,k,
          labels = c("A", "B","C","D"), ncol=2,nrow = 2)

Figure 12. Determination of best number of clusters for the TnSeq data, using the elbow (A) or silhouette (B) method. Panels (C) and (D) show the number of clusters by PAM and K-means, respectively.)

Now, we will compare all the clustering methods, DBSCAN and K-means and PAM with two clusters. We will select the method that yield a higher average silhouette value.

Show code

#DBSCAN silhouette
sil.dbscan <- silhouette(cl$cluster, dist(scores[,c(2:5)]))
s1 <- fviz_cluster(object = cl, data = scores[,c(2:5)], ellipse.type = "t",labelsize = 0,
              repel = FALSE) +
     theme_bw() + scale_colour_manual(values = c("#0073c2", "#efc000"))+ scale_fill_manual(values = c("#0073c2", "#efc000")) +
     labs(title = "DBSCAN clustering") +
     theme(legend.position = "none")
s2 <- fviz_silhouette(sil.dbscan, palette="jco")+  theme(    panel.background = element_rect(fill = "white"))

  cluster size ave.sil.width
0       0  755          0.16
1       1 1508          0.74

Show code

#kmeans
kmclust2 <- eclust(scores[,2:5], k=2, FUNcluster="kmeans", hc_metric="manhattan",graph=F)
k1 <- fviz_cluster(object = kmclust2, data = datos, ellipse.type = "t",labelsize = 0,
             repel = FALSE) + scale_colour_manual(values = c("#0073c2", "#efc000"))+ scale_fill_manual(values = c("#0073c2", "#efc000")) +
  theme_bw() +
  labs(title = "Kmeans clustering") +
  theme(legend.position = "none")

k2 <- fviz_silhouette(kmclust2, palette="jco")+  theme(
  panel.background = element_rect(fill = "white"))

  cluster size ave.sil.width
1       1 1797          0.69
2       2  466          0.37

Show code

#pams
set.seed(123)
pam_clusters <- pam(x= scores[,2:5], k=2, metric = "manhattan")


#visualization
p1 <- fviz_cluster(object = pam_clusters, data = scores[,2:5], ellipse.type = "t",labelsize = 0,  repel = FALSE) + scale_colour_manual(values = c("#0073c2", "#efc000"))+ scale_fill_manual(values = c("#0073c2", "#efc000")) +
  theme_bw() +
  labs(title = "PAM clustering") +
  theme(legend.position = "none")


p2 <- fviz_silhouette(pam_clusters, palette="jco")+    theme(
  panel.background = element_rect(fill = "white"))

  cluster size ave.sil.width
1       1 1598          0.76
2       2  665          0.26

Show code

ggarrange(s1,s2,k1, k2, p1,p2,
          labels = c("A", "","B","","C",""), ncol=2,nrow = 3)

Figure 13. Comparison of clustering method. Cluster results and silhouette plots for DBSCAN (A), K-Means (B) and PAM (C) clustering method are shown.

The Kmeans clustering is slightly better as its clusters area are closer to the silhouette mean, thus we will keep this clusters and name them as “Highly Permissive” and “Intermediate” genes. As shown in the following plot, both groups are well separated.

Show code

#full table
#subset the ~10% genes with less insertions
datos.j = cbind(scores, cluster= kmclust2$cluster)

ggplot(data=cbind(datos.j$cluster,stack(datos.j[,2:5]))) +xlab("Z-Score (log2)") + ylab("Gene density") + 
     geom_density(aes(x=as.numeric(values),group=as.factor(`datos.j$cluster`), color=as.factor(`datos.j$cluster`), fill=as.factor(`datos.j$cluster`)), alpha=0.7)+
  scale_fill_manual(values=c( "#619CFF" ,"#3366CC"))+
  scale_color_manual(values=c( "#619CFF" ,"#3366CC"))+
    theme_classic() + facet_grid(~ind)  +labs(fill="Cluster",color="Cluster")

Figure 14. Density plot of genes by cluster in each sample.

The first group contains the genes with higher Z-score mean and the second large group contains the genes with less insertions. To highlight the top genes with less insertions, we empirically select the top 20% genes with less insertions as “Less Permissive” genes, corresponding to 0 genes with a Z-score < -0.6. This subset is an artificial creation within the “Intermediate” cluster, not distinctly separate from it (Figure 16), but useful for data representation and discussion.

Show code

#top 20% of genes with lower z-score
less <- head(datos.j[order(datos.j$mean, decreasing = FALSE),1],round(nrow(datos.j)*.20,0))
datos.j$cluster[datos.j$Genes %in% less] <- 3

datos.j$cluster <- as.factor(datos.j$cluster)

datos.j <- datos.j %>% 
    mutate(across(c('Ppol_1', 'Ppol_2',"HB27_1","HB27_2","mean"), round, 3)) %>% 
mutate(cluster = case_when(cluster == 1 ~ "Intermediate", cluster == 2 ~ "Highly Permissive",cluster==3 ~ "Less Permissive"))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(...)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))

The separation between the “Intermediate” and “Highly Permissive” cluster is not very clear, as expected. Thus, for some analysis, we will consider these groups as only one same cluster. The following table contain the genes Z-score and classification in three groups.

Show code

datatable(datos.j,rownames = FALSE,  escape = FALSE, filter="top", extensions = 'Responsive',options = list( pageLength = 25, autoWidth = TRUE ))

Show code

#save final table
write.table(datos.j,"scores_clusters_24feb2025.csv",sep=";",row.names=FALSE)

Plot Z-scores per gene and groups

Let’s see how are the groups in each sample. Note that the genes are ordered by their mean score.

Show code

tmp <- cbind(stack(datos.j[,2:5]),datos.j[,6:7])

Warning in data.frame(..., check.names = FALSE): row names were found from a
short variable and have been discarded

Show code

p1 <- ggplot(data=tmp) +
  geom_point(aes(x=reorder(1:nrow(tmp),tmp[,3],decreasing=TRUE),y=values,color=cluster), alpha=0.7)+
  theme_classic() +
  scale_x_discrete(expand = c(0.01, 0)) + 
  theme(axis.text.x = element_blank(),axis.text.y=element_text(size=12,face="bold")) +
  ylab("Z-Score (log2)") + xlab("Gene") + facet_grid(~tmp[,2]) +
  scale_color_manual(values=c("#FF9966", "#619CFF" ,"#3366CC")) + theme(legend.position = c(0.9, 0.7)) +
  labs(color="")
#histogram
#ggplot(data=tmp) +xlab("Z-Score (log2)") + ylab("Gene density") + 
 #    geom_histogram(aes(x=as.numeric(values),color=cluster, fill=cluster), alpha=0.7, binwidth=0.2)+
  #   theme_classic() + facet_grid(~ind) 
#final plot for paper
p2 <- ggplot(data=datos.j[,6:7]) +xlab("Z-Score") + ylab("Gene density") +
     geom_histogram(aes(x=as.numeric(mean),group=as.factor(cluster), color=cluster, fill=cluster),binwidth=0.1, alpha=0.7)+
     theme_classic()   +labs(fill="Group",color="Group")+scale_fill_manual(values=c("#FF9966","#619CFF" , "#3366CC"))+scale_color_manual(values=c("#FF9966", "#619CFF" ,"#3366CC"))+guides(fill=guide_legend(position = "inside"))+theme(legend.position.inside = c(0.8,0.7))
ggsave("fig3a.pdf",p2,width=4,height=2.5)
ggarrange(p1, p2,
          labels = c("A", "B"), ncol=1)

Figure 16. Plot of Tn insertion Z-scores per gene and group. Density plot in all samples (A) and mean values (B). Values were sorted by the average of all samples.

Comparing WT vs Ppol strains

Although there is overall no differences between strains, we are going to check in detail if there is any gene with statistically significant more/less Tn insertions.

Show code

#meethod from https://lashlock.github.io/compbio/R_presentation.html
#we need to use raw counts as integers, no Z-scores and only complete cases
countData <- read.csv('tnseq_dd_counts_scores_all.csv', header = TRUE, sep = ";")
countData <- na.omit(countData[,c(1,7,9,11,13)])
countData[,2:5] <- lapply(countData[,2:5], as.integer)

metaData <- data.frame(names(countData[,2:5]),c("Ppol","Ppol","HB27","HB27"),c("Exp1","Exp2","Exp1","Exp2"))

names(metaData) <- c("Sample","Strain","Experiment")

dds <- DESeqDataSetFromMatrix(countData=countData, 
                              colData=metaData, 
                              design=~Strain, tidy = TRUE)

Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors

Show code

dds <- DESeq(dds)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

-- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

final dispersion estimates

fitting model and testing

Show code

res <- results(dds)

res <- res[order(res$padj),]

res$diffexpressed <- "NO"
res$diffexpressed[res$log2FoldChange > 2 & res$padj < 0.01] <- "UP"
res$diffexpressed[res$log2FoldChange < -2 & res$padj < 0.01] <- "DOWN"

library(ggrepel)
# plot adding up all layers we have seen so far
ggplot(data=res, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed, label=row.names(res))) +
        geom_point() + 
        theme_minimal() +
        geom_text_repel() +
        scale_color_manual(values=c("blue", "black", "red")) #+

Warning: Removed 1315 rows containing missing values or values outside the scale range
(`geom_point()`).

Warning: Removed 1315 rows containing missing values or values outside the scale range
(`geom_text_repel()`).

Warning: ggrepel: 935 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Figure 17. Volcano plot of differential Tn5 target genes in HB27 vs. Ppol strains (fold change >4, p.adj<0.01).

Show code

        #geom_vline(xintercept=c(-2, 2), col="red") +
        #geom_hline(yintercept=-log10(0.05), col="red")

As we can see there are genes with significant differences between samples.

Genes functional classification

COG

Show code

nog <- read_xlsx("00_raw/refs/MM_x4qgl_mt.emapper.annotations.xlsx",sheet=1,skip=2)
nog$seed_ortholog <- gsub("262724.","",nog$seed_ortholog)
names(nog)[2] <- "Genes"

After calculating the Z-scores, we are going to group the genes by their functions, using COG and KEGG functional categories. Functional annotation was obtained at EggNog mapper website Cantalapiedra et al. (2021). We could annotate 2006 (92.4%) of the HB27 proteins.

Show code

#load COG list
cog <- read.table("scripts/cog_list.csv",sep=";",header=TRUE)
cog <- cog[-26,]

#summarize and split categories
categories <- c()
for (i in 1:nrow(cog)){
  categories <- grep(cog$category[i],nog$COG_category)
  cog$sum[i] <- length(categories)
}

#plot freqs
ggplot(cog,aes(y=as.numeric(sum), x=definition,color=group,fill=group)) + 
    geom_bar(stat="identity", position="stack", alpha=0.8) + ylab("Number of genes") + xlab("") + coord_flip() + scale_y_continuous(breaks = seq(0, 500, by = 100))+
  theme_classic() + theme(text=element_text(size=15),legend.text=element_text(size=9))

Figure 18. COG groups annotated in HB27 genes

After the incorporation of COG and KEGG annotations, the final TnSeq table is exported as tnSeq_dd_full.csv.

Show code

#merge
final <- merge(datos.j,nog[,c(2,5,7,8,12)], all.x=TRUE)

final$COG_category <- final$COG_category %>% replace_na('-')
for (i in 1:nrow(final)){
  if (final$COG_category[i]!='-'){
      final$COG_group[i] <- cog$group[cog$category == substr(final$COG_category,1,1)[i]]  } else{
        final$COG_group[i]=='-'
    }
}
final$COG_group <- as.factor(final$COG_group)

#add Mario gene annotation
mario <- readxl::read_excel("00_raw/refs/HB8_HB27_final.xlsx")

New names:
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`

Show code

mario <- subset(mario,!is.na(mario$HB27))
mario <- mario[,c(7,4,3,6,5,2)]
colnames(mario) <- c("Genes","Genes_HB8","locus_tag","name","Description","aa")
final <- merge(final,mario,by="Genes", all.x=TRUE)
final$mean <- apply(final[,2:5],1,mean)
final$sd <- apply(final[,2:5],1,sd)
final[grep("tRNA-",final$name),9] <- "J"
final[grep("tRNA-",final$name),11] <- "ko:K14228"
final[grep("rRNA",final$name),11] <- "ko:K01980"
final <- final %>% replace(is.na(.), '-')
final[,8] <- substr(final[,8],1,7)
final <- final[,c(1,7:15,2:6,18,16)]
write.table(final[!duplicated(final),],"tnSeq_dd_full.csv", sep=";", row.names=FALSE)

#correlation
wt <- ggplot(data = final, aes(x=HB27_1,y=HB27_2,color=COG_group))+geom_point(size=3,alpha=0.6) +  theme_linedraw() + 
  xlab("HB27_1") + ylab("HB27_2") + stat_poly_line(color="#1F78B4",fill="#1F78B4") +
  stat_poly_eq(use_label(c("adj.R2", "p"))) +guides(color = guide_legend(nrow = 2))


ppol <- ggplot(data = final, aes(x=Ppol_1,y=Ppol_2,color=COG_group))+geom_point(size=3,alpha=0.6)  + theme_linedraw() +
  xlab("Ppol_1") + ylab("Ppol_2") +  stat_poly_line(color="#33A02C",fill="#33A02C") +
  stat_poly_eq(use_label(c("adj.R2", "p"))) 


ggarrange(wt, ppol,
          labels = c("A", "B"), ncol=1,common.legend = TRUE,legend="bottom")

Figure 19. Tn insertion scores per gene. Correlation between samples in WT (A) and Ppol (B) strains. Genes are colored by COG group

Now, we are going to plot the genes by groups and COG

Show code

stats <- as.data.frame(table(final$COG_category,final$cluster))
categories <- c()
for (i in 1:nrow(cog)){
  categories <- grep(cog$category[i],stats$Var1)
  categories_D <- subset(stats[categories,], Var2=="Highly Permissive")
  categories_R <- subset(stats[categories,], Var2=="Intermediate")
  categories_T <- subset(stats[categories,], Var2=="Less Permissive")
  cog$Less_permissive[i] <- sum(categories_T$Freq)
  cog$Intermediate[i] <- sum(categories_R$Freq)
  cog$Highly_permissive[i] <- sum(categories_D$Freq)
}
cog2 <- cbind(cog[,1:3],stack(cog[,5:7]))

Warning in data.frame(..., check.names = FALSE): row names were found from a
short variable and have been discarded

Show code

cog2 <- cog2 %>% replace(is.na(.), 0)
for (i in 1:nrow(cog2)){
  cog2$ratio[i] <- cog2$values[i]*100/sum(cog2$values[cog2$category==cog2$category[i]])
}
cog2$ind <- factor(cog2$ind,levels=c("Highly_permissive","Intermediate","Less_permissive"))


#plot 
ggplot(cog2[!cog2$values==0,],aes(y=ratio, x=definition, group=ind, fill=ind)) + 
    geom_bar(stat="identity", position="stack",color="grey40", alpha=0.8) + ylab("Gene category (%)") + xlab("") + coord_flip() + scale_y_continuous(breaks = seq(0, 100, by = 10)) +   theme_bw() + theme(text=element_text(size=15))+scale_fill_manual(values=c("#FF9966", "#619CFF" ,"#3366CC"))+labs(fill="Group")

Figure 20. Distribution of gene groups by COG functions

Show code

#ggsave("figs5a.pdf",width=10,height=3.5)

Show code

ggplot(cog2[!cog2$values==0 & cog2$ind=="Less_permissive",],aes(y=ratio, x=definition, group=ind)) + 
     geom_bar(stat="identity", position="stack", fill="steelblue", alpha=0.8) + ylab("Gene category (%)") + xlab("") + coord_flip() + scale_y_continuous(breaks = seq(0, 100, by = 10)) +   theme_bw() + theme(text=element_text(size=15))

Figure 21. Abundance of Less Permissive genes amongst COG functions

Show code

#ggsave("figs5b.pdf",width=9,height=4)

KEGG

EggNog mapper could assign KEGG Ontology (KO) group to 1352 (62.28%) of the HB27 proteins. After revision of the KO annotation, we manually incorporated several missing tRNA genes to improve the statistics, using K14228 (tRNA-leu).

Show code

#parse data and count
ko <- fread("scripts/ko.csv",sep=";",header=TRUE)
trna <- read.csv2("scripts/tRNAs.csv",header = FALSE)
final$KEGG_ko <- gsub("ko:","",final$KEGG_ko)
#tRNAs annotation is not very good, so we corrected for statistics
#all tRNAs not annotated were masked as K14228 (tRNA-leu)
final$KEGG_ko[final$Genes %in% trna$V1] <- "K14228"

duplicate_ko_rows <- function(data) {
  data %>% 
    mutate(reaction_ko = strsplit(KEGG_ko, ",")) %>%  # Split ko annotations
    unnest(reaction_ko) %>%                         # Unnest each annotation into separate rows
    select(-KEGG_ko)                            # Keep original columns except split annotations
}
datos_unnested <- duplicate_ko_rows(final)

stats <- summarise(group_by(datos_unnested,reaction_ko,cluster),count =n())

`summarise()` has grouped output by 'reaction_ko'. You can override using the
`.groups` argument.

Show code

stats <- plyr::join(stats,ko[,c(1:3,5)])

Joining by: reaction_ko

Show code

#xtabs
xkegg <- as.data.frame(xtabs(count~a_class+b_class+pathway+cluster,data=stats))
xkegg <- xkegg[!xkegg$Freq==0,]

for (i in 1:nrow(xkegg)){
  xkegg$ratio_b[i] <- xkegg$Freq[i]*100/sum(xkegg$Freq[xkegg$b_class==xkegg$b_class[i]])
}


xkegg$ratio_b <- as.numeric(xkegg$ratio_b)

xkegg$a_class <- as.factor(xkegg$a_class)
xkegg$b_class <- as.factor(xkegg$b_class)
xkegg$cluster <- factor(xkegg$cluster,levels=c("Less Permissive","Intermediate","Highly Permissive"))

#plot main kegg categories
#ggplot(xkegg,aes(y=ratio_b, x=b_class, group=factor(ind), fill=ind)) +   geom_bar(stat="identity", position="stack",color="grey40",linewidth=0.2, alpha=0.7) + ylab("KO group (%)") + xlab("") + theme_bw() + theme(text=element_text(size=15), axis.text.x = element_text( angle = 45,  hjust = 1, size = 12),plot.margin =margin(l=100,b=10,t=5,r=5) )   + facet_grid(~factor(a_class),scales="free",space="free",drop=TRUE,labeller = as_labeller(factor(xkegg$a_class), default=label_wrap_gen(14)))

#detailed kegg plot
ggplot(xkegg, aes(y=Freq, x=pathway, group=cluster, fill=cluster)) + 
     geom_bar(stat="identity", position="stack",color="grey40",linewidth=0.2, alpha=0.7) + ylab("KO group members") + xlab("") + theme_bw() + theme(text=element_text(size=15), axis.text.x = element_text( angle = 45,  hjust = 1, size = 12),plot.margin =margin(l=100,b=10,t=5,r=5) )  +
     facet_wrap(~a_class+b_class,scales="free",drop=TRUE,ncol=3, labeller = label_wrap_gen(30))  + theme(legend.position = "none")

Figure 22. Ratio of Highly Permissive (blue), Intermediate (green) and Less Permissive (red) genes in each KEGG Ontology group

Show code

#ggsave("figS5c.pdf",width=9,height=4)

KEGG Ontology groups with more Less Permissive genes

Now we, will plot only the “Less Permissive” genes.

Show code

ggplot(xkegg[xkegg$cluster=="Less Permissive" & xkegg$Freq>5,], aes(y=Freq, x=pathway, group=cluster)) +     geom_bar(stat="identity", position="stack",linewidth=0.2, alpha=0.7, fill="steelblue") + ylab("KO group members") + xlab("") + theme_bw()+ theme(text=element_text(size=15),plot.margin =margin(l=100,b=10,t=5,r=5) ) + coord_flip()

Figure 23. KEGG ontology categories for Less Permissive genes

Ribosome, transporters and some energy metabolism proteins are the categories with more Less Permissive genes.

KEGG pathways with Intermediate/Less Permissive genes

Show code

repli <- nog$Genes[grep("map03030",nog$KEGG_Pathway)]
repli <- datos.j[datos.j$Genes %in% repli,]
repli_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03030/",paste0(repli[repli$cluster!="Highly Permissive",1],collapse="/"))


ber <- nog$Genes[grep("map03410",nog$KEGG_Pathway)]
ber <- datos.j[datos.j$Genes %in% ber,]
ber_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03410/",paste0(ber[ber$cluster!="Highly Permissive",1],collapse="/"))

ner <- nog$Genes[grep("map03420",nog$KEGG_Pathway)]
ner <- datos.j[datos.j$Genes %in% ner,]
ner_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03420/",paste0(ner[ner$cluster!="Highly Permissive",1],collapse="/"))

mmr <- nog$Genes[grep("map03430",nog$KEGG_Pathway)]
mmr <- datos.j[datos.j$Genes %in% mmr,]
mmr_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03430/",paste0(mmr[mmr$cluster!="Highly Permissive",1],collapse="/"))

hr <- nog$Genes[grep("map03440",nog$KEGG_Pathway)]
hr <- datos.j[datos.j$Genes %in% hr,]
hr_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03430/",paste0(hr[hr$cluster!="Highly Permissive",1],collapse="/"))

rpo <- nog$Genes[grep("map03020",nog$KEGG_Pathway)]
rpo <- datos.j[datos.j$Genes %in% rpo,]
rpo_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03020/",paste0(rpo[rpo$cluster!="Highly Permissive",1],collapse="/"))

ribo <- nog$Genes[grep("map03010",nog$KEGG_Pathway)]
ribo <- datos.j[datos.j$Genes %in% ribo,]
ribo_link <- paste0("https://www.kegg.jp/kegg-bin/show_pathway?tth03010/",paste0(ribo[ribo$cluster!="Highly Permissive",1],collapse="/"))
ribo_link <- paste0(ribo_link,paste0("/",paste0(c("TT_C3035","TT_C3056","TT_C3036","TT_C3055","TT_C3024","TT_C3048"),collapse ="/")))

As an example, you can see some of the HB27 RNA transcrition & translation and DNA replication & repair pathways in the following links: Ribosome, RNA Polymerase, DNA Replication, Base Excision Repair, Mismatch Repair, Nucleotide Excision Repair, and Homologous Recombination. The genes with orthologs in HB27 are depicted with green background and the Intermediate/Less Permissive genes are highlighted in red.

Enrichment analysis

Show code

kk <- enrichKEGG(gene         = datos.j[datos.j$cluster=="Less Permissive",1],
                 organism     = 'tth',
                 pvalueCutoff = 0.05)

Reading KEGG annotation online: "https://rest.kegg.jp/link/tth/pathway"...

Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/tth"...

Show code

if (sum(kk@result$p.adjust<0.05)!=0){
  p1 <- dotplot(kk, title="Significant enriched KO")
} else {
  p1 <- c()
}


mkk <- enrichMKEGG(gene         = datos.j[datos.j$cluster=="Less Permissive",1],
                 organism     = 'tth',
                   pvalueCutoff = 0.05,
                   qvalueCutoff = 0.05)

Reading KEGG annotation online: "https://rest.kegg.jp/link/tth/module"...

Reading KEGG annotation online: "https://rest.kegg.jp/list/module"...

Show code

if (sum(mkk@result$p.adjust<0.05)!=0){
p2 <- dotplot(mkk, title="Significant enriched KEGG modules")
}else {
  p2 <- c()
}

p1

Figure 24. Significantly enriched KEGG Ontology groups made up of Less Permissive genes by KEGG ontology groups. There is no enriched KEGG module in this group

Show code

#ggarrange(p1, p2, 
 #         labels = c("A", "B"), ncol=1,nrow = 2,heights=c(1,3))

Show code

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggrepel_0.9.6               dbscan_1.2.2               
 [3] DESeq2_1.46.0               SummarizedExperiment_1.36.0
 [5] Biobase_2.66.0              MatrixGenerics_1.18.1      
 [7] matrixStats_1.5.0           GenomicRanges_1.58.0       
 [9] GenomeInfoDb_1.42.3         IRanges_2.40.1             
[11] S4Vectors_0.44.0            BiocGenerics_0.52.0        
[13] clusterProfiler_4.14.6      ggpmisc_0.6.1              
[15] ggpp_0.5.8-1                cluster_2.1.8.1            
[17] factoextra_1.0.7            details_0.4.0              
[19] ggpubr_0.6.0                lubridate_1.9.4            
[21] forcats_1.0.0               stringr_1.5.1              
[23] purrr_1.0.4                 readr_2.1.5                
[25] tidyr_1.3.1                 tibble_3.2.1               
[27] tidyverse_2.0.0             dplyr_1.1.4                
[29] RColorBrewer_1.1-3          readxl_1.4.5               
[31] DT_0.33                     data.table_1.17.0          
[33] ggplot2_3.5.1               formatR_1.14               
[35] knitr_1.50                  BiocManager_1.30.25        

loaded via a namespace (and not attached):
  [1] splines_4.4.1           ggplotify_0.1.2         R.oo_1.27.0            
  [4] cellranger_1.1.0        confintr_1.0.2          lifecycle_1.0.4        
  [7] rstatix_0.7.2           lattice_0.22-6          MASS_7.3-65            
 [10] crosstalk_1.2.1         backports_1.5.0         magrittr_2.0.3         
 [13] sass_0.4.9              rmarkdown_2.29          jquerylib_0.1.4        
 [16] yaml_2.3.10             ggtangle_0.0.6          cowplot_1.1.3          
 [19] DBI_1.2.3               abind_1.4-8             zlibbioc_1.52.0        
 [22] R.utils_2.13.0          yulab.utils_0.2.0       GenomeInfoDbData_1.2.13
 [25] enrichplot_1.26.6       tidytree_0.4.6          MatrixModels_0.5-3     
 [28] svglite_2.1.3           codetools_0.2-20        DelayedArray_0.32.0    
 [31] DOSE_4.0.0              tidyselect_1.2.1        aplot_0.2.5            
 [34] UCSC.utils_1.2.0        farver_2.1.2            jsonlite_1.9.1         
 [37] Formula_1.2-5           survival_3.8-3          systemfonts_1.2.1      
 [40] tools_4.4.1             treeio_1.30.0           ragg_1.3.3             
 [43] Rcpp_1.0.14             glue_1.8.0              gridExtra_2.3          
 [46] SparseArray_1.6.2       xfun_0.51               qvalue_2.38.0          
 [49] withr_3.0.2             fastmap_1.2.0           SparseM_1.84-2         
 [52] digest_0.6.37           timechange_0.3.0        R6_2.6.1               
 [55] gridGraphics_0.5-1      textshaping_1.0.0       colorspace_2.1-1       
 [58] GO.db_3.20.0            RSQLite_2.3.9           R.methodsS3_1.8.2      
 [61] generics_0.1.3          ggsci_3.2.0             httr_1.4.7             
 [64] htmlwidgets_1.6.4       S4Arrays_1.6.0          pkgconfig_2.0.3        
 [67] gtable_0.3.6            blob_1.2.4              XVector_0.46.0         
 [70] htmltools_0.5.8.1       carData_3.0-5           fgsea_1.32.0           
 [73] scales_1.3.0            png_0.1-8               ggfun_0.1.8            
 [76] rstudioapi_0.17.1       tzdb_0.4.0              reshape2_1.4.4         
 [79] nlme_3.1-167            cachem_1.1.0            parallel_4.4.1         
 [82] AnnotationDbi_1.68.0    desc_1.4.3              pillar_1.10.1          
 [85] grid_4.4.1              vctrs_0.6.5             car_3.1-3              
 [88] evaluate_1.0.3          cli_3.6.4               locfit_1.5-9.12        
 [91] compiler_4.4.1          rlang_1.1.5             crayon_1.5.3           
 [94] ggsignif_0.6.4          labeling_0.4.3          plyr_1.8.9             
 [97] fs_1.6.5                stringi_1.8.4           BiocParallel_1.40.0    
[100] munsell_0.5.1           Biostrings_2.74.1       lazyeval_0.2.2         
[103] GOSemSim_2.32.0         quantreg_6.1            Matrix_1.7-3           
[106] hms_1.1.3               patchwork_1.3.0         bit64_4.6.0-1          
[109] KEGGREST_1.46.0         clipr_0.8.0             igraph_2.1.4           
[112] broom_1.0.7             memoise_2.0.1           bslib_0.9.0            
[115] ggtree_3.14.0           fastmatch_1.1-6         bit_4.6.0              
[118] ape_5.8-1               gson_0.1.0              polynom_1.4-1

References

A. Ghomi, Fatemeh, Jakob J. Jung, Gemma C. Langridge, Amy K. Cain, Christine J. Boinett, Moataz Abd El Ghany, Derek J. Pickard, et al. 2024. “High-Throughput Transposon Mutagenesis in the Family Enterobacteriaceae Reveals Core Essential Genes and Rapid Turnover of Essentiality.” mBio 0 (0): e01798–24. https://doi.org/10.1128/mbio.01798-24.

Cantalapiedra, Carlos P., Ana Hernández-Plaza, Ivica Letunic, Peer Bork, and Jaime Huerta-Cepas. 2021. “eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale.” Molecular Biology and Evolution 38 (12): 5825–29. https://doi.org/10.1093/molbev/msab293.

Goodall, Emily C. A., Ashley Robinson, Iain G. Johnston, Sara Jabbari, Keith A. Turner, Adam F. Cunningham, Peter A. Lund, Jeffrey A. Cole, and Ian R. Henderson. 2018. “The Essential Genome of Escherichia Coli k-12.” mBio 9 (1): 10.1128/mbio.02096–17. https://doi.org/10.1128/mbio.02096-17.

Higgins, Steven, Stefano Gualdi, Marta Pinto-Carbó, and Leo Eberl. 2020. “Copper Resistance Genes of Burkholderia Cenocepacia H111 Identified by Transposon Sequencing.” Environmental Microbiology Reports 12 (2): 241–49. https://doi.org/10.1111/1758-2229.12828.

Moule, Madeleine G., Claudia M. Hemsley, Qihui Seet, José Afonso Guerra-Assunção, Jiali Lim, Mitali Sarkar-Tyson, Taane G. Clark, et al. 2014. “Genome-wide saturation mutagenesis of Burkholderia pseudomallei K96243 predicts essential genes and novel targets for antimicrobial development.” mBio 5 (1): e00926–00913. https://doi.org/10.1128/mBio.00926-13.

Ramsey, Kathryn M., Hannah E. Ledvina, Tenayaann M. Tresko, Jamie M. Wandzilak, Catherine A. Tower, Thomas Tallo, Caroline E. Schramm, et al. 2020. “Tn-Seq reveals hidden complexity in the utilization of host-derived glutathione in Francisella tularensis.” PLoS pathogens 16 (6): e1008566. https://doi.org/10.1371/journal.ppat.1008566.

Rubin, Benjamin E., Kelly M. Wetmore, Morgan N. Price, Spencer Diamond, Ryan K. Shultzaberger, Laura C. Lowe, Genevieve Curtin, Adam P. Arkin, Adam Deutschbauer, and Susan S. Golden. 2015. “The Essential Gene Set of a Photosynthetic Organism.” Proceedings of the National Academy of Sciences of the United States of America 112 (48): E6634–43. https://doi.org/10.1073/pnas.1519220112.

Valentino, Michael D., Lucy Foulston, Ama Sadaka, Veronica N. Kos, Regis A. Villet, John Santa Maria, David W. Lazinski, et al. 2014. “Genes contributing to Staphylococcus aureus fitness in abscess- and infection-related ecologies.” mBio 5 (5): e01729–01714. https://doi.org/10.1128/mBio.01729-14.