Research Project
Full description Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analysed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). Here we apply TF-IDF, an alignment-free method originating from document analysis, to infer networks of LGT among bacterial genomes. We examine four empirical datasets (and selected variants of one of them) of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges represent LGT. We then extract maximum and maximal cliques (i.e. GECs) from these graphs, and identify nodes that retain membership as k is varied. In these datasets, most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation, and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k.