Brief description
This database was built to identify taxa in metagenome samples using the CCMetagen pipeline. The whole NCBI nt collection allows a complete taxonomic overview, including from microbial eukaryotes that may be present in the dataset. This database is already indexed, ready to use with KMA and CCMetagen.
A manual describing how to use this dataset can be found at: https://github.com/vrmarcelino/CCMetagen
Additionally, a tutorial on the whole analysis of a set of metatranscriptome samples can be found at: https://github.com/vrmarcelino/CCMetagen/tree/master/tutorial
The database was built as follows:
The partially non-redundant nucleotide database was downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz) in January 2018. This database was formatted to include taxids in sequence headers.
Indexing was then performed with KMA using the commands:
kma_index -i nt_taxid.fas -o ncbi_nt -NI -Sparse TG
Three indexed databases are provided:
- NCBI nucleotide collection
- RefSeq database of bacterial and fungal genomes
Notes
Update to dataset:
The NCBI nucleotide collection contains many environmental and artificial sequence entries without taxonomic information (e.g. uncultured marine bacteria). We therefore compiled a database without those.
The file ncbi_nt_no_env_11jun2019.zip contains therefore all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).
Issued: 30 04 2019
Data time period: 09 04 2019 to 30 04 2019
User Contributed Tags
Login to tag this record with meaningful keywords to make it easier to discover
- DOI : 10.25910/5cc7cd40fca8e
- Handle : http://hdl.handle.net/2123/20336