Full description
This dataset contains the raw data associated with the study 'Incomplete sterol biosynthesis pathways and highly duplicated haem peroxidases revealed in Rhodophyte algae using multi-omics resource', submitted to Marine Drugs for review. In this study, Rhodophyte genome and transcriptome assemblies were functionally annotated and their metabolic pathways reconstructed, while phylogeny inferred with OrthoFinder was used to correlate the abundance of specific functional annotations with gene duplication analysis.Description of the data and file structure
This dataset contains 4 main directories, labelled D1, D2, D3, and D4. Major subdirectories are labelled with letters (for example, D1a, D1b, etc).
Assemblies referred to in this dataset can be divided into two broad categories: those with prior protein annotations available, and those without. Those which had prior protein annotations available for download are classed as 'pre-annotated', while those without will be referred to as 'unannotated'.
D1: BUSCO data
Directory D1 contains the BUSCO results for this study. There are two subdirectories: D1a and D1b. D1a contains the BUSCO data for all the assemblies used in the main part of this study, including all the red algae and a small outgroup of green algae and a Glaucophyte. D1b contains the BUSCO results for 41 preannotated green algal protein assemblies which were used as a comparison for the BUSCO results and other general statistics, and which were not otherwise used in the study.
D2: Protein sequence data
D2 contains protein assemblies that were annotated as part of this dataset. Protein annotations were predicted using AUGUSTUS trained on BUSCO training data generated by running BUSCO on genome mode with AUGUSTUS as the prediction algorithm. Sequences are in FASTA format, with an organism prefix before the gene number.
D3: Repeat data
Repeat identification and masking results are included in directory D3. Only genomes had repeats identified. Both pre-annotated and unannotated genome assemblies had masking, but only the unannotated assemblies had proteins predicted using these masked assemblies; the pre-annotated assemblies only had repeat identification done as comparison. The results include both summary tables (D3a) and full results tables (D3b). The summaries include short tables detailing the total percentage of the assembly corresponded to each type of repeat element. The result tables detail each individual repeat element for each assembly. Both sets of results are in .txt format.
D4: OrthoFinder data
Gene orthologue and phylogenetic data inferred by OrthoFinder is contained in directory D5. The OrthoFinder directory has two subdirectories: D4a and D4b. D4a contains the results from an OrthoFinder run on 64 functionally annotated Rhodophyte genome and transcriptome assemblies with an outgroup of Chlorophyte and Glaucophyte genomes. This was performed using default settings, using DendroBLAST for phylogenetic inference. D4b contains the results of a smaller OrthoFinder run using 32 genomes run using the multiple sequence alignment option with default parameters. Both subdirectories contain results as outputted by default by OrthoFinder. The WorkingDirectory data was not included in this dataset.
Sharing/Access information
Data was derived from the following sources:
Code/Software
This dataset was created using the following software packages:
Issued: 2023
Created: 20210513 to 20220331
Subjects
Biological Sciences |
Bioinformatics and Computational Biology |
Metabolism |
Multi-omics |
Rhodophyta |
User Contributed Tags
Login to tag this record with meaningful keywords to make it easier to discover
Identifiers
- usc : 11267943120002621