Data

Data from: Supervised Learning for Detection of Duplicates in Genomic Sequence Databases

RMIT University, Australia
Assoc Professor Xiuzhen Zhang (Associated with, Aggregated by)
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=https://figshare.com/articles/Supervised_Learning_for_Detection_of_Duplicates_in_Genomic_Sequence_Databases/3871353&rft.title=Data from: Supervised Learning for Detection of Duplicates in Genomic Sequence Databases&rft.identifier=9fa5c80cdeeb179d9d7b18d9e724dae7&rft.publisher=RMIT University, Australia&rft.description=Attached file provides supplementary data for linked article. First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.&rft.creator=Assoc Professor Xiuzhen Zhang&rft.date=2018&rft.relation=https://dx.doi.org/10.1371/journal.pone.0159644&rft_rights=All rights reserved &rft_rights=CC BY-NC: Attribution-Noncommercial 3.0 AU http://creativecommons.org/licenses/by-nc/3.0/au&rft_subject=CD-HIT&rft_subject=Protein&rft_subject=Integration&rft_subject=Records&rft_subject=Pattern Recognition and Data Mining&rft_subject=INFORMATION AND COMPUTING SCIENCES&rft_subject=ARTIFICIAL INTELLIGENCE AND IMAGE PROCESSING&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Other view details
Unknown

CC BY-NC: Attribution-Noncommercial 3.0 AU
http://creativecommons.org/licenses/by-nc/3.0/au

All rights reserved

Access:

Other view details

Data available in link

Contact Information


Figshare

Full description

Attached file provides supplementary data for linked article. First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases. We developed and evaluated a supervised duplicate detection method based on an expert curated dataset of duplicates, containing over one million pairs across five organisms derived from genomic sequence databases. We selected 22 features to represent distinct attributes of the database records, and developed a binary model and a multi-class model. Both models achieve promising performance; under cross-validation, the binary model had over 90% accuracy in each of the five organisms, while the multi-class model maintains high accuracy and is more robust in generalisation. We performed an ablation study to quantify the impact of different sequence record features, finding that features derived from meta-data, sequence identity, and alignment quality impact performance most strongly. The study demonstrates machine learning can be an effective additional tool for de-duplication of genomic sequence databases. All Data are available as described in the supplementary material.

This dataset is part of a larger collection

Subjects

User Contributed Tags    

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers
  • Local : 9fa5c80cdeeb179d9d7b18d9e724dae7