Data

MedRedQA

Commonwealth Scientific and Industrial Research Organisation
Nguyen, Vincent ; Karimi, Sarvnaz ; Rybinski, Maciek ; Xing, Zhenchang
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=info:doi10.25919/yn7x-9148&rft.title=MedRedQA&rft.identifier=https://doi.org/10.25919/yn7x-9148&rft.publisher=Commonwealth Scientific and Industrial Research Organisation&rft.description=A large non-factoid English consumer Question Answering (QA) dataset containing 51,000 pairs of consumer questions and their corresponding expert answers. This dataset is useful for bench-marking or training systems on more difficult real-world questions and responses which may contain spelling or formatting errors, or lexical gaps between consumer and expert vocabularies.\n\nBy downloading this dataset, you agree to have obtained ethics approval from your institution.\nLineage: We collected data from posts and comments to subreddit /r/askdocs, published between July 10, 2013, and April 2, 2022, totalling 600,000 submissions (original posts) and 1,700,000 comments (replies). We generated question-answer pairs by taking the highest scoring answer from a verified medical expert to a Reddit question. Questions with only images are removed, all links are removed and authors are removed. \n\nWe provide two separate datasets in this collection and provide the following schemas.\nMedRedQA - Reddit Medical Question and Answer pairs from /r/askdocs. CSV format.\ni. the poster's question (Body) \nii. Title of the post \niii. The filtered answer from a verified physician comment (Response)\niv. Occupation indicated for verification status\nv. Any PMCIDs found in the post\n\nMedRedQA+PubMed - PubMed Enriched subset of MedRedQA. JSON format.\ni. Question. The user's original question. The is equivalent to the Body field in MedRedQA\nii. Document: The abstract of the PubMed document (if it exists and contains an abstract) for that particular post. Note: it does not necessarily mean the answer references this document. But at least one other verified physician in the responses has mentioned that particular document.\niii. The filtered response. This is equivalent to the Response field in MedRedQA.&rft.creator=Nguyen, Vincent &rft.creator=Karimi, Sarvnaz &rft.creator=Rybinski, Maciek &rft.creator=Xing, Zhenchang &rft.date=2024&rft.edition=v1&rft.relation=https://aclanthology.org/2023.ijcnlp-main.42/&rft_rights=Creative Commons Attribution Noncommercial-Share Alike 4.0 Licence https://creativecommons.org/licenses/by-nc-sa/4.0/&rft_rights=Data is accessible online and may be reused in accordance with licence conditions&rft_rights=All Rights (including copyright) CSIRO, Australian National University 2023.&rft_subject=medredqa&rft_subject=aacl&rft_subject=dataset&rft_subject=reddit&rft_subject=consumer question answering&rft_subject=consumer&rft_subject=question answering&rft_subject=Applications in health&rft_subject=Applied computing&rft_subject=INFORMATION AND COMPUTING SCIENCES&rft_subject=Natural language processing&rft_subject=Artificial intelligence&rft_subject=Information retrieval and web search&rft_subject=Data management and data science&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Non-Commercial Licence view details
CC-BY-NC-SA

Creative Commons Attribution Noncommercial-Share Alike 4.0 Licence
https://creativecommons.org/licenses/by-nc-sa/4.0/

Data is accessible online and may be reused in accordance with licence conditions

All Rights (including copyright) CSIRO, Australian National University 2023.

Access:

Open view details

Accessible for free

Contact Information



Brief description

A large non-factoid English consumer Question Answering (QA) dataset containing 51,000 pairs of consumer questions and their corresponding expert answers. This dataset is useful for bench-marking or training systems on more difficult real-world questions and responses which may contain spelling or formatting errors, or lexical gaps between consumer and expert vocabularies.

By downloading this dataset, you agree to have obtained ethics approval from your institution.
Lineage: We collected data from posts and comments to subreddit /r/askdocs, published between July 10, 2013, and April 2, 2022, totalling 600,000 submissions (original posts) and 1,700,000 comments (replies). We generated question-answer pairs by taking the highest scoring answer from a verified medical expert to a Reddit question. Questions with only images are removed, all links are removed and authors are removed.

We provide two separate datasets in this collection and provide the following schemas.
MedRedQA - Reddit Medical Question and Answer pairs from /r/askdocs. CSV format.
i. the poster's question (Body)
ii. Title of the post
iii. The filtered answer from a verified physician comment (Response)
iv. Occupation indicated for verification status
v. Any PMCIDs found in the post

MedRedQA+PubMed - PubMed Enriched subset of MedRedQA. JSON format.
i. Question. The user's original question. The is equivalent to the Body field in MedRedQA
ii. Document: The abstract of the PubMed document (if it exists and contains an abstract) for that particular post. Note: it does not necessarily mean the answer references this document. But at least one other verified physician in the responses has mentioned that particular document.
iii. The filtered response. This is equivalent to the Response field in MedRedQA.

Available: 2024-05-01

Data time period: 2013-07-10 to 2022-04-02