Data

Probing Datasets for Noisy Texts

Federation University Australia
Kasthuriarachchy, Buddhika ; Chetty, Madhu ; Shatte, Adrian
Viewed: [[ro.stat.viewed]] Cited: [[ro.stat.cited]] Accessed: [[ro.stat.accessed]]
ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Adc&rfr_id=info%3Asid%2FANDS&rft_id=info:doi10.25955/604c5307db043&rft.title=Probing Datasets for Noisy Texts&rft.identifier=10.25955/604c5307db043&rft.publisher=Federation University Australia&rft.description=ContextProbing tasks are popular among NLP researchers to assessthe richness of the encoded representations of linguistic information. Eachprobing task is a classification problem, and the model’s performance shallvary depending on the richness of the linguistic properties crammed into therepresentation.This dataset contains five new probing datasets consist ofnoisy texts (Tweets) which can serve as a benchmark dataset for researchers tostudy the linguistic characteristics of unstructured and noisy texts.File StructureFormat: A tab-separated text fileColumn1: train/test/validation split (tr-train, te-test, va-validation) Column 2: class label (refer to the contentsection for the class labels of each task file) Column 3: Tweet message (text) Column4: a unique ID Contentsent_len.tsvIn this classification task, the goal is to predict thesentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1:(9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7:(34-70). This task is called “SentLen”in the paper.word_content.tsvWe consider a 10-way classifications task with 10 words astargets considering the available manually annotated instances. The task ispredicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. Weconstructed the data by picking the first 10 lower-cased words occurring in thecorpus vocabulary ordered by frequency and having a length of at least 4characters (to remove noise). Each sentence contains a single target word, andthe word occurs precisely once in the sentence. The task is referred to as “WC”in the paper. bigram_shift.tsvThe purpose of the Bigram Shift task is to test whether anencoder is sensitive to legal word orders. Two adjacent words in a Tweet areinverted, and the classification model performs a binary classification toidentify inverted (I) and non-inverted/original (O) Tweets. The task isreferred to as “BShift” in the paper. tree_depth.tsvThe Tree Depth task evaluates the encoded sentence's abilityto understand the hierarchical structure by allowing the classification modelto predict the depth of the longest path from the root to any leaf in theTweet's parser tree. The task is referred to as “TreeDepth” in the paper. odd_man_out.tsvThe Tweets are modified by replacing a random noun or a verbo with another noun or verb r. The task of the classifier is to identifywhether the sentence gets modified due to this change. Class label O refers tothe unmodified sentences while C refers to modified sentences. The task iscalled “SOMO” in the paper.&rft.creator=Kasthuriarachchy, Buddhika &rft.creator=Chetty, Madhu &rft.creator=Shatte, Adrian &rft.date=2021&rft.edition=4&rft_rights= https://creativecommons.org/licenses/by-nc-sa/4.0/&rft_subject=Natural language processing&rft_subject=probing dataset&rft_subject=sentence length&rft_subject=bigram shift&rft_subject=semantic odd man out&rft_subject=tree depth&rft_subject=word content&rft_subject=sentence vector&rft_subject=sentence representation&rft_subject=sentence embeddings&rft_subject=Natural Language Processing&rft.type=dataset&rft.language=English Access the data

Licence & Rights:

Other view details

Full description

Context

Probing tasks are popular among NLP researchers to assess
the richness of the encoded representations of linguistic information. Each
probing task is a classification problem, and the model’s performance shall
vary depending on the richness of the linguistic properties crammed into the
representation.



This dataset contains five new probing datasets consist of
noisy texts (Tweets) which can serve as a benchmark dataset for researchers to
study the linguistic characteristics of unstructured and noisy texts.


File Structure

Format: A tab-separated text file



Column
1: train/test/validation split (tr-train, te-test, va-validation)



Column 2: class label (refer to the content
section for the class labels of each task file)



Column 3: Tweet message (text)



Column
4: a unique ID


Content

sent_len.tsv

In this classification task, the goal is to predict the
sentence length in 8 possible bins (0-7) based on their lengths; 0: (5-8), 1:
(9-12), 2: (13-16), 3: (17-20), 4: (21-25), 5: (26-29), 6: (30-33), 7:
(34-70). This task is called “SentLen”
in the paper.

word_content.tsv

We consider a 10-way classifications task with 10 words as
targets considering the available manually annotated instances. The task is
predicting which of the target words appears on the given sentence. We have considered only the words that appear in the BERT vocabulary as target words. We
constructed the data by picking the first 10 lower-cased words occurring in the
corpus vocabulary ordered by frequency and having a length of at least 4
characters (to remove noise). Each sentence contains a single target word, and
the word occurs precisely once in the sentence. The task is referred to as “WC”
in the paper.

bigram_shift.tsv

The purpose of the Bigram Shift task is to test whether an
encoder is sensitive to legal word orders. Two adjacent words in a Tweet are
inverted, and the classification model performs a binary classification to
identify inverted (I) and non-inverted/original (O) Tweets. The task is
referred to as “BShift” in the paper.

tree_depth.tsv

The Tree Depth task evaluates the encoded sentence's ability
to understand the hierarchical structure by allowing the classification model
to predict the depth of the longest path from the root to any leaf in the
Tweet's parser tree. The task is referred to as “TreeDepth” in the paper.

odd_man_out.tsv

























The Tweets are modified by replacing a random noun or a verb
o with another noun or verb r. The task of the classifier is to identify
whether the sentence gets modified due to this change. Class label O refers to
the unmodified sentences while C refers to modified sentences. The task is
called “SOMO” in the paper.


Issued: 14 03 2021

Created: 14 03 2021

Modified: 14 03 2021

This dataset is part of a larger collection

Click to explore relationships graph
Subjects

User Contributed Tags    

Login to tag this record with meaningful keywords to make it easier to discover

Identifiers