huggingface/datasets

[Dataset requests] New datasets for Text Classification

Open

#353 opened on Jul 8, 2020

View on GitHub
 (12 comments) (5 reactions) (0 assignees)Python (18,313 stars) (2,496 forks)batch import
dataset requesthelp wanted

Description

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

  • TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
    • #386
  • Yelp-5
    • #1315
  • Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
  • SST (Stanford Sentiment Treebank) [include in glue]
    • #1934
  • Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
  • Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
    • #791
    • #1389
  • 20 Newsgroups. The 20 Newsgroups dataset [done]
    • #410
  • Sogou News dataset [done]
    • #450
  • Reuters news. The Reuters-21578 dataset [165] [done]
    • #471
  • DBpedia. The DBpedia dataset [170]
    • #1116
  • Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
  • EUR-Lex. The EUR-Lex dataset
  • WOS. The Web Of Science (WOS) dataset [done]
    • #424
  • PubMed. PubMed [173]
  • TREC-QA: TREC-6 + TREC-50
    • See above: TREC-6 dataset
  • Quora. The Quora dataset [180]
    • #366

All these datasets are cited in https://arxiv.org/abs/2004.03705

Contributor guide