huggingface/datasets

[Dataset requests] New datasets for Text Classification

Open

#353 ouverte le 8 juil. 2020

Voir sur GitHub
 (12 commentaires) (5 réactions) (0 assignés)Python (2 496 forks)batch import
dataset requesthelp wanted

Métriques du dépôt

Stars
 (18 313 stars)
Métriques de merge PR
 (Merge moyen 25j 5h) (21 PRs mergées en 30 j)

Description

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

  • TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
    • #386
  • Yelp-5
    • #1315
  • Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
  • SST (Stanford Sentiment Treebank) [include in glue]
    • #1934
  • Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
  • Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
    • #791
    • #1389
  • 20 Newsgroups. The 20 Newsgroups dataset [done]
    • #410
  • Sogou News dataset [done]
    • #450
  • Reuters news. The Reuters-21578 dataset [165] [done]
    • #471
  • DBpedia. The DBpedia dataset [170]
    • #1116
  • Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
  • EUR-Lex. The EUR-Lex dataset
  • WOS. The Web Of Science (WOS) dataset [done]
    • #424
  • PubMed. PubMed [173]
  • TREC-QA: TREC-6 + TREC-50
    • See above: TREC-6 dataset
  • Quora. The Quora dataset [180]
    • #366

All these datasets are cited in https://arxiv.org/abs/2004.03705

Guide contributeur