huggingface/datasets

[Dataset requests] New datasets for Text Classification

Open

#353 aperta il 8 lug 2020

Vedi su GitHub
 (12 commenti) (5 reazioni) (0 assegnatari)Python (2496 fork)batch import
dataset requesthelp wanted

Metriche repository

Star
 (18.313 star)
Metriche merge PR
 (Merge medio 25g 5h) (21 PR mergiate in 30 g)

Descrizione

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

  • TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
    • #386
  • Yelp-5
    • #1315
  • Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
  • SST (Stanford Sentiment Treebank) [include in glue]
    • #1934
  • Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
  • Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
    • #791
    • #1389
  • 20 Newsgroups. The 20 Newsgroups dataset [done]
    • #410
  • Sogou News dataset [done]
    • #450
  • Reuters news. The Reuters-21578 dataset [165] [done]
    • #471
  • DBpedia. The DBpedia dataset [170]
    • #1116
  • Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
  • EUR-Lex. The EUR-Lex dataset
  • WOS. The Web Of Science (WOS) dataset [done]
    • #424
  • PubMed. PubMed [173]
  • TREC-QA: TREC-6 + TREC-50
    • See above: TREC-6 dataset
  • Quora. The Quora dataset [180]
    • #366

All these datasets are cited in https://arxiv.org/abs/2004.03705

Guida contributor