huggingface/datasets

[Dataset requests] New datasets for Text Classification

Open

#353 建立於 2020年7月8日

在 GitHub 查看
 (12 留言) (5 反應) (0 負責人)Python (2,496 fork)batch import
dataset requesthelp wanted

倉庫指標

Star
 (18,313 star)
PR 合併指標
 (平均合併 25天 5小時) (30 天內合併 21 個 PR)

描述

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

  • TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
    • #386
  • Yelp-5
    • #1315
  • Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
  • SST (Stanford Sentiment Treebank) [include in glue]
    • #1934
  • Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
  • Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
    • #791
    • #1389
  • 20 Newsgroups. The 20 Newsgroups dataset [done]
    • #410
  • Sogou News dataset [done]
    • #450
  • Reuters news. The Reuters-21578 dataset [165] [done]
    • #471
  • DBpedia. The DBpedia dataset [170]
    • #1116
  • Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
  • EUR-Lex. The EUR-Lex dataset
  • WOS. The Web Of Science (WOS) dataset [done]
    • #424
  • PubMed. PubMed [173]
  • TREC-QA: TREC-6 + TREC-50
    • See above: TREC-6 dataset
  • Quora. The Quora dataset [180]
    • #366

All these datasets are cited in https://arxiv.org/abs/2004.03705

貢獻者指南