huggingface/datasets

[Dataset requests] New datasets for Text Classification

Open

#353 aberto em 8 de jul. de 2020

Ver no GitHub
 (12 comments) (5 reactions) (0 assignees)Python (2.496 forks)batch import
dataset requesthelp wanted

Métricas do repositório

Stars
 (18.313 stars)
Métricas de merge de PR
 (Mesclagem média 25d 5h) (21 fundiu PRs em 30d)

Description

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

  • TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
    • #386
  • Yelp-5
    • #1315
  • Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
  • SST (Stanford Sentiment Treebank) [include in glue]
    • #1934
  • Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
  • Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
    • #791
    • #1389
  • 20 Newsgroups. The 20 Newsgroups dataset [done]
    • #410
  • Sogou News dataset [done]
    • #450
  • Reuters news. The Reuters-21578 dataset [165] [done]
    • #471
  • DBpedia. The DBpedia dataset [170]
    • #1116
  • Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
  • EUR-Lex. The EUR-Lex dataset
  • WOS. The Web Of Science (WOS) dataset [done]
    • #424
  • PubMed. PubMed [173]
  • TREC-QA: TREC-6 + TREC-50
    • See above: TREC-6 dataset
  • Quora. The Quora dataset [180]
    • #366

All these datasets are cited in https://arxiv.org/abs/2004.03705

Guia do colaborador