huggingface/datasets
View on GitHub[Dataset requests] New datasets for Text Classification
Open
#353 opened on Jul 8, 2020
dataset requesthelp wanted
Description
We are missing a few datasets for Text Classification which is an important field.
Namely, it would be really nice to add:
- TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
- #386
- Yelp-5
- #1315
- Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
- SST (Stanford Sentiment Treebank) [include in glue]
- #1934
- Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
- Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
- #791
- #1389
- 20 Newsgroups. The 20 Newsgroups dataset [done]
- #410
- Sogou News dataset [done]
- #450
- Reuters news. The Reuters-21578 dataset [165] [done]
- #471
- DBpedia. The DBpedia dataset [170]
- #1116
- Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
- EUR-Lex. The EUR-Lex dataset
- WOS. The Web Of Science (WOS) dataset [done]
- #424
- PubMed. PubMed [173]
- TREC-QA: TREC-6 + TREC-50
- See above: TREC-6 dataset
- Quora. The Quora dataset [180]
- #366
All these datasets are cited in https://arxiv.org/abs/2004.03705