[Dataset requests] New datasets for Text Classification · huggingface/datasets#353

Repository metrics

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
- #386
Yelp-5
- #1315
Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
SST (Stanford Sentiment Treebank) [include in glue]
- #1934
Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
- #791
- #1389
20 Newsgroups. The 20 Newsgroups dataset [done]
- #410
Sogou News dataset [done]
- #450
Reuters news. The Reuters-21578 dataset [165] [done]
- #471
DBpedia. The DBpedia dataset [170]
- #1116
Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
EUR-Lex. The EUR-Lex dataset
WOS. The Web Of Science (WOS) dataset [done]
- #424
PubMed. PubMed [173]
TREC-QA: TREC-6 + TREC-50
- See above: TREC-6 dataset
Quora. The Quora dataset [180]
- #366

Research direction: Choose a remaining dataset from the list (e.g., MPQA, Ohsumed, EUR Lex, PubMed) and follow the pattern of previously added datasets in the repository. Look at examples like #386 or #1315 to understand how to add a dataset. Ensure you have the necessary permissions or authentication if required.
Tech stack: python
Domain: machine learningdata
Issue type: Feature
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: GitPython
Newbie friendliness: 70