[Dataset requests] New datasets for Text Classification · huggingface/datasets#353

Metriche repository

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
- #386
Yelp-5
- #1315
Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
SST (Stanford Sentiment Treebank) [include in glue]
- #1934
Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
- #791
- #1389
20 Newsgroups. The 20 Newsgroups dataset [done]
- #410
Sogou News dataset [done]
- #450
Reuters news. The Reuters-21578 dataset [165] [done]
- #471
DBpedia. The DBpedia dataset [170]
- #1116
Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
EUR-Lex. The EUR-Lex dataset
WOS. The Web Of Science (WOS) dataset [done]
- #424
PubMed. PubMed [173]
TREC-QA: TREC-6 + TREC-50
- See above: TREC-6 dataset
Quora. The Quora dataset [180]
- #366

Direzione di ricerca: Scegli un dataset rimanente dalla lista (es. MPQA, Ohsumed, EUR Lex, PubMed) e segui il modello dei dataset precedentemente aggiunti nel repository. Guarda esempi come #386 o #1315 per capire come aggiungere un dataset. Assicurati di avere le autorizzazioni o l'autenticazione necessarie se richieste.
Tech stack: python
Dominio: machine learningdata
Tipo issue: Funzionalità
Difficoltà: 3
Tempo stimato: Mezza giornata
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: GitPython
Adatta ai principianti: 70