[Dataset requests] New datasets for Text Classification · huggingface/datasets#353

倉庫指標

We are missing a few datasets for Text Classification which is an important field.

Namely, it would be really nice to add:

TREC-6 dataset (see here for instance: https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html#torchnlp.datasets.trec_dataset) [done]
- #386
Yelp-5
- #1315
Movie review (Movie Review (MR) dataset [156]) [done (same as rotten_tomatoes)]
SST (Stanford Sentiment Treebank) [include in glue]
- #1934
Multi-Perspective Question Answering (MPQA) dataset [require authentication (indeed manual download)]
Amazon. This is a popular corpus of product reviews collected from the Amazon website [159]. It contains labels for both binary classification and multi-class (5-class) classification
- #791
- #1389
20 Newsgroups. The 20 Newsgroups dataset [done]
- #410
Sogou News dataset [done]
- #450
Reuters news. The Reuters-21578 dataset [165] [done]
- #471
DBpedia. The DBpedia dataset [170]
- #1116
Ohsumed. The Ohsumed collection [171] is a subset of the MEDLINE database
EUR-Lex. The EUR-Lex dataset
WOS. The Web Of Science (WOS) dataset [done]
- #424
PubMed. PubMed [173]
TREC-QA: TREC-6 + TREC-50
- See above: TREC-6 dataset
Quora. The Quora dataset [180]
- #366

研究方向: 從列表中選擇一個剩餘的數據集（例如MPQA、Ohsumed、EUR Lex、PubMed），並按照儲存庫中先前添加的數據集的模式進行操作。查看類似#386或#1315的示例以了解如何添加數據集。如果需要，請確保您擁有必要的權限或認證。
技術棧: python
領域: machine learningdata
議題類型: 功能
難度: 3
預計時間: 半天
活動狀態: 活躍
清晰度: 清晰
前置要求: GitPython
新手友善度: 70