piskvorky/gensim

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup

Open

#2,561 创建于 2019年7月26日

在 GitHub 查看
 (8 评论) (0 反应) (0 负责人)Python (15,144 star) (4,349 fork)batch import
Hacktoberfestbugdifficulty mediumgood first issuehelp wantedimpact LOWreach LOW

描述

Problem description

I am trying to get clean wiki texts. But still getting headings markup.

Steps/code/corpus to reproduce

Just create WikiCorpus and call get_texts. Some texts will contain ==headings== (different number of = and different headings, of course). Simple test-case:

>>> import gensim.corpora.wikicorpus
>>> print(gensim.corpora.wikicorpus.filter_wiki('===heading==='))
===heading===

Versions

Python 3.7.4 (default, Jul 13 2019, 14:04:11) 
[GCC 8.3.0]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

贡献者指南

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup · piskvorky/gensim#2561 | Good First Issue