piskvorky/gensim
View on GitHubWikiCorpus.filter_wiki/remove_markup don't remove heading-markup
Open
#2,561 opened on Jul 26, 2019
Hacktoberfestbugdifficulty mediumgood first issuehelp wantedimpact LOWreach LOW
Description
Problem description
I am trying to get clean wiki texts. But still getting headings markup.
Steps/code/corpus to reproduce
Just create WikiCorpus and call get_texts. Some texts will contain ==headings== (different number of = and different headings, of course).
Simple test-case:
>>> import gensim.corpora.wikicorpus
>>> print(gensim.corpora.wikicorpus.filter_wiki('===heading==='))
===heading===
Versions
Python 3.7.4 (default, Jul 13 2019, 14:04:11)
[GCC 8.3.0]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1