piskvorky/gensim

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup

Open

#2,561 opened on Jul 26, 2019

View on GitHub
 (8 comments) (0 reactions) (0 assignees)Python (15,144 stars) (4,349 forks)batch import
Hacktoberfestbugdifficulty mediumgood first issuehelp wantedimpact LOWreach LOW

Description

Problem description

I am trying to get clean wiki texts. But still getting headings markup.

Steps/code/corpus to reproduce

Just create WikiCorpus and call get_texts. Some texts will contain ==headings== (different number of = and different headings, of course). Simple test-case:

>>> import gensim.corpora.wikicorpus
>>> print(gensim.corpora.wikicorpus.filter_wiki('===heading==='))
===heading===

Versions

Python 3.7.4 (default, Jul 13 2019, 14:04:11) 
[GCC 8.3.0]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

Contributor guide