piskvorky/gensim

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup

Open

#2,561 建立於 2019年7月26日

在 GitHub 查看
 (8 留言) (0 反應) (0 負責人)Python (15,144 star) (4,349 fork)batch import
Hacktoberfestbugdifficulty mediumgood first issuehelp wantedimpact LOWreach LOW

描述

Problem description

I am trying to get clean wiki texts. But still getting headings markup.

Steps/code/corpus to reproduce

Just create WikiCorpus and call get_texts. Some texts will contain ==headings== (different number of = and different headings, of course). Simple test-case:

>>> import gensim.corpora.wikicorpus
>>> print(gensim.corpora.wikicorpus.filter_wiki('===heading==='))
===heading===

Versions

Python 3.7.4 (default, Jul 13 2019, 14:04:11) 
[GCC 8.3.0]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1

貢獻者指南