TfidfVectorizer ngrams does not work when vocabulary provided · scikit-learn/scikit-learn#16017

(9 评论) (2 反应) (0 负责人)Python (66,084 star) (27,020 fork)batch import

Bughelp wantedmodule:feature_extraction

描述

Description

The TfidfVectorizer does not honor the ngram_range argument when the vocabulary is provided.

Steps/Code to Reproduce

Example 1, vocabulary is not provided, this works as expected:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']

Example 2, when vocabulary is provided. This does not work as expected:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        vocabulary={'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4},
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['a', 'b', 'c', 'd', 'e']

Note that it works if the vocabulary I provide are the ngrams themselves:

from sklearn.feature_extraction.text import TfidfVectorizer

X = ['abc',
     'bcd',
     'cde']

tfidf = TfidfVectorizer(stop_words=None,
                        analyzer='char',
                        vocabulary=['ab', 'bc', 'cd', 'de'],
                        ngram_range=(2, 2))

sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']

But that seems kind of silly, since I can't possibly know all of the ngrams a priori for a large dataset.

Expected Results

Expected to still get ngrams when vocabulary is provided, but did not.

Actual Results

See steps to reproduce above.

Versions

System:
    python: 3.7.5 (default, Oct 25 2019, 10:52:18)  [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/envs/myenv/bin/python
   machine: Darwin-18.6.0-x86_64-i386-64bit

Python dependencies:
       pip: 19.3.1
setuptools: 42.0.2.post20191203
   sklearn: 0.22
     numpy: 1.17.4
     scipy: 1.4.0
    Cython: 0.29.14
    pandas: 0.25.3
matplotlib: 3.1.2
    joblib: 0.14.1

Built with OpenMP: True

贡献者指南

技术栈
领域
议题类型
难度
预计时间
活动状态
清晰度
前置要求
新手友好度
研究方向