scikit-learn/scikit-learn
在 GitHub 查看TfidfVectorizer ngrams does not work when vocabulary provided
Closed
#16,017 创建于 2020年1月3日
Bughelp wantedmodule:feature_extraction
描述
Description
The TfidfVectorizer does not honor the ngram_range argument when the vocabulary is provided.
Steps/Code to Reproduce
Example 1, vocabulary is not provided, this works as expected:
from sklearn.feature_extraction.text import TfidfVectorizer
X = ['abc',
'bcd',
'cde']
tfidf = TfidfVectorizer(stop_words=None,
analyzer='char',
ngram_range=(2, 2))
sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']
Example 2, when vocabulary is provided. This does not work as expected:
from sklearn.feature_extraction.text import TfidfVectorizer
X = ['abc',
'bcd',
'cde']
tfidf = TfidfVectorizer(stop_words=None,
analyzer='char',
vocabulary={'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4},
ngram_range=(2, 2))
sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['a', 'b', 'c', 'd', 'e']
Note that it works if the vocabulary I provide are the ngrams themselves:
from sklearn.feature_extraction.text import TfidfVectorizer
X = ['abc',
'bcd',
'cde']
tfidf = TfidfVectorizer(stop_words=None,
analyzer='char',
vocabulary=['ab', 'bc', 'cd', 'de'],
ngram_range=(2, 2))
sps = tfidf.fit_transform(X)
print(tfidf.get_feature_names())
# ['ab', 'bc', 'cd', 'de']
But that seems kind of silly, since I can't possibly know all of the ngrams a priori for a large dataset.
Expected Results
Expected to still get ngrams when vocabulary is provided, but did not.
Actual Results
See steps to reproduce above.
Versions
System:
python: 3.7.5 (default, Oct 25 2019, 10:52:18) [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /anaconda3/envs/myenv/bin/python
machine: Darwin-18.6.0-x86_64-i386-64bit
Python dependencies:
pip: 19.3.1
setuptools: 42.0.2.post20191203
sklearn: 0.22
numpy: 1.17.4
scipy: 1.4.0
Cython: 0.29.14
pandas: 0.25.3
matplotlib: 3.1.2
joblib: 0.14.1
Built with OpenMP: True