Fail index analyzer that contains a graph token filter · elastic/elasticsearch#24396

(3 评论) (0 反应) (0 负责人)Java (76,700 star) (25,882 fork)batch import

:Search Relevance/Analysis>enhancementTeam:Search Relevancehelp wanted

描述

Currently it is possible to set a synonym_graph or a word_delimiter_graph token filter in an analyzer that is used at index time. Though these filters can produce side-paths that will break the positions in the index and make phrase query matching impossible on the field. The flatten_graph token filter is supposed to handle this situation but it can only flatten the graph which is also a lossy operation. So whether the user adds a flatten_graph filter at the end of the analyzer or not the positions of the terms in the index will not be accurate. Instead we could try to detect these situation and fail the mapping if a graph filter is used in an index analyzer. This would allow us to remove the flatten_graph filter and also help users to not shoot themselves in the foot. Here is an hopefully exhaustive list of token filters that should be impacted by this:

synonym_graph_filter
word_delimiter_graph_filter
shingles (only when output_unigram:true or min_size < max_size)
cjk (only when output_unigram:true)
ngram tokenizer when min_gram < max_gram
common_gram
kuromoji_tokenizer when (nbest_cost or nbest_example > 1).

贡献者指南

技术栈: java
领域: backendsearch
议题类型: feature
难度: 3
预计时间: 1-2 days
活动状态: stale
清晰度: mostly clear
前置要求: Understanding of Elasticsearch analyzersBasic Java knowledge
新手友好度: 35
研究方向: The issue proposes to detect graph token filters in index analyzers and fail the mapping. Currently, the flatten graph filter is used as a workaround. The codebase likely validates analyzer chains in the analysis module. Start by examining the token filter registration and analyzer building logic, e.g., in `AnalysisRegistry` or `AnalyzerProvider`. Add a validation step that checks for synonym graph filter, word delimiter graph filter, shingles (when output unigram:true or min size < max size), cjk (when output unigram:true), ngram tokenizer (when min gram < max gram), common gram, and kuromoji tokenizer (when nbest cost or nbest example > 1). When detected, throw an error. Also consider removing the flatten graph filter. The affected filters are listed in the issue.