Move analysis components to a module · elastic/elasticsearch#23658

2017-03-20T17:24:39.000Z

We'd like to move the analyzers from Elasticsearch core into a module. They would still ship with Elasticsearch, just not with the Elasticsearch jar. We like this for a few reasons: 1. It reduces the size of the high level rest client and the transport client. They don't need to reference analyzers. 2. It proves that analysis plugins are first class citizens by consuming the plugin API for setting up the analyzers. 3. It forces us to develop features a little more generically, not relying on specific analyzers, which is a good thing if you are going to have a first class plugin API. At this point I propose we move analysis components a few at a time. Claim the components you'd like to move before doing the move using the list below. We're doing this directly in master and 5.x. There is no need for a long running branch for this. Keep in mind when claiming components that moving the code is not time consuming but fixing tests that rely on the components might be. Misc ------- * [x] Allow plugins to build "pre-built" analysis components. This blocks a number of the analyzers below. * [x] Token filters #24223 #24572 #24223 * [x] Analyzers #31095 * [x] Tokenizers #24751 #24863 * [x] Char filters #25000 * [ ] Remove core's dependency on `lucene-analzyers-common.jar` Analyzers -------- * [x] Standard Analyzer (This one will stay in core. It isn't part of the `lucene-analzyers-common.jar` and it will keep testing easier to keep it in core.) * [x] Simple Analyzer * [x] Whitespace Analyzer * [x] Stop Analyzer * [x] Keyword Analyzer * [x] Pattern Analyzer #31095 * [x] Language Analyzers #31143 #31300 * [x] Fingerprint Analyzer #31095 * [x] Standard html strip Analyzer #31095 Tokenizers -------- * [x] Standard Tokenizer (I believe this one will also stay in core for the same reasons Standard Analyzer is staying.) * [x] Letter Tokenizer (#30538) * [x] Lowercase Tokenizer (#30538) * [x] Whitespace Tokenizer (#30538) * [x] UAX URL Email Tokenizer (#30538) * [x] Classic Tokenizer (#30538) * [x] Thai Tokenizer (#30538) * [x] N-Gram Tokenizer (#30538) * [x] Edge N-Gram Tokenizer (#30538) * [x] Keyword Tokenizer (#30642) * [x] Pattern Tokenizer (#30538) * [x] Path Tokenizer (#30538) Token Filters -------- * [x] Standard Token Filter (This will stay in core, because `StandardFilter` is part of lucene-core) * [x] ASCII Folding Token Filter (#23614) * [x] Flatten Graph Token Filter @martijnvg (#25214) * [x] Length Token Filter @martijnvg (#25214) * [x] Lowercase Token Filter @martijnvg (#25214) * [x] Uppercase Token Filter @martijnvg (#25214) * [x] NGram Token Filter @martijnvg (#25214) * [x] Edge NGram Token Filter @martijnvg (#25214) * [x] Porter Stem Token Filter @martijnvg (#24948) * [ ] Shingle Token Filter (trickier: because of `PhraseSuggestionBuilder`, it uses `ShingleTokenFilterFactory`'s getters) * [x] Stop Token Filter (can remain in core as it uses classes from lucene-core and lucene-suggest jars) * [x] Word Delimiter Token Filter (#23614) * [x] Stemmer Token Filter (#25384) * [x] Stemmer Override Token Filter (#25384) * [x] Keyword Marker Token Filter @martijnvg (#24948) * [x] Keyword Repeat Token Filter (Hasn't been exposed yet as a token filter) * [x] KStem Token Filter (#25384) * [x] Snowball Token Filter @martijnvg (#24948) * [x] Phonetic Token Filter (Is already in its own module `analysis-phonetic` ) * [x] Synonym Token Filter (trickier: because `CustomAnalyzerProvider` depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868 * [x] Synonym Graph Token Filter (trickier: because `CustomAnalyzerProvider` depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868 * [x] Compound Word Token Filter (#25384) * [x] Reverse Token Filter (#25384) * [x] Elision Token Filter (#25384) * [x] Truncate Token Filter (#25384) * [x] Unique Token Filter @martijnvg (#25214) * [x] Pattern Capture Token Filter (#25580) * [x] Pattern Replace Token Filter (#25580) * [x] Trim Token Filter @martijnvg (#24948) * [x] Limit Token Count Token Filter (#25580) * [ ] Hunspell Token Filter (trickier: because of its infra`AnalysisPlugin#getHunspellDictionaries()`) * [x] Common Grams Token Filter (#25580) * [x] Normalization Token Filter (#25715) * [x] CJK Width Token Filter (#25715) * [x] CJK Bigram Token Filter (#25715) * [x] Delimited Payload Token Filter (#25784) * [x] Keep Words Token Filter (#25784) * [x] Keep Types Token Filter (#25784) * [x] Classic Token Filter (#25784) * [x] Apostrophe Token Filter (#25784) * [x] Decimal Digit Token Filter (#25784) * [x] Fingerprint Token Filter (#25784) * [x] Minhash Token Filter (#25784) * [x] Scandinavian folding token filter (#25784) * [x] Language stem token filters (arabic, brazilian, czech, dutch, french, german, russian) (#26042) Character Filters --------- * [x] HTML Strip Char Filter #24261 * [x] Mapping Char Filter #24261 * [x] Pattern Replace Char Filter #24261

貢獻者指南

技術棧: java
領域: backend
議題類型: refactor
難度: 4
預計時間: over 1 week
活動狀態: stale
清晰度: clear
前置要求: Java developmentElasticsearch plugin APIMaven/Gradle build system
新手友善度: 10
研究方向: The remaining work involves removing core's dependency on lucene analyzers common.jar, moving the Shingle Token Filter (tricky due to PhraseSuggestionBuilder's use of ShingleTokenFilterFactory getters), and moving the Hunspell Token Filter (tricky due to AnalysisPlugin#getHunspellDictionaries()). Review linked PRs for previous moves, such as #24223, #24572, #31095, and others. Examine the current state of these components in the elasticsearch source to understand the required changes.

描述

Misc

Analyzers

Tokenizers

Token Filters

Character Filters

貢獻者指南