Move analysis components to a module · elastic/elasticsearch#23658

2017-03-20T17:24:39.000Z

We'd like to move the analyzers from Elasticsearch core into a module. They would still ship with Elasticsearch, just not with the Elasticsearch jar. We like this for a few reasons: 1. It reduces the size of the high level rest client and the transport client. They don't need to reference analyzers. 2. It proves that analysis plugins are first class citizens by consuming the plugin API for setting up the analyzers. 3. It forces us to develop features a little more generically, not relying on specific analyzers, which is a good thing if you are going to have a first class plugin API. At this point I propose we move analysis components a few at a time. Claim the components you'd like to move before doing the move using the list below. We're doing this directly in master and 5.x. There is no need for a long running branch for this. Keep in mind when claiming components that moving the code is not time consuming but fixing tests that rely on the components might be. Misc ------- * [x] Allow plugins to build "pre-built" analysis components. This blocks a number of the analyzers below. * [x] Token filters #24223 #24572 #24223 * [x] Analyzers #31095 * [x] Tokenizers #24751 #24863 * [x] Char filters #25000 * [ ] Remove core's dependency on `lucene-analzyers-common.jar` Analyzers -------- * [x] Standard Analyzer (This one will stay in core. It isn't part of the `lucene-analzyers-common.jar` and it will keep testing easier to keep it in core.) * [x] Simple Analyzer * [x] Whitespace Analyzer * [x] Stop Analyzer * [x] Keyword Analyzer * [x] Pattern Analyzer #31095 * [x] Language Analyzers #31143 #31300 * [x] Fingerprint Analyzer #31095 * [x] Standard html strip Analyzer #31095 Tokenizers -------- * [x] Standard Tokenizer (I believe this one will also stay in core for the same reasons Standard Analyzer is staying.) * [x] Letter Tokenizer (#30538) * [x] Lowercase Tokenizer (#30538) * [x] Whitespace Tokenizer (#30538) * [x] UAX URL Email Tokenizer (#30538) * [x] Classic Tokenizer (#30538) * [x] Thai Tokenizer (#30538) * [x] N-Gram Tokenizer (#30538) * [x] Edge N-Gram Tokenizer (#30538) * [x] Keyword Tokenizer (#30642) * [x] Pattern Tokenizer (#30538) * [x] Path Tokenizer (#30538) Token Filters -------- * [x] Standard Token Filter (This will stay in core, because `StandardFilter` is part of lucene-core) * [x] ASCII Folding Token Filter (#23614) * [x] Flatten Graph Token Filter @martijnvg (#25214) * [x] Length Token Filter @martijnvg (#25214) * [x] Lowercase Token Filter @martijnvg (#25214) * [x] Uppercase Token Filter @martijnvg (#25214) * [x] NGram Token Filter @martijnvg (#25214) * [x] Edge NGram Token Filter @martijnvg (#25214) * [x] Porter Stem Token Filter @martijnvg (#24948) * [ ] Shingle Token Filter (trickier: because of `PhraseSuggestionBuilder`, it uses `ShingleTokenFilterFactory`'s getters) * [x] Stop Token Filter (can remain in core as it uses classes from lucene-core and lucene-suggest jars) * [x] Word Delimiter Token Filter (#23614) * [x] Stemmer Token Filter (#25384) * [x] Stemmer Override Token Filter (#25384) * [x] Keyword Marker Token Filter @martijnvg (#24948) * [x] Keyword Repeat Token Filter (Hasn't been exposed yet as a token filter) * [x] KStem Token Filter (#25384) * [x] Snowball Token Filter @martijnvg (#24948) * [x] Phonetic Token Filter (Is already in its own module `analysis-phonetic` ) * [x] Synonym Token Filter (trickier: because `CustomAnalyzerProvider` depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868 * [x] Synonym Graph Token Filter (trickier: because `CustomAnalyzerProvider` depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868 * [x] Compound Word Token Filter (#25384) * [x] Reverse Token Filter (#25384) * [x] Elision Token Filter (#25384) * [x] Truncate Token Filter (#25384) * [x] Unique Token Filter @martijnvg (#25214) * [x] Pattern Capture Token Filter (#25580) * [x] Pattern Replace Token Filter (#25580) * [x] Trim Token Filter @martijnvg (#24948) * [x] Limit Token Count Token Filter (#25580) * [ ] Hunspell Token Filter (trickier: because of its infra`AnalysisPlugin#getHunspellDictionaries()`) * [x] Common Grams Token Filter (#25580) * [x] Normalization Token Filter (#25715) * [x] CJK Width Token Filter (#25715) * [x] CJK Bigram Token Filter (#25715) * [x] Delimited Payload Token Filter (#25784) * [x] Keep Words Token Filter (#25784) * [x] Keep Types Token Filter (#25784) * [x] Classic Token Filter (#25784) * [x] Apostrophe Token Filter (#25784) * [x] Decimal Digit Token Filter (#25784) * [x] Fingerprint Token Filter (#25784) * [x] Minhash Token Filter (#25784) * [x] Scandinavian folding token filter (#25784) * [x] Language stem token filters (arabic, brazilian, czech, dutch, french, german, russian) (#26042) Character Filters --------- * [x] HTML Strip Char Filter #24261 * [x] Mapping Char Filter #24261 * [x] Pattern Replace Char Filter #24261

(3 comments) (2 reactions) (0 assignees)Java (25,882 forks)batch import

:Search Relevance/Analysis>refactoringMetaTeam:Search Relevancehelp wanted

Repository metrics

Stars: (76,700 stars)
PR merge metrics: (平均マージ 2d) (30d で 1,000 merged PRs)

コントリビューターガイド

調査方針: 残っている未チェックの項目に焦点を当てる：Shingle Token Filter、Hunspell Token Filter、そしてコアのlucene analyzers common.jar依存関係の削除。他のアナライザがどのように移動されたかを調べて、分析モジュールを作成するパターンを理解する。
技術スタック: java
領域: backend
Issue 種別: リファクタリング
難度: 4
推定時間: 1週間超
活動状況: アクティブ
明確さ: 明確
前提条件: JavaElasticsearch plugin API
初心者向け度: 30

Repository metrics

説明

Misc

Analyzers

Tokenizers

Token Filters

Character Filters

コントリビューターガイド

新着 Easy issues をメールで受け取る。