:Search Relevance/Analysis>refactoringMetaTeam:Search Relevancehelp wanted
説明
We'd like to move the analyzers from Elasticsearch core into a module. They would still ship with Elasticsearch, just not with the Elasticsearch jar. We like this for a few reasons:
- It reduces the size of the high level rest client and the transport client. They don't need to reference analyzers.
- It proves that analysis plugins are first class citizens by consuming the plugin API for setting up the analyzers.
- It forces us to develop features a little more generically, not relying on specific analyzers, which is a good thing if you are going to have a first class plugin API.
At this point I propose we move analysis components a few at a time. Claim the components you'd like to move before doing the move using the list below. We're doing this directly in master and 5.x. There is no need for a long running branch for this.
Keep in mind when claiming components that moving the code is not time consuming but fixing tests that rely on the components might be.
Misc
- Allow plugins to build "pre-built" analysis components. This blocks a number of the analyzers below.
- Token filters #24223 #24572 #24223
- Analyzers #31095
- Tokenizers #24751 #24863
- Char filters #25000
- Remove core's dependency on
lucene-analzyers-common.jar
Analyzers
- Standard Analyzer (This one will stay in core. It isn't part of the
lucene-analzyers-common.jarand it will keep testing easier to keep it in core.) - Simple Analyzer
- Whitespace Analyzer
- Stop Analyzer
- Keyword Analyzer
- Pattern Analyzer #31095
- Language Analyzers #31143 #31300
- Fingerprint Analyzer #31095
- Standard html strip Analyzer #31095
Tokenizers
- Standard Tokenizer (I believe this one will also stay in core for the same reasons Standard Analyzer is staying.)
- Letter Tokenizer (#30538)
- Lowercase Tokenizer (#30538)
- Whitespace Tokenizer (#30538)
- UAX URL Email Tokenizer (#30538)
- Classic Tokenizer (#30538)
- Thai Tokenizer (#30538)
- N-Gram Tokenizer (#30538)
- Edge N-Gram Tokenizer (#30538)
- Keyword Tokenizer (#30642)
- Pattern Tokenizer (#30538)
- Path Tokenizer (#30538)
Token Filters
- Standard Token Filter (This will stay in core, because
StandardFilteris part of lucene-core) - ASCII Folding Token Filter (#23614)
- Flatten Graph Token Filter @martijnvg (#25214)
- Length Token Filter @martijnvg (#25214)
- Lowercase Token Filter @martijnvg (#25214)
- Uppercase Token Filter @martijnvg (#25214)
- NGram Token Filter @martijnvg (#25214)
- Edge NGram Token Filter @martijnvg (#25214)
- Porter Stem Token Filter @martijnvg (#24948)
- Shingle Token Filter (trickier: because of
PhraseSuggestionBuilder, it usesShingleTokenFilterFactory's getters) - Stop Token Filter (can remain in core as it uses classes from lucene-core and lucene-suggest jars)
- Word Delimiter Token Filter (#23614)
- Stemmer Token Filter (#25384)
- Stemmer Override Token Filter (#25384)
- Keyword Marker Token Filter @martijnvg (#24948)
- Keyword Repeat Token Filter (Hasn't been exposed yet as a token filter)
- KStem Token Filter (#25384)
- Snowball Token Filter @martijnvg (#24948)
- Phonetic Token Filter (Is already in its own module
analysis-phonetic) - Synonym Token Filter (trickier: because
CustomAnalyzerProviderdepends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868 - Synonym Graph Token Filter (trickier: because
CustomAnalyzerProviderdepends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868 - Compound Word Token Filter (#25384)
- Reverse Token Filter (#25384)
- Elision Token Filter (#25384)
- Truncate Token Filter (#25384)
- Unique Token Filter @martijnvg (#25214)
- Pattern Capture Token Filter (#25580)
- Pattern Replace Token Filter (#25580)
- Trim Token Filter @martijnvg (#24948)
- Limit Token Count Token Filter (#25580)
- Hunspell Token Filter (trickier: because of its infra
AnalysisPlugin#getHunspellDictionaries()) - Common Grams Token Filter (#25580)
- Normalization Token Filter (#25715)
- CJK Width Token Filter (#25715)
- CJK Bigram Token Filter (#25715)
- Delimited Payload Token Filter (#25784)
- Keep Words Token Filter (#25784)
- Keep Types Token Filter (#25784)
- Classic Token Filter (#25784)
- Apostrophe Token Filter (#25784)
- Decimal Digit Token Filter (#25784)
- Fingerprint Token Filter (#25784)
- Minhash Token Filter (#25784)
- Scandinavian folding token filter (#25784)
- Language stem token filters (arabic, brazilian, czech, dutch, french, german, russian) (#26042)
Character Filters
- HTML Strip Char Filter #24261
- Mapping Char Filter #24261
- Pattern Replace Char Filter #24261