elastic/elasticsearch

Move analysis components to a module

Open

#23,658 建立於 2017年3月20日

在 GitHub 查看
 (3 留言) (2 反應) (0 負責人)Java (76,700 star) (25,882 fork)batch import
:Search Relevance/Analysis>refactoringMetaTeam:Search Relevancehelp wanted

描述

We'd like to move the analyzers from Elasticsearch core into a module. They would still ship with Elasticsearch, just not with the Elasticsearch jar. We like this for a few reasons:

  1. It reduces the size of the high level rest client and the transport client. They don't need to reference analyzers.
  2. It proves that analysis plugins are first class citizens by consuming the plugin API for setting up the analyzers.
  3. It forces us to develop features a little more generically, not relying on specific analyzers, which is a good thing if you are going to have a first class plugin API.

At this point I propose we move analysis components a few at a time. Claim the components you'd like to move before doing the move using the list below. We're doing this directly in master and 5.x. There is no need for a long running branch for this.

Keep in mind when claiming components that moving the code is not time consuming but fixing tests that rely on the components might be.

Misc

  • Allow plugins to build "pre-built" analysis components. This blocks a number of the analyzers below.
    • Token filters #24223 #24572 #24223
    • Analyzers #31095
    • Tokenizers #24751 #24863
    • Char filters #25000
  • Remove core's dependency on lucene-analzyers-common.jar

Analyzers

  • Standard Analyzer (This one will stay in core. It isn't part of the lucene-analzyers-common.jar and it will keep testing easier to keep it in core.)
  • Simple Analyzer
  • Whitespace Analyzer
  • Stop Analyzer
  • Keyword Analyzer
  • Pattern Analyzer #31095
  • Language Analyzers #31143 #31300
  • Fingerprint Analyzer #31095
  • Standard html strip Analyzer #31095

Tokenizers

  • Standard Tokenizer (I believe this one will also stay in core for the same reasons Standard Analyzer is staying.)
  • Letter Tokenizer (#30538)
  • Lowercase Tokenizer (#30538)
  • Whitespace Tokenizer (#30538)
  • UAX URL Email Tokenizer (#30538)
  • Classic Tokenizer (#30538)
  • Thai Tokenizer (#30538)
  • N-Gram Tokenizer (#30538)
  • Edge N-Gram Tokenizer (#30538)
  • Keyword Tokenizer (#30642)
  • Pattern Tokenizer (#30538)
  • Path Tokenizer (#30538)

Token Filters

  • Standard Token Filter (This will stay in core, because StandardFilter is part of lucene-core)
  • ASCII Folding Token Filter (#23614)
  • Flatten Graph Token Filter @martijnvg (#25214)
  • Length Token Filter @martijnvg (#25214)
  • Lowercase Token Filter @martijnvg (#25214)
  • Uppercase Token Filter @martijnvg (#25214)
  • NGram Token Filter @martijnvg (#25214)
  • Edge NGram Token Filter @martijnvg (#25214)
  • Porter Stem Token Filter @martijnvg (#24948)
  • Shingle Token Filter (trickier: because of PhraseSuggestionBuilder, it uses ShingleTokenFilterFactory's getters)
  • Stop Token Filter (can remain in core as it uses classes from lucene-core and lucene-suggest jars)
  • Word Delimiter Token Filter (#23614)
  • Stemmer Token Filter (#25384)
  • Stemmer Override Token Filter (#25384)
  • Keyword Marker Token Filter @martijnvg (#24948)
  • Keyword Repeat Token Filter (Hasn't been exposed yet as a token filter)
  • KStem Token Filter (#25384)
  • Snowball Token Filter @martijnvg (#24948)
  • Phonetic Token Filter (Is already in its own module analysis-phonetic )
  • Synonym Token Filter (trickier: because CustomAnalyzerProvider depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868
  • Synonym Graph Token Filter (trickier: because CustomAnalyzerProvider depends on it which is used by analyze api and this token filter relies on AnalysisRegistry) #33868
  • Compound Word Token Filter (#25384)
  • Reverse Token Filter (#25384)
  • Elision Token Filter (#25384)
  • Truncate Token Filter (#25384)
  • Unique Token Filter @martijnvg (#25214)
  • Pattern Capture Token Filter (#25580)
  • Pattern Replace Token Filter (#25580)
  • Trim Token Filter @martijnvg (#24948)
  • Limit Token Count Token Filter (#25580)
  • Hunspell Token Filter (trickier: because of its infraAnalysisPlugin#getHunspellDictionaries())
  • Common Grams Token Filter (#25580)
  • Normalization Token Filter (#25715)
  • CJK Width Token Filter (#25715)
  • CJK Bigram Token Filter (#25715)
  • Delimited Payload Token Filter (#25784)
  • Keep Words Token Filter (#25784)
  • Keep Types Token Filter (#25784)
  • Classic Token Filter (#25784)
  • Apostrophe Token Filter (#25784)
  • Decimal Digit Token Filter (#25784)
  • Fingerprint Token Filter (#25784)
  • Minhash Token Filter (#25784)
  • Scandinavian folding token filter (#25784)
  • Language stem token filters (arabic, brazilian, czech, dutch, french, german, russian) (#26042)

Character Filters

  • HTML Strip Char Filter #24261
  • Mapping Char Filter #24261
  • Pattern Replace Char Filter #24261

貢獻者指南

Move analysis components to a module · elastic/elasticsearch#23658 | Good First Issue