During shard relocation some requests might fail to be sent to a shard · elastic/elasticsearch#13719

(2 comments) (0 reactions) (0 assignees)Java (25,882 forks)batch import

:Distributed/Distributed>bugTeam:Distributedhelp wantedresiliency

Repository metrics

Stars: (76,700 stars)
PR merge metrics: (Avg merge 2d) (1,000 merged PRs in 30d)

Description

When shards relocate then there might be small window in time where requests fail to reach the relocating shard. This happens when a node that lags one cluster state behind has not realized yet that a node has relocated and the relocation source is already removed. Below is a graphical representation of an example course of events. This caused us some trouble in test already because results were unexpected, see https://github.com/elastic/elasticsearch/issues/13266. It affects all actions that inherit from TransportBroadcastAction, TransportBroadcastByNodeAction and might also be problematic for others. For example: an optimize request might never reach a shard if it is relocating, indices stats may report wrong statistics, see https://github.com/elastic/elasticsearch/issues/13266#issuecomment-138470051, etc.

We should check if we can get away with just sending requests to relocation targets too for the affected actions or if we need to implement these kind of requests as replication action like we did for refresh and flush.

recovery-issues

finally

I: shard is relocating from n2 to n3 II: CS2 signals that n3 has started its shard and n2 can remove its own copy. shard on n2 is therefore closed. But n1 lags behind one cluster state and still expects an up and running primary on n2 III: if now n1 sends an optimize, indices stats request or the likes, it will send the request to n2 (based on CS1) but that does not not have the shard anymore.

Contributor guide

Research direction: Inspect the shard relocation logic in TransportBroadcastAction and TransportBroadcastByNodeAction to understand why requests fail during relocation. Look at how cluster state updates are propagated and consider adding logic to also forward requests to relocation targets.
Tech stack: java
Domain: backend
Issue type: Bug
Difficulty: 3
Estimated time: 1-2 days
Activity status: Stale
Clarity: Clear
Prerequisites: JavaElasticsearch internals
Newbie friendliness: 30

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.