Netflix/eureka

eureka-client: Connections to all servers fail if one is not reachable

Open

#1,250 opened on 2019年10月9日

GitHub で見る
 (3 comments) (0 reactions) (0 assignees)Java (12,190 stars) (3,712 forks)batch import
help wanted

説明

eureka-client version: 1.9.12 JDK version: 11.0.4 spring-cloud-netflix-eureka-client version: 2.1.2.RELEASE

On eureka-server-side we have 4 servers in 2 zones:

zone 1 zone 2
server 1 discovery-1
server 2 discovery-2
server 3 discovery-3
server 4 discovery-4

After server 1 crashed it was unreachable (NoRouteToHostException), the other server had no problem when contacting them per HTTP manually.

However, cache-refresh requests from the eureka-clients in our services produce errors for all server instances:

  • The first request that is done by the RetryableEurekaHttpClient tries to contact discovery-1 and fails because of the NoRouteToHostException
  • Afterwards the RetryableEurekaHttpClient tries discovery-2. Unfortunately it fails with com.fasterxml.jackson.core.JsonParseException: processing aborted at [Source: (GZIPInputStream); line: 1, column: 18]
  • As a last step it tries to contact discovery-3 which results in the same Exception as the request to discovery-2. Afterwars, because the retries reache the maximum numberOfRetries (3) it throws TransportException("Retry limit reached; giving up on completing the request").

The requests to discovery-2 and discovery-3 result in a JsonParseException because the Thread was interrupted: https://github.com/Netflix/eureka/blob/743af8be0fa37118a3a9ee0d39f3ba8a89621119/eureka-client/src/main/java/com/netflix/discovery/converters/EurekaJacksonCodec.java#L500. I think that the failed request to discovery-1 leads to the thread-interruption. Per remote debugging I was able to see, that the non-completed future was cancelled in https://github.com/Netflix/eureka/blob/743af8be0fa37118a3a9ee0d39f3ba8a89621119/eureka-client/src/main/java/com/netflix/discovery/TimedSupervisorTask.java#L96 which interrupts the thread in line 173 of FutureTask. The thread which is interrupted is the same thread that tries to contact discovery-2 and discovery-3.

Can you please advice me, are we doing something wrong or could this be a bug?

Full stack traces following in the order of occurence, everything in thread "DiscoveryClient-CacheRefreshExecutor-0":

https://pastebin.com/RkuvPgJH https://pastebin.com/Qyf1pHjF (2 times) https://pastebin.com/Jp5u3e5p

コントリビューターガイド