eureka-client: Connections to all servers fail if one is not reachable
#1,250 建立於 2019年10月9日
描述
eureka-client version: 1.9.12 JDK version: 11.0.4 spring-cloud-netflix-eureka-client version: 2.1.2.RELEASE
On eureka-server-side we have 4 servers in 2 zones:
| zone 1 | zone 2 | |
|---|---|---|
| server 1 | discovery-1 | |
| server 2 | discovery-2 | |
| server 3 | discovery-3 | |
| server 4 | discovery-4 |
After server 1 crashed it was unreachable (NoRouteToHostException), the other server had no problem when contacting them per HTTP manually.
However, cache-refresh requests from the eureka-clients in our services produce errors for all server instances:
- The first request that is done by the RetryableEurekaHttpClient tries to contact discovery-1 and fails because of the NoRouteToHostException
- Afterwards the RetryableEurekaHttpClient tries discovery-2. Unfortunately it fails with
com.fasterxml.jackson.core.JsonParseException: processing aborted at [Source: (GZIPInputStream); line: 1, column: 18] - As a last step it tries to contact discovery-3 which results in the same Exception as the request to discovery-2. Afterwars, because the retries reache the maximum numberOfRetries (3) it throws
TransportException("Retry limit reached; giving up on completing the request").
The requests to discovery-2 and discovery-3 result in a JsonParseException because the Thread was interrupted: https://github.com/Netflix/eureka/blob/743af8be0fa37118a3a9ee0d39f3ba8a89621119/eureka-client/src/main/java/com/netflix/discovery/converters/EurekaJacksonCodec.java#L500. I think that the failed request to discovery-1 leads to the thread-interruption. Per remote debugging I was able to see, that the non-completed future was cancelled in https://github.com/Netflix/eureka/blob/743af8be0fa37118a3a9ee0d39f3ba8a89621119/eureka-client/src/main/java/com/netflix/discovery/TimedSupervisorTask.java#L96 which interrupts the thread in line 173 of FutureTask. The thread which is interrupted is the same thread that tries to contact discovery-2 and discovery-3.
Can you please advice me, are we doing something wrong or could this be a bug?
Full stack traces following in the order of occurence, everything in thread "DiscoveryClient-CacheRefreshExecutor-0":
https://pastebin.com/RkuvPgJH https://pastebin.com/Qyf1pHjF (2 times) https://pastebin.com/Jp5u3e5p