Instance status keep STARTING in eureka server even it is UP on Local
#1,174 创建于 2019年1月16日
描述
At first, it is very difficult to reproduce. Even, this case is very rare but fatal. Currently, we met 3 times of 60+ services and 150+ instances in past two months.
eureka-client: 1.7.2 spring-boot: 1.5.12.RELEASE spring-cloud: Edgware.SR3
What we get
Instance info on remote

Status info on local
heap dump

Instance info in Heartbeat request
tcp dump

| Scope | Status | lastDirtyTimestamp | lastUpdatedTimestamp | registrationTimestamp |
|---|---|---|---|---|
| Request | UP | 1545039481813 | ||
| Local | UP | 1545039481813 | 1545039472888 | |
| Remote | STARTING | 1545039481813 | 1545039481899 | 1545039481898 |
Source Code Analysis
InstanceInfo initialization
The initialization is via InstanceInfoFactory#create() and status is STARTING
Instance registration
The real logic of registration is in DiscoveryClient#register(). But this method has two invokers.
I. Main flow registration
EurekaAutoServiceRegistration implements SmartLifecycle interface, its instance is initialized in EurekaClientAutoConfiguration#eurekaAutoServiceRegistration().
EurekaAutoServiceRegistration#start() method is invoker after all spring bean initialized and arranges EurekaServiceRegistry to register EurekaRegistration:
public void start() {
...
if (!this.running.get() && this.registration.getNonSecurePort() > 0) {
this.serviceRegistry.register(this.registration);
this.context.publishEvent(
new InstanceRegisteredEvent<>(this, this.registration.getInstanceConfig()));
this.running.set(true);
}
}
EurekaRegistration#register(): it set the instance status to initial stats UP which can be configured by eureka.instance.initial-status. During status updating, stats changed and timestamp marked down in lastDirtyTimestamp. Once status changed, StatusChangeListener#notify() is triggerred
public void register(EurekaRegistration reg) {
maybeInitializeClient(reg);
if (log.isInfoEnabled()) {
log.info("Registering application " + reg.getInstanceConfig().getAppname()
+ " with eureka with status "
+ reg.getInstanceConfig().getInitialStatus());
}
//1
reg.getApplicationInfoManager()
.setInstanceStatus(reg.getInstanceConfig().getInitialStatus());
if (reg.getHealthCheckHandler() != null) {
//2
reg.getEurekaClient().registerHealthCheck(reg.getHealthCheckHandler());
}
}
DiscoveryClient has a inner anonymous class implementing StatusChangeListener which will invoke InstanceInfoReplicator#onDemandUpdate()
statusChangeListener = new ApplicationInfoManager.StatusChangeListener() {
@Override
public String getId() {
return "statusChangeListener";
}
@Override
public void notify(StatusChangeEvent statusChangeEvent) {
if (InstanceStatus.DOWN == statusChangeEvent.getStatus() ||
InstanceStatus.DOWN == statusChangeEvent.getPreviousStatus()) {
// log at warn level if DOWN was involved
logger.warn("Saw local status change event {}", statusChangeEvent);
} else {
logger.info("Saw local status change event {}", statusChangeEvent);
}
instanceInfoReplicator.onDemandUpdate();
}
};
InstanceInfoReplicator is initialized during DiscoveryClient#initScheduledTasks() and it an implementation of Runnable. After initialized, it is started by ScheduledThreadPoolExecutor. Likewise, #onDemandUpdate() will start new thread with ScheduledThreadPoolExecutor.
In #run() method, it invokes DiscoveryClient#refreshInstanceInfo() to update instance status
vi HealthCheckHandler. Then, DiscoveryClient#register() invoked.
II. Registeration in Heartbeat thread
DiscoveryClient#initScheduledTasks() mentioned above. It starts other threads except InstanceInfoReplicator. Such as HeartbeatThread. The HeartbeatThread will report instance status and its lastDirtyTimestamp to remote periodically via PUT type HTTP request.
The request may have two response status: 404 and 200. The first one indicates there is no such instance registered on remote; the 2nd one means status report successfully.
If 404 received, the thread will invoke DiscoveryClient#register() to register instance.
Inference
The status on remote is STARTING, we can be sure it is registered by heartbeat thread.
There are serval time points:
1545039472888: this is whenInstanceInfoobject created. Because#lastUpdatedTimestampof local instance object is set when constructed. And never updated, this can be confirmed in heap dump above1545039481813: this is the timestamp when status change toUPfromSTARTING. This is also the last status modified time. Since then, heartbeat thread reports latest statsUPand its last modified timestamp1545039481813. Refer to tcp dump above1545039481898: this is the timestamp when remote received registration request. Refer to instance info on remote above1545039481899: this is the timestamp recored by remote. It is stored inlastUpdatedTimestampin remote instance info. It is only 1 millisecond later than registration time. Refer to instance info on remote above.
On remote, InstanceResource#renewLease() -> InstanceResource#validateDirtyTimestamp(): if the lastDirtyTimestamp in request is same as the one stored in registry. It WILL NOT update status even the stats in request is UP and the stored is STARTING.
Why this happens
Both DiscoveryClient and EurekaAutoServiceRegistration are initialized in EurekaClientAutoConfiguration. The latter depends on the former.
HeartbeatThread thread is started delay: eureka.instance.lease-renewal-interval-in-seconds can be used to configure delay time. Default is 30s, In our case it is configured as 10s.
This means there is race condition between Main flow and Heartbeat thread. Duration registration in heartbeat thread, it serializes instance info to JSON which gets its status first, then lastDirtyTimestamp. Between them, Main flow does status change in EurekaRegistration#register().
Further, EurekaAutoServiceRegistration#start() is invoked after all bean initialized. This means the cost is up to 10s for all beans' initialization.
Solution
In EurekaJacksonCodec$InstanceInfoSerializer#serialize(), move #autoMarshalEligible() invocation after line of jgen.writeStartObject(). This results gain of lastDirtyTimestamp is earlier than status. With this, it can be sure the status in first registration request align to real, even the origin lastDirtyTimestamp is earlier than real one. But this can be fix in later heartbeat reqeust.
