Netflix/eureka

Instance status keep STARTING in eureka server even it is UP on Local

Open

#1,174 opened on Jan 16, 2019

View on GitHub
 (10 comments) (0 reactions) (0 assignees)Java (12,190 stars) (3,712 forks)batch import
bughelp wanted

Description

At first, it is very difficult to reproduce. Even, this case is very rare but fatal. Currently, we met 3 times of 60+ services and 150+ instances in past two months.

eureka-client: 1.7.2 spring-boot: 1.5.12.RELEASE spring-cloud: Edgware.SR3

What we get

Instance info on remote

image

Status info on local

heap dump

image

Instance info in Heartbeat request

tcp dump

image

Scope Status lastDirtyTimestamp lastUpdatedTimestamp registrationTimestamp
Request UP 1545039481813
Local UP 1545039481813 1545039472888
Remote STARTING 1545039481813 1545039481899 1545039481898

Source Code Analysis

InstanceInfo initialization

The initialization is via InstanceInfoFactory#create() and status is STARTING

Instance registration

The real logic of registration is in DiscoveryClient#register(). But this method has two invokers.

I. Main flow registration

EurekaAutoServiceRegistration implements SmartLifecycle interface, its instance is initialized in EurekaClientAutoConfiguration#eurekaAutoServiceRegistration().

EurekaAutoServiceRegistration#start() method is invoker after all spring bean initialized and arranges EurekaServiceRegistry to register EurekaRegistration:

public void start() {
    ...
    if (!this.running.get() && this.registration.getNonSecurePort() > 0) {
    	this.serviceRegistry.register(this.registration);
    	this.context.publishEvent(
    			new InstanceRegisteredEvent<>(this, this.registration.getInstanceConfig()));
    	this.running.set(true);
    }
}

EurekaRegistration#register(): it set the instance status to initial stats UP which can be configured by eureka.instance.initial-status. During status updating, stats changed and timestamp marked down in lastDirtyTimestamp. Once status changed, StatusChangeListener#notify() is triggerred


public void register(EurekaRegistration reg) {
	maybeInitializeClient(reg);
	if (log.isInfoEnabled()) {
		log.info("Registering application " + reg.getInstanceConfig().getAppname()
				+ " with eureka with status "
				+ reg.getInstanceConfig().getInitialStatus());
	}
	//1
	reg.getApplicationInfoManager()
			.setInstanceStatus(reg.getInstanceConfig().getInitialStatus());
	if (reg.getHealthCheckHandler() != null) {
	   //2
		reg.getEurekaClient().registerHealthCheck(reg.getHealthCheckHandler());
	}
}

DiscoveryClient has a inner anonymous class implementing StatusChangeListener which will invoke InstanceInfoReplicator#onDemandUpdate()

statusChangeListener = new ApplicationInfoManager.StatusChangeListener() {
    @Override
    public String getId() {
        return "statusChangeListener";
    }
    @Override
    public void notify(StatusChangeEvent statusChangeEvent) {
        if (InstanceStatus.DOWN == statusChangeEvent.getStatus() ||
                InstanceStatus.DOWN == statusChangeEvent.getPreviousStatus()) {
            // log at warn level if DOWN was involved
            logger.warn("Saw local status change event {}", statusChangeEvent);
        } else {
            logger.info("Saw local status change event {}", statusChangeEvent);
        }
        instanceInfoReplicator.onDemandUpdate(); 
    }
};

InstanceInfoReplicator is initialized during DiscoveryClient#initScheduledTasks() and it an implementation of Runnable. After initialized, it is started by ScheduledThreadPoolExecutor. Likewise, #onDemandUpdate() will start new thread with ScheduledThreadPoolExecutor.

In #run() method, it invokes DiscoveryClient#refreshInstanceInfo() to update instance status vi HealthCheckHandler. Then, DiscoveryClient#register() invoked.

II. Registeration in Heartbeat thread

DiscoveryClient#initScheduledTasks() mentioned above. It starts other threads except InstanceInfoReplicator. Such as HeartbeatThread. The HeartbeatThread will report instance status and its lastDirtyTimestamp to remote periodically via PUT type HTTP request.

The request may have two response status: 404 and 200. The first one indicates there is no such instance registered on remote; the 2nd one means status report successfully.

If 404 received, the thread will invoke DiscoveryClient#register() to register instance.

Inference

The status on remote is STARTING, we can be sure it is registered by heartbeat thread.

There are serval time points:

  • 1545039472888: this is when InstanceInfo object created. Because #lastUpdatedTimestamp of local instance object is set when constructed. And never updated, this can be confirmed in heap dump above
  • 1545039481813: this is the timestamp when status change to UP from STARTING. This is also the last status modified time. Since then, heartbeat thread reports latest stats UP and its last modified timestamp 1545039481813. Refer to tcp dump above
  • 1545039481898: this is the timestamp when remote received registration request. Refer to instance info on remote above
  • 1545039481899: this is the timestamp recored by remote. It is stored in lastUpdatedTimestamp in remote instance info. It is only 1 millisecond later than registration time. Refer to instance info on remote above.

On remote, InstanceResource#renewLease() -> InstanceResource#validateDirtyTimestamp(): if the lastDirtyTimestamp in request is same as the one stored in registry. It WILL NOT update status even the stats in request is UP and the stored is STARTING.

Why this happens

Both DiscoveryClient and EurekaAutoServiceRegistration are initialized in EurekaClientAutoConfiguration. The latter depends on the former.

HeartbeatThread thread is started delay: eureka.instance.lease-renewal-interval-in-seconds can be used to configure delay time. Default is 30s, In our case it is configured as 10s.

This means there is race condition between Main flow and Heartbeat thread. Duration registration in heartbeat thread, it serializes instance info to JSON which gets its status first, then lastDirtyTimestamp. Between them, Main flow does status change in EurekaRegistration#register().

Further, EurekaAutoServiceRegistration#start() is invoked after all bean initialized. This means the cost is up to 10s for all beans' initialization.

Solution

In EurekaJacksonCodec$InstanceInfoSerializer#serialize(), move #autoMarshalEligible() invocation after line of jgen.writeStartObject(). This results gain of lastDirtyTimestamp is earlier than status. With this, it can be sure the status in first registration request align to real, even the origin lastDirtyTimestamp is earlier than real one. But this can be fix in later heartbeat reqeust.

image

Contributor guide