Load balancer check results in "[ERROR] epollEventLoopGroup-3-1 org.pytorch.serve.http.HttpRequestHandler - io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer" · pytorch/serve#2201

(1 comment) (1 reaction) (0 assignees)Java (790 forks)batch import

bughelp wanted

Repository metrics

Stars: (3,844 stars)
PR merge metrics: (No merged PRs in 30d)

Description

🐛 Describe the bug

We are using TorchServe to serve a yolox_x model trained by mmedet. We created a customized TorchServe docker image and we wrote a simple docker-compose.yml file which runs on a Debian 11 host with:

Docker version 23.0.1, build a5ee5b1
Docker Compose version v2.15.1

In our current deployment, we are using a load balancer based on HAProxy that communicates with the TorchServe hosts. The HAProxy checks (every second) if the TorchServe hosts are up and running by using the route GET $HOSTNAME:8080/ping and if the response has status code 200 and the response body contains the word Healthy then everything is ok.

Unfortunately, looking at the TorchServe logs using the command docker logs -f torchserve (where torchserve is the container name) we noticed a set of error each time the HAProxy checks the TorchServe host is up and running.

Here an example of the errors.

2023-03-27T07:53:33,604 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55612 "GET /ping HTTP/1.0" 200 0
2023-03-27T07:53:33,604 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:33,606 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

where LOADBLANCER_IP is the anonymized ip of the load balancer host.

When we stop the HAProxy service, the TorchServe instance stops logging the error. It looks like a non-blocking error but it is not a healthy behavior.

We are also performing the same issue with the default torchserve docker image and another custom model.

A similar issue was opened in 2021, however closing/avoiding the health check is not a valid option in our scenario.

Error logs

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-03-27T07:53:16,338 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2023-03-27T07:53:16,425 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.7.1
TS Home: /usr/local/lib/python3.9/site-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Metrics config path: /usr/local/lib/python3.9/site-packages/ts/configs/metrics.yaml
Number of GPUs: 0
Number of CPUs: 4
Max heap size: 960 M
Python executable: /usr/local/bin/python
Config file: model-store/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://0.0.0.0:8082
Model Store: /home/model-server/model-store
Initial Models: yoloxx-coco=yoloxx-coco.mar
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 4
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/model-server/model-store
Model config: {"yoloxx-coco": {"1.0": {"defaultVersion": true,"marName": "yoloxx-coco.mar","minWorkers": 1,"maxWorkers": 10,"batchSize": 1,"maxBatchDelay": 100,"responseTimeout": 100}}}
2023-03-27T07:53:16,432 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2023-03-27T07:53:16,435 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: yoloxx-coco.mar
2023-03-27T07:53:20,703 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model yoloxx-coco
2023-03-27T07:53:20,703 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model yoloxx-coco
2023-03-27T07:53:20,703 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model yoloxx-coco loaded.
2023-03-27T07:53:20,703 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: yoloxx-coco, count: 1
2023-03-27T07:53:20,722 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2023-03-27T07:53:20,722 [DEBUG] W-9000-yoloxx-coco_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/local/bin/python, /usr/local/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /usr/local/lib/python3.9/site-packages/ts/configs/metrics.yaml]
2023-03-27T07:53:20,785 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2023-03-27T07:53:20,786 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2023-03-27T07:53:20,787 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2023-03-27T07:53:20,787 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2023-03-27T07:53:20,790 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2023-03-27T07:53:20,989 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:23.057697296142578|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:5.087863922119141|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:18.1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:3169.96875|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:418.5859375|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,044 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:17.3|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,541 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55582 "GET /ping HTTP/1.0" 200 9
2023-03-27T07:53:21,542 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:21,549 [ERROR] epollEventLoopGroup-3-1 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2023-03-27T07:53:21,916 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2023-03-27T07:53:21,921 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - Successfully loaded /usr/local/lib/python3.9/site-packages/ts/configs/metrics.yaml.
2023-03-27T07:53:21,921 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - [PID]32
2023-03-27T07:53:21,922 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - Torch worker started.
2023-03-27T07:53:21,922 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - Python runtime: 3.9.16
2023-03-27T07:53:21,922 [DEBUG] W-9000-yoloxx-coco_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-yoloxx-coco_1.0 State change null -> WORKER_STARTED
2023-03-27T07:53:21,924 [INFO ] W-9000-yoloxx-coco_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2023-03-27T07:53:21,931 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2023-03-27T07:53:21,933 [INFO ] W-9000-yoloxx-coco_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1679903601933
2023-03-27T07:53:21,942 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - model_name: yoloxx-coco, batchSize: 1
2023-03-27T07:53:22,582 [WARN ] W-9000-yoloxx-coco_1.0-stderr MODEL_LOG - /usr/local/lib/python3.9/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
2023-03-27T07:53:22,582 [WARN ] W-9000-yoloxx-coco_1.0-stderr MODEL_LOG -   warnings.warn(
2023-03-27T07:53:23,554 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55586 "GET /ping HTTP/1.0" 200 0
2023-03-27T07:53:23,554 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:23,556 [ERROR] epollEventLoopGroup-3-2 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2023-03-27T07:53:23,672 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - generated new fontManager
2023-03-27T07:53:25,246 [INFO ] W-9000-yoloxx-coco_1.0-stdout MODEL_LOG - load checkpoint from local path: /home/model-server/tmp/models/dc79daf5690e45868dec208935c74c17/yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth
2023-03-27T07:53:25,540 [INFO ] W-9000-yoloxx-coco_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 3595
2023-03-27T07:53:25,540 [DEBUG] W-9000-yoloxx-coco_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-yoloxx-coco_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED
2023-03-27T07:53:25,540 [INFO ] W-9000-yoloxx-coco_1.0 TS_METRICS - W-9000-yoloxx-coco_1.0.ms:4831|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903605
2023-03-27T07:53:25,540 [INFO ] W-9000-yoloxx-coco_1.0 TS_METRICS - WorkerThreadTime.ms:12|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903605
2023-03-27T07:53:25,564 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55596 "GET /ping HTTP/1.0" 200 0
2023-03-27T07:53:25,564 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:25,565 [ERROR] epollEventLoopGroup-3-3 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2023-03-27T07:53:27,573 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55600 "GET /ping HTTP/1.0" 200 1
2023-03-27T07:53:27,573 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:27,574 [ERROR] epollEventLoopGroup-3-4 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2023-03-27T07:53:29,584 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55604 "GET /ping HTTP/1.0" 200 0
2023-03-27T07:53:29,584 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:29,586 [ERROR] epollEventLoopGroup-3-5 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2023-03-27T07:53:31,594 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55608 "GET /ping HTTP/1.0" 200 0
2023-03-27T07:53:31,594 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:31,596 [ERROR] epollEventLoopGroup-3-6 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
2023-03-27T07:53:33,604 [INFO ] pool-2-thread-2 ACCESS_LOG - /LOADBLANCER_IP:55612 "GET /ping HTTP/1.0" 200 0
2023-03-27T07:53:33,604 [INFO ] pool-2-thread-2 TS_METRICS - Requests2XX.Count:1|#Level:Host|#hostname:efa77cf4af16,timestamp:1679903601
2023-03-27T07:53:33,606 [ERROR] epollEventLoopGroup-3-7 org.pytorch.serve.http.HttpRequestHandler -
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Where LOADBLANCER_IP is the anonymized ip of the loadbalancer host

Installation instructions

I am using Docker with docker-compose and a custom image for running mmdetection object detection models (according to the official docs).

In the following, the code I defined for:

Dockerfile with entrypoint.sh and config.properties
docker-compose.yml

The Dockerfile is defined as follows.

FROM python:3.9-slim-buster

ARG PYTORCH="1.13.1"
ARG TORCHVISION="0.14.1"
ARG TORCHAUDIO="0.13.1"
RUN pip install torch==${PYTORCH}+cpu torchvision==${TORCHVISION}+cpu torchaudio==${TORCHAUDIO}+cpu --extra-index-url https://download.pytorch.org/whl/cpu

ARG MMCV="1.7.0"
ARG MMDET="2.28.1"

ENV PYTHONUNBUFFERED TRUE

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
    ca-certificates \
    g++ \
    openjdk-11-jre-headless \
    # MMDet Requirements
    ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
    && rm -rf /var/lib/apt/lists/*

ENV PATH="/opt/conda/bin:$PATH"
RUN export FORCE_CUDA=1

# TORCHSEVER
RUN pip install torchserve torch-model-archiver

# MMLAB
RUN ["/bin/bash", "-c", "pip install mmcv-full==${MMCV} -f https://download.openmmlab.com/mmcv/dist/cpu/torch${PYTORCH}/index.html"]
RUN pip install mmdet==${MMDET}

RUN useradd -m model-server \
    && mkdir -p /home/model-server/tmp

COPY entrypoint.sh /usr/local/bin/entrypoint.sh

RUN chmod +x /usr/local/bin/entrypoint.sh \
    && chown -R model-server /home/model-server

COPY config.properties /home/model-server/config.properties
RUN mkdir /home/model-server/model-store && chown -R model-server /home/model-server/model-store

EXPOSE 8080 8081 8082

USER model-server
WORKDIR /home/model-server
ENV TEMP=/home/model-server/tmp
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
CMD ["serve"]

The config.properties is available here

The entrypoint.sh is available here

To build the image: docker build --pull -t mmdet-torchserve-cpu:2.28.1

Finally, the docker-compose.yml is defined as follows

version: '3.8'

services:
  torchserve:
    image: 'mmdet-torchserve-cpu:2.28.1'
    ports:
      - '8080:8080'
      - '8081:8081'
      - '8082:8082'
    container_name: 'torchserve'
    volumes:
      - '/home/torchserve/model-store:/home/model-server/model-store'
    command:
      - 'torchserve --start'
      - '--ncs'
      - '--model-store model-store'
      - '--models yoloxx-coco=yoloxx-coco.mar'
      - '--ts-config model-store/config.properties'
    networks:
      - torchserve_net

networks:
  torchserve_net:

Model Packaging

The defined handler.py is based on the one proposed by OpenMMLab. We just changed the input/output and add some basic error propagation.

# Copyright (c) OpenMMLab. All rights reserved.
import base64
import os

import mmcv
import torch
from ts.torch_handler.base_handler import BaseHandler

from mmdet.apis import inference_detector, init_detector

from ts.utils.util import PredictionException
import time

class MMdetHandler(BaseHandler):
    threshold = 0.4

    def initialize(self, context):
        properties = context.system_properties
        self.map_location = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.device = torch.device(self.map_location + ':' +
                                   str(properties.get('gpu_id')) if torch.cuda.
                                   is_available() else self.map_location)
        self.manifest = context.manifest

        model_dir = properties.get('model_dir')
        serialized_file = self.manifest['model']['serializedFile']
        checkpoint = os.path.join(model_dir, serialized_file)
        self.config_file = os.path.join(model_dir, 'config.py')

        self.model = init_detector(self.config_file, checkpoint, self.device)
        self.initialized = True

    def preprocess(self, data):
        images_batches = []
        
        for req in data:
            images_batch=[]
            data_loaded = req.get("data") if req.get("data") is not None else req.get("body", {})
            
            if len(data_loaded['instances'])<1:
                raise ValueError("empty instances list")
            
            for image in data_loaded['instances']:
                image = base64.urlsafe_b64decode(image)
                image = mmcv.imfrombytes(image)
                images_batch.append(image)
            images_batches.append(images_batch)
        return images_batches

    def inference(self, images_batches, *args, **kwargs):
        model_results=[]
        shapes=[]
        for image_batch in images_batches:
            model_results.append(inference_detector(self.model, image_batch))
            shapes.append([image.shape for image in image_batch])
        
        results=(model_results, shapes)
        return results

    def postprocess(self, results):
        output_batches = []
        model_results, shapes = results
        
        for i in range(len(model_results)):
            post_processed_batch=[]
            
            for j in range(len(model_results[i])):
                image_result=model_results[i][j]
                shape=shapes[i][j]
                h, w, c= shape
                
                image_result_clean=[]
                
                if isinstance(image_result, tuple):
                    bbox_result, segm_result = image_result
                    if isinstance(segm_result, tuple):
                        segm_result = segm_result[0]  # ms rcnn
                else:
                    bbox_result, segm_result = image_result, None

                for class_index, class_result in enumerate(bbox_result):
                    class_name = self.model.CLASSES[class_index]
                    for bbox in class_result:
                        bbox_coords = bbox[:-1].tolist()
                        y1,x1, y2,x2 =bbox_coords
                        relative_bbox_coords = y1/h, x1/w, y2/h, x2/w
                        score = float(bbox[-1])
                        if score >= self.threshold:
                            image_result_clean.append({
                                'class_name': class_name,
                                'bbox': bbox_coords,
                                'relative_bbox': relative_bbox_coords,
                                'score': score  
                            })
                post_processed_batch.append({'predictions': image_result_clean,
                                             'img_shape': shape})
            output_batches.append(post_processed_batch)
        return output_batches
    
    def handle(self, data, context):
        """Entry point for default handler. It takes the data from the input request and returns
           the predicted outcome for the input.
        Args:
            data (list): The input data that needs to be made a prediction request on.
            context (Context): It is a JSON Object containing information pertaining to
                               the model artefacts parameters.
        Returns:
            list : Returns a list of dictionary with the predicted response.
        """

        # It can be used for pre or post processing if needed as additional request
        # information is available in context
        start_time = time.time()

        self.context = context
        metrics = self.context.metrics

        is_profiler_enabled = os.environ.get("ENABLE_TORCH_PROFILER", None)
        if is_profiler_enabled:
            if PROFILER_AVAILABLE:
                output, _ = self._infer_with_profiler(data=data)
            else:
                raise RuntimeError(
                    "Profiler is enabled but current version of torch does not support."
                    "Install torch>=1.8.1 to use profiler."
                )
        else:
            try:
                data_preprocess = self.preprocess(data)
            except Exception as e:
                raise PredictionException(
                    f"{type(e)} Error during data preprocessing: {str(e)}",
                    400)

            try:
                output = self.inference(data_preprocess)
            except Exception as e:
                raise PredictionException(
                        f"{type(e)} Error during inference: {str(e)}",
                        503)
            try:
                output = self.postprocess(output)
            except Exception as e:
                raise PredictionException(
                        f"{type(e)} Error during data postprocessing: {str(e)}",
                        503)
                
        stop_time = time.time()
        metrics.add_time(
            "HandlerTime", round((stop_time - start_time) * 1000, 2), None, "ms"
        )
        return output

To package the model as .mar, please refer to the following mmdet2torchserve.py file and the official doc of mmdet

config.properties

The config.properties is defined for the specific model and it's defined in the /home/model-server/model-store/config.properties on the container.

We don't override the default config.properties file defined in /home/model-server/config.properties.

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store

async_logging=true
default_response_timeout=120
enable_metrics_api=true
max_request_size = 6553500
max_response_size = 6553500

models={\
  "yoloxx-coco": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "yoloxx-coco.mar",\
        "minWorkers": 1,\
        "maxWorkers": 10,\
        "batchSize": 10,\
        "maxBatchDelay": 100,\
        "responseTimeout": 100\
    }\
  }\
}

Versions

The python serve/ts_scripts/print_env_info.py doesn't work in the docker container

The torch* libs are

torch==1.13.1+cpu
torch-model-archiver==0.7.1
torchaudio==0.13.1+cpu
torchserve==0.7.1
torchvision==0.14.1+cpu

The installed pip libs are the following ones.

model-server@efa77cf4af16:~$ pip freeze
addict==2.4.0
certifi==2022.12.7
charset-normalizer==3.1.0
contourpy==1.0.7
cycler==0.11.0
enum-compat==0.0.3
fonttools==4.39.1
idna==3.4
importlib-resources==5.12.0
kiwisolver==1.4.4
matplotlib==3.7.1
mmcv-full==1.7.0
mmdet==2.28.1
numpy==1.24.2
opencv-python==4.7.0.72
packaging==23.0
Pillow==9.4.0
psutil==5.9.4
pycocotools==2.0.6
pyparsing==3.0.9
python-dateutil==2.8.2
PyYAML==6.0
requests==2.28.2
scipy==1.10.1
six==1.16.0
terminaltables==3.1.10
torch==1.13.1+cpu
torch-model-archiver==0.7.1
torchaudio==0.13.1+cpu
torchserve==0.7.1
torchvision==0.14.1+cpu
typing_extensions==4.5.0
urllib3==1.26.15
yapf==0.32.0
zipp==3.15.0

Repro instructions

create an infrastructure with a load balancer which defines an health check on the route GET $HOSTNAME:8080/ping where $HOSTNAME is the host where TorchServe is running under docker-compose
create the TorchServe container docker-compose up -d --build to run the docker-compose f
run the command docker logs -f torchserve to see the error

Possible Solution

No response

Contributor guide

Research direction: Investigate the Netty HttpRequestHandler to determine why a connection reset by peer occurs after a successful ping response. Check if the client (HAProxy) closes the connection after receiving the response, causing the server to attempt a read on a closed channel. Consider adding proper handling for such cases or investigate the Netty configuration.
Tech stack: javapython
Domain: backend
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Mostly clear
Prerequisites: JavaNetty
Newbie friendliness: 65