Exception when using torchserve to deploy hugging face model: java.lang.InterruptedException: null
#3,026 opened on Mar 14, 2024
Description
🐛 Describe the bug
I followed the tutorial as https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers
First,
python Download_Transformer_models.py
Then,
torch-model-archiver --model-name BERTSeqClassification --version 1.0 --serialized-file Transformer_model/pytorch_model.bin --handler ./Transformer_handler_generalized.py --extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json"
Finally,
torchserve --start --model-store model_store --models my_tc=BERTSeqClassification.mar --ncs
The system cannot start as usualy, it gives out the error log, throwing an Exception
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?]
at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?]
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:367) ~[model-server.jar:?]
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:36) ~[model-server.jar:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:194) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:833) [?:?]
I tried curl to check the model
root@0510f3693f42:/home/model-server# curl http://127.0.0.1:8081/models
{
"models": []
}
Error logs
2024-03-14T07:34:24,938 [INFO ] epollEventLoopGroup-5-17 org.pytorch.serve.wlm.WorkerThread - 9015 Worker disconnected. WORKER_STARTED 2024-03-14T07:34:24,938 [INFO ] W-9015-my_tc_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9015. 2024-03-14T07:34:24,938 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2024-03-14T07:34:24,938 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1679) ~[?:?] at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:515) ~[?:?] at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:677) ~[?:?] at org.pytorch.serve.wlm.Model.pollBatch(Model.java:367) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:36) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:194) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] 2024-03-14T07:34:24,938 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - W-9015-my_tc_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2024-03-14T07:34:24,938 [WARN ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again 2024-03-14T07:34:24,939 [WARN ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9015-my_tc_1.0-stderr 2024-03-14T07:34:24,939 [WARN ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9015-my_tc_1.0-stdout 2024-03-14T07:34:24,939 [INFO ] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9015 in 3 seconds. 2024-03-14T07:34:24,946 [INFO ] W-9015-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9015-my_tc_1.0-stdout 2024-03-14T07:34:24,946 [INFO ] W-9015-my_tc_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9015-my_tc_1.0-stderr 2024-03-14T07:34:27,207 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9010, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,489 [DEBUG] W-9012-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9012, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,579 [DEBUG] W-9000-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,669 [DEBUG] W-9011-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9011, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,704 [DEBUG] W-9006-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9006, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,707 [DEBUG] W-9008-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9008, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,734 [DEBUG] W-9017-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9017, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,751 [DEBUG] W-9003-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9003, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,776 [DEBUG] W-9001-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9001, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,804 [DEBUG] W-9005-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9005, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,815 [DEBUG] W-9009-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9009, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,844 [DEBUG] W-9013-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9013, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,848 [DEBUG] W-9004-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9004, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,853 [DEBUG] W-9007-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9007, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,857 [DEBUG] W-9019-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9019, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,871 [DEBUG] W-9002-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9002, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,904 [DEBUG] W-9014-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9014, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,904 [DEBUG] W-9018-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9018, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,927 [DEBUG] W-9016-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9016, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:27,939 [DEBUG] W-9015-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/home/venv/bin/python, /home/venv/lib/python3.9/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9015, --metrics-config, /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml] 2024-03-14T07:34:28,642 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - s_name_part0=/home/model-server/tmp/.ts.sock, s_name_part1=9010, pid=8906 2024-03-14T07:34:28,644 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9010 2024-03-14T07:34:28,657 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - Successfully loaded /home/venv/lib/python3.9/site-packages/ts/configs/metrics.yaml. 2024-03-14T07:34:28,658 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - [PID]8906 2024-03-14T07:34:28,658 [INFO ] W-9010-my_tc_1.0-stdout MODEL_LOG - Torch worker started. 2024-03-14T07:34:28,659 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - W-9010-my_tc_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2024-03-14T07:34:28,659 [INFO ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9010 2024-03-14T07:34:28,660 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9010 Worker disconnected. WORKER_STARTED 2024-03-14T07:34:28,661 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2024-03-14T07:34:28,661 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?] at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:276) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:424) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:191) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.lang.Thread.run(Thread.java:833) [?:?] 2024-03-14T07:34:28,661 [DEBUG] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - W-9010-my_tc_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2024-03-14T07:34:28,662 [WARN ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Auto recovery failed again 2024-03-14T07:34:28,663 [WARN ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9010-my_tc_1.0-stderr 2024-03-14T07:34:28,663 [WARN ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9010-my_tc_1.0-stdout 2024-03-14T07:34:28,664 [INFO ] W-9010-my_tc_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9010 in 5 seconds. 2024-03-14T07:34:28,692 [ERROR] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - Unknown exception io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 2024-03-14T07:34:28,698 [INFO ] W-9010-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9010-my_tc_1.0-stdout 2024-03-14T07:34:28,698 [INFO ] W-9010-my_tc_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9010-my_tc_1.0-stderr
Installation instructions
pip install torchserve.
Yes, I am using docker image pytorch/torchserve:latest
Model Packaing
I use transformers=3.4.0 to save the pretrained model into
root@1796bda67dbf:~/Huggingface_Transformers# ll Transformer_model/
total 428008
drwxr-xr-x 2 root root 4096 Mar 14 07:05 ./
drwxr-xr-x 9 root root 4096 Mar 14 07:13 ../
-rw-r--r-- 1 root root 522 Mar 14 07:05 config.json
-rw-r--r-- 1 root root 438019213 Mar 14 07:05 pytorch_model.bin
-rw-r--r-- 1 root root 112 Mar 14 07:05 special_tokens_map.json
-rw-r--r-- 1 root root 174 Mar 14 07:05 tokenizer_config.json
-rw-r--r-- 1 root root 231508 Mar 14 07:05 vocab.txt
config.properties
No response
Versions
torchserve==0.9.0 torch-model-archiver==0.9.0
Python version: 3.9 (64-bit runtime) Python executable: /home/venv/bin/python
Versions of relevant python libraries: captum==0.6.0 numpy==1.24.3 psutil==5.9.5 requests==2.31.0 torch==2.1.0+cpu torch-model-archiver==0.9.0 torch-workflow-archiver==0.2.11 torchaudio==2.1.0+cpu torchdata==0.7.0 torchserve==0.9.0 torchtext==0.16.0+cpu torchvision==0.16.0+cpu wheel==0.40.0 torch==2.1.0+cpu torchtext==0.16.0+cpu torchvision==0.16.0+cpu torchaudio==2.1.0+cpu
Java Version:
OS: Ubuntu 20.04.6 LTS GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: N/A CMake version: N/A
Environment: library_path (LD_/DYLD_):
Repro instructions
As described above
Possible Solution
No response