pytorch/serve

The memory occupied by the model becomes larger after it is loaded into the GPU

Open

#2,022 opened on Dec 5, 2022

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Java (3,844 stars) (790 forks)batch import
help wantedtriaged

Description

🐛 Describe the bug

We have two GPT2 models. Model 1 has only 110 million parameters, which are stored in 16 bit floating point numbers. Model 2 has 3.5 billion parameters, which are also stored in 16 bit floating point numbers.

However, after the same handler is loaded into the torchserve: latest gpu, Model 2 occupies about 7G of memory, which is consistent with our calculation, but Model 1 actually occupies about 1G of memory, which is far more than we expected.

Error logs

Model 1 size:

$ ll -h conversation/
total 250M
-rw-rw-r-- 1 alex alex  848 12月  5 12:45 config.json
-rw-rw-r-- 1 alex alex 250M 12月  5 12:45 pytorch_model.bin

Model 2 size:

$ ll -h gpt2/
total 6.7G
-rw-rw-r-- 1 alex alex  858 11月 14 18:33 config.json
-rw-rw-r-- 1 alex alex 6.7G 11月 14 18:34 pytorch_model.bin

watch -n 1 nvidia-smi

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    155043      C   /home/venv/bin/python            1227MiB |
|    4   N/A  N/A    396432      C   /home/venv/bin/python            7507MiB |
|    6   N/A  N/A    155043      C   /home/venv/bin/python            1211MiB |
|    7   N/A  N/A    514629      C   /home/venv/bin/python             945MiB |
+-----------------------------------------------------------------------------+

Model 1 in GPU 7 and Model 2 in GPU 4. Do not care about the other two models on GPU0 and GPU6.

Installation instructions

using command:

docker run -it -d --name torchserve --gpus '"device=0,1,2,3,4,5,6,7"' -p 18080:8080 -p 18081:8081 -p 18082:8082 -p 17070:7070 -p 17071:7071 -v ./config.properties:/home/model-server/config.properties -v ./torchserve/model-store:/home/model-server/model-store pytorch/torchserve:latest-gpu

Model Packaing

Model 1: https://huggingface.co/IDEA-CCNL/Wenzhong-GPT2-110M Model 2: https://huggingface.co/IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese

after finetune.

packaing model 1:

torch-model-archiver --model-name insurance_chat    --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"

packing model 2:

torch-model-archiver --model-name insurance_chat_3.5B    --force --version 1.0 --serialized-file /data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/config.json"

The two commands are very similar. The only difference is the path of the model.

config.properties

No response

Versions

$ docker exec torchserve  pip list
Package                Version
---------------------- ------------
accelerate             0.15.0
captum                 0.5.0
diffusers              0.9.0
huggingface-hub        0.11.1
jieba                  0.42.1
matplotlib             3.5.2
matplotlib-inline      0.1.6
numpy                  1.22.4
oss2                   2.16.0
packaging              21.3
pandas                 1.4.2
Pillow                 9.0.1
pip                    22.3.1
pycryptodome           3.16.0
requests               2.28.0
rouge                  1.0.1
scikit-learn           1.1.3
scipy                  1.9.3
sentence-transformers  2.2.2
sentencepiece          0.1.97
tokenizers             0.13.2
torch                  1.11.0+cu102
torch-model-archiver   0.6.0
torchserve             0.6.0
torchtext              0.12.0
torchvision            0.12.0+cu102
transformers           4.25.1

Repro instructions

The two models have similar operations.

git pull
curl -X DELETE http://localhost:18081/models/insurance_chat
rm -rf /data/liuzhaofeng/torchserve/model-store/insurance_chat.mar

torch-model-archiver --model-name insurance_chat    --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"

curl -X POST "http://localhost:18081/models?url=insurance_chat.mar"
curl -X PUT "http://localhost:18081/models/insurance_chat?min_worker=1"

Possible Solution

Shouldn't Model 1 only use 200+M of memory?

Contributor guide