The memory occupied by the model becomes larger after it is loaded into the GPU · pytorch/serve#2022

(1 comment) (0 reactions) (0 assignees)Java (790 forks)batch import

help wantedtriaged

Repository metrics

Stars: (3,844 stars)
PR merge metrics: (No merged PRs in 30d)

Description

🐛 Describe the bug

We have two GPT2 models. Model 1 has only 110 million parameters, which are stored in 16 bit floating point numbers. Model 2 has 3.5 billion parameters, which are also stored in 16 bit floating point numbers.

However, after the same handler is loaded into the torchserve: latest gpu, Model 2 occupies about 7G of memory, which is consistent with our calculation, but Model 1 actually occupies about 1G of memory, which is far more than we expected.

Error logs

Model 1 size:

$ ll -h conversation/
total 250M
-rw-rw-r-- 1 alex alex  848 12月  5 12:45 config.json
-rw-rw-r-- 1 alex alex 250M 12月  5 12:45 pytorch_model.bin

Model 2 size:

$ ll -h gpt2/
total 6.7G
-rw-rw-r-- 1 alex alex  858 11月 14 18:33 config.json
-rw-rw-r-- 1 alex alex 6.7G 11月 14 18:34 pytorch_model.bin

watch -n 1 nvidia-smi

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    155043      C   /home/venv/bin/python            1227MiB |
|    4   N/A  N/A    396432      C   /home/venv/bin/python            7507MiB |
|    6   N/A  N/A    155043      C   /home/venv/bin/python            1211MiB |
|    7   N/A  N/A    514629      C   /home/venv/bin/python             945MiB |
+-----------------------------------------------------------------------------+

Model 1 in GPU 7 and Model 2 in GPU 4. Do not care about the other two models on GPU0 and GPU6.

Installation instructions

using command:

docker run -it -d --name torchserve --gpus '"device=0,1,2,3,4,5,6,7"' -p 18080:8080 -p 18081:8081 -p 18082:8082 -p 17070:7070 -p 17071:7071 -v ./config.properties:/home/model-server/config.properties -v ./torchserve/model-store:/home/model-server/model-store pytorch/torchserve:latest-gpu

Model Packaing

Model 1: https://huggingface.co/IDEA-CCNL/Wenzhong-GPT2-110M Model 2: https://huggingface.co/IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese

after finetune.

packaing model 1:

torch-model-archiver --model-name insurance_chat    --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"

packing model 2:

torch-model-archiver --model-name insurance_chat_3.5B    --force --version 1.0 --serialized-file /data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/config.json"

The two commands are very similar. The only difference is the path of the model.

config.properties

No response

Versions

$ docker exec torchserve  pip list
Package                Version
---------------------- ------------
accelerate             0.15.0
captum                 0.5.0
diffusers              0.9.0
huggingface-hub        0.11.1
jieba                  0.42.1
matplotlib             3.5.2
matplotlib-inline      0.1.6
numpy                  1.22.4
oss2                   2.16.0
packaging              21.3
pandas                 1.4.2
Pillow                 9.0.1
pip                    22.3.1
pycryptodome           3.16.0
requests               2.28.0
rouge                  1.0.1
scikit-learn           1.1.3
scipy                  1.9.3
sentence-transformers  2.2.2
sentencepiece          0.1.97
tokenizers             0.13.2
torch                  1.11.0+cu102
torch-model-archiver   0.6.0
torchserve             0.6.0
torchtext              0.12.0
torchvision            0.12.0+cu102
transformers           4.25.1

Repro instructions

The two models have similar operations.

git pull
curl -X DELETE http://localhost:18081/models/insurance_chat
rm -rf /data/liuzhaofeng/torchserve/model-store/insurance_chat.mar

torch-model-archiver --model-name insurance_chat    --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin  --handler /data/nlg_pipeline/gpt2/dialog/handler.py    --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"

curl -X POST "http://localhost:18081/models?url=insurance_chat.mar"
curl -X PUT "http://localhost:18081/models/insurance_chat?min_worker=1"

Possible Solution

Shouldn't Model 1 only use 200+M of memory?

Contributor guide

Research direction: Examine memory allocation and overhead when TorchServe loads a small model. Compare total GPU memory usage to model parameter size. Check for any additional memory reserved by TorchServe or PyTorch for intermediate buffers, CUDA context, or model replication across workers.
Tech stack: pythonpytorch
Domain: backendapi
Issue type: Bug
Difficulty: 2
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: PyTorchTorchServe
Newbie friendliness: 50