The memory occupied by the model becomes larger after it is loaded into the GPU
#2,022 opened on Dec 5, 2022
Description
🐛 Describe the bug
We have two GPT2 models. Model 1 has only 110 million parameters, which are stored in 16 bit floating point numbers. Model 2 has 3.5 billion parameters, which are also stored in 16 bit floating point numbers.
However, after the same handler is loaded into the torchserve: latest gpu, Model 2 occupies about 7G of memory, which is consistent with our calculation, but Model 1 actually occupies about 1G of memory, which is far more than we expected.
Error logs
Model 1 size:
$ ll -h conversation/
total 250M
-rw-rw-r-- 1 alex alex 848 12月 5 12:45 config.json
-rw-rw-r-- 1 alex alex 250M 12月 5 12:45 pytorch_model.bin
Model 2 size:
$ ll -h gpt2/
total 6.7G
-rw-rw-r-- 1 alex alex 858 11月 14 18:33 config.json
-rw-rw-r-- 1 alex alex 6.7G 11月 14 18:34 pytorch_model.bin
watch -n 1 nvidia-smi
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 155043 C /home/venv/bin/python 1227MiB |
| 4 N/A N/A 396432 C /home/venv/bin/python 7507MiB |
| 6 N/A N/A 155043 C /home/venv/bin/python 1211MiB |
| 7 N/A N/A 514629 C /home/venv/bin/python 945MiB |
+-----------------------------------------------------------------------------+
Model 1 in GPU 7 and Model 2 in GPU 4. Do not care about the other two models on GPU0 and GPU6.
Installation instructions
using command:
docker run -it -d --name torchserve --gpus '"device=0,1,2,3,4,5,6,7"' -p 18080:8080 -p 18081:8081 -p 18082:8082 -p 17070:7070 -p 17071:7071 -v ./config.properties:/home/model-server/config.properties -v ./torchserve/model-store:/home/model-server/model-store pytorch/torchserve:latest-gpu
Model Packaing
Model 1: https://huggingface.co/IDEA-CCNL/Wenzhong-GPT2-110M Model 2: https://huggingface.co/IDEA-CCNL/Wenzhong2.0-GPT2-3.5B-chinese
after finetune.
packaing model 1:
torch-model-archiver --model-name insurance_chat --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin --handler /data/nlg_pipeline/gpt2/dialog/handler.py --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"
packing model 2:
torch-model-archiver --model-name insurance_chat_3.5B --force --version 1.0 --serialized-file /data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/pytorch_model.bin --handler /data/nlg_pipeline/gpt2/dialog/handler.py --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/data/nlg_pipeline/gpt2/dialog/models/finetune/gpt2/config.json"
The two commands are very similar. The only difference is the path of the model.
config.properties
No response
Versions
$ docker exec torchserve pip list
Package Version
---------------------- ------------
accelerate 0.15.0
captum 0.5.0
diffusers 0.9.0
huggingface-hub 0.11.1
jieba 0.42.1
matplotlib 3.5.2
matplotlib-inline 0.1.6
numpy 1.22.4
oss2 2.16.0
packaging 21.3
pandas 1.4.2
Pillow 9.0.1
pip 22.3.1
pycryptodome 3.16.0
requests 2.28.0
rouge 1.0.1
scikit-learn 1.1.3
scipy 1.9.3
sentence-transformers 2.2.2
sentencepiece 0.1.97
tokenizers 0.13.2
torch 1.11.0+cu102
torch-model-archiver 0.6.0
torchserve 0.6.0
torchtext 0.12.0
torchvision 0.12.0+cu102
transformers 4.25.1
Repro instructions
The two models have similar operations.
git pull
curl -X DELETE http://localhost:18081/models/insurance_chat
rm -rf /data/liuzhaofeng/torchserve/model-store/insurance_chat.mar
torch-model-archiver --model-name insurance_chat --force --version 1.0 --serialized-file /home/AppealGenerate/saved_model/conversation/pytorch_model.bin --handler /data/nlg_pipeline/gpt2/dialog/handler.py --export-path /data/liuzhaofeng/torchserve/model-store/ --extra-files "/home/AppealGenerate/saved_model/conversation/config.json"
curl -X POST "http://localhost:18081/models?url=insurance_chat.mar"
curl -X PUT "http://localhost:18081/models/insurance_chat?min_worker=1"
Possible Solution
Shouldn't Model 1 only use 200+M of memory?