hiyouga/LlamaFactory

关于npu训练模型总结以及疑问

Open

#4388 opened on Jun 20, 2024

View on GitHub
 (33 comments) (3 reactions) (0 assignees)Python (71,268 stars) (8,704 forks)batch import
good first issuenpupending

Description

Reminder

  • I have read the README and searched the existing issues.

System Info

QWEN2-1.5B(0.5B)

正常

QWEN2-7B(MoE)

需要使用bf16 #4278 正常

QWEN2-72B

正常,有一点点问题,只能在8卡上启动(stage3),16卡上会OOM,需要继续探究原因。

glm4

注释掉torch.jit行 使用bf16 参考 #4339 #3788

chatglm3

同上方式 但模型合并后需要将原文件夹除去*bin和pytorch_model.bin.index.json以外的文件复制过来 参考 #1307

DeepSeek (MoE)

失败 需要将模型做算子转化 参考:https://www.hiascend.com/document/detail/zh/Pytorch/60RC1/ptmoddevg/trainingmigrguide/performance_tuning_0027.html#ZH-CN_TOPIC_0000001889766765__section132951137183219

gemma

正常

LLaMA-3

正常

Baichuan-2

正常

PHI3

报错 File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect contents = read_file_cached(tiktoken_bpe_file, expected_hash) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 64, in read_file_cached contents = read_file(blobpath) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/tiktoken/load.py", line 25, in read_file resp = requests.get(blobpath) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 73, in get self.sock = sock = self._new_conn() File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 203, in _new_conn return request("get", url, params=params, **kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/api.py", line 59, in request conn.connect() File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect self._validate_conn(conn) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn return session.request(method=method, url=url, **kwargs) File "/home/hadoop-friday-llm/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request return tokenizer_class.from_pretrained( File "/home/hadoop-friday-llm/.cache/huggingface/modules/transformers_modules/Phi-3-small-8k-instruct/tokenization_phi3_small.py", line 190, in from_pretrained raise NameResolutionError(self.host, self, e) from e urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f4053c11070>: Failed to resolve 'openaipublic.blob.core.windows.net' ([Errno -2] Name or service not known)

Mistral-7B-v0.1

正常

Mixtral-8x7B-v0.1

8卡 64G需要stage3

CodeLlama-7b-hf(13B)

正常

Yi1.5

正常

Reproduction

llamafactory

Expected behavior

主要挑选了一些具有代表性的模型 重新在npu上实验 希望可以全部成功 但是phi3的失败希望可以解答一下 模型确认是在本地 并使用的绝对路径

Others

No response

Contributor guide