用官方提供的代码:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("D:\data\llm\GLM", trust_remote_code=True, revision="")
model = AutoModel.from_pretrained("D:\data\llm\GLM", trust_remote_code=True, revision="").half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
踩了无数坑之后,终于调出了问答结果,但是发现问着问着就失去响应了,大约能问20多次,就是卡在
response, history = model.chat(tokenizer, "你好", history=[])
这一句,执行不结束,也不报错,就是没响应。
或者一段时间不提问,再次提问的时候也会这样。体验就好像GLM有十分钟的使用限制一样。
我的环境是
WINDOWS10
NVIDIA A40 (47G显存)
python-3.10.11
cuda_11.3
torch-1.12
transformers 4.28.1
强制手动结束后输出如下:
File "D:\Program Files\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\CCSTC/.cache\huggingface\modules\transformers_modules\GLM2\modeling_chatglm.py", line 1028, in chat
outputs = self.generate(**inputs, **gen_kwargs)
File "D:\Program Files\Python310\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\Program Files\Python310\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\Program Files\Python310\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
KeyboardInterrupt
想问问可能是哪里的问题?