Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_bmm) #322

Open
Docnoah opened this issue Dec 21, 2024 · 4 comments

Comments

@Docnoah
Copy link

Docnoah commented Dec 21, 2024

我在01-LLaMA3-8B-Instruct FastApi 部署调用中运行其中的api.py,服务器是正常运行的,代码也没有报错,当我采用curl -X POST "http://127.0.0.1:6006/" -H "Content-Type: application/json" -d '{"prompt": "你好!", "history": []}'访问时,报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:7! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
我的自我解决经验:
1.检查model和input_ids的model均放在device,无效
2.使用accelerate库,accelerator = Accelerator() ; model, tokenizer = accelerator.prepare(model, tokenizer),无效
3.更新transformer库,参考其他issue,无效
求解决方案

@Docnoah
Copy link
Author

Docnoah commented Dec 21, 2024

PixPin_2024-12-21_17-07-07
PixPin_2024-12-21_17-06-56
问题如图

@KMnO4-zx
Copy link
Contributor

启动fastapi服务 和请求是要在两个终端的,你也可以使用request来请求

@Docnoah
Copy link
Author

Docnoah commented Dec 21, 2024

PixPin_2024-12-21_17-15-57
PixPin_2024-12-21_17-15-24
我是在两个终端运行的哦,启动服务是没问题的,但另一个终端进行请求的时候报错;对应启动服务的终端的变化就是8分钟前的上面截图,这是对应的代码(用的是教程源码),不知道为什么tensors not to be on the same device,您可以提供一些思路or经验,我按照您的思路去排查解决
from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import uvicorn
import json
import datetime
import torch

设置设备参数

DEVICE = "cuda" # 使用CUDA
DEVICE_ID = "0" # CUDA设备ID,如果未设置则为空
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICE # 组合CUDA设备信息

清理GPU内存函数

def torch_gc():
if torch.cuda.is_available(): # 检查是否可用CUDA
with torch.cuda.device(CUDA_DEVICE): # 指定CUDA设备
torch.cuda.empty_cache() # 清空CUDA缓存
torch.cuda.ipc_collect() # 收集CUDA内存碎片

构建 chat 模版

def bulid_input(prompt, history=[]):
system_format='<|start_header_id|>system<|end_header_id|>\n\n{content}<|eot_id|>'
user_format='<|start_header_id|>user<|end_header_id|>\n\n{content}<|eot_id|>'
assistant_format='<|start_header_id|>assistant<|end_header_id|>\n\n{content}<|eot_id|>\n'
history.append({'role':'user','content':prompt})
prompt_str = ''
# 拼接历史对话
for item in history:
if item['role']=='user':
prompt_str+=user_format.format(content=item['content'])
else:
prompt_str+=assistant_format.format(content=item['content'])
return prompt_str

创建FastAPI应用

app = FastAPI()

处理POST请求的端点

@app.post("/")
async def create_item(request: Request):
global model, tokenizer # 声明全局变量以便在函数内部使用模型和分词器
json_post_raw = await request.json() # 获取POST请求的JSON数据
json_post = json.dumps(json_post_raw) # 将JSON数据转换为字符串
json_post_list = json.loads(json_post) # 将字符串转换为Python对象
prompt = json_post_list.get('prompt') # 获取请求中的提示
history = json_post_list.get('history', []) # 获取请求中的历史记录

messages = [
        # {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
]

# 调用模型进行对话生成
input_str = bulid_input(prompt=prompt, history=history)
input_ids = tokenizer.encode(input_str, add_special_tokens=False, return_tensors='pt').cuda()

generated_ids = model.generate(
input_ids=input_ids, max_new_tokens=512, do_sample=True,
top_p=0.9, temperature=0.5, repetition_penalty=1.1, eos_token_id=tokenizer.encode('<|eot_id|>')[0]
)
outputs = generated_ids.tolist()[0][len(input_ids[0]):]
response = tokenizer.decode(outputs)
response = response.strip().replace('<|eot_id|>', "").replace('<|start_header_id|>assistant<|end_header_id|>\n\n', '').strip() # 解析 chat 模版


now = datetime.datetime.now()  # 获取当前时间
time = now.strftime("%Y-%m-%d %H:%M:%S")  # 格式化时间为字符串
# 构建响应JSON
answer = {
    "response": response,
    "status": 200,
    "time": time
}
# 构建日志信息
log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'
print(log)  # 打印日志
torch_gc()  # 执行GPU内存清理
return answer  # 返回响应

主函数入口

if name == 'main':
# 加载预训练的分词器和模型
model_name_or_path = '/home/zhangwenxuan/project_todo/model_LLM/LLM-Research/Meta-Llama-3-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", torch_dtype=torch.bfloat16).cuda()

# 启动FastAPI应用
# 用6006端口可以将autodl的端口映射到本地,从而在本地使用api
uvicorn.run(app, host='0.0.0.0', port=6006, workers=1)  # 在指定端口和主机上启动应用

@KMnO4-zx
Copy link
Contributor

你应该是由多张卡,我们教程是在单卡上测试的,所有在代码中的.cuda()都可以指定为单张卡(如果单卡可以加载成功的话).to('cuda:0')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants