[Bug] OOM when setting return_logprob=True #2607

CSammyfd · 2024-12-27T02:53:37Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

background: For each text, I just need to forward once and take the probs of some part in prompts
action: I set return_logprob=True and logprob_start_len=2000 (also max_new_tokens=1)
result:
- 1. When sending single request, it works correctly and I can get the latest 2000 token probs
- 1. When sending multiple requests(50 requests in 1 second), the server ran into OOM
- 1. When sending multiple requests(50 requests in 1 second) and cancel the return_logprob parameter, the server works normally

I wonder if the problem is caused by return_logprob?
I ran into the similar problem when I use vllm，and have found the issues about it:
vllm-project/vllm#5067
vllm-project/vllm#1532
vllm-project/vllm#5907
https://github.com/vllm-project/vllm/pull/5355
So maybe the reasons are the same? : CUDA memory used by calculating prompt_logprobs is not counted in profile-running

Reproduction

from openai import OpenAI
import requests
import threading
from queue import Queue
import requests
import json
from transformers import AutoModelForCausalLM, AutoTokenizer

client = None

model_name = "./models/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt300},
{"role": "assistant", "content": prompt600}
]

text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)

def send_request(index, text, queue, client):
response = requests.post(
"http://10.48.2.2:30000/generate",
json={
"text": text,
"sampling_params": {
"temperature": 0,
"max_new_tokens": 1,
},
"return_logprob": True,
"logprob_start_len": 3000
},
)

queue.put((index, response))

import time
start_time = time.time()
results_queue = Queue()
threads = []
for i in range(50):
thread = threading.Thread(target=send_request, args=(i, text, results_queue, client))
thread.start()
threads.append(thread)

for thread in threads:
thread.join()
end_time = time.time()

print(end_time- start_time)

completion_list = [None for _ in range(50)]
for _ in range(50):
index, result = results_queue.get()
completion_list[index] = result

res_json = json.loads(completion_list[0].text)

Environment

2024-12-27 10:51:32,563 - modelscope - INFO - PyTorch version 2.4.0 Found.
2024-12-27 10:51:32,564 - modelscope - INFO - Loading ast index from /home/yanhui_he/.cache/modelscope/ast_indexer
2024-12-27 10:51:32,717 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 cac1c2695a261ce83ddea2be8560cb8b and a total number of 972 components indexed
WARNING 12-27 10:51:34 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA A100 80GB PCIe
GPU 0,1,2,3,4,5,6,7,8,9 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda-12.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.54.15
PyTorch: 2.4.0+cu121
sglang: 0.4.1
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.44.0
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.9.5
fastapi: 0.115.0
hf_transfer: Module Not Found
huggingface_hub: 0.24.5
interegular: 0.3.3
modelscope: 1.13.3
orjson: 3.9.15
packaging: 23.2
psutil: 5.9.6
pydantic: 2.9.2
multipart: 0.0.9
zmq: 26.0.3
uvicorn: 0.29.0
uvloop: 0.19.0
vllm: 0.6.1.post2
xgrammar: Module Not Found
openai: 1.47.1
anthropic: 0.25.8
litellm: Module Not Found
decord: Module Not Found
NVIDIA Topology:
�[4mGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU8 GPU9 CPU Affinity NUMA Affinity GPU NUMA ID�[0m
GPU0 X NV12 PXB PXB PXB SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV12 X PIX PXB PXB SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 PXB PIX X PXB PXB SYS SYS NV12 SYS SYS 0-31,64-95 0 N/A
GPU3 PXB PXB PXB X NV12 SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 PXB PXB PXB NV12 X SYS SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU5 SYS SYS SYS SYS SYS X NV12 PXB PXB PXB 32-63,96-127 1 N/A
GPU6 SYS SYS SYS SYS SYS NV12 X PIX PXB PXB 32-63,96-127 1 N/A
GPU7 SYS SYS NV12 SYS SYS PXB PIX X PXB PXB 32-63,96-127 1 N/A
GPU8 SYS SYS SYS SYS SYS PXB PXB PXB X NV12 32-63,96-127 1 N/A
GPU9 SYS SYS SYS SYS SYS PXB PXB PXB NV12 X 32-63,96-127 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 655350

The text was updated successfully, but these errors were encountered:

merrymercy · 2024-12-29T00:30:03Z

reduce --mem-fraction-static

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] OOM when setting return_logprob=True #2607

[Bug] OOM when setting return_logprob=True #2607

CSammyfd commented Dec 27, 2024

merrymercy commented Dec 29, 2024

[Bug] OOM when setting return_logprob=True #2607

[Bug] OOM when setting return_logprob=True #2607

Comments

CSammyfd commented Dec 27, 2024

Checklist

Describe the bug

Reproduction

Environment

merrymercy commented Dec 29, 2024