[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled. #13704

sgsdxzy · 2025-02-22T15:45:48Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Run command:

vllm serve --enable-chunked-prefill --enable-prefix-caching -tp 8 cognitivecomputations/DeepSeek-R1-AWQ --dtype float16 --trust-remote-code --max-model-len 131072 --max-seq-len-to-capture 131072 --num-speculative-tokens 1

Symptom:
Upon receiving a request, only the first word (for example "Okay") is generated, then the generation is stuck and no new tokens are streamed.

As can see from the console log, number of accepted tokens remains 0 while number draft tokens increases.

After removing --num-speculative-tokens 1 vllm works fine.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

sgsdxzy added the bug Something isn't working label Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled. #13704

[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled. #13704

sgsdxzy commented Feb 22, 2025

[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled. #13704

[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled. #13704

Comments

sgsdxzy commented Feb 22, 2025

Your current environment

🐛 Describe the bug

Before submitting a new issue...