# QwQ-32B：如何高效运行 Qwen 发布了 QwQ-32B——一个推理模型，在许多方面的性能可与 DeepSeek-R1 相媲美 [基准测试](https://qwenlm.github.io/blog/qwq-32b/)。不过，人们一直在遇到 **无限生成**, **大量重复**、\ 令牌问题以及微调问题。我们希望本指南能帮助你调试并修复大多数问题！ {% hint style="info" %} 我们带有修复补丁的模型上传版本非常适合微调、vLLM 和 Transformers。如果你使用的是 llama.cpp 以及以 llama.cpp 为后端的引擎，请按照我们的 [这里的说明](#tutorial-how-to-run-qwq-32b) 来修复无限生成。 {% endhint %} **带有我们修复补丁的 Unsloth QwQ-32B 上传版本：** | [GGUF](https://huggingface.co/unsloth/QwQ-32B-GGUF) | [动态 4-bit](https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit) | [BnB 4-bit](https://huggingface.co/unsloth/QwQ-32B-bnb-4bit) | [16-bit](https://huggingface.co/unsloth/QwQ-32B) | | --------------------------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------ | ## :gear: 官方推荐设置根据 [Qwen](https://huggingface.co/Qwen/QwQ-32B)，以下是推理时的推荐设置： * 温度 0.6 * Top\_K 为 40（或 20 到 40） * Min\_P 设为 0.00（可选，但 0.01 也很好，llama.cpp 默认值是 0.1） * Top\_P 为 0.95 * 重复惩罚为 1.0。（在 llama.cpp 和 transformers 中，1.0 表示禁用） * 聊天模板： `<|im_start|>user\n用 Python 创建一个 Flappy Bird 游戏。\n<|im_end|>\n<|im_start|>assistant\n\n` {% hint style="warning" %} `llama.cpp` 使用 `min_p = 0.1`默认情况下，这可能会引发问题。强制将其设为 0.0。 {% endhint %} ## :thumbsup: llama.cpp 的推荐设置我们注意到很多人使用了一个 `重复惩罚` 大于 1.0。例如 1.1 到 1.5。这实际上会干扰 llama.cpp 的采样机制。重复惩罚的目标是惩罚重复生成，但我们发现这并没有按预期工作。关闭 `重复惩罚` 也有效（即将其设为 1.0），但我们发现保留它对于惩罚无限生成很有用。要使用它，我们发现你还必须在 llama.cpp 中将采样器的顺序调整为在应用 `重复惩罚`之前，否则会出现无限生成。所以添加这个： ```bash --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" ``` 默认情况下，llama.cpp 使用以下顺序： ```bash --samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature" ``` 我们本质上重新排列了 temperature 和 dry，并将 min\_p 前移。这意味着我们按以下顺序应用采样器： ```bash top_k=40 top_p=0.95 min_p=0.0 temperature=0.6 dry typ_p xtc ``` 如果你仍然遇到问题，你可以增加`--repeat-penalty 1.0 到 1.2 或 1.3。` 感谢 [@krist486](https://x.com/krist486/status/1897885598196654180) 提醒我注意 llama.cpp 的采样方向。 ## :sunny: Dry 重复惩罚我们研究了使用 `dry penalty` ，如以下内容所建议：，使用 0.8 的值，但我们实际上发现这会 **更容易导致语法问题，尤其是在编程方面**。如果你仍然遇到问题，可以将`dry penalty 提高到 0.8。` 如果你决定使用，采用我们交换后的采样顺序也会有所帮助 `dry penalty`. ## :llama: 教程：如何在 Ollama 中运行 QwQ-32B 1. 安装 `ollama` 如果你还没有的话！ ```bash apt-get update apt-get install pciutils -y curl -fsSL https://ollama.com/install.sh | sh ``` 2. 运行模型！注意，如果失败了，你可以在 `ollama serve`另一个终端中调用。我们将所有修复和建议的参数（temperature、min\_p 等）包含在 `参数` ！ ```bash ollama run hf.co/unsloth/QwQ-32B-GGUF:Q4_K_M ``` ## 📖 教程：如何在 llama.cpp 中运行 QwQ-32B 1. 获取最新的 `llama.cpp` 在 [GitHub 这里](https://github.com/ggml-org/llama.cpp)。你也可以按照下面的构建说明操作。将 `-DGGML_CUDA=ON` 改为 `-DGGML_CUDA=OFF` 如果你没有 GPU，或者只想进行 CPU 推理。 **对于 Apple Mac / Metal 设备**，设置 `-DGGML_CUDA=OFF` 然后照常继续——Metal 支持默认开启。 ```bash apt-get update apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y git clone https://github.com/ggml-org/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split cp llama.cpp/build/bin/llama-* llama.cpp ``` 2. 通过以下方式下载模型（在安装 `pip install huggingface_hub hf_transfer` ）。你可以选择 Q4\_K\_M，或其他量化版本（例如 BF16 全精度）。更多版本见： ```python # !pip install huggingface_hub hf_transfer import os os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" from huggingface_hub import snapshot_download snapshot_download( repo_id = "unsloth/QwQ-32B-GGUF", local_dir = "unsloth-QwQ-32B-GGUF", allow_patterns = ["*Q4_K_M*"], # 适用于 Q4_K_M ) ``` 3. 运行 Unsloth 的 Flappy Bird 测试，这会将输出保存到 `Q4_K_M_yes_samplers.txt` 4. 编辑 `--threads 32` 来设置 CPU 线程数， `--ctx-size 16384` 来设置上下文长度， `--n-gpu-layers 99` 来设置 GPU 卸载多少层。如果你的 GPU 显存不足，请尝试调整它。如果你只进行 CPU 推理，也请移除它。 5. 我们使用 `--repeat-penalty 1.1` 和 `--dry-multiplier 0.5` ，你可以自行调整。 ```bash ./llama.cpp/llama-cli \ --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \ --threads 32 \\ --ctx-size 16384 \ --n-gpu-layers 99 \\ --seed 3407 \ --prio 2 \ --temp 0.6 \\ --repeat-penalty 1.1 \ --dry-multiplier 0.5 \ --min-p 0.01 \ --top-k 40 \ --top-p 0.95 \ -no-cnv \\ --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" \ --prompt "<|im_start|>user\n用 Python 创建一个 Flappy Bird 游戏。你必须包含以下内容：\n1. 你必须使用 pygame。\n2. 背景颜色应随机选择，并且是浅色调。以浅蓝色开始。\n3. 多次按下 SPACE 键会加速小鸟。\n4. 小鸟的形状应随机选择为正方形、圆形或三角形。颜色应随机选择为深色。\n5. 在底部放置一些地面，颜色随机选择为深棕色或黄色。\n6. 在右上角显示分数。如果你穿过管道且没有撞到它们，分数就递增。\n7. 生成间距随机且留有足够空隙的管道。颜色随机为深绿色、浅棕色或深灰色调。\n8. 当你失败时，显示最高分。文本要显示在屏幕内。按 q 或 Esc 将退出游戏。重新开始则再次按 SPACE。\n最终游戏应放在 Python 的 markdown 区块中。在最终 markdown 区块之前检查你的代码是否有错误并修复。<|im_end|>\n<|im_start|>assistant\n\n" \ 2>&1 | tee Q4_K_M_yes_samplers.txt ``` 来自我们 1.58bit 博客的完整输入是： ``` <|im_start|>user 使用 Python 创建一个 Flappy Bird 游戏。你必须包含以下内容： 1. 你必须使用 pygame。 2. 背景颜色应随机选择，并且是浅色。以浅蓝色开始。 3. 多次按下 SPACE 会加速小鸟。 4. 小鸟的形状应随机选择为正方形、圆形或三角形。颜色应随机选择为深色。 5. 在底部放一些随机选择为深棕色或黄色的地面。 6. 在右上角显示分数。如果你通过管道且没有撞上它们，分数就增加。 7. 随机间隔生成足够空隙的管道。将它们随机设为深绿色、浅棕色或深灰色。 8. 当你失败时，显示最高分。让文字显示在屏幕内。按 q 或 Esc 退出游戏。重新开始时再次按 SPACE。最终游戏应放在 Python 的 markdown 区块中。在最终 markdown 区块之前检查你的代码是否有错误并修复。<|im_end|> <|im_start|>assistant ``` 去除思考部分后，最终 Python 输出的开头和结尾： ```python import pygame import random import sys pygame.init() ### 继续 class Bird: def __init__(self): ### 继续 def main(): best_score = 0 current_score = 0 game_over = False pipes = [] first_time = True # 跟踪第一次游戏运行 # 初始设置 background_color = (173, 216, 230) # 初始为浅蓝色 land_color = random.choice(land_colors) bird = Bird() while True: for event in pygame.event.get(): ### 继续 if not game_over: # 更新小鸟和管道 bird.update() ### 继续 # 绘制 ### 继续 pygame.display.flip() clock.tick(60) if __name__ == "__main__": main() ```

完整的最终 Python 输出（已去除思考部分）：

{% code fullWidth="false" %} ```python import pygame import random import sys pygame.init() WIDTH, HEIGHT = 800, 600 screen = pygame.display.set_mode((WIDTH, HEIGHT)) pygame.display.set_caption("Flappy Bird 克隆版") clock = pygame.time.Clock() # 颜色 pipe_colors = [(0, 100, 0), (210, 180, 140), (50, 50, 50)] land_colors = [(139, 69, 19), (255, 255, 0)] # 游戏常量 GRAVITY = 0.5 PIPE_SPEED = 5 BIRD_SIZE = 30 LAND_HEIGHT = 50 PIPE_WIDTH = 50 PIPE_GAP = 150 class Bird: def __init__(self): self.x = WIDTH // 2 self.y = HEIGHT // 2 self.velocity = 0 self.shape = random.choice(['square', 'circle', 'triangle']) self.color = (random.randint(0, 100), random.randint(0, 100), random.randint(0, 100)) self.rect = pygame.Rect(self.x - BIRD_SIZE//2, self.y - BIRD_SIZE//2, BIRD_SIZE, BIRD_SIZE) def update(self): self.velocity += GRAVITY self.y += self.velocity self.rect.y = self.y - BIRD_SIZE//2 self.rect.x = self.x - BIRD_SIZE//2 # 保持 x 居中 def draw(self): if self.shape == 'square': pygame.draw.rect(screen, self.color, self.rect) elif self.shape == 'circle': pygame.draw.circle(screen, self.color, (self.rect.centerx, self.rect.centery), BIRD_SIZE//2) elif self.shape == 'triangle': points = [ (self.rect.centerx, self.rect.top), (self.rect.left, self.rect.bottom), (self.rect.right, self.rect.bottom) ] pygame.draw.polygon(screen, self.color, points) def spawn_pipe(): pipe_x = WIDTH top_height = random.randint(50, HEIGHT - PIPE_GAP - LAND_HEIGHT) rect_top = pygame.Rect(pipe_x, 0, PIPE_WIDTH, top_height) bottom_y = top_height + PIPE_GAP bottom_height = (HEIGHT - LAND_HEIGHT) - bottom_y rect_bottom = pygame.Rect(pipe_x, bottom_y, PIPE_WIDTH, bottom_height) color = random.choice(pipe_colors) return { 'rect_top': rect_top, 'rect_bottom': rect_bottom, 'color': color, 'scored': False } def main(): best_score = 0 current_score = 0 game_over = False pipes = [] first_time = True # 跟踪第一次游戏运行 # 初始设置 background_color = (173, 216, 230) # 初始为浅蓝色 land_color = random.choice(land_colors) bird = Bird() while True: for event in pygame.event.get(): if event.type == pygame.QUIT: pygame.quit() sys.exit() if event.type == pygame.KEYDOWN: if event.key == pygame.K_ESCAPE or event.key == pygame.K_q: pygame.quit() sys.exit() if event.key == pygame.K_SPACE: if game_over: # 重置游戏 bird = Bird() pipes.clear() current_score = 0 if first_time: # 初次游戏结束后的第一次重开 background_color = (random.randint(200, 255), random.randint(200, 255), random.randint(200, 255)) first_time = False else: background_color = (random.randint(200, 255), random.randint(200, 255), random.randint(200, 255)) land_color = random.choice(land_colors) game_over = False else: # 让小鸟跳起 bird.velocity = -15 # 初始向上速度 if not game_over: # 更新小鸟和管道 bird.update() # 将管道向左移动 remove_pipes = [] for pipe in pipes: pipe['rect_top'].x -= PIPE_SPEED pipe['rect_bottom'].x -= PIPE_SPEED # 检查小鸟是否穿过管道 if not pipe['scored'] and bird.rect.x > pipe['rect_top'].right: current_score += 1 pipe['scored'] = True # 检查管道是否移出屏幕 if pipe['rect_top'].right < 0: remove_pipes.append(pipe) # 删除屏幕外的管道 for p in remove_pipes: pipes.remove(p) # 如有需要则生成新管道 if not pipes or pipes[-1]['rect_top'].x < WIDTH - 200: pipes.append(spawn_pipe()) # 检查碰撞 land_rect = pygame.Rect(0, HEIGHT - LAND_HEIGHT, WIDTH, LAND_HEIGHT) bird_rect = bird.rect # 检查管道 for pipe in pipes: if bird_rect.colliderect(pipe['rect_top']) or bird_rect.colliderect(pipe['rect_bottom']): game_over = True break # 检查地面和顶部 if bird_rect.bottom >= land_rect.top or bird_rect.top <= 0: game_over = True if game_over: if current_score > best_score: best_score = current_score # 绘制 screen.fill(background_color) # 绘制管道 for pipe in pipes: pygame.draw.rect(screen, pipe['color'], pipe['rect_top']) pygame.draw.rect(screen, pipe['color'], pipe['rect_bottom']) # 绘制地面 pygame.draw.rect(screen, land_color, (0, HEIGHT - LAND_HEIGHT, WIDTH, LAND_HEIGHT)) # 绘制小鸟 bird.draw() # 绘制分数 font = pygame.font.SysFont(None, 36) score_text = font.render(f'Score: {current_score}', True, (0, 0, 0)) screen.blit(score_text, (WIDTH - 150, 10)) # 游戏结束界面 if game_over: over_text = font.render('Game Over!', True, (255, 0, 0)) best_text = font.render(f'Best: {best_score}', True, (255, 0, 0)) restart_text = font.render('Press SPACE to restart', True, (255, 0, 0)) screen.blit(over_text, (WIDTH//2 - 70, HEIGHT//2 - 30)) screen.blit(best_text, (WIDTH//2 - 50, HEIGHT//2 + 10)) screen.blit(restart_text, (WIDTH//2 - 100, HEIGHT//2 + 50)) pygame.display.flip() clock.tick(60) if __name__ == "__main__": main() ``` {% endcode %}

6. 运行时，我们得到了一个可运行的游戏！

7. 现在尝试在没有修复的情况下运行相同内容！所以移除 `--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"` 这会将输出保存到 `Q4_K_M_no_samplers.txt` ```bash ./llama.cpp/llama-cli \ --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \ --threads 32 \\ --ctx-size 16384 \ --n-gpu-layers 99 \\ --seed 3407 \ --prio 2 \ --temp 0.6 \\ --repeat-penalty 1.1 \ --dry-multiplier 0.5 \ --min-p 0.01 \ --top-k 40 \ --top-p 0.95 \ -no-cnv \\ --prompt "<|im_start|>user\n用 Python 创建一个 Flappy Bird 游戏。你必须包含以下内容：\n1. 你必须使用 pygame。\n2. 背景颜色应随机选择，并且是浅色调。以浅蓝色开始。\n3. 多次按下 SPACE 键会加速小鸟。\n4. 小鸟的形状应随机选择为正方形、圆形或三角形。颜色应随机选择为深色。\n5. 在底部放置一些地面，颜色随机选择为深棕色或黄色。\n6. 在右上角显示分数。如果你穿过管道且没有撞到它们，分数就递增。\n7. 生成间距随机且留有足够空隙的管道。颜色随机为深绿色、浅棕色或深灰色调。\n8. 当你失败时，显示最高分。文本要显示在屏幕内。按 q 或 Esc 将退出游戏。重新开始则再次按 SPACE。\n最终游戏应放在 Python 的 markdown 区块中。在最终 markdown 区块之前检查你的代码是否有错误并修复。<|im_end|>\n<|im_start|>assistant\n\n" \ 2>&1 | tee Q4_K_M_no_samplers.txt ``` 你会得到一些循环，但 **存在问题的 Python 语法错误** 以及许多其他问题。例如下面看起来正确，但其实是错的！即第 39 行 `pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?` {% code overflow="wrap" lineNumbers="true" %} ```python import pygame import random pygame.init() # 常量 WIDTH, HEIGHT = 800, 600 GROUND_HEIGHT = 20 GRAVITY = 0.7 PIPE_SPEED = -3 BIRD_SIZE = 45 MIN_GAP = 130 MAX_GAP = 200 PIPE_COLORS = [(0, 96, 0), (205, 133, 63), (89, 97, 107)] DARK_BROWN = (94, 72, 4) YELLOW = (252, 228, 6) screen = pygame.display.set_mode((WIDTH, HEIGHT)) clock = pygame.time.Clock() def random_light_color(): return ( random.randint(180, 230), random.randint(190, 300), random.randint(250, 255) ) def reset_game(): global bird_x, bird_y global pipes, score global background_color, land_color global bird_shape, bird_color # 小鸟属性 bird_x = WIDTH * 0.3 bird_y = HEIGHT // 2 bird_vel = -5 # 初始向上推力 pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'? ``` {% endcode %} 8. 如果你使用 `--repeat-penalty 1.5`，情况会更糟、更明显，实际上会出现完全错误的语法。 ```python import pygame from random import randint # 用于随机生成颜色/形状/位置 pygame.init() # 常量： WIDTH, HEIGHT =456 ,702 # BACKGROUND_COLOR_LIGHTS=['lightskyblue'] GAP_SIZE=189 # BIRD_RADIUS=3. PIPE_SPEED=- ( ) ? class Game(): def __init__(self): self.screen_size=( ) def reset_game_vars(): global current_scor e # 设为零以及其他初始状态。 # 主游戏循环： while running : for event in pygame.event.get() : if quit ... 等等 pygame.quit() print("代码已简化。由于时间限制，完整可运行版本还需要进一步实现。") ``` 9. 你可能会想，也许是 Q4\_K\_M？B16，即全精度应该没问题吧？不对——如果我们不使用以下修复，输出同样会失败：-`-samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"` 在使用重复惩罚时。 ## :sunrise\_over\_mountains: 还是不行？试试 Min\_p = 0.1，Temperature = 1.5 根据 Min\_p 论文，为了获得更有创造性和更多样化的输出，如果你仍然看到重复，请尝试禁用 top\_p 和 top\_k！ ```bash ./llama.cpp/llama-cli --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \ --threads 32 --n-gpu-layers 99 \ --ctx-size 16384 \ --temp 1.5 \ --min-p 0.1 \ --top-k 0 \ --top-p 1.0 \\ -no-cnv \\ --prompt "<|im_start|>user\n用 Python 创建一个 Flappy Bird 游戏。你必须包含以下内容：\n1. 你必须使用 pygame。\n2. 背景颜色应随机选择，并且是浅色调。以浅蓝色开始。\n3. 多次按下 SPACE 键会加速小鸟。\n4. 小鸟的形状应随机选择为正方形、圆形或三角形。颜色应随机选择为深色。\n5. 在底部放置一些地面，颜色随机选择为深棕色或黄色。\n6. 在右上角显示分数。如果你穿过管道且没有撞到它们，分数就递增。\n7. 生成间距随机且留有足够空隙的管道。颜色随机为深绿色、浅棕色或深灰色调。\n8. 当你失败时，显示最高分。文本要显示在屏幕内。按 q 或 Esc 将退出游戏。重新开始则再次按 SPACE。\n最终游戏应放在 Python 的 markdown 区块中。在最终 markdown 区块之前检查你的代码是否有错误并修复。<|im_end|>\n<|im_start|>assistant\n\n" ``` 另一种方法是直接禁用 `min_p` ，因为 llama.cpp 默认使用 `min_p = 0.1`! ```bash ./llama.cpp/llama-cli --model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf \ --threads 32 --n-gpu-layers 99 \ --ctx-size 16384 \ --temp 0.6 \\ --min-p 0.0 \\ --top-k 40 \ --top-p 0.95 \ -no-cnv \\ --prompt "<|im_start|>user\n用 Python 创建一个 Flappy Bird 游戏。你必须包含以下内容：\n1. 你必须使用 pygame。\n2. 背景颜色应随机选择，并且是浅色调。以浅蓝色开始。\n3. 多次按下 SPACE 键会加速小鸟。\n4. 小鸟的形状应随机选择为正方形、圆形或三角形。颜色应随机选择为深色。\n5. 在底部放置一些地面，颜色随机选择为深棕色或黄色。\n6. 在右上角显示分数。如果你穿过管道且没有撞到它们，分数就递增。\n7. 生成间距随机且留有足够空隙的管道。颜色随机为深绿色、浅棕色或深灰色调。\n8. 当你失败时，显示最高分。文本要显示在屏幕内。按 q 或 Esc 将退出游戏。重新开始则再次按 SPACE。\n最终游戏应放在 Python 的 markdown 区块中。在最终 markdown 区块之前检查你的代码是否有错误并修复。<|im_end|>\n<|im_start|>assistant\n\n" ``` ## :thinking: \ 令牌没有显示？有些人报告说，由于 \ 是在聊天模板中默认添加的，一些系统无法正确输出思考轨迹。你需要手动编辑 Jinja 模板，从： {% code overflow="wrap" %} ``` ``` {% endcode %} 改为另一个，通过移除末尾的 `\n` 。现在模型在推理时必须手动添加 `\n` ，这并不总是能成功。DeepSeek 也编辑了所有模型，默认添加一个 `` 令牌，以强制模型进入推理模式。所以把 `{%- if add_generation_prompt %} {{- '<|im_start|>assistant\n\n' }} {%- endif %}` 改为 `{%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}` 即移除 `\n`

移除了 <think>\n 部分的完整 jinja 模板

{% code overflow="wrap" %} ``` ``` {% endcode %}

## 附加说明我们起初以为： 1. QwQ 的上下文长度并非原生的 128K，而是带有 YaRN 扩展的 32K。例如在以下的 readme 文件中，我们看到： ```json { ..., "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" } } ``` 我们尝试覆盖 llama.cpp 对 YaRN 的处理，但没有任何变化。 {% code overflow="wrap" %} ```bash --override-kv qwen2.context_length=int:131072 \ --override-kv qwen2.rope.scaling.type=str:yarn \ --override-kv qwen2.rope.scaling.factor=float:4 \ --override-kv qwen2.rope.scaling.original_context_length=int:32768 \ --override-kv qwen2.rope.scaling.attn_factor=float:1.13862943649292 \ ``` {% endcode %} 2. 我们也曾认为可能是 RMS Layernorm 的 epsilon 有问题——不是 1e-5，而可能是 1e-6。例如 [这个](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct/blob/main/config.json) 有 `rms_norm_eps=1e-06`，而 [这个](https://huggingface.co/Qwen/Qwen2.5-32B/blob/main/config.json) 有 `rms_norm_eps=1e-05` 。我们也对其进行了覆盖，但没有起作用： {% code overflow="wrap" %} ```bash --override-kv qwen2.attention.layer_norm_rms_epsilon=float:0.000001 \ ``` {% endcode %} 3. 我们还测试了 llama.cpp 和普通 Transformers 之间的 tokenizer ID 是否匹配，感谢 [@kalomaze](https://x.com/kalomaze/status/1897875332230779138)。它们匹配，所以这不是罪魁祸首。我们在下面提供实验结果： {% file src="/files/364d8eaed794b43f67464314e70d58724a14192d" %} 没有采样修复的 BF16 全精度 {% endfile %} {% file src="/files/d9b57be22c049ee9e78293072ae85f8f7e297dc7" %} 带采样修复的 BF16 全精度 {% endfile %} {% file src="/files/cf21edac67682548497331e4c92eff9f98bad55a" %} 没有采样修复的 Q4\_K\_M 精度 {% endfile %} {% file src="/files/e6e3d29b62c3a2565e7a6d5fd6d41a1b90a30e22" %} 带采样修复的 Q4\_K\_M 精度 {% endfile %} ## :pencil2: Tokenizer 错误修复 * 我们还发现了一些特别影响微调的问题！EOS 令牌是正确的，但 PAD 令牌可能更应该是 `"<|vision_pad|>`" 我们已在以下文件中更新它： ``` "eos_token": "<|im_end|>", "pad_token": "<|endoftext|>", ``` ## :tools: 动态 4-bit 量化我们还上传了动态 4bit 量化，相比朴素的 4bit 量化能提高准确性！我们附上了 QwQ 量化误差图分析，分别针对激活和权重量化误差：

我们已将动态 4-bit 量化上传到：自 vLLM 0.7.3 起（2025 年 2 月 20 日），vLLM 现在支持加载 Unsloth 动态 4bit 量化！我们所有的 GGUF 都在 ! --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://unsloth.ai/docs/zh/mo-xing/tutorials/qwq-32b-how-to-run-effectively.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.