GLM-4.7-Flash：如何本地运行

在你的设备上本地运行并微调 GLM-4.7-Flash！

GLM-4.7-Flash 是 Z.ai 面向本地部署打造的新一代 30B MoE 推理模型，在编程、智能体工作流和聊天方面提供同类最佳性能。它使用约 3.6B 参数，支持 200K 上下文，并在 SWE-Bench、GPQA 以及推理/聊天基准上领先。

GLM-4.7-Flash 可运行于 24GB RAM/VRAM/统一内存（完整精度需要 32GB），现在你还可以使用 Unsloth 进行微调。要使用 vLLM 运行 GLM 4.7 Flash，请参见 GLM-4.7-Flash

1月21日更新： llama.cpp 修复了指定错误的一个 bug scoring_func: "softmax" （应为 "sigmoid"）。这会导致循环和较差的输出。我们已更新 GGUF 文件——请重新下载模型以获得更好的输出。

现在你可以使用 Z.ai 推荐的参数并获得很好的结果：

对于通用场景： --temp 1.0 --top-p 0.95
对于工具调用： --temp 0.7 --top-p 1.0
重复惩罚： 禁用它，或者设置 --repeat-penalty 1.0

1月22日：CUDA 的 FA 修复现已合并，因此推理速度更快了。

运行教程微调

要运行 GLM-4.7-Flash GGUF： unsloth/GLM-4.7-Flash-GGUF

⚙️ 使用指南

为了获得最佳性能，请确保你的可用总内存（VRAM + 系统 RAM）大于你正在下载的量化模型文件大小。如果不够，llama.cpp 仍然可以通过 SSD/HDD 卸载运行，但推理会更慢。

在与 Z.ai 团队交流后，他们建议使用 GLM-4.7 的采样参数：

默认设置（大多数任务）

Terminal Bench，SWE Bench Verified

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

重复惩罚 = 禁用或 1.0

对于通用场景： --temp 1.0 --top-p 0.95
对于工具调用： --temp 0.7 --top-p 1.0
如果使用 llama.cpp，请设置 --min-p 0.01 因为 llama.cpp 的默认值是 0.05
有时你需要试验哪些数值最适合你的使用场景。

目前我们 不建议 使用 Ollama 运行这个 GGUF，因为可能存在聊天模板兼容性问题。该 GGUF 在 llama.cpp（或例如 LM Studio、Jan 等后端）上运行良好。

记得禁用重复惩罚！或者设置 --repeat-penalty 1.0

最大上下文窗口： 202,752

🖥️ 运行 GLM-4.7-Flash

根据你的使用场景，你需要使用不同的设置。一些 GGUF 体积看起来相近，是因为模型架构（如 gpt-oss）的维度不能被 128 整除，因此部分参数无法量化到更低比特。

由于本指南使用 4-bit，你大约需要 18GB RAM/统一内存。我们建议至少使用 4-bit 精度以获得最佳性能。

目前我们 不建议 使用 Ollama 运行这个 GGUF，因为可能存在聊天模板兼容性问题。该 GGUF 在 llama.cpp（或例如 LM Studio、Jan 等后端）上运行良好。

记得禁用重复惩罚！或者设置 --repeat-penalty 1.0

Llama.cpp 教程（GGUF）：

在 llama.cpp 中运行的说明（注意我们将使用 4 位以适配大多数设备）：

获取最新的 llama.cpp 在 GitHub 这里。你也可以按照下面的构建说明进行。将 -DGGML_CUDA=ON 改为 -DGGML_CUDA=OFF 如果你没有 GPU，或者只想进行 CPU 推理。 对于 Apple Mac / Metal 设备，设置 -DGGML_CUDA=OFF 然后照常继续——Metal 支持默认开启。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

你可以直接从 Hugging Face 拉取。随着你的 RAM/VRAM 允许，你可以将上下文增加到 200K。

你也可以尝试 Z.ai 推荐的 GLM-4.7 采样参数：

对于通用场景： --temp 1.0 --top-p 0.95
对于工具调用： --temp 0.7 --top-p 1.0
记得禁用重复惩罚！

针对 通用指令 用例：

./llama.cpp/llama-cli \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \\
    --ctx-size 16384 \
    --temp 1.0 --top-p 0.95 --min-p 0.01

针对 工具调用 用例：

./llama.cpp/llama-cli \
    -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \\
    --ctx-size 16384 \
    --temp 0.7 --top-p 1.0 --min-p 0.01

通过以下方式下载模型（安装完 pip install huggingface_hub）。你可以选择 UD-Q4_K_XL 或其他量化版本。如果下载卡住，请参见 Hugging Face Hub，XET 调试

pip install -U huggingface_hub
hf download unsloth/GLM-4.7-Flash-GGUF \\
    --local-dir unsloth/GLM-4.7-Flash-GGUF \\
    --include "*UD-Q2_K_XL*"

然后在对话模式下运行模型：

./llama.cpp/llama-cli \
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \\
    --ctx-size 16384 \
    --seed 3407 \\
    --temp 1.0 \
    --top-p 0.95 \\
    --min-p 0.01

此外，请按需调整 上下文窗口 按需，最多到 202752

➿减少重复和循环

1月21日更新：llama.cpp 修复了指定错误的 bug "scoring_func": "softmax" 这会导致循环和较差的输出（应为 sigmoid）。我们已更新 GGUF 文件。请重新下载模型以获得更好的输出。

这意味着你现在可以使用 Z.ai 推荐的参数并获得很好的结果：

对于通用场景： --temp 1.0 --top-p 0.95
对于工具调用： --temp 0.7 --top-p 1.0
如果使用 llama.cpp，请设置 --min-p 0.01 因为 llama.cpp 的默认值是 0.05
记得禁用重复惩罚！或者设置 --repeat-penalty 1.0

我们添加了 "scoring_func": "sigmoid" 改为 config.json 用于主模型 - 见.

目前我们 不建议 使用 Ollama 运行这个 GGUF，因为可能存在聊天模板兼容性问题。该 GGUF 在 llama.cpp（或例如 LM Studio、Jan 等后端）上运行良好。

🐦使用 UD-Q4_K_XL 的 Flappy Bird 示例

作为示例，我们通过使用 UD-Q4_K_XL 经由以下方式进行了如下长对话： ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --fit on --temp 1.0 --top-p 0.95 --min-p 0.01 :

你好
2+2 等于多少
创建一个 Python Flappy Bird 游戏
用 Rust 创建一个完全不同的游戏
找出两者中的 bug
把我刚提到的第一个游戏做成一个独立的 HTML 文件
找出 bug 并展示修复后的游戏

最终渲染出如下 HTML 形式的 Flappy Bird 游戏：

HTML 中的 Flappy Bird 游戏（可展开）

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
    <title>Flappy Bird 修复版</title>
    <style>
        body {
            margin: 0;
            display: flex;
            justify-content: center;
            align-items: center;
            height: 100vh;
            background-color: #222;
            font-family: 'Arial', sans-serif;
            overflow: hidden;
            user-select: none;
            -webkit-user-select: none;
            touch-action: none; /* 防止在移动端缩放 */
        }

        #game-container {
            position: relative;
            box-shadow: 0 0 20px rgba(0,0,0,0.5);
        }

        canvas {
            background-color: #87CEEB;
            display: block;
            border-radius: 4px;
        }

        /* UI 覆层 */
        #ui-layer {
            position: absolute;
            top: 0;
            left: 0;
            width: 100%;
            height: 100%;
            pointer-events: none;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            text-align: center;
        }

        #score-display {
            position: absolute;
            top: 40px;
            left: 50%;
            transform: translateX(-50%);
            font-size: 48px;
            font-weight: bold;
            color: white;
            text-shadow: 3px 3px 0 #000;
            z-index: 10;
            font-family: 'Courier New', Courier, monospace;
        }

        #start-screen, #game-over-screen {
            background: rgba(0, 0, 0, 0.7);
            width: 100%;
            height: 100%;
            display: flex;
            flex-direction: column;
            justify-content: center;
            align-items: center;
            color: white;
            pointer-events: auto; /* 允许点击 */
            cursor: pointer;
        }

        h1 { margin: 0 0 10px 0; font-size: 60px; text-shadow: 4px 4px 0 #000; line-height: 1; }
        p { font-size: 22px; margin: 10px 0; color: #ddd; }
        
        .btn {
            background: linear-gradient(to bottom, #ffeb3b, #fbc02d);
            border: 3px solid #fff;
            color: #333;
            padding: 15px 40px;
            font-size: 28px;
            font-weight: bold;
            cursor: pointer;
            border-radius: 8px;
            box-shadow: 0 6px 0 #c49000, 0 10px 10px rgba(0,0,0,0.3);
            text-transform: uppercase;
            transition: all 0.1s;
            margin-top: 10px;
        }

        .btn:active {
            transform: translateY(4px);
            box-shadow: 0 2px 0 #c49000, 0 4px 4px rgba(0,0,0,0.3);
        }

        .score-board {
            background: #ded895;
            border: 2px solid #543847;
            padding: 20px 40px;
            border-radius: 10px;
            box-shadow: 4px 4px 0 #543847;
            margin-bottom: 30px;
            display: none;
            border: 4px solid #543847;
        }
        
        .score-board h2 { margin: 0 0 5px 0; color: #e86101; font-size: 40px; }
        .score-board span { font-size: 20px; color: #543847; display: block; text-align: center; }

    </style>
</head>
<body>

    <div id="game-container">
        <canvas id="gameCanvas" width="400" height="600"></canvas>
        
        <div id="score-display">0</div>

        <div id="ui-layer">
            <div id="start-screen">
                <h1>FLAPPY<br>BIRD</h1>
                <p>点击或按空格键开始</p>
                <button class="btn" style="display:none;" id="touch-instruction">点击开始</button>
            </div>

            <div id="game-over-screen">
                <h1>游戏结束</h1>
                <div class="score-board" id="score-board">
                    <h2>得分：<span id="final-score">0</span></h2>
                </div>
                <button class="btn" id="restart-btn">再试一次</button>
            </div>
        </div>
    </div>

<script>
    const canvas = document.getElementById('gameCanvas');
    const ctx = canvas.getContext('2d');

    // --- 常量 ---
    const GRAVITY = 0.35; // 稍微更强一些的重力，手感更好
    const JUMP_STRENGTH = -6.5;
    const PIPE_GAP = 180;
    const PIPE_WIDTH = 60;
    const PIPE_SPEED = 2.5;
    const PIPE_SPAWN_RATE = 100;

    // --- 状态 ---
    let frames = 0;
    let score = 0;
    let isGameOver = false;
    let isPlaying = false;
    let gameLoopId;

    const ui = {
        startScreen: document.getElementById('start-screen'),
        gameOverScreen: document.getElementById('game-over-screen'),
        scoreDisplay: document.getElementById('score-display'),
        scoreBoard: document.getElementById('score-board'),
        finalScore: document.getElementById('final-score'),
        restartBtn: document.getElementById('restart-btn')
    };

    const bird = {
        x: 80,
        y: 150,
        radius: 12, // 固定半径
        velocity: 0,
        
        draw: function() {
            // 根据速度旋转小鸟，增加视觉效果
            let angle = Math.min(Math.PI / 4, Math.max(-Math.PI / 4, (this.velocity * 0.1)));
            
            ctx.save();
            ctx.translate(this.x, this.y);
            ctx.rotate(angle);
            
            // 绘制身体
            ctx.fillStyle = '#FFD700';
            ctx.beginPath();
            ctx.arc(0, 0, this.radius, 0, Math.PI * 2);
            ctx.fill();
            
            // 眼睛
            ctx.fillStyle = 'white';
            ctx.beginPath();
            ctx.arc(4, -4, 4, 0, Math.PI * 2);
            ctx.fill();
            ctx.fillStyle = 'black';
            ctx.beginPath();
            ctx.arc(6, -4, 2, 0, Math.PI * 2);
            ctx.fill();
            
            // 翅膀
            ctx.fillStyle = '#FFA500';
            ctx.beginPath();
            ctx.arc(-4, 4, 5, 0, Math.PI * 2);
            ctx.fill();

            ctx.restore();
        },

        update: function() {
            this.velocity += GRAVITY;
            this.y += this.velocity;
        },

        jump: function() {
            this.velocity = JUMP_STRENGTH;
        },

        reset: function() {
            this.y = 150;
            this.velocity = 0;
        }
    };

    let pipes = [];

    function createPipe() {
        const minHeight = 50;
        const maxPos = canvas.height - PIPE_GAP - minHeight;
        const topHeight = Math.floor(Math.random() * (maxPos - minHeight + 1)) + minHeight;
        
        pipes.push({
            x: canvas.width,
            topHeight: topHeight,
            bottomY: topHeight + PIPE_GAP,
            width: PIPE_WIDTH,
            passed: false
        });
    }

    function drawPipes() {
        ctx.fillStyle = '#2ecc71';
        ctx.strokeStyle = '#27ae60';
        ctx.lineWidth = 2;
        
        pipes.forEach(pipe => {
            // 上管道
            ctx.fillRect(pipe.x, 0, pipe.width, pipe.topHeight);
            ctx.strokeRect(pipe.x, 0, pipe.width, pipe.topHeight);
            
            // 下管道
            ctx.fillRect(pipe.x, pipe.bottomY, pipe.width, canvas.height - pipe.bottomY);
            ctx.strokeRect(pipe.x, pipe.bottomY, pipe.width, canvas.height - pipe.bottomY);

            // 管道帽
            const capH = 20;
            ctx.fillStyle = '#27ae60'; 
            ctx.fillRect(pipe.x - 2, pipe.topHeight - capH, pipe.width + 4, capH);
            ctx.fillRect(pipe.x - 2, pipe.bottomY, pipe.width + 4, capH);
        });
    }

    function updatePipes() {
        if (frames % PIPE_SPAWN_RATE === 0) createPipe();

        for (let i = 0; i < pipes.length; i++) {
            let p = pipes[i];
            p.x -= PIPE_SPEED;

            // --- 修复后的碰撞检测 ---
            // 将小鸟视为半径为 'bird.radius' 的圆
            // 管道是一个矩形：x, x+w, y_top, y_bottom
            let birdLeft = bird.x - bird.radius;
            let birdRight = bird.x + bird.radius;
            let birdTop = bird.y - bird.radius;
            let birdBottom = bird.y + bird.radius;

            // 水平重叠
            if (birdRight > p.x && birdLeft < p.x + p.width) {
                // 垂直重叠（撞到上管道或下管道）
                if (birdTop < p.topHeight || birdBottom > p.bottomY) {
                    gameOver();
                }
            }

            // --- 修复后的计分 ---
            // 如果管道已经移出屏幕左侧，并且尚未计分
            if (p.x + p.width < 0 && !p.passed) {
                score++;
                p.passed = true;
                ui.scoreDisplay.innerText = score;
            }

            if (p.x < -60) {
                pipes.shift();
                i--;
            }
        }
    }

    function checkCollisions() {
        // 地面
        if (bird.y + bird.radius >= canvas.height) {
            gameOver();
        }
        // 天花板
        if (bird.y - bird.radius <= 0) {
            bird.y = bird.radius;
            bird.velocity = 0;
        }
    }

    function drawBackground() {
        // 清空
        ctx.clearRect(0, 0, canvas.width, canvas.height);
        
        // 地面
        ctx.fillStyle = '#654321';
        ctx.fillRect(0, canvas.height - 10, canvas.width, 10);
        
        // 云朵
        ctx.fillStyle = "rgba(255, 255, 255, 0.6)";
        for(let i=0; i<4; i++) {
            let x = (frames * 0.5 + i * 150) % (canvas.width + 100) - 50;
            let y = (i * 40) + 20;
            let scale = 1 + (Math.sin(frames * 0.02 + i) * 0.1);
            let size = 30 * scale;
            ctx.beginPath();
            ctx.arc(x, y, size, 0, Math.PI * 2);
            ctx.arc(x + 20*scale, y - 10*scale, size * 1.2, 0, Math.PI * 2);
            ctx.arc(x + 40*scale, y, size, 0, Math.PI * 2);
            ctx.fill();
        }
    }

    function update() {
        if (!isPlaying) return;
        bird.update();
        updatePipes();
        checkCollisions();
        frames++;
    }

    function draw() {
        drawBackground();
        drawPipes();
        bird.draw();
    }

    function loop() {
        update();
        draw();
        if (isPlaying || !isGameOver) {
            gameLoopId = requestAnimationFrame(loop);
        }
    }

    function startGame() {
        isPlaying = true;
        isGameOver = false;
        
        // UI
        ui.startScreen.style.display = 'none';
        ui.gameOverScreen.style.display = 'none';
        ui.scoreBoard.style.display = 'none';
        
        // 逻辑
        bird.reset();
        pipes = [];
        score = 0;
        frames = 0;
        ui.scoreDisplay.innerText = '0';
        
        loop();
    }

    function gameOver() {
        isPlaying = false;
        isGameOver = true;
        cancelAnimationFrame(gameLoopId);
        
        ui.finalScore.innerText = score;
        ui.gameOverScreen.style.display = 'flex';
        ui.scoreBoard.style.display = 'block';
    }

    // --- 输入处理 ---

    function handleInput(e) {
        if (e.type === 'keydown' && e.code === 'Space') e.preventDefault();

        if (isPlaying) {
            bird.jump();
        } else if (!isGameOver) {
            // 点击开始界面（或在游戏尚未开始时的任意点击）
            startGame();
        }
    }

    // 键盘
    window.addEventListener('keydown', (e) => {
        if (e.code === 'Space') handleInput(e);
    });

    // 鼠标 / 触摸
    window.addEventListener('mousedown', handleInput);
    window.addEventListener('touchstart', (e) => {
        // 防止缩放/滚动
        // e.preventDefault(); 
        handleInput(e);
    }, {passive: false});

    // UI 交互
    ui.restartBtn.addEventListener('click', (e) => {
        e.stopPropagation();
        startGame();
    });
    
    // 允许点击 Game Over 覆层重新开始
    ui.gameOverScreen.addEventListener('mousedown', (e) => {
        if(e.target === ui.gameOverScreen) startGame();
    });
    ui.gameOverScreen.addEventListener('touchstart', (e) => {
        if(e.target === ui.gameOverScreen) {
            e.preventDefault();
            startGame();
        }
    });

    // 初始绘制
    drawBackground();
    bird.reset();
    bird.draw();

</script>
</body>
</html>

我们还截取了一些截图（4bit 可用）：

🦥 GLM-4.7-Flash 微调

Unsloth 现在支持对 GLM-4.7-Flash 进行微调，不过你需要使用 transformers v5。30B 模型无法放入免费的 Colab GPU；不过，你可以使用我们的 notebook。GLM-4.7-Flash 的 16-bit LoRA 微调大约会使用 60GB VRAM:

GLM-4.7-Flash SFT LoRA notebook

使用 A100 40GB VRAM 时有时会遇到显存不足。为了更顺畅地运行，你需要使用 H100/A100 80GB VRAM。

Google Colabcolab.research.google.com

在微调 MoE 时，最好不要微调路由层，因此我们默认将其禁用。如果你希望保留其推理能力（可选），可以使用直接回答和思维链示例的混合。至少使用 75% 推理和 25% 非推理，以让模型保留其推理能力。

🦙Llama-server 服务与部署

要将 GLM-4.7-Flash 部署到生产环境，我们使用 llama-server 在一个新终端中，例如通过 tmux，按以下方式部署模型：

./llama.cpp/llama-server \\
    --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf \\
    --alias "unsloth/GLM-4.7-Flash" \\
    --seed 3407 \\
    --temp 1.0 \
    --top-p 0.95 \\
    --min-p 0.01 \
    --ctx-size 16384 \
    --port 8001

然后在一个新终端中，在执行 pip install openai之后，执行：

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/GLM-4.7-Flash",
    messages = [{"role": "user", "content": "2+2 等于多少？"},],
)
print(completion.choices[0].message.content)

将打印

用户提出一个简单问题："What is 2+2?" 答案是 4。请给出答案。

2 + 2 = 4.

💻 vLLM 中的 GLM-4.7-Flash

你现在可以使用我们新的 FP8 Dynamic 量化版本该模型用于获得高性能和快速推理。首先从 nightly 版安装 vLLM：

uv pip install --upgrade --force-reinstall vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers.git
uv pip install --force-reinstall numba

然后启动服务 Unsloth 的动态 FP8 版本该模型。我们启用了 FP8 以将 KV cache 内存使用量降低 50%，并在 4 张 GPU 上运行。如果你只有 1 张 GPU，请使用 CUDA_VISIBLE_DEVICES='0' 并设置 --tensor-parallel-size 1 或者移除这个参数。要禁用 FP8，请移除 --quantization fp8 --kv-cache-dtype fp8

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
CUDA_VISIBLE_DEVICES='0,1,2,3' vllm serve unsloth/GLM-4.7-Flash-FP8-Dynamic \\
    --served-model-name unsloth/GLM-4.7-Flash \\
    --tensor-parallel-size 4 \
    --tool-call-parser glm47 \\
    --reasoning-parser glm45 \\
    --enable-auto-tool-choice \
    --dtype bfloat16 \
    --seed 3407 \\
    --max-model-len 200000 \
    --gpu-memory-utilization 0.95 \\
    --max_num_batched_tokens 16384 \\
    --port 8001 \\
    --kv-cache-dtype fp8

然后你可以通过 OpenAI API 调用已部署的模型：

from openai import AsyncOpenAI, OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI( # 或 AsyncOpenAI
    api_key=openai_api_key,
    base_url=openai_api_base,
)

⭐ vLLM GLM-4.7-Flash 预测解码

我们发现，使用 GLM 4.7 Flash 的 MTP（多 token 预测）模块会使生成吞吐量从 1 张 B200 上的 13,000 tokens 降到 1,300 tokens！（慢 10 倍）在 Hopper 上，希望应该没问题。

    --speculative-config.method mtp \\
    --speculative-config.num_speculative_tokens 1

在 1xB200 上吞吐量仅为 1,300 tokens/s（每个用户的解码速度为 130 tokens/s）

而在 1xB200 上吞吐量为 13,000 tokens/s（每个用户的解码速度仍为 130 tokens/s）

🔨使用 GLM-4.7-Flash 进行工具调用

查看 Tool Calling Guide 了解如何进行工具调用的更多细节。在一个新的终端中（如果使用 tmux，请使用 CTRL+B+D），我们创建一些工具，比如两个数字相加、执行 Python 代码、执行 Linux 函数等等：

import json, subprocess, random
from typing import Any
def add_number(a: float | str, b: float | str) -> float:
    return float(a) + float(b)
def multiply_number(a: float | str, b: float | str) -> float:
    return float(a) * float(b)
def substract_number(a: float | str, b: float | str) -> float:
    return float(a) - float(b)
def write_a_story() -> str:
    return random.choice([
        "很久很久以前，在一个很远很远的星系里...",
        "有两个朋友，他们喜欢树懒和代码...",
        "世界即将终结，因为每只树懒都进化出了超人般的智慧...",
        "其中一个朋友并不知道，另一个朋友不小心写了一个让树懒进化的程序...",
    ])
def terminal(command: str) -> str:
    if "rm" in command or "sudo" in command or "dd" in command or "chmod" in command:
        msg = "无法执行 'rm, sudo, dd, chmod' 命令，因为它们很危险"
        print(msg); return msg
    print(f"正在执行终端命令 `{command}`")
    try:
        return str(subprocess.run(command, capture_output = True, text = True, shell = True, check = True).stdout)
    except subprocess.CalledProcessError as e:
        return f"命令失败：{e.stderr}"
def python(code: str) -> str:
    data = {}
    exec(code, data)
    del data["__builtins__"]
    return str(data)
MAP_FN = {
    "add_number": add_number,
    "multiply_number": multiply_number,
    "substract_number": substract_number,
    "write_a_story": write_a_story,
    "terminal": terminal,
    "python": python,
}
tools = [
    {
        "type": "function",
        "function": {
            "name": "add_number",
            "description": "添加两个数字。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "multiply_number",
            "description": "将两个数字相乘。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "substract_number",
            "description": "减去两个数字。",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "string",
                        "description": "第一个数字。",
                    },
                    "b": {
                        "type": "string",
                        "description": "第二个数字。",
                    },
                },
                "required": ["a", "b"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "write_a_story",
            "description": "写一个随机故事。",
            "parameters": {
                "type": "object",
                "properties": {},
                "required": [],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "terminal",
            "description": "在终端中执行操作。",
            "parameters": {
                "type": "object",
                "properties": {
                    "command": {
                        "type": "string",
                        "description": "你希望启动的命令，例如 `ls`、`rm` 等。",
                    },
                },
                "required": ["command"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "python",
            "description": "调用一个 Python 解释器，执行将要运行的 Python 代码。",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "要运行的 Python 代码",
                    },
                },
                "required": ["code"],
            },
        },
    },
]

然后我们使用下面的函数（复制、粘贴并执行），它们会自动解析函数调用，并针对任何模型调用 OpenAI 端点：

from openai import OpenAI
def unsloth_inference(
    messages,
    temperature = 0.7,
    top_p = 1.0,
    top_k = -1,
    repetition_penalty = 0.0,
):
    messages = messages.copy()
    openai_client = OpenAI(
        base_url = "http://127.0.0.1:8001/v1",
        api_key = "sk-no-key-required",
    )
    model_name = next(iter(openai_client.models.list())).id
    print(f"Using model = {model_name}")
    has_tool_calls = True
    original_messages_len = len(messages)
    while has_tool_calls:
        print(f"Current messages = {messages}")
        response = openai_client.chat.completions.create(
            model = model_name,
            messages = messages,
            temperature = temperature,
            top_p = top_p,
            tools = tools if tools else None,
            tool_choice = "auto" if tools else None,
            extra_body = {"top_k": top_k, "min_p": min_p, "dry_multiplier" :repetition_penalty,}
        )
        tool_calls = response.choices[0].message.tool_calls or []
        content = response.choices[0].message.content or ""
        tool_calls_dict = [tc.to_dict() for tc in tool_calls] if tool_calls else tool_calls
        messages.append({"role": "assistant", "tool_calls": tool_calls_dict, "content": content,})
        for tool_call in tool_calls:
            fx, args, _id = tool_call.function.name, tool_call.function.arguments, tool_call.id
            out = MAP_FN[fx](**json.loads(args))
            messages.append({"role": "tool", "tool_call_id": _id, "name": fx, "content": str(out),})
        else:
            has_tool_calls = False
    return messages

通过以下方式启动 GLM-4.7-Flash 之后 llama-server 之后，像在 GLM-4.7-Flash 或者查看 Tool Calling Guide 更多细节下，我们就可以进行一些工具调用：

GLM 4.7 的数学运算工具调用

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "What is today's date plus 3 days?"}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = -1, min_p = 0.01)

为 GLM-4.7-Flash 执行生成的 Python 代码的工具调用

messages = [{
    "role": "user",
    "content": [{"type": "text", "text": "用 Python 创建一个斐波那契函数并求 fib(20)。"}],
}]
unsloth_inference(messages, temperature = 1.0, top_p = 0.95, top_k = -1, min_p = 0.01)

基准测试

除 AIME 25 外，GLM-4.7-Flash 在所有基准上都是表现最好的 30B 模型。

基准

GLM-4.7-Flash

Qwen3-30B-A3B-Thinking-2507

GPT-OSS-20B

AIME 25

91.6

85.0

91.7

GPQA

75.2

73.4

71.5

LCB v6

64.0

66.0

61.0

HLE

14.4

9.8

10.9

SWE-bench Verified

59.2

22.0

34.0

τ²-Bench

79.5

49.0

47.7

BrowseComp

42.8

2.29

28.3

上一页MiniMax-M2.5 下一页Kimi K2.5

最后更新于1天前

这有帮助吗？

hashtag⚙️ 使用指南

hashtag🖥️ 运行 GLM-4.7-Flash

hashtagLlama.cpp 教程（GGUF）：

hashtag➿减少重复和循环

hashtag🐦使用 UD-Q4_K_XL 的 Flappy Bird 示例

hashtag🦥 GLM-4.7-Flash 微调

hashtag🦙Llama-server 服务与部署

hashtag💻 vLLM 中的 GLM-4.7-Flash

hashtag⭐ vLLM GLM-4.7-Flash 预测解码

hashtag🔨使用 GLM-4.7-Flash 进行工具调用

hashtag基准测试

⚙️ 使用指南

🖥️ 运行 GLM-4.7-Flash

Llama.cpp 教程（GGUF）：

➿减少重复和循环

🐦使用 UD-Q4_K_XL 的 Flappy Bird 示例

🦥 GLM-4.7-Flash 微调

🦙Llama-server 服务与部署

💻 vLLM 中的 GLM-4.7-Flash

⭐ vLLM GLM-4.7-Flash 预测解码

🔨使用 GLM-4.7-Flash 进行工具调用

基准测试