【LLM】VLLM 单卡部署 Qwen3-4B

尝试在个人显卡上部署一个 Qwen-4B 模型

1. 部署

1.1 软硬件版本

  • 显卡:RTX4070
  • 操作系统:Ubuntu 23.10
  • Python:3.12
  • VLLM:0.8.5.post1
  • CUDA:12.2

1.2 部署环境配置

# 下载模型文件
$ git clone https://www.modelscope.cn/Qwen/Qwen3-4B.git
# 创建 python 环境
$ conda create --name llm_hello_world python=3.12
$ conda activate llm_hello_world
# 安装 vllm
$ pip install -U "vllm"
# Docker 拉 Open-WebUI 镜像
$ docker pull ghcr.dockerproxy.com/open-webui/open-webui:main

1.3 部署测试

启动 VLLM Serve

$ vllm serve /home/zhuxingda/Projets/Qwen3-4B --quantization fp8 --gpu-memory-utilization 0.6 --max-num-seq 1 --max-model-len 4K --api-key 123456

部署完之后请求接口确认成功

$ curl http://localhost:8000/v1/models -H "Authorization: Bearer 123456" | jq
{
  "object": "list",
  "data": [
    {
      "id": "/home/zhuxingda/Projets/Qwen3-4B",
      "object": "model",
      "created": 1748317788,
      "owned_by": "vllm",
      "root": "/home/zhuxingda/Projets/Qwen3-4B",
      "parent": null,
      "max_model_len": 4096,
      "permission": [
       ...
      ]
    }
  ]
}

启动 Open-WebUI 镜像

$ sudo docker run -d \                                                       
   --restart unless-stopped \
   --name open-webui \
   -p 0.0.0.0:8080:8080 \
   -v $(pwd)/data:/app/backend/data \
   -e WEBUI_SECRET_KEY=123456 \
   -e HF_ENDPOINT=https://hf-mirror.com \
   ghcr.nju.edu.cn/open-webui/open-webui:main

启动成功之后打开 http://host-name:8080 并在设置中新建连接 截图 然后在工作空间中添加模型即可使用
查看显卡监控发现尽管是个 4B 模型并且用了 FP8 量化,但 VLLM 的进程还是占用了9个多GB显存

$ nvidia-smi 
Tue May 27 12:32:42 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4070        Off | 00000000:01:00.0  On |                  N/A |
|  0%   37C    P8               7W / 200W |   9858MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    213900      C   ...envs/llm_hello_world/bin/python3.12     9852MiB |
+---------------------------------------------------------------------------------------+

但查看 VLLM 启动时的日志模型只占用了 4.1 GB

INFO 05-25 23:37:44 [gpu_model_runner.py:1347] Model loading took 4.1402 GiB and 0.962489 seconds