基于你的硬件配置（RTX 3070 16GB显存 + 48GB RAM + 5800X），利用 llama.cpp 的 CPU offload 能力，可以覆盖的模型范围远超纯GPU方案。以下是从强到弱的推荐列表。

***

## 硬件能力分析

你的关键优势在于 **16GB VRAM + 48GB RAM 的组合**，可以让大参数模型的非关键层（FFN exps）卸载到CPU，正如你当前启动命令中 `-ot "\.ffn_(up|down|gate)_exps\.=CPU"` 所做的那样  [huggingface](https://huggingface.co/ubergarm/Qwen3-30B-A3B-GGUF)。当前跑 Qwen3-30B-A3B Q4_K_M 只用了约 4.8GB 显存、30 t/s，说明GPU还有大量余量可利用  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg)。

***

## 推荐可运行模型列表

以下按**综合实力**排序（全GPU推理 or GPU+CPU混合推理均可）：

### 旗舰级 MoE 模型（CPU卸载必需）

| 模型 | 参数量 | 量化建议 | 显存占用 | 推理速度 | 备注 |
|------|--------|----------|---------|---------|------|
| **Qwen3.5-122B-A10B** | 122B/10B激活 | UD-IQ3_XXS (~44.7GB) | ~14.7GB GPU | ~20-22 t/s | 16GB显存最强MoE，质量碾压27B  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |
| **Mistral-Small-4-119B** | 119B MoE | UD-IQ3_XXS (~42.8GB) | ~14.8GB GPU | ~28-30 t/s | 非Qwen系替代，代码能力强  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |
| **Nemotron Super 120B** | 120B | IQ3_XXS (~56.2GB) | ~15GB GPU | ~17 t/s | 需要56GB总内存，你48GB稍紧  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |

### 高性能 MoE 小参数（当前已运行，可升级）

| 模型 | 参数量 | 量化建议 | 显存占用 | 推理速度 | 备注 |
|------|--------|----------|---------|---------|------|
| **Qwen3.6-35B-A3B** | 35B/3B激活 | UD-IQ3_XXS (~13.2GB) | ~14.7GB GPU | ~145 t/s | 速度极快，接近全GPU  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |
| **Qwen3.5-35B-A3B** | 35B/3B激活 | UD-IQ3_S (~13.6GB) | ~14.9GB GPU | ~136 t/s | 可推到100k context不掉速  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |
| **Qwen3-30B-A3B** ✅ | 30B/3B激活 | Q4_K_M（当前） | ~4.8GB GPU | ~30 t/s | 当前方案，可尝试Q6_K_M提质量 |

### 高质量 Dense 模型（全GPU或少量offload）

| 模型 | 参数量 | 量化建议 | 显存占用 | 推理速度 | 备注 |
|------|--------|----------|---------|---------|------|
| **Qwen3-14B** | 14B Dense | Q8_0 (~15GB) | ~15GB GPU | ~40-50 t/s | 全GPU，综合质量最佳Dense  [biton.co](https://www.biton.co.jp/blog_71.html) |
| **Gemma 4-26B-A4B** | 26B MoE | UD-IQ4_XS (~13.4GB) | ~14.7GB GPU | ~120 t/s | Google多模态背景，速度出色  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |
| **GLM-4.7-Flash REAP 23B** | 23B | IQ4_XS (~12.6GB) | ~13.7GB GPU | ~122 t/s | 中文能力强，适合你的使用场景  [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg) |
| **Qwen3-8B** | 8B Dense | Q8_0 (~9GB) | ~9GB GPU | ~80+ t/s | 快速响应场景备用 |

***

## 实际最强推荐

**综合质量最强**（48GB RAM够用）：→ **Qwen3.5-122B-A10B-UD-IQ3_XXS**，模型文件约44.7GB，GPU占约14.7GB，CPU承载其余，实测约20-22 t/s，智力水平远超30B 。 [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg)

**速度质量最佳平衡**：→ **Qwen3.6-35B-A3B-UD-IQ3_XXS**，~145 t/s，全程GPU，13.8GB显存即可搞定，是当前30B方案的直接升级 。 [glukhov](https://www.glukhov.org/llm-performance/benchmarks/best-llm-on-16gb-vram-gpu/)

你目前运行30B时GPU显存仅用了4.8GB/16GB，说明大量FFN被卸载到CPU 。建议优先尝试 Qwen3.6-35B-A3B 的 IQ3_XXS 量化，可将几乎全部层放入GPU实现最高速度。 [dev](https://dev.to/rosgluk/16-gb-vram-llm-benchmarks-with-llamacpp-speed-and-context-3hgg)

请你给出运行 Qwen3.6-35B-A3B的方式，IQ3_XXS 量化是什么 性能损耗与当前的