Nvidia 驱动安装和 Ollama 的使用

本文最后更新于 2025年1月2日 凌晨

根据同事反馈,高版本的 NVIDIA 驱动兼容性有问题,需要安装 Nvidia 驱动 525.147.05 ,过程中可能需要升级内核。

安装 Nvidia 驱动

查看 Debian 上显卡安装情况。

1
2
lspci -nn | egrep -i "3d|display|vga"  
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)

查看驱动安装具体的情况。

1
2
3
4
5
6
7
8
9
10
11
12
lsmod | grep nouveau  
nouveau              2433024  0
mxm_wmi                16384  1 nouveau
i2c_algo_bit           16384  1 nouveau
drm_display_helper    184320  1 nouveau
drm_ttm_helper         16384  1 nouveau
ttm                    94208  2 drm_ttm_helper,nouveau
drm_kms_helper        204800  2 drm_display_helper,nouveau
drm                   614400  5 drm_kms_helper,drm_display_helper,drm_ttm_helper,ttm,nouveau
video                  65536  2 asus_wmi,nouveau
wmi                    36864  5 video,asus_wmi,wmi_bmof,mxm_wmi,nouveau
button                 24576  1 nouveau

看来安装的是开源版本的驱动 nouveau,需要先禁用。

1
2
3
4
echo "blacklist nouveau" | sudo tee /etc/modprobe.d/nouveau-blacklist.conf
sudo update-initramfs -u
sudo update-grub
sudo reboot

重启后,执行 lsmod | grep nouveau 发现已经返回为空了,成功禁用。

执行命令 sudo apt install nvidia-driver firmware-misc-nonfree 安装 NVIDIA Proprietary Driver 报错。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Consult /var/lib/dkms/nvidia-current/525.147.05/build/make.log for more information.  
dpkg: error processing package nvidia-kernel-dkms (--configure):
installed nvidia-kernel-dkms package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver:
nvidia-driver depends on nvidia-kernel-dkms (= 525.147.05-4~deb12u1) | nvidia-kernel-525.147.05 | nvidia-open-kernel-525.147.05 | nvidia-open-kernel-525.147.05; however:
 Package nvidia-kernel-dkms is not configured yet.
 Package nvidia-kernel-525.147.05 is not installed.
 Package nvidia-kernel-dkms which provides nvidia-kernel-525.147.05 is not configured yet.
 Package nvidia-open-kernel-525.147.05 is not installed.
 Package nvidia-open-kernel-525.147.05 is not installed.

dpkg: error processing package nvidia-driver (--configure):
dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.36-9+deb12u4) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.1.0-18-amd64
Processing triggers for update-glx (1.2.2) ...
Processing triggers for glx-alternative-nvidia (1.2.2) ...
update-alternatives: using /usr/lib/nvidia to provide /usr/lib/glx (glx) in auto mode
Processing triggers for glx-alternative-mesa (1.2.2) ...
Processing triggers for libc-bin (2.36-9+deb12u4) ...
Processing triggers for initramfs-tools (0.142) ...
update-initramfs: Generating /boot/initrd.img-6.1.0-18-amd64
Errors were encountered while processing:
nvidia-kernel-dkms
nvidia-driver
E: Sub-process /usr/bin/dpkg returned an error code (1)

确认 debian 版本 lsb_release -a

1
2
3
4
5
No LSB modules are available.  
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

根据 stackexchange 上的回答 ,安全升级 Debian 内核的方法是使用 backports 安装。

1
2
3
4
echo "deb http://deb.debian.org/debian bookworm-backports main" | sudo tee /etc/apt/sources.list.d/debian-backports.list
sudo apt update
sudo apt install -t bookworm-backports linux-image-amd64
sudo reboot

重新启动后,执行 uname -a 发现内核已经成功升级了。

1
2
uname -a                                                                          
Linux debian 6.7.12+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.7.12-1~bpo12+1 (2024-05-06) x86_64 GNU/Linux

重新安装 NVIDIA Proprietary Driver sudo apt install nvidia-driver firmware-misc-nonfree ,这次没有报错了。

1
2
3
nvidia-smi    

NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0

NVIDIA Proprietary Driver 525 感觉有问题,过了一段时间后机器出现重启现象,dmesg 显示错误 ACPI BIOS Error (bug)

上网搜索错误,有人反馈是 525 驱动问题(不确定)。Debain 系统 Nvidia 驱动有更新,执行 apt upgrade 后成功升级到 535 。

1
2
3
# nvidia-smi

NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2

升级 Nvidia 驱动到 535 后,暂未出现重启现象。

安装 ollama

执行下面的命令安装 Ollama

1
curl -fsSL https://ollama.com/install.sh | sh

下载速度很慢,还是挂线路。

1
2
3
export https_proxy=http://127.0.0.1:7890
export http_proxy=http://127.0.0.1:7890
curl -fsSL https://ollama.com/install.sh | sh

挂上线路后,很快 Ollama 就安装成功了。

1
2
3
4
5
6
7
8
9
10
11
>>> Downloading ollama...  
######################################################################## 100.0%#=#=-#  #                                                                      
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

ollama 下载 llama3 8b 和 qwen2 7b 模型,执行下面的命令:

1
2
ollama pull llama3
ollama pull qwen2:7b

测试 llama3 模型,运行正常。

1
2
3
ollama run llama3  
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

升级 Ollama

Ollama 0.3.0 支持通过 llama3.1 进行工具调用,有必要升级。参见 [4]

1
2
sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama

升级完毕,需要重启 Ollama 服务。

1
2
sudo systemctl daemon-reload
sudo systemctl restart ollama

新版本的 Ollama 已经不是一个单独的文件,而是一个 tar.gz 的压缩包。
tar.tgz 中包含了 ollama 运行需要的动态库,在升级前需要将这些动态库删除。

1
2
3
4
5
sudo rm -rf /usr/lib/ollama
sudo rm -rf /usr/local/lib/ollama

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
sudo tar -C /usr/local -xzf ollama-linux-amd64.tgz

执行完上述命令后,启动 ollama 检查版本确认是否升级成功。

1
2
ollama serve
ollama -v

配置 Ollama

如果需要在浏览器插件(比如沉浸翻译)中调用 Ollama api,涉及 Cross-Origin 访问,需要修改 Ollama 配置。

官方文档提到了相关的设置 [5],用 vim 直接修改 /etc/systemd/system/ollama.service 中,添加下面内容:

1
Environment="OLLAMA_HOST=*"

重启 Ollama 服务

1
2
sudo systemctl daemon-reload
sudo systemctl restart ollama

在远程主机上,查看 Ollama 端口侦听情况

1
2
apt install net-tools
netstat -antp | grep -i ollama

Ollama 默认侦听 127.0.0.1:11434

1
tcp        0      0 127.0.0.1:11434         0.0.0.0:*               LISTEN      50508/ollama

利用 SSH 将远程主机 Ollama 侦听的端口 11434 转发到本地 127.0.0.1:11434

1
ssh -N -g -L 127.0.0.1:11434:127.0.0.1:11434 root@1.1.1.1  # 将 1.1.1.1 替换成你的 ip

卸载 Ollama

停止 Ollama 服务

1
2
3
sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service

删除二进制文件

1
sudo rm $(which ollama)

删除 Ollama 用户

1
2
3
sudo rm -r /usr/share/ollama
sudo userdel ollama
sudo groupdel ollama

Ollama 的使用

1
OLLAMA_ORIGINS=* OLLAMA_NUM_PARALLEL=16 ollama serve
  • OLLAMA_ORIGINS 跨域设置
  • OLLAMA_NUM_PARALLEL 支持的并行请求数量
  • OLLAMA_DEBUG 打印调试信息
  • OLLAMA_LLM_LIBRARY 支持下面的选项 rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11 rocm_v5
  • OLLAMA_KEEP_ALIVE 模型在显存内加载的时间,默认为 5 分钟
  • OLLAMA_GPU_OVERHEAD 单独为每个 GPU 预留的 VRAM ,单位是字节

https://github.com/ollama/ollama/blob/main/envconfig/config.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
func AsMap() map[string]EnvVar {
ret := map[string]EnvVar{
"OLLAMA_DEBUG": {"OLLAMA_DEBUG", Debug(), "Show additional debug information (e.g. OLLAMA_DEBUG=1)"},
"OLLAMA_FLASH_ATTENTION": {"OLLAMA_FLASH_ATTENTION", FlashAttention(), "Enabled flash attention"},
"OLLAMA_HOST": {"OLLAMA_HOST", Host(), "IP Address for the ollama server (default 127.0.0.1:11434)"},
"OLLAMA_KEEP_ALIVE": {"OLLAMA_KEEP_ALIVE", KeepAlive(), "The duration that models stay loaded in memory (default \"5m\")"},
"OLLAMA_LLM_LIBRARY": {"OLLAMA_LLM_LIBRARY", LLMLibrary(), "Set LLM library to bypass autodetection"},
"OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners(), "Maximum number of loaded models per GPU"},
"OLLAMA_MAX_QUEUE": {"OLLAMA_MAX_QUEUE", MaxQueue(), "Maximum number of queued requests"},
"OLLAMA_MODELS": {"OLLAMA_MODELS", Models(), "The path to the models directory"},
"OLLAMA_NOHISTORY": {"OLLAMA_NOHISTORY", NoHistory(), "Do not preserve readline history"},
"OLLAMA_NOPRUNE": {"OLLAMA_NOPRUNE", NoPrune(), "Do not prune model blobs on startup"},
"OLLAMA_NUM_PARALLEL": {"OLLAMA_NUM_PARALLEL", NumParallel(), "Maximum number of parallel requests"},
"OLLAMA_ORIGINS": {"OLLAMA_ORIGINS", Origins(), "A comma separated list of allowed origins"},
"OLLAMA_RUNNERS_DIR": {"OLLAMA_RUNNERS_DIR", RunnersDir(), "Location for runners"},
"OLLAMA_SCHED_SPREAD": {"OLLAMA_SCHED_SPREAD", SchedSpread(), "Always schedule model across all GPUs"},
"OLLAMA_TMPDIR": {"OLLAMA_TMPDIR", TmpDir(), "Location for temporary files"},
}
if runtime.GOOS != "darwin" {
ret["CUDA_VISIBLE_DEVICES"] = EnvVar{"CUDA_VISIBLE_DEVICES", CudaVisibleDevices(), "Set which NVIDIA devices are visible"}
ret["HIP_VISIBLE_DEVICES"] = EnvVar{"HIP_VISIBLE_DEVICES", HipVisibleDevices(), "Set which AMD devices are visible"}
ret["ROCR_VISIBLE_DEVICES"] = EnvVar{"ROCR_VISIBLE_DEVICES", RocrVisibleDevices(), "Set which AMD devices are visible"}
ret["GPU_DEVICE_ORDINAL"] = EnvVar{"GPU_DEVICE_ORDINAL", GpuDeviceOrdinal(), "Set which AMD devices are visible"}
ret["HSA_OVERRIDE_GFX_VERSION"] = EnvVar{"HSA_OVERRIDE_GFX_VERSION", HsaOverrideGfxVersion(), "Override the gfx used for all detected AMD GPUs"}
ret["OLLAMA_INTEL_GPU"] = EnvVar{"OLLAMA_INTEL_GPU", IntelGPU(), "Enable experimental Intel GPU detection"}
}
return ret
}

可以使用 ollama serve -h 查看官方明确支持的环境变量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
ollama serve -h  
Start ollama

Usage:
 ollama serve [flags]

Aliases:
 serve, start

Flags:
 -h, --help   help for serve

Environment Variables:
     OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
     OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
     OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
     OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models per GPU
     OLLAMA_MAX_QUEUE           Maximum number of queued requests
     OLLAMA_MODELS              The path to the models directory
     OLLAMA_NUM_PARALLEL        Maximum number of parallel requests
     OLLAMA_NOPRUNE             Do not prune model blobs on startup
     OLLAMA_ORIGINS             A comma separated list of allowed origins
     OLLAMA_SCHED_SPREAD        Always schedule model across all GPUs
                                 
     OLLAMA_FLASH_ATTENTION     Enabled flash attention
     OLLAMA_KV_CACHE_TYPE       Quantization type for the K/V cache (default: f16)
     OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
     OLLAMA_GPU_OVERHEAD        Reserve a portion of VRAM per GPU (bytes)
     OLLAMA_LOAD_TIMEOUT        How long to allow model loads to stall before giving up (default "5m")

切换 Ollama 的模型文件位置

  • OLLAMA_MODELS 环境变量可以修改模型文件下载的位置。

先把选择下载的模型文件移动到新创建的 /mnt/disk , 默认存在的位置是 ~/.ollama/models
(使用 Ollama 官方的脚本安装,默认路径是 /usr/share/ollama/.ollama/

1
2
mv ~/.ollama/ /mnt/disk/
ln -s /mnt/disk/.ollama ~/.ollama

重新 Ollama 启动测试

1
NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OLLAMA_ORIGINS=* OLLAMA_NUM_PARALLEL=16 OLLAMA_KEEP_ALIVE=10m ollama serve

测试成功。

1
2
3
root@pm-65c50001:~# ollama run llama3:8b-instruct-fp16  
>>> hi
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

研究 Ollama mutlple GPU

https://github.com/ollama/ollama/issues/4198 中发现一些比较重要的环境变量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[Unit]  
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_ORIGINS='*'"
Environment="OLLAMA_MODELS=/ollama/ollama/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]
WantedBy=default.target

综合一下,决定使用下面的命令行:

1
NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OLLAMA_ORIGINS=* OLLAMA_NUM_PARALLEL=16 OLLAMA_KEEP_ALIVE=10m ollama serve

如何判断模型是否加载入 GPU

1
2
3
ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
  • 48%/52% CPU/GPU 类似这样的显示,说明模型只有部分加载入 GPU,还有一部分加载入系统的内存

内核设置 numa_balancing

Ollama 0.3.6 加载 Llama 3.1 405b 失败,界面一直卡着不动,在 Github 上看到类似问题。

https://github.com/ollama/ollama/issues/6425

1
2
3
4
time=2024-08-29T10:50:24.720+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-29T10:50:24.720+08:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-29T10:50:24.720+08:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance

上面的 Issue 提到 有影响,手工禁用并使用 Ollama 0.3.8 可以解决问题。

1
2
echo 0 > /proc/sys/kernel/numa_balancing  # or
sysctl -w kernel.numa_balancing=0

学习 /numa_balancing 的作用,看上去像优化内存使用。
https://docs.kernel.org/admin-guide/sysctl/kernel.html#numa-balancing

1
Enables/disables and configures automatic page fault based NUMA memory balancing

Ollama 从 0.37 开始不再是一个独立的 X86_64 的 elf,而是一个 tar.gz
https://github.com/ollama/ollama/releases/download/v0.3.8/ollama-linux-amd64.tgz

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# cd root
# tar zxvf ollama-linux-amd64.tgz  
./
./lib/
./lib/ollama/
./lib/ollama/libcublas.so.12.4.2.65
./lib/ollama/libcublasLt.so.11
./lib/ollama/libcublas.so.11.5.1.109
./lib/ollama/libcudart.so.11.3.109
./lib/ollama/libcublas.so.12
./lib/ollama/libcublasLt.so
./lib/ollama/libcublas.so.11
./lib/ollama/libcublas.so
./lib/ollama/libcudart.so
./lib/ollama/libcublasLt.so.12
./lib/ollama/libcublasLt.so.11.5.1.109
./lib/ollama/libcudart.so.11.0
./lib/ollama/libcudart.so.12.4.99
./lib/ollama/libcudart.so.12
./lib/ollama/libcublasLt.so.12.4.2.65
./bin/
./bin/ollama
1
2
#export LD_LIBRARY_PATH=/root/lib/ollama/
OLLAMA_DEBUG=1 NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OLLAMA_ORIGINS=* OLLAMA_KEEP_ALIVE=10m /root/bin/ollama serve

34s 后,Llama3.1 405b 成功启动

1
time=2024-08-29T15:07:02.113+08:00 level=INFO source=server.go:630 msg="llama runner started in 34.94 seconds"

参考资料

[0] NvidiaGraphicsDrivers
https://wiki.debian.org/NvidiaGraphicsDrivers

[1] Debain backports Instructions
https://backports.debian.org/Instructions/

[2] Ollama on Linux
https://github.com/ollama/ollama/blob/main/docs/linux.md

[3] Is it possible & safe to use latest kernel with Debian?
https://unix.stackexchange.com/questions/725783/is-it-possible-safe-to-use-latest-kernel-with-debian

[4] Ollama v0.3.0 release note
https://github.com/ollama/ollama/releases/tag/v0.3.0

[5] Ollama FAQ
https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server


Nvidia 驱动安装和 Ollama 的使用
https://usmacd.com/cn/Debian_Nvidia_Ollama/
作者
henices
发布于
2024年7月26日
许可协议