Nvidia 驱动安装和 Ollama 的使用

本文最后更新于 2025年1月2日 凌晨

根据同事反馈,高版本的 NVIDIA 驱动兼容性有问题,需要安装 Nvidia 驱动 525.147.05 ,过程中可能需要升级内核。

安装 Nvidia 驱动

查看 Debian 上显卡安装情况。

lspci -nn | egrep -i "3d|display|vga"  
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation AD102 [GeForce RTX 4090] [10de:2684] (rev a1)

查看驱动安装具体的情况。

lsmod | grep nouveau  
nouveau              2433024  0  
mxm_wmi                16384  1 nouveau  
i2c_algo_bit           16384  1 nouveau  
drm_display_helper    184320  1 nouveau  
drm_ttm_helper         16384  1 nouveau  
ttm                    94208  2 drm_ttm_helper,nouveau  
drm_kms_helper        204800  2 drm_display_helper,nouveau  
drm                   614400  5 drm_kms_helper,drm_display_helper,drm_ttm_helper,ttm,nouveau  
video                  65536  2 asus_wmi,nouveau  
wmi                    36864  5 video,asus_wmi,wmi_bmof,mxm_wmi,nouveau  
button                 24576  1 nouveau

看来安装的是开源版本的驱动 nouveau,需要先禁用。

echo "blacklist nouveau" | sudo tee /etc/modprobe.d/nouveau-blacklist.conf
sudo update-initramfs -u
sudo update-grub
sudo reboot

重启后,执行 lsmod | grep nouveau 发现已经返回为空了,成功禁用。

执行命令 sudo apt install nvidia-driver firmware-misc-nonfree 安装 NVIDIA Proprietary Driver 报错。

Consult /var/lib/dkms/nvidia-current/525.147.05/build/make.log for more information.  
dpkg: error processing package nvidia-kernel-dkms (--configure):  
installed nvidia-kernel-dkms package post-installation script subprocess returned error exit status 10  
dpkg: dependency problems prevent configuration of nvidia-driver:  
nvidia-driver depends on nvidia-kernel-dkms (= 525.147.05-4~deb12u1) | nvidia-kernel-525.147.05 | nvidia-open-kernel-525.147.05 | nvidia-open-kernel-525.147.05; however:  
 Package nvidia-kernel-dkms is not configured yet.  
 Package nvidia-kernel-525.147.05 is not installed.  
 Package nvidia-kernel-dkms which provides nvidia-kernel-525.147.05 is not configured yet.  
 Package nvidia-open-kernel-525.147.05 is not installed.  
 Package nvidia-open-kernel-525.147.05 is not installed.  
  
dpkg: error processing package nvidia-driver (--configure):  
dependency problems - leaving unconfigured  
Processing triggers for libc-bin (2.36-9+deb12u4) ...  
Processing triggers for initramfs-tools (0.142) ...  
update-initramfs: Generating /boot/initrd.img-6.1.0-18-amd64  
Processing triggers for update-glx (1.2.2) ...  
Processing triggers for glx-alternative-nvidia (1.2.2) ...  
update-alternatives: using /usr/lib/nvidia to provide /usr/lib/glx (glx) in auto mode  
Processing triggers for glx-alternative-mesa (1.2.2) ...  
Processing triggers for libc-bin (2.36-9+deb12u4) ...  
Processing triggers for initramfs-tools (0.142) ...  
update-initramfs: Generating /boot/initrd.img-6.1.0-18-amd64  
Errors were encountered while processing:  
nvidia-kernel-dkms  
nvidia-driver  
E: Sub-process /usr/bin/dpkg returned an error code (1)

确认 debian 版本 lsb_release -a

No LSB modules are available.  
Distributor ID: Debian  
Description:    Debian GNU/Linux 12 (bookworm)  
Release:        12  
Codename:       bookworm

根据 stackexchange 上的回答 ,安全升级 Debian 内核的方法是使用 backports 安装。

echo "deb http://deb.debian.org/debian bookworm-backports main" | sudo tee /etc/apt/sources.list.d/debian-backports.list
sudo apt update
sudo apt install -t bookworm-backports linux-image-amd64
sudo reboot

重新启动后,执行 uname -a 发现内核已经成功升级了。

uname -a                                                                          
Linux debian 6.7.12+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.7.12-1~bpo12+1 (2024-05-06) x86_64 GNU/Linux

重新安装 NVIDIA Proprietary Driver sudo apt install nvidia-driver firmware-misc-nonfree ,这次没有报错了。

nvidia-smi    

NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0

NVIDIA Proprietary Driver 525 感觉有问题,过了一段时间后机器出现重启现象,dmesg 显示错误 ACPI BIOS Error (bug)

上网搜索错误,有人反馈是 525 驱动问题(不确定)。Debain 系统 Nvidia 驱动有更新,执行 apt upgrade 后成功升级到 535 。

# nvidia-smi

NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2

升级 Nvidia 驱动到 535 后,暂未出现重启现象。

安装 ollama

执行下面的命令安装 Ollama

curl -fsSL https://ollama.com/install.sh | sh

下载速度很慢,还是挂线路。

export https_proxy=http://127.0.0.1:7890
export http_proxy=http://127.0.0.1:7890
curl -fsSL https://ollama.com/install.sh | sh

挂上线路后,很快 Ollama 就安装成功了。

>>> Downloading ollama...  
######################################################################## 100.0%#=#=-#  #                                                                        
>>> Installing ollama to /usr/local/bin...  
>>> Creating ollama user...  
>>> Adding ollama user to render group...  
>>> Adding ollama user to video group...  
>>> Adding current user to ollama group...  
>>> Creating ollama systemd service...  
>>> Enabling and starting ollama service...  
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.  
>>> NVIDIA GPU installed.

ollama 下载 llama3 8b 和 qwen2 7b 模型,执行下面的命令:

ollama pull llama3
ollama pull qwen2:7b

测试 llama3 模型,运行正常。

ollama run llama3  
>>> hi  
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

升级 Ollama

Ollama 0.3.0 支持通过 llama3.1 进行工具调用,有必要升级。参见 [4]

sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama

升级完毕,需要重启 Ollama 服务。

sudo systemctl daemon-reload
sudo systemctl restart ollama

新版本的 Ollama 已经不是一个单独的文件,而是一个 tar.gz 的压缩包。
tar.tgz 中包含了 ollama 运行需要的动态库,在升级前需要将这些动态库删除。

sudo rm -rf /usr/lib/ollama
sudo rm -rf /usr/local/lib/ollama

curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
sudo tar -C /usr/local -xzf ollama-linux-amd64.tgz

执行完上述命令后,启动 ollama 检查版本确认是否升级成功。

ollama serve
ollama -v

配置 Ollama

如果需要在浏览器插件(比如沉浸翻译)中调用 Ollama api,涉及 Cross-Origin 访问,需要修改 Ollama 配置。

官方文档提到了相关的设置 [5],用 vim 直接修改 /etc/systemd/system/ollama.service 中,添加下面内容:

Environment="OLLAMA_HOST=*"

重启 Ollama 服务

sudo systemctl daemon-reload
sudo systemctl restart ollama

在远程主机上,查看 Ollama 端口侦听情况

apt install net-tools
netstat -antp | grep -i ollama

Ollama 默认侦听 127.0.0.1:11434

tcp        0      0 127.0.0.1:11434         0.0.0.0:*               LISTEN      50508/ollama

利用 SSH 将远程主机 Ollama 侦听的端口 11434 转发到本地 127.0.0.1:11434

ssh -N -g -L 127.0.0.1:11434:127.0.0.1:11434 root@1.1.1.1  # 将 1.1.1.1 替换成你的 ip

卸载 Ollama

停止 Ollama 服务

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service

删除二进制文件

sudo rm $(which ollama)

删除 Ollama 用户

sudo rm -r /usr/share/ollama
sudo userdel ollama
sudo groupdel ollama

Ollama 的使用

OLLAMA_ORIGINS=* OLLAMA_NUM_PARALLEL=16 ollama serve
  • OLLAMA_ORIGINS 跨域设置
  • OLLAMA_NUM_PARALLEL 支持的并行请求数量
  • OLLAMA_DEBUG 打印调试信息
  • OLLAMA_LLM_LIBRARY 支持下面的选项 rocm_v6 cpu cpu_avx cpu_avx2 cuda_v11 rocm_v5
  • OLLAMA_KEEP_ALIVE 模型在显存内加载的时间,默认为 5 分钟
  • OLLAMA_GPU_OVERHEAD 单独为每个 GPU 预留的 VRAM ,单位是字节

https://github.com/ollama/ollama/blob/main/envconfig/config.go

func AsMap() map[string]EnvVar {
	ret := map[string]EnvVar{
		"OLLAMA_DEBUG":             {"OLLAMA_DEBUG", Debug(), "Show additional debug information (e.g. OLLAMA_DEBUG=1)"},
		"OLLAMA_FLASH_ATTENTION":   {"OLLAMA_FLASH_ATTENTION", FlashAttention(), "Enabled flash attention"},
		"OLLAMA_HOST":              {"OLLAMA_HOST", Host(), "IP Address for the ollama server (default 127.0.0.1:11434)"},
		"OLLAMA_KEEP_ALIVE":        {"OLLAMA_KEEP_ALIVE", KeepAlive(), "The duration that models stay loaded in memory (default \"5m\")"},
		"OLLAMA_LLM_LIBRARY":       {"OLLAMA_LLM_LIBRARY", LLMLibrary(), "Set LLM library to bypass autodetection"},
		"OLLAMA_MAX_LOADED_MODELS": {"OLLAMA_MAX_LOADED_MODELS", MaxRunners(), "Maximum number of loaded models per GPU"},
		"OLLAMA_MAX_QUEUE":         {"OLLAMA_MAX_QUEUE", MaxQueue(), "Maximum number of queued requests"},
		"OLLAMA_MODELS":            {"OLLAMA_MODELS", Models(), "The path to the models directory"},
		"OLLAMA_NOHISTORY":         {"OLLAMA_NOHISTORY", NoHistory(), "Do not preserve readline history"},
		"OLLAMA_NOPRUNE":           {"OLLAMA_NOPRUNE", NoPrune(), "Do not prune model blobs on startup"},
		"OLLAMA_NUM_PARALLEL":      {"OLLAMA_NUM_PARALLEL", NumParallel(), "Maximum number of parallel requests"},
		"OLLAMA_ORIGINS":           {"OLLAMA_ORIGINS", Origins(), "A comma separated list of allowed origins"},
		"OLLAMA_RUNNERS_DIR":       {"OLLAMA_RUNNERS_DIR", RunnersDir(), "Location for runners"},
		"OLLAMA_SCHED_SPREAD":      {"OLLAMA_SCHED_SPREAD", SchedSpread(), "Always schedule model across all GPUs"},
		"OLLAMA_TMPDIR":            {"OLLAMA_TMPDIR", TmpDir(), "Location for temporary files"},
	}
	if runtime.GOOS != "darwin" {
		ret["CUDA_VISIBLE_DEVICES"] = EnvVar{"CUDA_VISIBLE_DEVICES", CudaVisibleDevices(), "Set which NVIDIA devices are visible"}
		ret["HIP_VISIBLE_DEVICES"] = EnvVar{"HIP_VISIBLE_DEVICES", HipVisibleDevices(), "Set which AMD devices are visible"}
		ret["ROCR_VISIBLE_DEVICES"] = EnvVar{"ROCR_VISIBLE_DEVICES", RocrVisibleDevices(), "Set which AMD devices are visible"}
		ret["GPU_DEVICE_ORDINAL"] = EnvVar{"GPU_DEVICE_ORDINAL", GpuDeviceOrdinal(), "Set which AMD devices are visible"}
		ret["HSA_OVERRIDE_GFX_VERSION"] = EnvVar{"HSA_OVERRIDE_GFX_VERSION", HsaOverrideGfxVersion(), "Override the gfx used for all detected AMD GPUs"}
		ret["OLLAMA_INTEL_GPU"] = EnvVar{"OLLAMA_INTEL_GPU", IntelGPU(), "Enable experimental Intel GPU detection"}
	}
	return ret
}

可以使用 ollama serve -h 查看官方明确支持的环境变量。

ollama serve -h  
Start ollama  
  
Usage:  
 ollama serve [flags]  
  
Aliases:  
 serve, start  
  
Flags:  
 -h, --help   help for serve  
  
Environment Variables:  
     OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)  
     OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)  
     OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")  
     OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models per GPU  
     OLLAMA_MAX_QUEUE           Maximum number of queued requests  
     OLLAMA_MODELS              The path to the models directory  
     OLLAMA_NUM_PARALLEL        Maximum number of parallel requests  
     OLLAMA_NOPRUNE             Do not prune model blobs on startup  
     OLLAMA_ORIGINS             A comma separated list of allowed origins  
     OLLAMA_SCHED_SPREAD        Always schedule model across all GPUs  
                                   
     OLLAMA_FLASH_ATTENTION     Enabled flash attention  
     OLLAMA_KV_CACHE_TYPE       Quantization type for the K/V cache (default: f16)  
     OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection  
     OLLAMA_GPU_OVERHEAD        Reserve a portion of VRAM per GPU (bytes)  
     OLLAMA_LOAD_TIMEOUT        How long to allow model loads to stall before giving up (default "5m")

切换 Ollama 的模型文件位置

  • OLLAMA_MODELS 环境变量可以修改模型文件下载的位置。

先把选择下载的模型文件移动到新创建的 /mnt/disk , 默认存在的位置是 ~/.ollama/models
(使用 Ollama 官方的脚本安装,默认路径是 /usr/share/ollama/.ollama/

mv ~/.ollama/ /mnt/disk/
ln -s /mnt/disk/.ollama ~/.ollama

重新 Ollama 启动测试

NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OLLAMA_ORIGINS=* OLLAMA_NUM_PARALLEL=16 OLLAMA_KEEP_ALIVE=10m ollama serve

测试成功。

root@pm-65c50001:~# ollama run llama3:8b-instruct-fp16  
>>> hi  
Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

研究 Ollama mutlple GPU

https://github.com/ollama/ollama/issues/4198 中发现一些比较重要的环境变量。

[Unit]  
Description=Ollama Service  
After=network-online.target

[Service]  
Environment="OLLAMA_HOST=0.0.0.0:11434"  
Environment="OLLAMA_ORIGINS='*'"  
Environment="OLLAMA_MODELS=/ollama/ollama/models"  
Environment="OLLAMA_KEEP_ALIVE=10m"  
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=2"  
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"  
ExecStart=/usr/local/bin/ollama serve  
User=ollama  
Group=ollama  
Restart=always  
RestartSec=3  
Environment="PATH=/root/.local/bin:/root/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]  
WantedBy=default.target

综合一下,决定使用下面的命令行:

NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OLLAMA_ORIGINS=* OLLAMA_NUM_PARALLEL=16 OLLAMA_KEEP_ALIVE=10m ollama serve

如何判断模型是否加载入 GPU

ollama ps
NAME      	ID          	SIZE 	PROCESSOR	UNTIL
llama3:70b	bcfb190ca3a7	42 GB	100% GPU 	4 minutes from now
  • 48%/52% CPU/GPU 类似这样的显示,说明模型只有部分加载入 GPU,还有一部分加载入系统的内存

内核设置 numa_balancing

Ollama 0.3.6 加载 Llama 3.1 405b 失败,界面一直卡着不动,在 Github 上看到类似问题。

https://github.com/ollama/ollama/issues/6425

time=2024-08-29T10:50:24.720+08:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-29T10:50:24.720+08:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-29T10:50:24.720+08:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance

上面的 Issue 提到 有影响,手工禁用并使用 Ollama 0.3.8 可以解决问题。

echo 0 > /proc/sys/kernel/numa_balancing  # or
sysctl -w kernel.numa_balancing=0

学习 /numa_balancing 的作用,看上去像优化内存使用。
https://docs.kernel.org/admin-guide/sysctl/kernel.html#numa-balancing

Enables/disables and configures automatic page fault based NUMA memory balancing

Ollama 从 0.37 开始不再是一个独立的 X86_64 的 elf,而是一个 tar.gz
https://github.com/ollama/ollama/releases/download/v0.3.8/ollama-linux-amd64.tgz

# cd root
# tar zxvf ollama-linux-amd64.tgz    
./  
./lib/  
./lib/ollama/  
./lib/ollama/libcublas.so.12.4.2.65  
./lib/ollama/libcublasLt.so.11  
./lib/ollama/libcublas.so.11.5.1.109  
./lib/ollama/libcudart.so.11.3.109  
./lib/ollama/libcublas.so.12  
./lib/ollama/libcublasLt.so  
./lib/ollama/libcublas.so.11  
./lib/ollama/libcublas.so  
./lib/ollama/libcudart.so  
./lib/ollama/libcublasLt.so.12  
./lib/ollama/libcublasLt.so.11.5.1.109  
./lib/ollama/libcudart.so.11.0  
./lib/ollama/libcudart.so.12.4.99  
./lib/ollama/libcudart.so.12  
./lib/ollama/libcublasLt.so.12.4.2.65  
./bin/  
./bin/ollama
#export LD_LIBRARY_PATH=/root/lib/ollama/
OLLAMA_DEBUG=1 NVIDIA_VISIBLE_DEVICES=all CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OLLAMA_ORIGINS=* OLLAMA_KEEP_ALIVE=10m /root/bin/ollama serve

34s 后,Llama3.1 405b 成功启动

time=2024-08-29T15:07:02.113+08:00 level=INFO source=server.go:630 msg="llama runner started in 34.94 seconds"

参考资料

[0] NvidiaGraphicsDrivers
https://wiki.debian.org/NvidiaGraphicsDrivers

[1] Debain backports Instructions
https://backports.debian.org/Instructions/

[2] Ollama on Linux
https://github.com/ollama/ollama/blob/main/docs/linux.md

[3] Is it possible & safe to use latest kernel with Debian?
https://unix.stackexchange.com/questions/725783/is-it-possible-safe-to-use-latest-kernel-with-debian

[4] Ollama v0.3.0 release note
https://github.com/ollama/ollama/releases/tag/v0.3.0

[5] Ollama FAQ
https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server


Nvidia 驱动安装和 Ollama 的使用
https://usmacd.com/cn/Debian_Nvidia_Ollama/
作者
henices
发布于
2024年7月26日
许可协议