我通过在VMware ESXI7.0.2上安装虚拟机,版本为centos 7-3.10.0-1160.el7.x86_64。然后我需要为这个虚拟机安装英伟达显卡驱动,但是当我执行.run之后,nvidia-smi却显示:> No devices were found
查看了网上大量资料,对许多配置进行了更改,包括但不限于的配置有:
- 安装基础依赖环境等
- 在ESXI管理页面中设置显卡为直通
- nouveau也禁用掉了
- 显卡驱动版本也更换了多个
- 在虚拟机配置中的 虚拟机选项-高级-配置参数中新增参数:
- pciPassthru.64bitMMIOSizeGB = 64
- pciPassthru.use64bitMMIO = TRUE
- 编译驱动内核版本也一致:
# uname -r
3.10.0-1160.el7.x86_64
# modinfo nvidia | grep vermagic
vermagic: 3.10.0-1160.el7.x86_64 SMP mod_unload modversions
但是在最后检查的时候依然提示:No devices were found
各项状态检查如下
lsmod | grep nouveau
无输出
lspci | grep -i nvidia
0b:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
13:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
lsmod | grep nvidia
nvidia_uvm 1287347 0
nvidia_drm 58061 0
nvidia_modeset 1298897 1 nvidia_drm
nvidia 56742045 2 nvidia_modeset,nvidia_uvm
drm_kms_helper 186531 2 vmwgfx,nvidia_drm
drm 456166 6 ttm,drm_kms_helper,nvidia,vmwgfx,nvidia_drm
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.247.01 Wed Mar 26 11:50:32 UTC 2025
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
ls -l /dev/nvidia*
crw-rw-rw-. 1 root root 195, 0 7月 9 17:36 /dev/nvidia0
crw-rw-rw-. 1 root root 195, 1 7月 9 17:36 /dev/nvidia1
crw-rw-rw-. 1 root root 195, 255 7月 9 17:36 /dev/nvidiactl
crw-rw-rw-. 1 root root 236, 0 7月 9 17:36 /dev/nvidia-uvm
crw-rw-rw-. 1 root root 236, 1 7月 9 17:36 /dev/nvidia-uvm-tools
/dev/nvidia-caps:
总用量 0
cr--------. 1 root root 239, 1 7月 9 17:36 nvidia-cap1
cr--r--r--. 1 root root 239, 2 7月 9 17:36 nvidia-cap2
dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.el7.x86_64 root=/dev/mapper/centos_ai-root ro crashkernel=auto rd.lvm.lv=centos_ai/root rd.lvm.lv=centos_ai/swap rhgb quiet intel_iommu=on iommu=pt rd.driver.blacklist=nouveau
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-1160.el7.x86_64 root=/dev/mapper/centos_ai-root ro crashkernel=auto rd.lvm.lv=centos_ai/root rd.lvm.lv=centos_ai/swap rhgb quiet intel_iommu=on iommu=pt rd.driver.blacklist=nouveau
[ 0.000000] DMAR: IOMMU enabled
dmesg | grep -i nvrm
[ 4.441939] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 4.442554] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 4.442602] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 4.442605] NVRM: None of the NVIDIA devices were initialized.
[ 4.570555] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 4.570583] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 4.570607] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 4.570608] NVRM: None of the NVIDIA devices were initialized.
[ 5.881807] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 5.881833] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 5.881854] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 5.881855] NVRM: None of the NVIDIA devices were initialized.
[ 6.108337] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 6.108399] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 6.108444] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 6.108447] NVRM: None of the NVIDIA devices were initialized.
[ 11.955199] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 11.955228] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 11.955251] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 11.955252] NVRM: None of the NVIDIA devices were initialized.
[ 12.458272] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 12.458303] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 12.458327] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 12.458329] NVRM: None of the NVIDIA devices were initialized.
[ 13.464136] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 13.464165] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 13.464187] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 13.464189] NVRM: None of the NVIDIA devices were initialized.
[ 14.067732] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 14.067758] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 14.067781] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 14.067782] NVRM: None of the NVIDIA devices were initialized.
[ 501.632900] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 501.632952] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 501.633008] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 501.633010] NVRM: None of the NVIDIA devices were initialized.
[ 606.357046] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 606.357105] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 606.357171] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 606.357174] NVRM: None of the NVIDIA devices were initialized.
[ 691.541885] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 691.541940] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 691.541994] NVRM: The NVIDIA probe routine failed for 2 device(s).
[ 691.541997] NVRM: None of the NVIDIA devices were initialized.
[ 781.468371] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 781.468377] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 781.468395] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 781.579751] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 781.579760] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 781.579785] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 781.690427] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.247.01 Wed Mar 26 11:50:32 UTC 2025
[ 816.478975] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 816.478985] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 816.479011] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:0b:00.0)
[ 816.589390] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR1 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 816.589399] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR2 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 816.589425] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
NVRM: BAR5 is 0M @ 0x0 (PCI:0000:13:00.0)
[ 816.699807] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.247.01 Wed Mar 26 11:50:32 UTC 2025
[ 830.034843] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 830.034927] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 830.345331] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 830.345420] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 830.938662] NVRM: GPU 0000:13:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 830.938735] NVRM: GPU 0000:13:00.0: rm_init_adapter failed, device minor number 1
[ 831.254334] NVRM: GPU 0000:13:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 831.254413] NVRM: GPU 0000:13:00.0: rm_init_adapter failed, device minor number 1
[ 2885.867783] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 2885.867875] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 2886.179362] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 2886.179450] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[ 2886.496928] NVRM: GPU 0000:13:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 2886.497021] NVRM: GPU 0000:13:00.0: rm_init_adapter failed, device minor number 1
[ 2886.813693] NVRM: GPU 0000:13:00.0: RmInitAdapter failed! (0x24:0x72:1447)
[ 2886.813773] NVRM: GPU 0000:13:00.0: rm_init_adapter failed, device minor number 1
似乎没有分配内存地址给GPU?找了一圈也没什么思路,问AI也是解决不了。接下来的排查方向和解决思路应该是什么呢