2

Intro

Hi, I was recently given access to a remote machine with a GPU that I can use to accelerate my research. However, I am having a lot of trouble getting it set up. I went through the Nvidia Linux Install Guide a couple of times but am still not able to use Cuda with PyTorch (print(torch.cuda.is_available()) returns false).

Details

lspci | grep -i nvidia

00:0a.0 VGA compatible controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)

Linux distribution is Ubuntu 16.04.6 LTS

Issue

After installation of the toolkit, I got success messages and running nvcc --version results in

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

However, running nvidia-smi results in

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

I then ran

  1. $ sudo /usr/bin/nvidia-uninstall to uninstall driver
  2. sudo ./NVIDIA-Linux-x86_64–410.104.run --no-x-check

However, I got an error in the installation and these are the logs:

cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Jun  6 17:32:05 2019
installer version: 410.104

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --no-x-check

Unable to load: nvidia-installer ncurses v6 user interface

Using: nvidia-installer ncurses user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> Installing NVIDIA driver version 410.104.
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
-> Installing both new and classic TLS OpenGL libraries.
-> Installing classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.410.104"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.410.104"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.410.104"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.410.104"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
   Checking for libglvnd installation.
   Checking libGLdispatch...
   Can't load library libGLdispatch.so.0: libGLdispatch.so.0: cannot open shared object file: No such file or directory
Will install libglvnd libraries.
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (410.104):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
-> done.
ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com

Anyone have any idea on how to fix my setup and install the driver properly so that I can get Pytorch to recognize cuda?

1 Answers1

1

You do not need to manually install CUDA when you install pytorch, see Do I need to install cuda separately after installing the NVIDIA display driver? and there the first sentence and bullet point for pytorch:

You do not need the system "CUDA Toolkit"

You might ask why, and the simple answer is that pytorch installs its own binaries and does not care which system CUDA Toolkit is installed.

If the options are available to you, just use the official https://pytorch.org/get-started/locally/ to find the right one-liner to install pytorch with cuda support on linux. And if those options are not available on your system, you need to install from source, then see How to install pytorch FROM SOURCE (with cuda enabled for a deprecated CUDA cc 3.5 of an old gpu) using anaconda prompt on Windows 10?, to start with.

questionto42
  • 2,691