Tensorman container cannot access GPU - Pop!_OS 22.04 #40

prina404 · 2023-03-28T12:05:46Z

After closely following the Guide on the official System 76 website for installing Tensorman, I can't seem to get tensorflow to recognize my GPU (Rtx 2080).

Here is the code used to test it:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

and this is the output of $ tensorman run --gpu python3 -- ./test_tf.py :

"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/matteo/Desktop/Tirocinio/HouseExpoSLAM/pseudoslam/personal_experiments:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python3" "./test_tf.py"
2023-03-28 11:21:33.267329: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-28 11:21:34.303425: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-03-28 11:21:34.303451: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 2ee87ebed972
2023-03-28 11:21:34.303459: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 2ee87ebed972
2023-03-28 11:21:34.303486: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-03-28 11:21:34.303509: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 525.89.2
Num GPUs Available:  0

I've also tried to launch the python script from a container shell $ tensorman run --gpu --python3 bash but received the same error.

Some information for context:

Output of $ nvidia-smi :

Tue Mar 28 13:39:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   35C    P8    16W / 260W |    341MiB /  8192MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2509      G   /usr/lib/xorg/Xorg                 84MiB |
|    0   N/A  N/A      2640      G   /usr/bin/gnome-shell              126MiB |
|    0   N/A  N/A      4652      G   firefox                           127MiB |
+-----------------------------------------------------------------------------+

Output of $ tensorman run --gpu nvidia-smi:

Failed to initialize NVML: Unknown Error

I have a dual-boot with grub as my bootloader, knowing that it may interfere with the cgroups kernel parameter, I have updated its config file to include the necessary option, as suggested in this comment. Here is the output of $ cat /proc/cmdline :

BOOT_IMAGE=/boot/vmlinuz-6.2.6-76060206-generic root=UUID=dfabc901-317f-4f76-a0b3-be039b32f5a6 ro systemd.unified_cgroup_hierarchy=0 quiet splash

I have no previous experience in installing TF / Tensorman and in using containers, hope I didn't miss some crucial details in the process.

The text was updated successfully, but these errors were encountered:

prina404 · 2023-03-28T13:46:26Z

Update

Hope this gives some additional context to the issue

After trying various fixes unsuccessfully, I managed to get it working by setting the --privileged docker flag in
~/.config/tensorman/config.toml.

Since i'm working locally it's not a big deal to give privileged access to the container, but it surely isn't an ideal fix.

n3m0-22 assigned n3m0-22 and unassigned n3m0-22 Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorman container cannot access GPU - Pop!_OS 22.04 #40

Tensorman container cannot access GPU - Pop!_OS 22.04 #40

prina404 commented Mar 28, 2023

prina404 commented Mar 28, 2023

Tensorman container cannot access GPU - Pop!_OS 22.04 #40

Tensorman container cannot access GPU - Pop!_OS 22.04 #40

Comments

prina404 commented Mar 28, 2023

prina404 commented Mar 28, 2023

Update