Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorman container cannot access GPU - Pop!_OS 22.04 #40

Open
prina404 opened this issue Mar 28, 2023 · 1 comment
Open

Tensorman container cannot access GPU - Pop!_OS 22.04 #40

prina404 opened this issue Mar 28, 2023 · 1 comment

Comments

@prina404
Copy link

After closely following the Guide on the official System 76 website for installing Tensorman, I can't seem to get tensorflow to recognize my GPU (Rtx 2080).

Here is the code used to test it:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

and this is the output of $ tensorman run --gpu python3 -- ./test_tf.py :

"docker" "run" "-u" "1000:1000" "--gpus=all" "-e" "HOME=/project" "-it" "--rm" "-v" "/home/matteo/Desktop/Tirocinio/HouseExpoSLAM/pseudoslam/personal_experiments:/project" "-w" "/project" "tensorflow/tensorflow:latest-gpu" "python3" "./test_tf.py"
2023-03-28 11:21:33.267329: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-28 11:21:34.303425: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-03-28 11:21:34.303451: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:168] retrieving CUDA diagnostic information for host: 2ee87ebed972
2023-03-28 11:21:34.303459: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:175] hostname: 2ee87ebed972
2023-03-28 11:21:34.303486: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:199] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
2023-03-28 11:21:34.303509: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:203] kernel reported version is: 525.89.2
Num GPUs Available:  0

I've also tried to launch the python script from a container shell $ tensorman run --gpu --python3 bash but received the same error.


Some information for context:

Output of $ nvidia-smi :

Tue Mar 28 13:39:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   35C    P8    16W / 260W |    341MiB /  8192MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2509      G   /usr/lib/xorg/Xorg                 84MiB |
|    0   N/A  N/A      2640      G   /usr/bin/gnome-shell              126MiB |
|    0   N/A  N/A      4652      G   firefox                           127MiB |
+-----------------------------------------------------------------------------+

Output of $ tensorman run --gpu nvidia-smi:

Failed to initialize NVML: Unknown Error

I have a dual-boot with grub as my bootloader, knowing that it may interfere with the cgroups kernel parameter, I have updated its config file to include the necessary option, as suggested in this comment. Here is the output of $ cat /proc/cmdline :

BOOT_IMAGE=/boot/vmlinuz-6.2.6-76060206-generic root=UUID=dfabc901-317f-4f76-a0b3-be039b32f5a6 ro systemd.unified_cgroup_hierarchy=0 quiet splash

I have no previous experience in installing TF / Tensorman and in using containers, hope I didn't miss some crucial details in the process.

@prina404
Copy link
Author

Update

Hope this gives some additional context to the issue

After trying various fixes unsuccessfully, I managed to get it working by setting the --privileged docker flag in
~/.config/tensorman/config.toml.

Since i'm working locally it's not a big deal to give privileged access to the container, but it surely isn't an ideal fix.

@n3m0-22 n3m0-22 assigned n3m0-22 and unassigned n3m0-22 Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants