monitor GPU ressources #785

AlexandreKempf · 2024-02-14T14:34:48Z

New feature: monitoring for harware metrics

For a first discussion on this PR content, you can look at here. It was a outdated version of this PR that only contains CPU, ram and disk metrics.

Monitoring hardware

In this PR we add the possibility for the user to monitor the GPU, CPU, RAM, and disk during one experiment.
The GPU metrics are only collected if at least one GPU is detected.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import SystemMonitor

with Live() as live:
    live.system_monitor = SystemMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/directory", "home": "/home"}))

If you allow the monitoring of your system, if will track the following:

system/cpu/count -> number of CPU cores
system/cpu/usage (%) -> the average usage of the CPUs.
system/cpu/parallelization (%) -> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.
system/ram/usage (%) -> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.
system/ram/usage (GB) -> RAM used. Useful to increase batch size or data processed at the same time.
system/ram/total (GB) -> Total RAM in your system
system/disk/usage (%) -> Amount of disk used by the partition that contain the given path, in %. By default uses "/". You can specify the paths to the partition you want to monitor. For instance, the code example above monitors /data and /home. Data and code often live in very different paths/volumes, so it is useful for the user to be able to specify its own path.
system/disk/usage (GB) -> Amount of disk used at a given path.
system/disk/total (GB) -> Total disk storage at a given path.
system/gpu/count -> Number of GPUs detected.
system/gpu/usage (%) -> Usage of each GPU in %.
system/vram/usage (%) -> Usage of each GPU virtual memory in %.
system/vram/usage (GB) -> Usage of each GPU virtual memory in GB.
system/vram/total (GB) -> total amount of GPU virtual memory in GB, for each GPU.

Note that as several paths can be specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.

Note that as several GPUs can be detected, the full metric name for GPU metrics (except count) is suffix with /<idx> that indicate the index of the GPU. Example: system/gpu/usage (%)/0

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as GPU count, GPU vram total, CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like:

Here is what Studio looks like:
https://studio.iterative.ai/user/AlexandreKempf/projects/image_classification-imzssxc5ew

Note that studio live update is a little buggy, but it is fixed in this PR

Note that we're calling `nvmlShutdown` and `nvmlInit` at each fetch of the GPU metrics like in labML code

❗ I have followed the Contributing to DVCLive guide.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here: add documentation about DVClive monitoring system metrics dvc.org#5138

for more information, see https://pre-commit.ci

hide CPUMetrics properties change duration to nb_samples log only once if problem with psutil

…the data

pyproject.toml

general improvements in the code as well

src/dvclive/monitor_system.py

dberenbaum · 2024-02-20T19:22:32Z

pyproject.toml

+  "psutil",
+  "pynvml"


Should we include by default or make them optional?

We will include them if we decide to monitor the system's metrics by default :)
I created a new option called system.

I meant it as a genuine question, not a suggestion to make them optional. How lightweight are they? I think there's also downside to adding lots of options. We have discussed in the past making a lightweight version of dvclive for those who need it very lean. Totally up to you here.

Yep, my 2cs. I would prefer "system" to be enabled by default. Agreed to check on how big / complicated those deps are.

Both, when installed together, are quite light: ~500KB.

@shcheklein Then should I change the monitor_system default to True in the Live object?
Any idea on that topic @dberenbaum?

My 0.02$ is that it would be nice to have it by default, especially for new users. But for existing users, it can be annoying to have new plots all over the studio dashboard I spend days customizing. What should we prioritize?

off and consider making it a default in 4.0

sounds good. let's make a ticket for this? :) and do it soon I hope. (I just feel it's a really good and appealing feature - people will stick better with the extension, Studio, etc).

Let's wait for 4.0. Do we have a roadmap to 4.0 by any chance, so I can add that ticket?
Also this ticket should mention that psutil and pynvml should not be optional anymore.

Here it is: #746

If we make the deps optional, do we need to change the PR a bit to make sure we fail gracefully if they are not installed? It looks like loading dvclive.live now will try to import those deps. (I'm also fine to just start installing them by default now since I think it's less annoying and non-breaking)

Ok, I put them by default as they are quite light. Also, having them installed by default should simplify the documentation.

src/dvclive/monitor_system.py

shcheklein

Seems good to me. Thanks @AlexandreKempf !

dberenbaum · 2024-02-21T13:11:41Z

Hey @AlexandreKempf, take a look at my remaining questions/suggestions, but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

shcheklein · 2024-02-21T16:58:10Z

but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

I've approved it already!

AlexandreKempf · 2024-02-21T19:14:16Z

I added the doc here fyi.

AlexandreKempf · 2024-02-22T10:42:03Z

Added changes based on recommendations in the documentation PR

install psutil and pynvml by default
removed plot argument since we always want these metrics plotted
add method monitor_system to Live instead of a property|

dberenbaum · 2024-02-22T15:47:18Z

src/dvclive/live.py

+
+        Args:
+            interval (float): the time interval between samples in seconds. To keep the
+                sampling interval small, the maximum value allowed is 0.1 seconds.


Not a blocker, but is the max value of 0.1 seconds needed?

I like that we guide users toward realistic values if they have no clues on good values for the sampling interval. The values are coming from W&B's code.

dberenbaum · 2024-02-22T15:47:41Z

src/dvclive/live.py

+                sampling interval small, the maximum value allowed is 0.1 seconds.
+                Default to 0.05.
+            num_samples (int): the number of samples to collect before the aggregation.
+                The value should be between 1 and 30 samples. Default to 20.


Why between 1 and 30?

Same reason as above

dberenbaum

I left a couple questions, but they aren't blockers, so feel free to merge when ready.

src/dvclive/live.py

mattseddon · 2024-02-26T03:08:21Z

Should this have closed #81?

AlexandreKempf and others added 22 commits February 6, 2024 12:06

add cpu monitoring

b5a8171

add unit tests and more cpu metrics

e3b654c

change default value for callback

e6cff32

uses a percentage value for cpu parallelism

8663011

add ram total

1346750

remove total ram measure from plots

5f90bea

update pyproject.toml

2ac50f2

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c0a288

for more information, see https://pre-commit.ci

add tmpdir to metrics tests

30315bb

Merge branch 'main' into monitor-cpu-ressources

0f55943

default to no monitoring callbacks

40686c7

fix tmp_dir for test on windows and macos

4b98144

fix tmp_dir for test on windows and macos

36fd94d

fix update data to studio live experiment

59db522

fix studio update data problem

2393fb0

debug studio updates

3b327c5

improve code readability:

b0ac980

hide CPUMetrics properties change duration to nb_samples log only once if problem with psutil

remove hack lightning

5162ac2

fix lightning problem with steps in studio

9353645

simplify the metric names

871bebc

Merge branch 'main' into monitor-cpu-ressources

072252d

don't increment num_point_sent_to_studio if studio didn't received …

e169abc

…the data

AlexandreKempf force-pushed the monitor-cpu-ressources branch from 0ac38dc to e169abc Compare February 14, 2024 15:37

dberenbaum reviewed Feb 14, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

AlexandreKempf added 3 commits February 15, 2024 10:46

add directory metrics to the list of metrics tracked + refacto

aa5b511

clean code and split features into several PRs

9d0e70d

cleaner user interface

ba72e01

AlexandreKempf mentioned this pull request Feb 15, 2024

Monitors CPU, RAM, and disk usage #773

Closed

2 tasks

AlexandreKempf added 2 commits February 15, 2024 12:30

add docstrings

6e29ff6

mypy conflicts

719087b

Merge branch 'monitor-cpu-ressources' into monitor-gpu-ressources

30446bc

AlexandreKempf force-pushed the monitor-gpu-ressources branch from 88f426a to 8d0d1b3 Compare February 20, 2024 15:07

AlexandreKempf changed the base branch from monitor-cpu-ressources to main February 20, 2024 17:09

merge cpu and gpu monitoring into a system monitor object

3f94e24

general improvements in the code as well

AlexandreKempf force-pushed the monitor-gpu-ressources branch from 8d0d1b3 to 3f94e24 Compare February 20, 2024 17:27

dberenbaum requested a review from shcheklein February 20, 2024 19:01

dberenbaum reviewed Feb 20, 2024

View reviewed changes

src/dvclive/monitor_system.py Show resolved Hide resolved

dberenbaum reviewed Feb 20, 2024

View reviewed changes

src/dvclive/monitor_system.py Outdated Show resolved Hide resolved

dberenbaum reviewed Feb 20, 2024

View reviewed changes

src/dvclive/monitor_system.py Show resolved Hide resolved

shcheklein approved these changes Feb 21, 2024

View reviewed changes

change pyproject dependencies and fix typo

506dd52

AlexandreKempf mentioned this pull request Feb 21, 2024

add documentation about DVClive monitoring system metrics iterative/dvc.org#5138

Merged

AlexandreKempf added 4 commits February 22, 2024 09:40

install psutil and pynvml by default

5df5d48

remove plot argument in SystemMonitor

29477ba

change call to the SystemMonitor object to be a Live method

53d83d5

Merge branch 'main' into monitor-gpu-ressources

38d5c72

AlexandreKempf requested a review from dberenbaum February 22, 2024 10:42

dberenbaum reviewed Feb 22, 2024

View reviewed changes

dberenbaum approved these changes Feb 22, 2024

View reviewed changes

dberenbaum reviewed Feb 22, 2024

View reviewed changes

src/dvclive/live.py Show resolved Hide resolved

AlexandreKempf merged commit 2c7c378 into main Feb 22, 2024
14 checks passed

AlexandreKempf deleted the monitor-gpu-ressources branch February 22, 2024 18:59

AlexandreKempf mentioned this pull request Feb 26, 2024

plots: fix plots nested configuration iterative/dvc#10320

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor GPU ressources #785

monitor GPU ressources #785

AlexandreKempf commented Feb 14, 2024 •

edited

Loading

dberenbaum Feb 20, 2024

AlexandreKempf Feb 21, 2024

dberenbaum Feb 21, 2024

shcheklein Feb 21, 2024

AlexandreKempf Feb 21, 2024

shcheklein Feb 21, 2024

AlexandreKempf Feb 21, 2024

dberenbaum Feb 21, 2024

dberenbaum Feb 21, 2024 •

edited

Loading

AlexandreKempf Feb 22, 2024

shcheklein left a comment

dberenbaum commented Feb 21, 2024

shcheklein commented Feb 21, 2024

AlexandreKempf commented Feb 21, 2024

AlexandreKempf commented Feb 22, 2024

dberenbaum Feb 22, 2024

AlexandreKempf Feb 22, 2024 •

edited

Loading

dberenbaum Feb 22, 2024

AlexandreKempf Feb 22, 2024 •

edited

Loading

dberenbaum left a comment

mattseddon commented Feb 26, 2024

monitor GPU ressources #785

monitor GPU ressources #785

Conversation

AlexandreKempf commented Feb 14, 2024 • edited Loading

New feature: monitoring for harware metrics

Monitoring hardware

Files generated

Plot display

Note that we're calling nvmlShutdown and nvmlInit at each fetch of the GPU metrics like in labML code

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shcheklein left a comment

Choose a reason for hiding this comment

dberenbaum commented Feb 21, 2024

shcheklein commented Feb 21, 2024

AlexandreKempf commented Feb 21, 2024

AlexandreKempf commented Feb 22, 2024

Choose a reason for hiding this comment

AlexandreKempf Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexandreKempf Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

dberenbaum left a comment

Choose a reason for hiding this comment

mattseddon commented Feb 26, 2024

AlexandreKempf commented Feb 14, 2024 •

edited

Loading

Note that we're calling `nvmlShutdown` and `nvmlInit` at each fetch of the GPU metrics like in labML code

dberenbaum Feb 21, 2024 •

edited

Loading

AlexandreKempf Feb 22, 2024 •

edited

Loading

AlexandreKempf Feb 22, 2024 •

edited

Loading