Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor GPU ressources #785

Merged
merged 44 commits into from
Feb 22, 2024
Merged

monitor GPU ressources #785

merged 44 commits into from
Feb 22, 2024

Conversation

AlexandreKempf
Copy link
Contributor

@AlexandreKempf AlexandreKempf commented Feb 14, 2024

New feature: monitoring for harware metrics

For a first discussion on this PR content, you can look at here. It was a outdated version of this PR that only contains CPU, ram and disk metrics.

Monitoring hardware

In this PR we add the possibility for the user to monitor the GPU, CPU, RAM, and disk during one experiment.
The GPU metrics are only collected if at least one GPU is detected.

To use this feature you can use it with a simple argument:

from dvclive import Live

with Live(monitor_system=True) as live:
    ...

If you want to use advance features you can specify each parameter this way:

from dvclive import Live
from dvclive.monitor_system import SystemMonitor

with Live() as live:
    live.system_monitor = SystemMonitor(interval = 0.1, num_samples=15, directories_to_monitor={"data": "/path/to/data/directory", "home": "/home"}))

If you allow the monitoring of your system, if will track the following:

  • system/cpu/count -> number of CPU cores
  • system/cpu/usage (%) -> the average usage of the CPUs.
  • system/cpu/parallelization (%) -> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.
  • system/ram/usage (%) -> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.
  • system/ram/usage (GB) -> RAM used. Useful to increase batch size or data processed at the same time.
  • system/ram/total (GB) -> Total RAM in your system
  • system/disk/usage (%) -> Amount of disk used by the partition that contain the given path, in %. By default uses "/". You can specify the paths to the partition you want to monitor. For instance, the code example above monitors /data and /home. Data and code often live in very different paths/volumes, so it is useful for the user to be able to specify its own path.
  • system/disk/usage (GB) -> Amount of disk used at a given path.
  • system/disk/total (GB) -> Total disk storage at a given path.
  • system/gpu/count -> Number of GPUs detected.
  • system/gpu/usage (%) -> Usage of each GPU in %.
  • system/vram/usage (%) -> Usage of each GPU virtual memory in %.
  • system/vram/usage (GB) -> Usage of each GPU virtual memory in GB.
  • system/vram/total (GB) -> total amount of GPU virtual memory in GB, for each GPU.

Note that as several paths can be specified, the full metric name is system/disk/usage (%)/<user defined name>. For instance it would be system/disk/usage (%)/data for the /path/to/data/disk and system/disk/usage (%)/home for /home.

Note that as several GPUs can be detected, the full metric name for GPU metrics (except count) is suffix with /<idx> that indicate the index of the GPU. Example: system/gpu/usage (%)/0

All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as GPU count, GPU vram total, CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.

I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.

Files generated

The metrics about the CPU are stored with the log_metric function. It means that the .tsv files are stored in the dvclive/plots folder. A specific folder, system, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in the dvclive/metrics.json file.

Plot display

Here is what VScode extension looks like:
image

Here is what Studio looks like:
https://studio.iterative.ai/user/AlexandreKempf/projects/image_classification-imzssxc5ew

Note that studio live update is a little buggy, but it is fixed in this PR

Note that we're calling nvmlShutdown and nvmlInit at each fetch of the GPU metrics like in labML code

@AlexandreKempf AlexandreKempf changed the base branch from monitor-cpu-ressources to main February 20, 2024 17:09
general improvements in the code as well
Comment on lines +41 to +42
"psutil",
"pynvml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include by default or make them optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will include them if we decide to monitor the system's metrics by default :)
I created a new option called system.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it as a genuine question, not a suggestion to make them optional. How lightweight are they? I think there's also downside to adding lots of options. We have discussed in the past making a lightweight version of dvclive for those who need it very lean. Totally up to you here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, my 2cs. I would prefer "system" to be enabled by default. Agreed to check on how big / complicated those deps are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both, when installed together, are quite light: ~500KB.

@shcheklein Then should I change the monitor_system default to True in the Live object?
Any idea on that topic @dberenbaum?

My 0.02$ is that it would be nice to have it by default, especially for new users. But for existing users, it can be annoying to have new plots all over the studio dashboard I spend days customizing. What should we prioritize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off and consider making it a default in 4.0

sounds good. let's make a ticket for this? :) and do it soon I hope. (I just feel it's a really good and appealing feature - people will stick better with the extension, Studio, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for 4.0. Do we have a roadmap to 4.0 by any chance, so I can add that ticket?
Also this ticket should mention that psutil and pynvml should not be optional anymore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is: #746

Copy link
Collaborator

@dberenbaum dberenbaum Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the deps optional, do we need to change the PR a bit to make sure we fail gracefully if they are not installed? It looks like loading dvclive.live now will try to import those deps. (I'm also fine to just start installing them by default now since I think it's less annoying and non-breaking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I put them by default as they are quite light. Also, having them installed by default should simplify the documentation.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. Thanks @AlexandreKempf !

@dberenbaum
Copy link
Collaborator

Hey @AlexandreKempf, take a look at my remaining questions/suggestions, but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

@shcheklein
Copy link
Member

but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready.

I've approved it already!

@AlexandreKempf
Copy link
Contributor Author

I added the doc here fyi.

@AlexandreKempf
Copy link
Contributor Author

Added changes based on recommendations in the documentation PR

  • install psutil and pynvml by default
  • removed plot argument since we always want these metrics plotted
  • add method monitor_system to Live instead of a property|


Args:
interval (float): the time interval between samples in seconds. To keep the
sampling interval small, the maximum value allowed is 0.1 seconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but is the max value of 0.1 seconds needed?

Copy link
Contributor Author

@AlexandreKempf AlexandreKempf Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we guide users toward realistic values if they have no clues on good values for the sampling interval. The values are coming from W&B's code.

sampling interval small, the maximum value allowed is 0.1 seconds.
Default to 0.05.
num_samples (int): the number of samples to collect before the aggregation.
The value should be between 1 and 30 samples. Default to 20.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why between 1 and 30?

Copy link
Contributor Author

@AlexandreKempf AlexandreKempf Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above

Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple questions, but they aren't blockers, so feel free to merge when ready.

@AlexandreKempf AlexandreKempf merged commit 2c7c378 into main Feb 22, 2024
14 checks passed
@AlexandreKempf AlexandreKempf deleted the monitor-gpu-ressources branch February 22, 2024 18:59
@mattseddon
Copy link
Contributor

Should this have closed #81?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants