-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitor GPU ressources #785
Conversation
for more information, see https://pre-commit.ci
hide CPUMetrics properties change duration to nb_samples log only once if problem with psutil
0ac38dc
to
e169abc
Compare
88f426a
to
8d0d1b3
Compare
general improvements in the code as well
8d0d1b3
to
3f94e24
Compare
"psutil", | ||
"pynvml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include by default or make them optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will include them if we decide to monitor the system's metrics by default :)
I created a new option called system
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it as a genuine question, not a suggestion to make them optional. How lightweight are they? I think there's also downside to adding lots of options. We have discussed in the past making a lightweight version of dvclive for those who need it very lean. Totally up to you here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, my 2cs. I would prefer "system" to be enabled by default. Agreed to check on how big / complicated those deps are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both, when installed together, are quite light: ~500KB.
@shcheklein Then should I change the monitor_system
default to True
in the Live
object?
Any idea on that topic @dberenbaum?
My 0.02$ is that it would be nice to have it by default, especially for new users. But for existing users, it can be annoying to have new plots all over the studio dashboard I spend days customizing. What should we prioritize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off and consider making it a default in 4.0
sounds good. let's make a ticket for this? :) and do it soon I hope. (I just feel it's a really good and appealing feature - people will stick better with the extension, Studio, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for 4.0. Do we have a roadmap to 4.0 by any chance, so I can add that ticket?
Also this ticket should mention that psutil
and pynvml
should not be optional anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is: #746
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the deps optional, do we need to change the PR a bit to make sure we fail gracefully if they are not installed? It looks like loading dvclive.live
now will try to import those deps. (I'm also fine to just start installing them by default now since I think it's less annoying and non-breaking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I put them by default as they are quite light. Also, having them installed by default should simplify the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. Thanks @AlexandreKempf !
Hey @AlexandreKempf, take a look at my remaining questions/suggestions, but if @shcheklein is satisfied from a technical perspective, feel free to merge when you feel it's ready. |
I've approved it already! |
I added the doc here fyi. |
Added changes based on recommendations in the documentation PR
|
|
||
Args: | ||
interval (float): the time interval between samples in seconds. To keep the | ||
sampling interval small, the maximum value allowed is 0.1 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker, but is the max value of 0.1 seconds needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we guide users toward realistic values if they have no clues on good values for the sampling interval. The values are coming from W&B's code.
sampling interval small, the maximum value allowed is 0.1 seconds. | ||
Default to 0.05. | ||
num_samples (int): the number of samples to collect before the aggregation. | ||
The value should be between 1 and 30 samples. Default to 20. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why between 1 and 30?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple questions, but they aren't blockers, so feel free to merge when ready.
Should this have closed #81? |
New feature: monitoring for harware metrics
For a first discussion on this PR content, you can look at here. It was a outdated version of this PR that only contains CPU, ram and disk metrics.
Monitoring hardware
In this PR we add the possibility for the user to monitor the GPU, CPU, RAM, and disk during one experiment.
The GPU metrics are only collected if at least one GPU is detected.
To use this feature you can use it with a simple argument:
If you want to use advance features you can specify each parameter this way:
If you allow the monitoring of your system, if will track the following:
system/cpu/count
-> number of CPU coressystem/cpu/usage (%)
-> the average usage of the CPUs.system/cpu/parallelization (%)
-> How many CPU cores use more than 20% of their possibilities? It is useful when you're looking to parallelize your code to train your model or process your data faster.system/ram/usage (%)
-> percentage of the RAM used. Useful to increase batch size or data processed at the same time in the RAM.system/ram/usage (GB)
-> RAM used. Useful to increase batch size or data processed at the same time.system/ram/total (GB)
-> Total RAM in your systemsystem/disk/usage (%)
-> Amount of disk used by the partition that contain the given path, in %. By default uses "/". You can specify the paths to the partition you want to monitor. For instance, the code example above monitors/data
and/home
. Data and code often live in very different paths/volumes, so it is useful for the user to be able to specify its own path.system/disk/usage (GB)
-> Amount of disk used at a given path.system/disk/total (GB)
-> Total disk storage at a given path.system/gpu/count
-> Number of GPUs detected.system/gpu/usage (%)
-> Usage of each GPU in %.system/vram/usage (%)
-> Usage of each GPU virtual memory in %.system/vram/usage (GB)
-> Usage of each GPU virtual memory in GB.system/vram/total (GB)
-> total amount of GPU virtual memory in GB, for each GPU.Note that as several paths can be specified, the full metric name is
system/disk/usage (%)/<user defined name>
. For instance it would besystem/disk/usage (%)/data
for the/path/to/data/disk
andsystem/disk/usage (%)/home
for/home
.Note that as several GPUs can be detected, the full metric name for GPU metrics (except
count
) is suffix with/<idx>
that indicate the index of the GPU. Example:system/gpu/usage (%)/0
All the values that can change during an experiment can be saved as plots. Timestamps are automatically recorded with the metrics. Other metrics (that don't change) such as GPU count, GPU vram total, CPU count, RAM total and disk total are saved as metrics but cannot be saved as plots.
I decided to split the usage in % and GB. First, because it is more consistent with the other loggers out there. Second, both are extremely relevant based on which cloud instance you run your experiment. If you only run your experiment on the same hardware, the distinction is not really interesting.
Files generated
The metrics about the CPU are stored with the
log_metric
function. It means that the.tsv
files are stored in thedvclive/plots
folder. A specific folder,system
, contains all the metrics about the CPU to distinguish them from the user-defined metrics. The metrics are also saved in thedvclive/metrics.json
file.Plot display
Here is what VScode extension looks like:

Here is what Studio looks like:
https://studio.iterative.ai/user/AlexandreKempf/projects/image_classification-imzssxc5ew
Note that studio live update is a little buggy, but it is fixed in this PR
Note that we're calling
nvmlShutdown
andnvmlInit
at each fetch of the GPU metrics like in labML code