-
Notifications
You must be signed in to change notification settings - Fork 6.8k
bugfix for parallel rand generator on multi-gpu #9300
Conversation
: ctx(ctx), sampler(ncopy), resource(ncopy), curr_ptr(0) { | ||
for (size_t i = 0; i < sampler.size(); ++i) { | ||
const uint32_t seed = ctx.dev_id + i * kMaxNumGPUs + global_seed * kRandMagic; | ||
resource[i].var = Engine::Get()->NewVariable(); | ||
common::random::RandGenerator<xpu> *r = new common::random::RandGenerator<xpu>(); | ||
common::random::RandGenerator<xpu>::AllocState(r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the line that caused the bug.
Fix verified |
Feel free to create a test case as CI has two GPUs available per slave |
Current test machines in CI have at least two GPUs. However, you will need to detect GPU count in the unit test because when an individual runs unit test on his local machine (Mac, for instance), it likely will only have one GPU. |
For example, run nvidia-smi -L and count number of lines returned which start with "GPU [0-9]+" or something like that |
Relying on nvidia-smi could cause issues as behaviour on windows and unix tend to differ. It should be possible to get the number of available GPUs within MXNet, right? |
I am not aware of an API call for that from python. One would need to be created, which would also force the CUDA library to init. |
The CUDA library is going to be needed anyways, right? So that shouldn't be an issue. Couldn't this be achieved by trying to address the GPU on slot 0, 1, 2 etc and then just count how often it fails? The problem is that nvidia-smi on windows may be present or not. While working on CI for windows, I've experienced that this tool may be at different locations which can not be retrieved programmatically. The workaround was to use a static path for CI. |
Description
GPU rand state was not allocated on the right device, which can cause 'illegal memory access' when running with multi-GPUs.
It's line 281 in
src/resource.cc
causes the problem.Can we do test on multi-gpu? I mean, can I assume our CI instances have multi-gpu enabled?
Checklist
Essentials
make lint
)Changes