bugfix for parallel rand generator on multi-gpu #9300

yzhliu · 2018-01-04T02:07:48Z

Description

GPU rand state was not allocated on the right device, which can cause 'illegal memory access' when running with multi-GPUs.

It's line 281 in src/resource.cc causes the problem.

Can we do test on multi-gpu? I mean, can I assume our CI instances have multi-gpu enabled?

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Fix gpu rand state allocation problem
readability: NativeRandom -> ParallelRandom

yzhliu · 2018-01-04T02:11:24Z

src/resource.cc

        : ctx(ctx), sampler(ncopy), resource(ncopy), curr_ptr(0) {
      for (size_t i = 0; i < sampler.size(); ++i) {
        const uint32_t seed = ctx.dev_id + i * kMaxNumGPUs + global_seed * kRandMagic;
        resource[i].var = Engine::Get()->NewVariable();
        common::random::RandGenerator<xpu> *r = new common::random::RandGenerator<xpu>();
-        common::random::RandGenerator<xpu>::AllocState(r);


this is the line that caused the bug.

cjolivier01 · 2018-01-04T03:30:38Z

Fix verified

marcoabreu · 2018-01-04T10:33:03Z

Feel free to create a test case as CI has two GPUs available per slave

cjolivier01 · 2018-01-04T15:36:13Z

Current test machines in CI have at least two GPUs. However, you will need to detect GPU count in the unit test because when an individual runs unit test on his local machine (Mac, for instance), it likely will only have one GPU.

cjolivier01 · 2018-01-04T15:38:33Z

For example, run nvidia-smi -L and count number of lines returned which start with "GPU [0-9]+" or something like that

marcoabreu · 2018-01-04T17:58:42Z

Relying on nvidia-smi could cause issues as behaviour on windows and unix tend to differ. It should be possible to get the number of available GPUs within MXNet, right?

cjolivier01 · 2018-01-04T18:09:22Z

I am not aware of an API call for that from python. One would need to be created, which would also force the CUDA library to init.
Do you know the outputs for the same command to differ or you're theorizing? I don't care how it's done, really -- just trying to suggest the way with the least lifting.

marcoabreu · 2018-01-04T18:14:20Z

The CUDA library is going to be needed anyways, right? So that shouldn't be an issue. Couldn't this be achieved by trying to address the GPU on slot 0, 1, 2 etc and then just count how often it fails?

The problem is that nvidia-smi on windows may be present or not. While working on CI for windows, I've experienced that this tool may be at different locations which can not be retrieved programmatically. The workaround was to use a static path for CI.

bugfix for parallel rand generator on multi-gpu

ad531cb

yzhliu added the Bug label Jan 4, 2018

yzhliu requested review from piiswrong and szha January 4, 2018 02:07

yzhliu commented Jan 4, 2018

View reviewed changes

szha approved these changes Jan 4, 2018

View reviewed changes

cjolivier01 approved these changes Jan 4, 2018

View reviewed changes

cjolivier01 merged commit 661bcef into apache:master Jan 4, 2018

yzhliu deleted the fix-rand-multi-gpu branch January 4, 2018 18:06

yzhliu added a commit to yzhliu/mxnet that referenced this pull request Jan 4, 2018

bugfix for parallel rand generator on multi-gpu (apache#9300)

e4ff7c5

yuxiangw pushed a commit to yuxiangw/incubator-mxnet that referenced this pull request Jan 25, 2018

bugfix for parallel rand generator on multi-gpu (apache#9300)

cc3d15b

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

bugfix for parallel rand generator on multi-gpu (apache#9300)

e85cf65

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

bugfix for parallel rand generator on multi-gpu (apache#9300)

b658f21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix for parallel rand generator on multi-gpu #9300

bugfix for parallel rand generator on multi-gpu #9300

yzhliu commented Jan 4, 2018

yzhliu Jan 4, 2018

cjolivier01 commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

cjolivier01 commented Jan 4, 2018

cjolivier01 commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

cjolivier01 commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

bugfix for parallel rand generator on multi-gpu #9300

bugfix for parallel rand generator on multi-gpu #9300

Conversation

yzhliu commented Jan 4, 2018

Description

Checklist

Essentials

Changes

yzhliu Jan 4, 2018

Choose a reason for hiding this comment

cjolivier01 commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

cjolivier01 commented Jan 4, 2018

cjolivier01 commented Jan 4, 2018

marcoabreu commented Jan 4, 2018

cjolivier01 commented Jan 4, 2018

marcoabreu commented Jan 4, 2018