RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70

abdelhadi-azouni · 2023-07-25T12:26:27Z

Cloned the llama repo and copied the export_meta_llama_bin.py file then run:
torchrun --nproc_per_node 1 export_meta_llama_bin.py

I get
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

I'm on linux 16vCPUs 32G ram no GPU

Is the code supposed to work without GPU?

joey00072 · 2023-07-25T12:54:38Z

in export function try
self = self.to(torch.device("cpu"))

llama2.c/export_meta_llama_bin.py

Line 14 in 98ec4ba

def export(self, filepath='model.bin'):

tell me if this works

sha0coder · 2023-07-25T13:49:46Z

is not the export() which claim the GPU, its Llama.build()

abdelhadi-azouni · 2023-07-25T13:53:58Z

still have the same error
full traceback:

Traceback (most recent call last):
File "/home/dev1/llama2.c/llama/export_meta_llama_bin.py", line 85, in
generator = Llama.build( File "/home/dev1/llama2.c/llama/llama/generation.py", line 62, in build torch.distributed.init_process_group("nccl") File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper( File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1024, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1102) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/dev1/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ export_meta_llama_bin.py FAILED

Foundation42 · 2023-07-25T14:00:51Z

I'm having the same problem. I worked with ChatGPT to try and rewrite parts of the dependent llama library that requires the GPU and distributed support. We got so far but other issues started popping up.

Looking at this export_meta_llama_bin.py it seems like it could maybe be rewritten not to require the dependency on the llama library at some point, that might be a lot better

FWIW: Here was a change recommended by ChatGPT to the build function in llama/generate.py

@staticmethod
def build(
    ckpt_dir: str,
    tokenizer_path: str,
    max_seq_len: int,
    max_batch_size: int,
    model_parallel_size: Optional[int] = None,
) -> "Llama":
    if model_parallel_size is None:
        model_parallel_size = 1

    local_rank = 0
    torch.manual_seed(1)

    start_time = time.time()
    checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
    assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
    assert model_parallel_size == len(
        checkpoints
    ), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {model_parallel_size}"
    ckpt_path = checkpoints[get_model_parallel_rank()]
    checkpoint = torch.load(ckpt_path, map_location="cpu")
    with open(Path(ckpt_dir) / "params.json", "r") as f:
        params = json.loads(f.read())

    model_args: ModelArgs = ModelArgs(
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
        **params,
    )
    tokenizer = Tokenizer(model_path=tokenizer_path)
    model_args.vocab_size = tokenizer.n_words
    torch.set_default_tensor_type(torch.FloatTensor)
    model = Transformer(model_args)
    model.load_state_dict(checkpoint, strict=False)
    print(f"Loaded in {time.time() - start_time:.2f} seconds")

    return Llama(model, tokenizer)

It adds checks for whether there is a GPU available and uses different tensor types and schedulers.. but, again, there is more to it than this as other stuff breaks as well

sha0coder · 2023-07-25T14:04:09Z

that would be great, other option is publishing the exported llama2_7b.bin

Foundation42 · 2023-07-25T14:06:53Z

that would be great, other option is publishing the exported llama2_7b.bin

that would be nice. probably won't happen because - "bypass license agreement" mularky

karpathy · 2023-07-25T14:16:21Z

that would be great, other option is publishing the exported llama2_7b.bin

that would be nice. probably won't happen because - "bypass license agreement" mularky

Exactly why I didn't. (Also it is 26GB)
I'm sorry to see this causing issues, it is silly that we need torchrun to export the weights. Would happily accept a PR fixing that.

rlancemartin · 2023-07-25T14:19:39Z

that would be great, other option is publishing the exported llama2_7b.bin

that would be nice. probably won't happen because - "bypass license agreement" mularky

Exactly why I didn't. (Also it is 26GB) I'm sorry to see this causing issues, it is silly that we need torchrun to export the weights. Would happily accept a PR fixing that.

Ah, I see. Also asked about a shared file since export probably won't work (easily) on a Mac (NCCL dependency).

karpathy · 2023-07-25T14:26:46Z

I have to go to work shortly; It should be possible to hack the llama export code to not require torchrun or GPU. I'll look into it after work, or happy to accept any PRs meanwhile.

Foundation42 · 2023-07-25T15:26:49Z

Maybe someone can use the guts from convert.py in the llama.cpp project for the .pth model loading and merge it with this conversion and export code. I would do it but my Python is not that great. I'm a C/C++ guy.

dontgetfoundout · 2023-07-25T15:47:35Z

Check out this repo I saw linked in another discussion. I don't have a Mac to try the code atm, but this may be a good starting point for the changes you need to get llama running there.

https://github.com/krychu/llama/blob/main/README.md

sudara · 2023-07-25T16:11:03Z

I used this llama fork successfully on my M1 mac:

https://github.com/aggiee/llama-v2-mps

like so:

PYTORCH_ENABLE_MPS_FALLBACK=1 torchrun --nproc_per_node 1 export_meta_llama_bin.py

If you've already downloaded the model, in addition to moving the model folder into the llama-v2-mps repo, be sure to move tokenizer.model and tokenizer_checklist.chk.

I initially just compiled with

clang -Ofast -march=native run.c -lm  -o run

Note: the model was doing ~3 tokens/s on 96 threads on linux.... I get about a token a minute out of the box on my M1 max.

I then did the following:

brew install libomp # get OpenMP
brew install llvm # get latest clang, apple clang is old
echo 'export PATH="/opt/homebrew/opt/llvm/bin:$PATH"' >> ~/.zshrc # make it easy to find new clang
clang -Ofast -fopenmp -march=native run.c  -lm  -o run # recompile with OpenMP

and ran

OMP_NUM_THREADS=8 ./run llama2_7b.bin

which is.... slightly faster....maybe...? I know nothing about OMP, but without it, CPU usage seems around 33%, with threads at 8 (matches num of cores on my M1 max) it seems to settle around 120% cpu. With threads at 32 (just to see) the cpus are more dominated at 400% — but tokens don't necessarily come faster.

Anyway! A great starting place for my needs (I'm interested in taking a crack at optimizing for apple/arm) but yeah... ctrl-C will be your best friend for now :)

joey00072 · 2023-07-25T16:28:42Z

I've ported code for loading llama,
#71
Can someone test It (I don't have beefy machine 🥲 )
Note: I have disabled mps backend since not everyone has mac and we are not using it anyways

CC: @sudara @abdelhadi-azouni @karpathy

joey00072 · 2023-07-25T18:19:04Z

updated #71 (comment)

karpathy · 2023-07-25T23:54:30Z

try with updated code :)

rlancemartin mentioned this issue Jul 25, 2023

Shared a exported llama2_7b.bin file #73

Closed

karpathy closed this as completed Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70

abdelhadi-azouni commented Jul 25, 2023 •

edited

Loading

joey00072 commented Jul 25, 2023

sha0coder commented Jul 25, 2023

abdelhadi-azouni commented Jul 25, 2023 •

edited

Loading

Foundation42 commented Jul 25, 2023 •

edited

Loading

sha0coder commented Jul 25, 2023

Foundation42 commented Jul 25, 2023 •

edited

Loading

karpathy commented Jul 25, 2023

rlancemartin commented Jul 25, 2023

karpathy commented Jul 25, 2023 •

edited

Loading

Foundation42 commented Jul 25, 2023

dontgetfoundout commented Jul 25, 2023

sudara commented Jul 25, 2023 •

edited

Loading

joey00072 commented Jul 25, 2023

joey00072 commented Jul 25, 2023

karpathy commented Jul 25, 2023

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70

Comments

abdelhadi-azouni commented Jul 25, 2023 • edited Loading

joey00072 commented Jul 25, 2023

sha0coder commented Jul 25, 2023

abdelhadi-azouni commented Jul 25, 2023 • edited Loading

Foundation42 commented Jul 25, 2023 • edited Loading

sha0coder commented Jul 25, 2023

Foundation42 commented Jul 25, 2023 • edited Loading

karpathy commented Jul 25, 2023

rlancemartin commented Jul 25, 2023

karpathy commented Jul 25, 2023 • edited Loading

Foundation42 commented Jul 25, 2023

dontgetfoundout commented Jul 25, 2023

sudara commented Jul 25, 2023 • edited Loading

joey00072 commented Jul 25, 2023

joey00072 commented Jul 25, 2023

karpathy commented Jul 25, 2023

abdelhadi-azouni commented Jul 25, 2023 •

edited

Loading

abdelhadi-azouni commented Jul 25, 2023 •

edited

Loading

Foundation42 commented Jul 25, 2023 •

edited

Loading

Foundation42 commented Jul 25, 2023 •

edited

Loading

karpathy commented Jul 25, 2023 •

edited

Loading

sudara commented Jul 25, 2023 •

edited

Loading