Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70

Closed
abdelhadi-azouni opened this issue Jul 25, 2023 · 15 comments

Comments

@abdelhadi-azouni
Copy link

abdelhadi-azouni commented Jul 25, 2023

Cloned the llama repo and copied the export_meta_llama_bin.py file then run:
torchrun --nproc_per_node 1 export_meta_llama_bin.py

I get
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

I'm on linux 16vCPUs 32G ram no GPU

Is the code supposed to work without GPU?

@joey00072
Copy link

in export function try
self = self.to(torch.device("cpu"))

def export(self, filepath='model.bin'):

tell me if this works

@sha0coder
Copy link

is not the export() which claim the GPU, its Llama.build()

@abdelhadi-azouni
Copy link
Author

abdelhadi-azouni commented Jul 25, 2023

still have the same error
full traceback:

Traceback (most recent call last):
File "/home/dev1/llama2.c/llama/export_meta_llama_bin.py", line 85, in
generator = Llama.build( File "/home/dev1/llama2.c/llama/llama/generation.py", line 62, in build torch.distributed.init_process_group("nccl") File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper( File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1024, in _new_process_group_helper
backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1102) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/dev1/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dev1/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ export_meta_llama_bin.py FAILED

@Foundation42
Copy link

Foundation42 commented Jul 25, 2023

I'm having the same problem. I worked with ChatGPT to try and rewrite parts of the dependent llama library that requires the GPU and distributed support. We got so far but other issues started popping up.

Looking at this export_meta_llama_bin.py it seems like it could maybe be rewritten not to require the dependency on the llama library at some point, that might be a lot better

FWIW: Here was a change recommended by ChatGPT to the build function in llama/generate.py

@staticmethod
def build(
    ckpt_dir: str,
    tokenizer_path: str,
    max_seq_len: int,
    max_batch_size: int,
    model_parallel_size: Optional[int] = None,
) -> "Llama":
    if model_parallel_size is None:
        model_parallel_size = 1

    local_rank = 0
    torch.manual_seed(1)

    start_time = time.time()
    checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
    assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
    assert model_parallel_size == len(
        checkpoints
    ), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {model_parallel_size}"
    ckpt_path = checkpoints[get_model_parallel_rank()]
    checkpoint = torch.load(ckpt_path, map_location="cpu")
    with open(Path(ckpt_dir) / "params.json", "r") as f:
        params = json.loads(f.read())

    model_args: ModelArgs = ModelArgs(
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
        **params,
    )
    tokenizer = Tokenizer(model_path=tokenizer_path)
    model_args.vocab_size = tokenizer.n_words
    torch.set_default_tensor_type(torch.FloatTensor)
    model = Transformer(model_args)
    model.load_state_dict(checkpoint, strict=False)
    print(f"Loaded in {time.time() - start_time:.2f} seconds")

    return Llama(model, tokenizer)

It adds checks for whether there is a GPU available and uses different tensor types and schedulers.. but, again, there is more to it than this as other stuff breaks as well

@sha0coder
Copy link

that would be great, other option is publishing the exported llama2_7b.bin

@Foundation42
Copy link

Foundation42 commented Jul 25, 2023

that would be great, other option is publishing the exported llama2_7b.bin

that would be nice. probably won't happen because - "bypass license agreement" mularky

@karpathy
Copy link
Owner

that would be great, other option is publishing the exported llama2_7b.bin

that would be nice. probably won't happen because - "bypass license agreement" mularky

Exactly why I didn't. (Also it is 26GB)
I'm sorry to see this causing issues, it is silly that we need torchrun to export the weights. Would happily accept a PR fixing that.

@rlancemartin
Copy link

that would be great, other option is publishing the exported llama2_7b.bin

that would be nice. probably won't happen because - "bypass license agreement" mularky

Exactly why I didn't. (Also it is 26GB) I'm sorry to see this causing issues, it is silly that we need torchrun to export the weights. Would happily accept a PR fixing that.

Ah, I see. Also asked about a shared file since export probably won't work (easily) on a Mac (NCCL dependency).

@karpathy
Copy link
Owner

karpathy commented Jul 25, 2023

I have to go to work shortly; It should be possible to hack the llama export code to not require torchrun or GPU. I'll look into it after work, or happy to accept any PRs meanwhile.

@Foundation42
Copy link

Maybe someone can use the guts from convert.py in the llama.cpp project for the .pth model loading and merge it with this conversion and export code. I would do it but my Python is not that great. I'm a C/C++ guy.

@dontgetfoundout
Copy link

Check out this repo I saw linked in another discussion. I don't have a Mac to try the code atm, but this may be a good starting point for the changes you need to get llama running there.

https://github.com/krychu/llama/blob/main/README.md

@sudara
Copy link

sudara commented Jul 25, 2023

I used this llama fork successfully on my M1 mac:

https://github.com/aggiee/llama-v2-mps

like so:

PYTORCH_ENABLE_MPS_FALLBACK=1 torchrun --nproc_per_node 1 export_meta_llama_bin.py

If you've already downloaded the model, in addition to moving the model folder into the llama-v2-mps repo, be sure to move tokenizer.model and tokenizer_checklist.chk.

I initially just compiled with

clang -Ofast -march=native run.c -lm  -o run 

Note: the model was doing ~3 tokens/s on 96 threads on linux.... I get about a token a minute out of the box on my M1 max.

I then did the following:

brew install libomp # get OpenMP
brew install llvm # get latest clang, apple clang is old
echo 'export PATH="/opt/homebrew/opt/llvm/bin:$PATH"' >> ~/.zshrc # make it easy to find new clang
clang -Ofast -fopenmp -march=native run.c  -lm  -o run # recompile with OpenMP

and ran

OMP_NUM_THREADS=8 ./run llama2_7b.bin

which is.... slightly faster....maybe...? I know nothing about OMP, but without it, CPU usage seems around 33%, with threads at 8 (matches num of cores on my M1 max) it seems to settle around 120% cpu. With threads at 32 (just to see) the cpus are more dominated at 400% — but tokens don't necessarily come faster.

Anyway! A great starting place for my needs (I'm interested in taking a crack at optimizing for apple/arm) but yeah... ctrl-C will be your best friend for now :)

Terminal - 2023-07-25 51@2x

@joey00072
Copy link

I've ported code for loading llama,
#71
Can someone test It (I don't have beefy machine 🥲 )
Note: I have disabled mps backend since not everyone has mac and we are not using it anyways

CC: @sudara @abdelhadi-azouni @karpathy

@joey00072
Copy link

updated #71 (comment)

@karpathy
Copy link
Owner

try with updated code :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants