-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! #70
Comments
in export function try llama2.c/export_meta_llama_bin.py Line 14 in 98ec4ba
tell me if this works |
is not the export() which claim the GPU, its Llama.build() |
still have the same error Traceback (most recent call last): |
I'm having the same problem. I worked with ChatGPT to try and rewrite parts of the dependent llama library that requires the GPU and distributed support. We got so far but other issues started popping up. Looking at this export_meta_llama_bin.py it seems like it could maybe be rewritten not to require the dependency on the llama library at some point, that might be a lot better FWIW: Here was a change recommended by ChatGPT to the build function in llama/generate.py @staticmethod
def build(
ckpt_dir: str,
tokenizer_path: str,
max_seq_len: int,
max_batch_size: int,
model_parallel_size: Optional[int] = None,
) -> "Llama":
if model_parallel_size is None:
model_parallel_size = 1
local_rank = 0
torch.manual_seed(1)
start_time = time.time()
checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
assert len(checkpoints) > 0, f"no checkpoint files found in {ckpt_dir}"
assert model_parallel_size == len(
checkpoints
), f"Loading a checkpoint for MP={len(checkpoints)} but world size is {model_parallel_size}"
ckpt_path = checkpoints[get_model_parallel_rank()]
checkpoint = torch.load(ckpt_path, map_location="cpu")
with open(Path(ckpt_dir) / "params.json", "r") as f:
params = json.loads(f.read())
model_args: ModelArgs = ModelArgs(
max_seq_len=max_seq_len,
max_batch_size=max_batch_size,
**params,
)
tokenizer = Tokenizer(model_path=tokenizer_path)
model_args.vocab_size = tokenizer.n_words
torch.set_default_tensor_type(torch.FloatTensor)
model = Transformer(model_args)
model.load_state_dict(checkpoint, strict=False)
print(f"Loaded in {time.time() - start_time:.2f} seconds")
return Llama(model, tokenizer) It adds checks for whether there is a GPU available and uses different tensor types and schedulers.. but, again, there is more to it than this as other stuff breaks as well |
that would be great, other option is publishing the exported llama2_7b.bin |
that would be nice. probably won't happen because - "bypass license agreement" mularky |
Exactly why I didn't. (Also it is 26GB) |
Ah, I see. Also asked about a shared file since export probably won't work (easily) on a Mac (NCCL dependency). |
I have to go to work shortly; It should be possible to hack the llama export code to not require torchrun or GPU. I'll look into it after work, or happy to accept any PRs meanwhile. |
Maybe someone can use the guts from convert.py in the llama.cpp project for the .pth model loading and merge it with this conversion and export code. I would do it but my Python is not that great. I'm a C/C++ guy. |
Check out this repo I saw linked in another discussion. I don't have a Mac to try the code atm, but this may be a good starting point for the changes you need to get llama running there. |
I used this llama fork successfully on my M1 mac: https://github.com/aggiee/llama-v2-mps like so:
If you've already downloaded the model, in addition to moving the model folder into the I initially just compiled with
Note: the model was doing ~3 tokens/s on 96 threads on linux.... I get about a token a minute out of the box on my M1 max. I then did the following:
and ran
which is.... slightly faster....maybe...? I know nothing about OMP, but without it, CPU usage seems around 33%, with threads at 8 (matches num of cores on my M1 max) it seems to settle around 120% cpu. With threads at 32 (just to see) the cpus are more dominated at 400% — but tokens don't necessarily come faster. Anyway! A great starting place for my needs (I'm interested in taking a crack at optimizing for apple/arm) but yeah... ctrl-C will be your best friend for now :) |
I've ported code for loading llama, |
updated #71 (comment) |
try with updated code :) |
Cloned the llama repo and copied the export_meta_llama_bin.py file then run:
torchrun --nproc_per_node 1 export_meta_llama_bin.py
I get
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
I'm on linux 16vCPUs 32G ram no GPU
Is the code supposed to work without GPU?
The text was updated successfully, but these errors were encountered: