Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HfHubHTTPError: 429 Client Error: Too Many Requests for URL when trying to access SlimPajama-627B or c4 on TPUs #7344

Open
clankur opened this issue Dec 22, 2024 · 1 comment

Comments

@clankur
Copy link

clankur commented Dec 22, 2024

Describe the bug

I am trying to run some trainings on Google's TPUs using Huggingface's DataLoader on SlimPajama-627B and c4, but I end up running into 429 Client Error: Too Many Requests for URL error when I call load_dataset. The even odder part is that I am able to sucessfully run trainings with the wikitext dataset. Is there something I need to setup to specifically train with SlimPajama or C4 with TPUs because I am not clear why I am getting these errors.

Steps to reproduce the bug

These are the commands you could run to produce the error below but you will require a ClearML account (you can create one here) with a queue setup to run on Google TPUs

git clone https://github.com/clankur/muGPT.git
cd muGPT
python -m train --config-name=slim_v4-32_84m.yaml +training.queue={NAME_OF_CLEARML_QUEUE}

The error I see:

Traceback (most recent call last):
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/clearml/binding/hydra_bind.py", line 230, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 1037, in main
    main_contained(config, logger)
  File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/train.py", line 840, in main_contained
    loader = get_loader("train", config.training_data, config.training.tokens)
  File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 549, in get_loader
    return HuggingFaceDataLoader(split, config, token_batch_params)
  File "/home/clankur/.clearml/venvs-builds/3.10/task_repository/muGPT.git/input_loader.py", line 395, in __init__
    self.dataset = load_dataset(
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 2112, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1798, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1495, in dataset_module_factory
    raise e1 from None
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1479, in dataset_module_factory
    ).get_module()
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/load.py", line 1034, in get_module
    else get_data_patterns(base_path, download_config=self.download_config)
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 457, in get_data_patterns
    return _get_data_files_patterns(resolver)
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 248, in _get_data_files_patterns
    data_files = pattern_resolver(pattern)
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/datasets/data_files.py", line 340, in resolve_pattern
    for filepath, info in fs.glob(pattern, detail=True).items()
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 409, in glob
    return super().glob(path, **kwargs)
  File "/home/clankur/.clearml/venvs-builds/3.10/lib/python3.10/site-packages/fsspec/spec.py", line 602, in glob
    allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 429, in find
    out = self._ls_tree(path, recursive=True, refresh=refresh, revision=resolved_path.revision, **kwargs)
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 358, in _ls_tree
    self._ls_tree(
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 375, in _ls_tree
    for path_info in tree:
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3080, in list_repo_tree
    for path_info in paginate(path=tree_url, headers=headers, params={"recursive": recursive, "expand": expand}):
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_pagination.py", line 46, in paginate
    hf_raise_for_status(r)
  File "/home/clankur/conda/envs/jax/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/cerebras/SlimPajama-627B/tree/2d0accdd58c5d5511943ca1f5ff0e3eb5e293543?recursive=True&expand=True&cursor=ZXlKbWFXeGxYMjVoYldVaU9pSjBaWE4wTDJOb2RXNXJNUzlsZUdGdGNHeGxYMmh2YkdSdmRYUmZPVFEzTG1wemIyNXNMbnB6ZENKOTo2MjUw (Request ID: Root=1-67673de9-1413900606ede7712b08ef2c;1304c09c-3e69-4222-be14-f10ee709d49c)
maximum queue size reached
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Expected behavior

I'd expect the DataLoader to load from the SlimPajama-627B and c4 dataset without issue.

Environment info

  • datasets version: 2.14.4
  • Platform: Linux-5.8.0-1035-gcp-x86_64-with-glibc2.31
  • Python version: 3.10.16
  • Huggingface_hub version: 0.26.5
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
@lhoestq
Copy link
Member

lhoestq commented Jan 10, 2025

Hi ! This is due to your old version of datasets which calls HF with expand=True, an option that is strongly rate limited.

Recent versions of datasets don't rely on this anymore, you can fix your issue by upgrading datasets :)

pip install -U datasets

You can also get maximum HF availability on your compute nodes with HF Enterprise (see network security features)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants