Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0 #11931

Open
bernardhan33 opened this issue Jan 22, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@bernardhan33
Copy link

Describe the bug

Please advise if I should move this bug to NeMo-Run or Megatron-Energon repositories.

I have been following the NeMo 2.0 official documentation to test with LLaVA pretraining and fine-tuning using the Energon Dataloader. However, I have been running into the "Unserializable" error where the FAQ does not help.

Steps/Code to reproduce bug

  1. Start the container on a a3-megagpu-8g node (on GCP) with
docker run --gpus all -it --rm -v /mnt/disks/ssd-array/nemo-data:/nemo-data -v /proc:/writable_proc --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.12
  1. Create a python script with the following code:
from nemo.collections.multimodal.data.energon import (
    ImageToken,
    MultiModalSampleConfig,
    EnergonMultiModalDataModule,
)
from transformers import AutoProcessor, ImageProcessingMixin
import nemo_run as run

from typing import Optional

import lightning.pytorch as pl
import torch

from nemo import lightning as nl
from nemo.collections import llm, vlm
from nemo.collections.llm.recipes.finetune_default import nemo_resume
from nemo.collections.llm.recipes.log.default import tensorboard_logger
from nemo.collections.llm.recipes.optim.adam import distributed_fused_adam_with_cosine_annealing
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed
from nemo.utils.exp_manager import TimingCallback

# Load processor, tokenizer, and image processor from pre-trained model
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
tokenizer = processor.tokenizer
image_processor = processor.image_processor

# Define dataset path
dataset_path = "/nemo-data/wds"

name = "nemo-2.0-llava1.5-pretraining"

# Configure multimodal samples
config = MultiModalSampleConfig(
    image_token=ImageToken(token_str="<image>", token_id=-200),
    ignore_place_holder=-100
)

finetune = vlm.llava15_7b.finetune_recipe(
    name="llava15_7b_finetune",
    dir="/nemo-data/new-ckpts",
    num_nodes=1,
    num_gpus_per_node=8,
)
finetune.data = run.Config(
    EnergonMultiModalDataModule,
    path=dataset_path,
    tokenizer=tokenizer,
    image_processor=image_processor,
    micro_batch_size=8,
    global_batch_size=256,
    num_workers=4,
    multimodal_sample_config=config,
)

run.run(finetune, executor=run.LocalExecutor())

Note that all the code are copied from the NeMo 2.0 documentation directly without modifications.

Running the code will yield error:

fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <unk> of type <class 'tokenizers.AddedToken'>. Error occurred at path "<root>.kwargs['unk_token']".")

Expected behavior

I should be able to start the training job just fine.

Environment overview (please complete the following information)

  • Environment location: GCP, a3-megagpu-8g node.
  • Method of NeMo install: used docker container nvcr.io/nvidia/nemo:24.12.
  • If method of install is [Docker], provide docker pull & docker run commands used: Yes, see above.

Additional context

I have seen this from the FAQ and tried a couple of alternatives but neither helped:

  1. wrapping the tokenizer object with run.Config directly by run.Config(tokenizer) --> same error.
  2. initializing a different tokenizer by
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
tokenizer = run.Config(
    AutoTokenizer,
    "meta-llama/Meta-Llama-3-8B"
)

--> this passes the serializable problem with the tokenizer but I then ran into the same serializable problem with image_processor
3. Try a different image_processor

from transformers import CLIPImageProcessor
image_processor = run.Config(
    CLIPImageProcessor.from_pretrained,
    "openai/clip-vit-large-patch14-336",
    torch_dtype=torch.bfloat16
)

got similar error

fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <bound method ImageProcessingMixin.from_pretrained of <class 'transformers.image_processing_base.ImageProcessingMixin'>> of type <class 'method'>. Error occurred at path '<root>.data.image_processor.<metadata>.fn_or_cls'.")

@bernardhan33 bernardhan33 added the bug Something isn't working label Jan 22, 2025
@yashaswikarnati
Copy link
Collaborator

Sorry you are running into this, this is a known issue with Fiddle serialization which will be fixed with this PR.

We have a workaround - https://github.com/NVIDIA/NeMo/pull/11655/files.

If its easier you could use the scripts here https://github.com/NVIDIA/NeMo/blob/main/scripts/vlm/llava_next_pretrain.py, these does not rely on nemo run to start the training.

@yashaswikarnati yashaswikarnati self-assigned this Jan 23, 2025
@bernardhan33
Copy link
Author

Thanks @yashaswikarnati I was able to get llava_next_pretrain.py to work with the Energon data loader.

However, I'd like to also test with the native data format with NevaLazyDataModule so I replaced lines 70 to 79 with the following code:

data_config = vlm.ImageDataConfig(
    image_folder="/nemo-data/images_1344_1344",
    conv_template="v1",  # Customize based on your dataset needs
)

# Data module setup
data = vlm.NevaLazyDataModule(
    paths="/nemo-data/filtered_data.json",  # Path to your dataset
    data_config=data_config,
    seq_length=decoder_seq_length,
    global_batch_size=gbs,  # Global batch size
    micro_batch_size=mbs,  # Micro batch size
    tokenizer=tokenizer.tokenizer,  # Define your tokenizer if needed
    image_processor=processor.image_processor,  # Add an image processor if required
    num_workers=8,  # Number of workers for data loading
)

The code then fails with

Traceback (most recent call last):
  File "/opt/NeMo/test.py", line 242, in <module>
    main(args)
  File "/opt/NeMo/test.py", line 203, in main
    llm.pretrain(
  File "/opt/NeMo/nemo/collections/llm/api.py", line 149, in pretrain
    _validate_config(model, data, trainer, log=log, resume=resume, optim=optim)
  File "/opt/NeMo/nemo/collections/llm/api.py", line 910, in _validate_config
    assert data.micro_batch_size > 0
AttributeError: 'NevaLazyDataModule' object has no attribute 'micro_batch_size'

while NevaLazyDataModule obviously has the attribute micro_batch_size. This reminds me of my comment in the other issue. Any suggestions here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants