Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0 #11931

bernardhan33 · 2025-01-22T22:17:12Z

Describe the bug

Please advise if I should move this bug to NeMo-Run or Megatron-Energon repositories.

I have been following the NeMo 2.0 official documentation to test with LLaVA pretraining and fine-tuning using the Energon Dataloader. However, I have been running into the "Unserializable" error where the FAQ does not help.

Steps/Code to reproduce bug

Start the container on a a3-megagpu-8g node (on GCP) with

docker run --gpus all -it --rm -v /mnt/disks/ssd-array/nemo-data:/nemo-data -v /proc:/writable_proc --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.12

Create a python script with the following code:

from nemo.collections.multimodal.data.energon import (
    ImageToken,
    MultiModalSampleConfig,
    EnergonMultiModalDataModule,
)
from transformers import AutoProcessor, ImageProcessingMixin
import nemo_run as run

from typing import Optional

import lightning.pytorch as pl
import torch

from nemo import lightning as nl
from nemo.collections import llm, vlm
from nemo.collections.llm.recipes.finetune_default import nemo_resume
from nemo.collections.llm.recipes.log.default import tensorboard_logger
from nemo.collections.llm.recipes.optim.adam import distributed_fused_adam_with_cosine_annealing
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed
from nemo.utils.exp_manager import TimingCallback

# Load processor, tokenizer, and image processor from pre-trained model
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
tokenizer = processor.tokenizer
image_processor = processor.image_processor

# Define dataset path
dataset_path = "/nemo-data/wds"

name = "nemo-2.0-llava1.5-pretraining"

# Configure multimodal samples
config = MultiModalSampleConfig(
    image_token=ImageToken(token_str="<image>", token_id=-200),
    ignore_place_holder=-100
)

finetune = vlm.llava15_7b.finetune_recipe(
    name="llava15_7b_finetune",
    dir="/nemo-data/new-ckpts",
    num_nodes=1,
    num_gpus_per_node=8,
)
finetune.data = run.Config(
    EnergonMultiModalDataModule,
    path=dataset_path,
    tokenizer=tokenizer,
    image_processor=image_processor,
    micro_batch_size=8,
    global_batch_size=256,
    num_workers=4,
    multimodal_sample_config=config,
)

run.run(finetune, executor=run.LocalExecutor())

Note that all the code are copied from the NeMo 2.0 documentation directly without modifications.

Running the code will yield error:

fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <unk> of type <class 'tokenizers.AddedToken'>. Error occurred at path "<root>.kwargs['unk_token']".")

Expected behavior

I should be able to start the training job just fine.

Environment overview (please complete the following information)

Environment location: GCP, a3-megagpu-8g node.
Method of NeMo install: used docker container nvcr.io/nvidia/nemo:24.12.
If method of install is [Docker], provide docker pull & docker run commands used: Yes, see above.

Additional context

I have seen this from the FAQ and tried a couple of alternatives but neither helped:

wrapping the tokenizer object with run.Config directly by run.Config(tokenizer) --> same error.
initializing a different tokenizer by

from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
tokenizer = run.Config(
    AutoTokenizer,
    "meta-llama/Meta-Llama-3-8B"
)

--> this passes the serializable problem with the tokenizer but I then ran into the same serializable problem with image_processor
3. Try a different image_processor

from transformers import CLIPImageProcessor
image_processor = run.Config(
    CLIPImageProcessor.from_pretrained,
    "openai/clip-vit-large-patch14-336",
    torch_dtype=torch.bfloat16
)

got similar error

fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <bound method ImageProcessingMixin.from_pretrained of <class 'transformers.image_processing_base.ImageProcessingMixin'>> of type <class 'method'>. Error occurred at path '<root>.data.image_processor.<metadata>.fn_or_cls'.")

The text was updated successfully, but these errors were encountered:

yashaswikarnati · 2025-01-23T16:42:55Z

Sorry you are running into this, this is a known issue with Fiddle serialization which will be fixed with this PR.

We have a workaround - https://github.com/NVIDIA/NeMo/pull/11655/files.

If its easier you could use the scripts here https://github.com/NVIDIA/NeMo/blob/main/scripts/vlm/llava_next_pretrain.py, these does not rely on nemo run to start the training.

bernardhan33 · 2025-01-24T06:39:20Z

Thanks @yashaswikarnati I was able to get llava_next_pretrain.py to work with the Energon data loader.

However, I'd like to also test with the native data format with NevaLazyDataModule so I replaced lines 70 to 79 with the following code:

data_config = vlm.ImageDataConfig(
    image_folder="/nemo-data/images_1344_1344",
    conv_template="v1",  # Customize based on your dataset needs
)

# Data module setup
data = vlm.NevaLazyDataModule(
    paths="/nemo-data/filtered_data.json",  # Path to your dataset
    data_config=data_config,
    seq_length=decoder_seq_length,
    global_batch_size=gbs,  # Global batch size
    micro_batch_size=mbs,  # Micro batch size
    tokenizer=tokenizer.tokenizer,  # Define your tokenizer if needed
    image_processor=processor.image_processor,  # Add an image processor if required
    num_workers=8,  # Number of workers for data loading
)

The code then fails with

Traceback (most recent call last):
  File "/opt/NeMo/test.py", line 242, in <module>
    main(args)
  File "/opt/NeMo/test.py", line 203, in main
    llm.pretrain(
  File "/opt/NeMo/nemo/collections/llm/api.py", line 149, in pretrain
    _validate_config(model, data, trainer, log=log, resume=resume, optim=optim)
  File "/opt/NeMo/nemo/collections/llm/api.py", line 910, in _validate_config
    assert data.micro_batch_size > 0
AttributeError: 'NevaLazyDataModule' object has no attribute 'micro_batch_size'

while NevaLazyDataModule obviously has the attribute micro_batch_size. This reminds me of my comment in the other issue. Any suggestions here?

bernardhan33 added the bug Something isn't working label Jan 22, 2025

yashaswikarnati self-assigned this Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0 #11931

Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0 #11931

bernardhan33 commented Jan 22, 2025

yashaswikarnati commented Jan 23, 2025

bernardhan33 commented Jan 24, 2025

Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0 #11931

Unserializable Error with using Energon Dataloader for NeVA (LLaVA) pretraining / fine-tuning and NeMo 2.0 #11931

Comments

bernardhan33 commented Jan 22, 2025

yashaswikarnati commented Jan 23, 2025

bernardhan33 commented Jan 24, 2025