You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please advise if I should move this bug to NeMo-Run or Megatron-Energon repositories.
I have been following the NeMo 2.0 official documentation to test with LLaVA pretraining and fine-tuning using the Energon Dataloader. However, I have been running into the "Unserializable" error where the FAQ does not help.
Steps/Code to reproduce bug
Start the container on a a3-megagpu-8g node (on GCP) with
docker run --gpus all -it --rm -v /mnt/disks/ssd-array/nemo-data:/nemo-data -v /proc:/writable_proc --shm-size=32g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.12
Note that all the code are copied from the NeMo 2.0 documentation directly without modifications.
Running the code will yield error:
fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <unk> of type <class 'tokenizers.AddedToken'>. Error occurred at path "<root>.kwargs['unk_token']".")
Expected behavior
I should be able to start the training job just fine.
Environment overview (please complete the following information)
Environment location: GCP, a3-megagpu-8g node.
Method of NeMo install: used docker container nvcr.io/nvidia/nemo:24.12.
If method of install is [Docker], provide docker pull & docker run commands used: Yes, see above.
Additional context
I have seen this from the FAQ and tried a couple of alternatives but neither helped:
wrapping the tokenizer object with run.Config directly by run.Config(tokenizer) --> same error.
--> this passes the serializable problem with the tokenizer but I then ran into the same serializable problem with image_processor
3. Try a different image_processor
fiddle._src.experimental.serialization.UnserializableValueError: Unserializable value <bound method ImageProcessingMixin.from_pretrained of <class 'transformers.image_processing_base.ImageProcessingMixin'>> of type <class 'method'>. Error occurred at path '<root>.data.image_processor.<metadata>.fn_or_cls'.")
The text was updated successfully, but these errors were encountered:
Thanks @yashaswikarnati I was able to get llava_next_pretrain.py to work with the Energon data loader.
However, I'd like to also test with the native data format with NevaLazyDataModule so I replaced lines 70 to 79 with the following code:
data_config=vlm.ImageDataConfig(
image_folder="/nemo-data/images_1344_1344",
conv_template="v1", # Customize based on your dataset needs
)
# Data module setupdata=vlm.NevaLazyDataModule(
paths="/nemo-data/filtered_data.json", # Path to your datasetdata_config=data_config,
seq_length=decoder_seq_length,
global_batch_size=gbs, # Global batch sizemicro_batch_size=mbs, # Micro batch sizetokenizer=tokenizer.tokenizer, # Define your tokenizer if neededimage_processor=processor.image_processor, # Add an image processor if requirednum_workers=8, # Number of workers for data loading
)
The code then fails with
Traceback (most recent call last):
File "/opt/NeMo/test.py", line 242, in <module>
main(args)
File "/opt/NeMo/test.py", line 203, in main
llm.pretrain(
File "/opt/NeMo/nemo/collections/llm/api.py", line 149, in pretrain
_validate_config(model, data, trainer, log=log, resume=resume, optim=optim)
File "/opt/NeMo/nemo/collections/llm/api.py", line 910, in _validate_config
assert data.micro_batch_size > 0
AttributeError: 'NevaLazyDataModule' object has no attribute 'micro_batch_size'
while NevaLazyDataModule obviously has the attribute micro_batch_size. This reminds me of my comment in the other issue. Any suggestions here?
Describe the bug
Please advise if I should move this bug to NeMo-Run or Megatron-Energon repositories.
I have been following the NeMo 2.0 official documentation to test with LLaVA pretraining and fine-tuning using the Energon Dataloader. However, I have been running into the "Unserializable" error where the FAQ does not help.
Steps/Code to reproduce bug
Note that all the code are copied from the NeMo 2.0 documentation directly without modifications.
Running the code will yield error:
Expected behavior
I should be able to start the training job just fine.
Environment overview (please complete the following information)
nvcr.io/nvidia/nemo:24.12
.docker pull
&docker run
commands used: Yes, see above.Additional context
I have seen this from the FAQ and tried a couple of alternatives but neither helped:
run.Config
directly byrun.Config(tokenizer)
--> same error.--> this passes the serializable problem with the tokenizer but I then ran into the same serializable problem with image_processor
3. Try a different image_processor
got similar error
The text was updated successfully, but these errors were encountered: