Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About camera_param embedding and "loss is NaN" #127

Open
Bowa529 opened this issue Dec 14, 2024 · 11 comments
Open

About camera_param embedding and "loss is NaN" #127

Bowa529 opened this issue Dec 14, 2024 · 11 comments
Labels

Comments

@Bowa529
Copy link

Bowa529 commented Dec 14, 2024

Hello, I am using my own dataset to replace the nuscenes dataset for experiments. When I input camera params as conditions into the model, after training for thousands of steps, I always encounter the problem of "loss is NaN". I noticed that after UNet, my model_pred has many Inf values. May I ask if the embedding of camera params will have any impact? When there are differences in the focal length and center coordinates of my camera parameters, including rotation, translation, and the parameters of the Nuscenes dataset.

@Bowa529
Copy link
Author

Bowa529 commented Dec 14, 2024

I also print the input, but the input has no inf; and I only put text prompt and bev map into model is correct. But when I also input my camera param, it has problems.

@flymin
Copy link
Member

flymin commented Dec 14, 2024

The generalization ability of camera params is limited, which is a known issue as presented in our work, MagicDrive3D. However, we have tried some parameters different from the nuScene and did not observe NaN or Inf (anyway, the results are not satisfactory).

NaN in training can be due to many reasons. You may refer to some previous issues for solutions.

Besides, if you think camera pose embedding is the key reason. In our latest work, we implemented the "base_token" + "zero_proj" module to mitigate such issues on any token in the sequence embeddings (we do not include this part in the paper as they may not be useful when training from scratch). Please check
https://github.com/flymin/MagicDriveDiT/blob/d537ecfbf7d83af4518b6509c8b99c4c467c8264/magicdrivedit/models/magicdrive/magicdrive_stdit3.py#L999

@Bowa529
Copy link
Author

Bowa529 commented Dec 17, 2024

When handling two Field of View (FOV) perspectives, I tried training with camera parameters for both views separately. One perspective trains normally, but the other still encounters the "loss is NaN" issue. If there are significant differences in camera intrinsics like focal length, do you think the camera parameter encoding process needs to be modified? Or do you have any other suggestions? Thanks for your help.

@flymin
Copy link
Member

flymin commented Dec 17, 2024

Please consider this, zero init should help to stable the training process.

Besides, if you think camera pose embedding is the key reason. In our latest work, we implemented the "base_token" + "zero_proj" module to mitigate such issues on any token in the sequence embeddings (we do not include this part in the paper as they may not be useful when training from scratch). Please check
flymin/MagicDriveDiT@d537ecf/magicdrivedit/models/magicdrive/magicdrive_stdit3.py#L999

@Bowa529
Copy link
Author

Bowa529 commented Dec 17, 2024

Sure, thank you. I will try the methods you suggested.

@Bowa529 Bowa529 closed this as completed Dec 17, 2024
@Bowa529 Bowa529 reopened this Dec 17, 2024
@Bowa529
Copy link
Author

Bowa529 commented Dec 17, 2024

I'm sorry that I still have a question: Why do you concatenate the original input with the results after sine and cosine encoding to form camera_emb in your camera parameter encoding process? Thanks for your help.

@flymin
Copy link
Member

flymin commented Dec 17, 2024

I think you are talking about the Fourier Embedding, which we borrowed from NeRF. You can find the citation (Mildenhall et al., 2020) in our paper.

Copy link

This issue is stale because it has been open for 7 days with no activity. If you do not have any follow-ups, the issue will be closed soon.

@github-actions github-actions bot added the stale label Dec 24, 2024
@3buffers
Copy link

I'm sorry that I still have a question: Why do you concatenate the original input with the results after sine and cosine encoding to form camera_emb in your camera parameter encoding process? Thanks for your help.

did u solve that? thanks

@github-actions github-actions bot removed the stale label Dec 30, 2024
@3buffers
Copy link

I'm sorry that I still have a question: Why do you concatenate the original input with the results after sine and cosine encoding to form camera_emb in your camera parameter encoding process? Thanks for your help.

did u solve that? thanks

acturally i think is due to data type reason, for instance if you use your dataset to replace nuscenes, you should focus on your bboxes.dtype, if you still use float16 which will be make overflow when doing fourier_embedder specificlly in doing embed_fns will make float overflow to make you box embed to be inf or nan, thus you should make bigger dtype like float32 or 64 i think

Copy link

github-actions bot commented Jan 7, 2025

This issue is stale because it has been open for 7 days with no activity. If you do not have any follow-ups, the issue will be closed soon.

@github-actions github-actions bot added the stale label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants