Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressing video quality on generated videos #54

Open
zlenyk opened this issue Jan 15, 2025 · 2 comments
Open

Regressing video quality on generated videos #54

zlenyk opened this issue Jan 15, 2025 · 2 comments

Comments

@zlenyk
Copy link

zlenyk commented Jan 15, 2025

Hello, and thank you for open sourcing such an amazing work!
I wanted to see what output should I expect and if I'm doing potentially something wrong. The quality of videos generated using 5B autoregressive video2world model is always a little worse than it's input. I wanted to see if I can get sort of "infinite generation" using output of generation as input to the next iteration in a loop.
As a result, after 10-20 generation I start getting complete gibberish. I thought that running diffusion decoder should be preventing this effect. Am I misusing this model? Here is how I'm generating videos:

python cosmos1/models/autoregressive/inference/video2world.py   
--ar_model_dir=Cosmos-1.0-Autoregressive-5B-Video2World     
--top_p=0.7     
--temperature=1.0     
--offload_guardrail_models     
--offload_diffusion_decoder     
--offload_ar_model     
--offload_tokenizer     
--offload_text_encoder_model 
...

Image

@zlenyk
Copy link
Author

zlenyk commented Jan 17, 2025

My own update: to large degree (but not 100%), responsible for that fact is that sending output videos as input to the next iteration goes through encoding/decoding process everytime.
We could bypass this by only passing tokens (not videos) between iterations and send some new parameter to control length of generation. It looks like that was the idea of parameter "num_chunks_to_generate", but in current implementation it's rather useless.
I do have an implementation of just passing tokens, is this something that would be useful for others?

@monko9j1
Copy link

My own update: to large degree (but not 100%), responsible for that fact is that sending output videos as input to the next iteration goes through encoding/decoding process everytime. We could bypass this by only passing tokens (not videos) between iterations and send some new parameter to control length of generation. It looks like that was the idea of parameter "num_chunks_to_generate", but in current implementation it's rather useless. I do have an implementation of just passing tokens, is this something that would be useful for others?

@zlenyk I would be interested to see your implementation, sounds useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants