Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Training File #56

Open
ALEX13679173326 opened this issue Nov 20, 2024 · 10 comments
Open

Provide Training File #56

ALEX13679173326 opened this issue Nov 20, 2024 · 10 comments

Comments

@ALEX13679173326
Copy link

Nick Work!!
Could you please provide the file '/apdcephfs/share_1290939/0_public_datasets/WebVid/metadata/metadata_2048_val.csv' in your training code? Because the WebVid dataset can not be found in the official github repository.

@MyNiuuu
Copy link
Owner

MyNiuuu commented Nov 20, 2024

Sorry, but this work was done during my internship at Tencent AI Lab. Since I have left the company, I can no longer access the data files stored on their servers any more.

Nevertheless, I have found the training CSV files on the internet, such as:

However, I cannot guarantee the quality or authenticity of these links, as they are unofficial sources.

@tyrink
Copy link

tyrink commented Dec 9, 2024

Sorry, but this work was done during my internship at Tencent AI Lab. Since I have left the company, I can no longer access the data files stored on their servers any more.

Nevertheless, I have found the training CSV files on the internet, such as:

However, I cannot guarantee the quality or authenticity of these links, as they are unofficial sources.

Hi, as described in the paper, the sparse2dense flow prediction and controlnet are trained together in stage1 , may I ask that how many videos are used for training in that stage? And can you provide some guidelines for selecting the training videos from the original WebVid-10M?

@MyNiuuu
Copy link
Owner

MyNiuuu commented Dec 9, 2024

We trained the model for approximately 100,000 iterations using the WebVid-10M dataset, with a batch size of 8 (one per A100 GPU). This means a total of about 800,000 video clips were used for training. No specific video selection was applied; the model was trained directly on the entire WebVid-10M dataset.

@tyrink
Copy link

tyrink commented Dec 10, 2024

We trained the model for approximately 100,000 iterations using the WebVid-10M dataset, with a batch size of 8 (one per A100 GPU). This means a total of about 800,000 video clips were used for training. No specific video selection was applied; the model was trained directly on the entire WebVid-10M dataset.

Thanks for your reply! And I would like to check if the S2D module directly adopt the CMP pre-trained weights or go through finetuning based on this weights?

@MyNiuuu
Copy link
Owner

MyNiuuu commented Dec 10, 2024

We trained the model for approximately 100,000 iterations using the WebVid-10M dataset, with a batch size of 8 (one per A100 GPU). This means a total of about 800,000 video clips were used for training. No specific video selection was applied; the model was trained directly on the entire WebVid-10M dataset.

Thanks for your reply! And I would like to check if the S2D module directly adopt the CMP pre-trained weights or go through finetuning based on this weights?

We observe no significant performance gap between the two following choices:

  1. initialize S2D with CMP weight and finetune the S2D
  2. directly using CMP pre-trained weights

@tyrink
Copy link

tyrink commented Dec 10, 2024

We trained the model for approximately 100,000 iterations using the WebVid-10M dataset, with a batch size of 8 (one per A100 GPU). This means a total of about 800,000 video clips were used for training. No specific video selection was applied; the model was trained directly on the entire WebVid-10M dataset.

Thanks for your reply! And I would like to check if the S2D module directly adopt the CMP pre-trained weights or go through finetuning based on this weights?

We observe no significant performance gap between the two following choices:

  1. initialize S2D with CMP weight and finetune the S2D
  2. directly using CMP pre-trained weights

Got it! By the way, as demonstrated in Figure 2, feature interactions between the warped feature and denoising unet feature take place at the decoder part, which seems different from feature interactions at the encoder part in Figure 3.

@MyNiuuu
Copy link
Owner

MyNiuuu commented Dec 11, 2024

different

The Encoder in Figure 3 is a part of the Controlnet it self, which is called 'Fusion Encoder', which is illustrated in Figure 3 and the text. The arrow from Controlnet to SVD Decoder depicted in Figure 2 correspond to the greyscale font 'To SVD Encoders' part in Figure 3

Screenshot 2024-12-11 at 16 24 46

@tyrink
Copy link

tyrink commented Dec 11, 2024

different

The Encoder in Figure 3 is a part of the Controlnet it self, which is called 'Fusion Encoder', which is illustrated in Figure 3 and the text. The arrow from Controlnet to SVD Decoder depicted in Figure 2 correspond to the greyscale font 'To SVD Encoders' part in Figure 3

Screenshot 2024-12-11 at 16 24 46

Sorry I may have misunderstood the model structure before, i.e. the "warp" part of the mofa-adapter illustrated in Figure 2 mainly consists of two encoders (the reference encoder and the fusion encoder)?

@MyNiuuu
Copy link
Owner

MyNiuuu commented Dec 11, 2024

different

The Encoder in Figure 3 is a part of the Controlnet it self, which is called 'Fusion Encoder', which is illustrated in Figure 3 and the text. The arrow from Controlnet to SVD Decoder depicted in Figure 2 correspond to the greyscale font 'To SVD Encoders' part in Figure 3
Screenshot 2024-12-11 at 16 24 46

Sorry I may have misunderstood the model structure before, i.e. the "warp" part of the mofa-adapter illustrated in Figure 2 mainly consists of two encoders (the reference encoder and the fusion encoder)?

Yes, the "warp" part of the mofa-adapter illustrated in Figure 2 consists of two encoders, the reference encoder and the fusion encoder.

@tyrink
Copy link

tyrink commented Dec 11, 2024

different

The Encoder in Figure 3 is a part of the Controlnet it self, which is called 'Fusion Encoder', which is illustrated in Figure 3 and the text. The arrow from Controlnet to SVD Decoder depicted in Figure 2 correspond to the greyscale font 'To SVD Encoders' part in Figure 3
Screenshot 2024-12-11 at 16 24 46

Sorry I may have misunderstood the model structure before, i.e. the "warp" part of the mofa-adapter illustrated in Figure 2 mainly consists of two encoders (the reference encoder and the fusion encoder)?

Yes, the "warp" part of the mofa-adapter illustrated in Figure 2 consists of two encoders, the reference encoder and the fusion encoder.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants