Skip to content

Applicability of RIFLEx to CogVideoX Image-to-Video Version #24

@fuyuchenIfyw

Description

@fuyuchenIfyw

[Question] Applicability of RIFLEx to CogVideoX Image-to-Video Version

Problem Description

I encountered significant video quality degradation when attempting to apply the RIFLEx method to the Image-to-Video (I2V) version of CogVideoX-5b. When using RIFLEx to modify RoPE, the generated video frames exhibit blurring and distortion.

Environment Information

  • Model Version: CogVideoX-5b-I2V
  • Hardware: NVIDIA A100
  • Software: PyTorch 2.0, transformers 4.30.2

Reproduction Steps

I used the following code to test RIFLEx on the I2V version:

  if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.add_argument('--seed', type=int, help='Random seed', 
                      default=1234)
  parser.add_argument('--k', type=int, help='Index of intrinsic frequency', 
                      default=2)
  parser.add_argument('--N_k', type=int, help='The period of intrinsic frequency in latent space', 
                      default=20)
  parser.add_argument('--num_frames', type=int, help='Number of frames for inference', 
                      default=97)
  parser.add_argument('--finetune', help='Whether finetuned version', action='store_true')
  parser.add_argument('--model_id', type=str, help='huggingface path for models', 
                      default="THUDM/CogVideoX-5b-I2V")
  parser.add_argument('--image', type=str, help='Image for generation',
                      default="CogKit/quickstart/data/i2v/train/images/1d50a3d9703f152758d5422c8b48010f.png")
  parser.add_argument('--prompt', type=str, help='Prompts for generation',
                      default="A dynamic sequence unfolds on the deck of a ship, where a small, mouse-like character with large ears and short pants enthusiastically steers the vessel using a wheel. A larger, bulky character with a long pole engages in a playful confrontation, asserting dominance or playfully provoking the smaller one. Expressive gestures and movements convey emotions and intentions, set against a nautical backdrop featuring a steering wheel, life preserver, and bell. The two characters interact in a lively, competitive, or friendly exchange.")
args = parser.parse_args()

assert (args.num_frames - 1) % 4 == 0, "num_frames should be 4 * k + 1"
L_test = (args.num_frames - 1) // 4 + 1  # latent frames
transformer = CogVideoXTransformer3DModel.from_pretrained(
    args.model_id,
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX-5b-I2V",
    transformer=transformer,
    torch_dtype=torch.bfloat16
).to("cuda")

pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

generator = torch.Generator("cuda").manual_seed(args.seed)

# For training-free, if extrapolate length exceeds the period of intrinsic frequency, modify RoPE
if L_test > args.N_k and not args.finetune:
    pipe._prepare_rotary_positional_embeddings = MethodType(
        partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)

# We fine-tune the model on new theta_k and N_k, and thus modify RoPE to match the fine-tuning setting.
if args.finetune:
    L_test = args.N_k  # the fine-tuning frequency setting
    pipe._prepare_rotary_positional_embeddings = MethodType(
        partial(_prepare_rotary_positional_embeddings_riflex, k=args.k, L_test=L_test), pipe)

image = load_image(args.image)

video = pipe(image=image, prompt=args.prompt, num_frames=args.num_frames, height=480, width=720, guidance_scale=6,
             num_inference_steps=50, generator=generator).frames[0]
export_to_video(video, f"seed_{args.seed}_{args.prompt[:20]}.mp4", fps=8)

Expected Results

Applying RIFLEx should generate coherent, high-quality video sequences that maintain content consistency and temporal coherence with the input image.

Actual Results

  • When RIFLEx is enabled, the generated videos exhibit:

Image

Image

Questions

  1. Is RIFLEx designed to work with the I2V version of CogVideoX, or is it only applicable to the Text-to-Video version?
  2. Are there any special configurations or parameter adjustments required for using RIFLEx with the I2V version?

Additional Information

  • I have tried using CogVideoX-5b-I2V without modification to generate a 97-frame video, and the result was exactly the same as when using RIFLEx + CogVideoX-5b-I2V. Does this suggest that RIFLEx has no effect on CogVideoX-5b-I2V?

Thank you for your assistance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions