Stars
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
📺 An End-to-End Solution for High-Resolution and Long Video Generation Based on Transformer Diffusion
The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
Janus-Series: Unified Multimodal Understanding and Generation Models
The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
The paper collections for the autoregressive models in vision.
Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input.
Movie Gen Bench - two media generation evaluation benchmarks released with Meta Movie Gen
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
MoH: Multi-Head Attention as Mixture-of-Head Attention
Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A collection of awesome video generation studies.
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Official codes of VEnhancer: Generative Space-Time Enhancement for Video Generation
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
[ECCV 2024] FreeInit: Bridging Initialization Gap in Video Diffusion Models
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining"
✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Official implementation of Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle
Video-LlaVA fine-tune for CinePile evaluation
Scaling Diffusion Transformers with Mixture of Experts