Kandinsky 5.0: Russia's Open-Source Answer to AI Video Generation
Kandinsky 5.0 brings 10-second video generation to consumer GPUs with Apache 2.0 licensing. We explore how NABLA attention and flow matching make this possible.

The Open-Source Video Landscape Shifts
When ByteDance open-sourced their video understanding model and Tencent released HunyuanVideo, we saw the first tremors of a shift. Now Kandinsky Lab, backed by Sberbank, has released a complete family of models that anyone can run, modify, and commercialize under the Apache 2.0 license.
This is not a research preview or a restricted API. The full weights, training code, and inference pipeline are available on GitHub and Hugging Face.
The Model Family
For context on diffusion architectures, see our deep dive on diffusion transformers.
Kandinsky 5.0 is not a single model but a family of three:
Video Lite (2B Parameters)
The lightweight option for consumer hardware. Generates 5 to 10 second videos at 768×512 resolution, 24 fps. Runs on 12GB VRAM with memory offloading. The distilled 16-step variant produces a 5-second clip in 35 to 60 seconds on an H100.
Video Pro (19B Parameters)
The full model for maximum quality. Outputs HD video at 1280×768, 24 fps. Requires datacenter-class GPUs but delivers results competitive with closed-source alternatives.
A 6B parameter Image Lite model rounds out the family for still image generation at 1280×768 or 1024×1024 resolution.
Technical Architecture
The engineering decisions in Kandinsky 5.0 reveal a team focused on practical deployment rather than benchmark chasing.
Foundation: Flow Matching Over Diffusion
Traditional diffusion models learn to reverse a noise-adding process step by step. Flow matching takes a different approach: it learns a direct path from noise to image through a continuous flow field. The advantages are significant:
NABLA: Making Long Videos Possible
The real innovation is NABLA, short for Neighborhood Adaptive Block-Level Attention. Standard transformer attention scales quadratically with sequence length. For video, this is catastrophic. A 10-second clip at 24 fps contains 240 frames, each with thousands of spatial patches. Full attention across all of them is computationally intractable.
NABLA addresses this through sparse attention patterns. Rather than attending to every patch in every frame, it focuses computation on:
- Local spatial neighborhoods within each frame
- Temporal neighbors across adjacent frames
- Learned global anchors for long-range coherence
The result is near-linear scaling with video length instead of quadratic. This is what makes 10-second generation feasible on consumer hardware.
For comparison, most competing models struggle with videos longer than 5 seconds without specialized hardware.
Building on HunyuanVideo
Rather than training everything from scratch, Kandinsky 5.0 adopts the 3D VAE from Tencent's HunyuanVideo project. This encoder-decoder handles the translation between pixel space and the compact latent space where the diffusion process operates.
Text understanding comes from Qwen2.5-VL, a vision-language model, combined with CLIP embeddings for semantic grounding. This dual-encoder approach allows the model to understand both the literal meaning and the visual style implied by prompts.
Performance: Where It Stands
The team positions Video Lite as the top performer among open-source models in its parameter class. Benchmarks show:
| Model | Parameters | Max Duration | VRAM (5s) |
|---|---|---|---|
| Kandinsky Video Lite | 2B | 10 seconds | 12GB |
| CogVideoX-2B | 2B | 6 seconds | 16GB |
| Open-Sora 1.2 | 1.1B | 16 seconds | 18GB |
The 12GB VRAM requirement opens the door to deployment on consumer RTX 3090 and 4090 cards, a significant accessibility milestone.
Quality comparisons are harder to quantify. User reports suggest Kandinsky produces more consistent motion than CogVideoX but lags behind HunyuanVideo in photorealism. The 16-step distilled model sacrifices some fine detail for speed, a trade-off that works well for prototyping but may not satisfy final production needs.
Running Kandinsky Locally
The project provides ComfyUI nodes and standalone scripts. A basic text-to-video workflow:
from kandinsky5 import Kandinsky5VideoLite
model = Kandinsky5VideoLite.from_pretrained("kandinskylab/Kandinsky-5.0-T2V-Lite")
model.enable_model_cpu_offload() # For 12GB cards
video = model.generate(
prompt="A mountain lake at dawn, mist rising from still water",
num_frames=120, # 5 seconds at 24fps
guidance_scale=7.0,
num_inference_steps=16
)
video.save("output.mp4")Memory offloading moves model weights between CPU and GPU during inference. This trades speed for accessibility, allowing larger models to run on smaller cards.
The Sberbank Connection
Kandinsky Lab operates under Sber AI, the artificial intelligence division of Sberbank, Russia's largest bank. This backing explains the substantial resources behind the project: multi-stage training on proprietary data, reinforcement learning post-training, and the engineering effort to open-source a complete production pipeline.
The geopolitical context adds complexity. Western developers may face institutional pressure to avoid Russian-origin models. The Apache 2.0 license is legally clear, but organizational policies vary. For individual developers and smaller studios, the calculus is simpler: good technology is good technology.
Always verify licensing and export compliance for your specific jurisdiction and use case.
Practical Applications
The 10-second duration and consumer hardware requirements open specific use cases:
Social Content
Concept Visualization
Custom Training
Research
Looking Forward
Kandinsky 5.0 represents a broader trend: the gap between open and closed-source video generation is narrowing. A year ago, open models produced short, low-resolution clips with obvious artifacts. Today, a 2B parameter model on consumer hardware generates 10-second HD video that would have seemed impossible in 2023.
The race is not over. Closed-source leaders like Sora 2 and Runway Gen-4.5 still lead in quality, duration, and controllability. But the floor is rising. For many applications, open-source is now good enough.
Resources
The Takeaway
Kandinsky 5.0 may not top every benchmark, but it succeeds where it matters most: running real video generation on hardware that real people own, under a license that allows real commercial use. In the race to democratize AI video, the Russian team has just moved the finish line closer.
For developers exploring open-source video generation, Kandinsky 5.0 deserves a place on your shortlist.
Was this article helpful?

Alexis
AI EngineerAI engineer from Lausanne combining research depth with practical innovation. Splits time between model architectures and alpine peaks.
Related Articles
Continue exploring with these related posts

TurboDiffusion: The Real-Time AI Video Generation Breakthrough
ShengShu Technology and Tsinghua University unveil TurboDiffusion, achieving 100-200x faster AI video generation and ushering in the era of real-time creation.

The Open-Source AI Video Revolution: Can Consumer GPUs Compete with Tech Giants?
ByteDance and Tencent just released open-source video models that run on consumer hardware. This changes everything for independent creators.

CraftStory Model 2.0: How Bidirectional Diffusion Unlocks 5-Minute AI Videos
While Sora 2 maxes out at 25 seconds, CraftStory just dropped a system that generates coherent 5-minute videos. The secret? Running multiple diffusion engines in parallel with bidirectional constraints.