Most “unified” video models — ones that both generate and edit from mixed text, image, and video inputs — are heavy, 13B parameters or more, and they handle editing by concatenating the source video’s tokens, which doubles the sequence length and quadruples attention cost. LoomVideo, a new arXiv release from Peking University, aims for the same flexibility at 5B parameters.
## The architecture trick
Two moves do the work. First, it swaps the standard text encoder for a multimodal LLM, so the model reads interleaved image, text, and video instructions natively. Second, it uses a “Deepstack” injection that aligns the MLLM’s multi-layer features with the diffusion model — instead of bolting the source video on as extra sequence length.
## Why it matters
Unified video generate-and-edit is where the field is heading, but compute cost is the wall. A 5B model that holds its own against much larger unified frameworks is the kind of efficiency result that decides whether this runs anywhere but a research cluster. The implementation is open-source, so the claims are checkable rather than just charted.

Leave a comment