xAI’s Grok Imagine 1.5 takes a still image (or a text prompt) and animates it into a clip — with native, synchronized audio baked in: music, sound effects, even lip-synced dialogue. It shipped as an API preview on June 3.
## What’s new in 1.5
The headline is audio in every generation — no separate pass, no extra charge. It produces 720p at 24fps in 6-to-15-second clips and adds multi-shot sequencing: stage a frame, animate it, then chain shots into longer scenes that hold a consistent look across a whole project. Video extension and reference-guided generation round it out.
## Where it lands
On the Image-to-Video Arena it ranks #1, a +52 Elo jump over Grok Imagine 1.0, beating Seedance 2.0, HappyHorse 1.0, and Google Veo in blind testing. Pricing is usage-based — $0.08/sec at 480p, $0.14/sec at 720p, plus $0.01 per image input. For short-form video with sound handled inside one model, that’s an aggressive entry point.

Leave a comment