OpenBMB open-sourced MiniCPM-V 4.6 on May 11. A 1.3B-parameter multimodal model built on SigLIP2-400M plus Qwen3.5-0.8B, aimed squarely at the edge — phones, laptops, consumer GPUs.
The trick is in the visual encoder. LLaVA-UHD v4 brings intra-ViT early compression with a hybrid 4x/16x token compression ratio, cutting vision encoding compute by more than 50% versus the previous generation.
A small model that punches up
On the Artificial Analysis Intelligence Index it scores 13, beating Qwen3.5-0.8B at 10 — while burning 1/19 the tokens. On OpenCompass, RefCOCO, HallusionBench, MUIRBench and OCRBench, it matches Qwen3.5-2B. A 1.3B model holding even with something nearly twice its size on the multimodal benchmarks people actually cite.
How to run it
Apache 2.0 weights, full inference stack out of the box: vLLM, SGLang, llama.cpp and Ollama for serving; SWIFT and LLaMA-Factory for fine-tuning. A single consumer GPU is enough — production-grade VLM on a 4090 instead of a cluster. Typical use cases: on-device OCR, document understanding, visual chat for mobile apps, image-heavy RAG pipelines without per-call cloud bills.
Why it matters
End-side multimodal has been waiting for a model that’s both genuinely capable and cheap to run. 1/19 the token cost at higher quality rewires the unit economics for any product processing images at scale.
You Might Also Like
- Ollama mlx on Apple Silicon 1810 Tokens sec Prefill and the end of Llama cpp on mac
- Ggml Llama cpp Joins Hugging Face and Honestly it was Only a Matter of Time
- Hypura Runs a 31gb Model on a 32gb mac at 2 2 tok s Llama cpp Just Ooms
- Tutor Intelligence Data Factory 1 100 Robots Generate 10000 Hours of Training Data per Week
- Ricursive Intelligence Just Raised 335m in 4 Months and i Think Nvidia Should be Paying Attention

Leave a comment