Cactus Compute, the YC-backed on-device inference startup, open-sourced Needle today: a 26M-parameter Simple Attention Network that does single-shot function calling on phones and smartwatches. No MLPs, just attention and gating — the team argues FFN params are wasteful at this scale, and cross-attention is the right primitive for routing a query to the right tool.
The training story is the punchline
Pre-train: 200B tokens, 27 hours, 16 TPU v6e chips. Post-train on 2B function-call tokens: 45 minutes. Reported throughput hits 6,000 tokens/sec prefill and 1,200 tokens/sec decode. On single-shot function calling Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LLaMA-2.5-350M — every comparison is roughly 10x its size.
Open weights and a finetune-your-own-tools playground
MIT licensed, weights on Hugging Face. The repo ships a playground UI that runs the whole pipeline locally — synthesize tool-call data via Gemini, finetune Needle on your own tool schema, evaluate, bundle the result. Show HN hit 174 points and #1 on the front page today.
The point isn’t replacing Gemini 3. It’s that if a 26M model can route tools well enough on a smartwatch, the agent loop stops needing the cloud. That’s the bet Cactus has been making since YC S25, and Needle is the most concrete proof yet.
You Might Also Like
- Ollama mlx on Apple Silicon 1810 Tokens sec Prefill and the end of Llama cpp on mac
- Ggml Llama cpp Joins Hugging Face and Honestly it was Only a Matter of Time
- Moonshine Open Weights stt the Tiny Speech Model That Punches way Above its Weight
- Openai Parameter Golf 1m in Compute Credits for Squeezing a Language Model Into 16mb
- Liquid ai Lfm2 5 350m how 350 Million Parameters Trained on 28 Trillion Tokens Outrun Models Twice its Size

Leave a comment