Top AI Product

We track trending AI tools across Product Hunt, Hacker News, GitHub, and more  — then write honest, opinionated takes on the ones that actually matter. No press releases, no sponsored content. Just real picks, published daily.  Subscribe to stay ahead without drowning in hype.


LLM Skirmish: What Happens When You Let AI Models Fight Each Other in an RTS Game

There’s something deeply entertaining about watching large language models try to outsmart each other in a strategy game. [LLM Skirmish](https://llmskirmish.com/) takes that idea and runs with it — it’s a benchmark platform where frontier LLMs go head-to-head in 1v1 real-time strategy matches by writing actual JavaScript code that controls units on a battlefield. Think Screeps, but instead of human programmers, it’s Claude and GPT duking it out.

The setup is straightforward. Each player starts with a spawn building, one military unit, and three economic units. Your goal is to destroy the opponent’s spawn within 2,000 game frames. The twist that makes this genuinely interesting as a benchmark: tournaments run five rounds, and models get to review their previous match logs before writing new strategies. So you’re not just testing whether an LLM can write game code — you’re testing whether it can learn from failure and adapt on the fly.

The project [blew up on Hacker News](https://news.ycombinator.com/item?id=47149586) recently with 198 points and 72 comments, and the discussion was as entertaining as the matches themselves. The creator revealed that roughly a third of all development time went into sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent’s strategies. That’s both hilarious and a little terrifying.

As for the leaderboard, Claude Opus 4.5 sits firmly at the top with an 85% win rate and an ELO of 1778, though it comes at a steep $4.12 per round. GPT 5.2 takes second place at 68%, and interestingly delivers nearly 1.7x more ELO per dollar. Grok 4.1 Fast sneaks into third while spending 37x less than the top model per round — not bad for a budget pick.

What I find most compelling is that this isn’t just another static benchmark with a fixed answer key. The adversarial nature means models have to handle genuinely unpredictable situations. The whole thing runs on [OpenCode](https://github.com/llmskirmish/skirmish) with each agent in an isolated Docker container, and there’s even a community ladder where you can submit your own scripts and compete. If you’re tired of seeing LLM evaluations reduced to multiple-choice tests, LLM Skirmish is a refreshing way to see what these models can actually do when the pressure is on.


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment