There’s something deeply entertaining about watching large language models try to outsmart each other in a strategy game. [LLM Skirmish](https://llmskirmish.com/) takes that idea and runs with it — it’s a benchmark platform where frontier LLMs go head-to-head in 1v1 real-time strategy matches by writing actual JavaScript code that controls units on a battlefield. Think Screeps, but instead of human programmers, it’s Claude and GPT duking it out.
The setup is straightforward. Each player starts with a spawn building, one military unit, and three economic units. Your goal is to destroy the opponent’s spawn within 2,000 game frames. The twist that makes this genuinely interesting as a benchmark: tournaments run five rounds, and models get to review their previous match logs before writing new strategies. So you’re not just testing whether an LLM can write game code — you’re testing whether it can learn from failure and adapt on the fly.
The project [blew up on Hacker News](https://news.ycombinator.com/item?id=47149586) recently with 198 points and 72 comments, and the discussion was as entertaining as the matches themselves. The creator revealed that roughly a third of all development time went into sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent’s strategies. That’s both hilarious and a little terrifying.
As for the leaderboard, Claude Opus 4.5 sits firmly at the top with an 85% win rate and an ELO of 1778, though it comes at a steep $4.12 per round. GPT 5.2 takes second place at 68%, and interestingly delivers nearly 1.7x more ELO per dollar. Grok 4.1 Fast sneaks into third while spending 37x less than the top model per round — not bad for a budget pick.
What I find most compelling is that this isn’t just another static benchmark with a fixed answer key. The adversarial nature means models have to handle genuinely unpredictable situations. The whole thing runs on [OpenCode](https://github.com/llmskirmish/skirmish) with each agent in an isolated Docker container, and there’s even a community ladder where you can submit your own scripts and compete. If you’re tired of seeing LLM evaluations reduced to multiple-choice tests, LLM Skirmish is a refreshing way to see what these models can actually do when the pressure is on.

Leave a comment