LLM Skirmish An Adversarial In-Context Learning Benchmark

watch tournament matches

TL;DR

LLM Skirmish is a benchmark where LLMs play 1v1 RTS (real time strategy) games against each other
LLMs write their battle strategies in code, which are then executed in the game environment.
LLMs test learning in a skirmish context, as each tournament lasts five rounds and LLMs are able to change strategies between rounds.

It was great to see the energy in using games to assess the LLM last year. Yet there’s a strange difference between Frontier LLM one-shotting completed coding projects and those same models struggling to get out of Pokémon Red’s Mount Moon.

We wanted to create an LLM game benchmark that puts coding, the superpower of this generation of Frontier LLMs, on full display. Ten years ago, a team released a game called Scrips. It was described as an “MMO RTS sandbox for programmers”. In Scripts, human players write JavaScript strategies that are executed in the game environment. Players gain resources, lose territory and units are destroyed. It is a traditional RTS, but controlled entirely through code.

The Scripts paradigm, writing code and executing it in a real-time game environment, is suitable for the LLM benchmark. Based on a version of the Scribs open source API, LLM Skirmish pits LLMs against each other in a series of 1v1 real-time strategy games.

In LLM Skirmish, each player starts with a “spawn” (a building that can spawn units), one military unit, and three economic units. The objective of each LLM Skirmish match is to eliminate your opponent’s spawn. If a player is not eliminated within 2,000 game frames (each player is allowed up to one second of runtime calculation per frame), the game ends and the winner is determined based on the score.

Each LLM Skirmish tournament consists of five rounds. In each round, each LLM is asked to write a script implementing their strategy. For all rounds after the first round, each LLM can view the results of all its matches from the previous round and use that information to make changes to the script it submits for the next round. In each round, each player plays all other players once. This means there are 10 matches per round and 50 matches per tournament.

LLM Skirmish was conducted using OpenCode, an open source general purpose agentic coding harness. OpenCode was selected because it was not designed for any of the evaluated models and is completely open source to aid replication.

Each LLM agent runs in a separate Docker container with OpenCode providing the coding environment. The orchestrator coordinates the tournament by sending signals to each agent, who then use OpenCode’s tools (file editing, shell commands, etc.) to write and submit their game scripts.

quick structure

At the beginning of each round, agents receive OBJECTIVE.md (game rules, API documentation, and instructions for writing game scripts) and NEXT_ROUND.md (instructions for reviewing match logs from previous rounds, rounds 2-5 only). Two example strategies are also provided to agents as reference.

script verification

After each agent creates its strategy, the orchestrator validates the script. If verification fails, the agent receives an error message and has 3 attempts to fix the problem before the round can proceed.

The LLM tests learning in the Skirmish context, as each tournament lasts five rounds and the models are able to change strategies between rounds. One can speculate that if a model is learning successfully in context, scripts written after seeing previous results (as in Rounds 2–5) will be of higher quality than scripts written in Round 1.

Across all tournaments, each model submits 25 scripts for a total of 250 matches. In a tournament, we consider each model as a player. If we treat each script as a player and all scripts play against each other, we can simulate 7,750 matches to get a strong per-round average win rate (a proxy for script quality).

Script Round vs Performance

We can see that four of the five models evaluated have a significant increase in the average win rate between Round 1 and Round 5 (Cloud Opus 4.5 +20%, GLM 4.7 +16%, GPT 5.2 +7%, Grok 4.1 Fast +6%).

gemini 3 pro performance

The Gemini 3 Pro’s performance presents an anomaly. Its Round 1 average win rate was 70% (higher than all four other evaluated models), while its Round 2-5 average win rate was 15% (lower than all four other evaluated models). Gemini 3 Pro’s Round 1 script is almost four times smaller than the top-performing models Cloud 4.5 Opus and GPT 5.2. A qualitative review of the Gemini 3 Pro script shows that it had success with simple strategies in Round 1. In Rounds 2–5, compared to the other four models evaluated, Gemini 3 Pro most aggressively populated its context with results from previous rounds before submitting its script for that round, suggesting that context rot was a notable contributor to performance variation. Whether this reference rot reflects other models being better at planning tool use than Gemini 3 Pro, or whether OpenCode is a uniquely inaccessible harness for Gemini 3 Pro, is worth further investigation in future editions of LLM Skirmish.

API costs vary significantly across different models. The chart below shows the average cost per round against each model’s ELO rating. Cloud Opus 4.5 achieved the highest ELO (1778) but at the highest cost ($4.12/round). GPT 5.2 offers approximately 1.7 times more ELO per dollar than Cloud Opus 4.5.

cost vs performance

With a 71% Round 1 win rate, Gemini 3 Pro leads all models in the early game with simple and aggressive strategies
In later rounds, the Gemini 3 Pro struggles to manage information from previous rounds

true rival

GLM 4.7, the win rate in head-to-head matches is exactly 50%

In Round 1 matches, Cloud Opus 4.5 performed admirably, but its heavy focus on economy left it vulnerable to GPT 5.2.
By Round 2, Cloud Opus 4.5 is already a dominant model, and the quality of scripts still increases in all rounds

spoiler

GPT 5.2. By Round 5, GPT 5.2 is often the only model able to score a win against Cloud Opus 4.5, preventing it from a complete sweep.

With a verbose coding style, GPT 5.2’s best scripts rank in the top decile, with Round 2 scripts achieving an 89% hypothetical win rate against the field
But more code is not always better: A Round 5 script with 39 helper functions lands in the bottom decile, which suggests that GPT 5.2 is sometimes overengineering when it should be simplifying.

slavery

Cloud Opus 4.5. Both models win the entire tournament, but Cloud has a slight edge in head-to-head matches

With a +16% win rate increase from Round 1 to Round 5, GLM 4.7 shows the second steepest learning curve of all models, but the improvement is inconsistent, with scripts ranging from top quartile to dead last across the field
Unlike top performers, it never applies kiting, formation or logic, relying solely on consistent threat prioritization and focus fire to punch above its weight class.

true rival

Gemini 3 Pro, the win rate in head-to-head matches is exactly 50%

Cheap tokens and concise logic allowed Grok 4.1 Fast to claim third place while spending 37 times less than the top model per round.
But short scripts are fragile: its worst scripts collapse completely, falling from a 75% win rate to just 6.5%.

Achilles heel

GLM 4.7. The Grok is 15 points behind the GLM in the win rate of other models.

<a href

LLM Skirmish An Adversarial In-Context Learning Benchmark