We are designing games for our LMIQ Benchmark Challenge that are challenging for AI but relatively easy for humans. We've developed a variant of the classic battleship game that employs hidden knowledge and special rules to make the game more strategic and interesting. There is still a bit of random luck involved, but the rules certainly provide room for an optimal playing strategy.
In our initial testing, it seemed promising that this was a sufficiently novel design to challenge the reasoning capabilities of state of the art AI models—exactly what we wanted. Our suspicion was this was out of distribution enough that models would lack any memorized intuition for how play the game effectively. We tried to assess just how "out of distribution" this game might be, and the signal was positive.
To dig deeper into just how effectively models could tackle our new game design we not only tested them directly, playing the game, but also challenged them to design an optimal algorithm to play the game.
We are still keeping the details of the game design secret, so it will be unknown when first released with the LMIQ Benchmark.
Game Algorithm Design
For the algorithm design, we first setup a full programmatic environment with an example "player" class to serve as a reference implementation. This reference player would essentially just make random decisions during gameplay, but it demonstrated how to wire everything up.
We then provided that reference implementation and a thorough description of the game rules in a prompt, and instructed models as follows:
Analyze the above game description and reference code implementation. Your task is to devise an optimal playing algorithm for this game and then implement the solution in code. Design the optimal playing strategy and then implement a code solution which implements the GamePlayer class. Note that algorithmic time complexity and performance are key considerations in your solution, you must keep runtime performance in mind.
We took the initial response from each model and tested it, and also, prompted the model for a single revision to their response, instructing them:
Nice, now, review your proposed solution for any issues or improvements. You can provide a revised solution if you have any improvements and I will proceed to test — or, if you have no updates, just say so.
Only Grok declined to make improvements 😎.
Evaluation
To evaluate the algorithms, we tested them against 3 variants we had previously developed:
- The random player (the previously mentioned reference implementation)
- A more intelligent player, which still has some random behavior
- A more intelligent player with strategic behavior
We tested each AI designed implementation against each of the above programmatic players for a total of 6,000 games:
- 2,000 games against each of the human designed algorithms, with the 1st mover advantage split 50/50 across all games
In addition, we produced and tested 3 more AI designed solutions using our LLM Thinktank tool which allows us to run a problem statement through a pipeline workflow of multiple models. We used this workflow to have GPT 5.2, Gemini 3 Pro, Claude Opus 4.5, and Grok 4.1 all develop and refine solutions, and then we had Gemini 3 Pro synthesize and GPT 5.2 apply a final round of review/revision (which cost a total of $1.30 USD).
We repeated this process, but with GPT 4o as the only model. We also tested a hybrid synthesis approach combining multiple model outputs.
Results
The results are as follows:
LLM Designed Algorithmic Player Results:
| Player | Games Won/Total | Win Rate |
|---|---|---|
| GPT 5.2 Player | 1876/6000 | 31.3% |
| GPT 5.2 Player Revised | 1704/6000 | 28.4% |
| Gemini 3 Pro Player | 1417/6000 | 23.6% |
| Gemini 3 Pro Player Revised | 1860/6000 | 31.0% |
| Claude Opus 4.5 Player | 2427/6000 | 40.5% |
| Claude Opus 4.5 Player Revised | 2443/6000 | 40.7% |
| Grok Player | 1843/6000 | 30.7% |
| SOTA Synthesis Player | 1968/6000 | 32.8% |
| GPT 4o Synthesis Player | 349/6000 | 5.8% |
| Hybrid Synthesis Player | 166/6000 | 2.8% |
Human Designed Algorithmic Player Baselines:
| Player | Games Won/Total | Win Rate |
|---|---|---|
| Random Player | 1370/6000 | 22.8% |
| Optimal Player (Random) | 3230/6000 | 53.8% |
| Optimal Player (Strategic) | 4405/6000 | 73.4% |
As you can see, the GPT 4o synthesized player won only 5.8% of games, and the hybrid synthesis player performed even worse at 2.8%, highlighting how hard this task is. (It's not that hard... it's just a little outside of the pre-training corpora...). The other results mostly converge around ~30%.
Some highlights worth pointing out:
- This task is very challenging for frontier models, in two ways:
- Inferring the optimal game playing strategy is difficult
- Implementing an optimal solution in code is difficult (although, not as hard as the strategy design)
- The baseline results from our programmatic players establish good reference points for how good the LLM produced solutions are: e.g. ~23% is about as good as playing randomly. A win rate over 70% is quite good.
- The model designed solutions are barely better than playing randomly. The outlier of Claude is a result of a more aggressive/risky playing strategy.
- The SOTA synthesis approach really did not lift performance at all. It remained dragged down by the common denominator that none of the models could develop the right insight around the core game strategy.
As one of the models, who we asked to analyze these decisions, put it, the "Meta-Lesson" here is:
These models can implement complex algorithms but failed at strategic reasoning about tradeoffs.
Our optimal player algorithm is mostly programmed by an AI, but, its design was arrived at through human reasoning and intuition.
Playing the Models
Of course, the real test here is how the models actually play the game. In our testing so far, the models generally struggle to play well. It's challenging for them to synthesize the optimal playing strategy on the fly and apply it correctly to the current game situation. Conversely, it is not that hard... for humans.
It gets a little expensive to test the game with the models directly... we simply can't afford to run 1,000 games. But we've run small samples with Grok 4.1 and Gemini 3 Flash against our optimal player (random and strategic) and their win rate is 0%.
LLM: google/gemini-3-flash-preview
Opponent: OptimalPlayer (RANDOM)
LLM goes first
[Game 1] Finished in 5.2s - Opponent wins via detonation (6 turns) $0.0040
[Game 3] Finished in 6.9s - Opponent wins via detonation (6 turns) $0.0047
[Game 4] Finished in 8.9s - Opponent wins via detonation (8 turns) $0.0064
[Game 5] Finished in 18.8s - Opponent wins via detonation (14 turns) $0.012
[Game 6] Finished in 28.1s - Opponent wins via detonation (16 turns) $0.018
[Game 2] Finished in 30.1s - Opponent wins via detonation (18 turns) $0.019
[Game 9] Finished in 36.8s - Opponent wins via detonation (22 turns) $0.025
[Game 10] Finished in 40.4s - Opponent wins via detonation (20 turns) $0.026
[Game 8] Finished in 40.6s - Opponent wins via detonation (28 turns) $0.029
[Game 7] Finished in 53.2s - Opponent wins via failed detonation (27 turns) $0.035
We are quite confident that any of the (current) frontier models will be unable to beat our optimal player. As part of the LMIQ benchmark, the goal here is to see models play as well (and as efficiently) as human players.
Takeaways
- It remains challenging for the models to generalize outside of their training data.
- Global risk calculations, tradeoffs, and optimization problems are especially hard.
- Frontier LLMs struggle to track relatively simple game state as it grows over a series of moves, and become prone to mistakes.
Look forward to playing this game in the LMIQ platform in the future!