Introduction
The ability to identify novel directions of interest is one of the key aspects of human intelligence. Do large language models have a “sense” of when something is new?
We became interested in this idea after developing a simple variation of a well-known game, which leading frontier AI models had a hard time playing and a hard time developing an optimal playing strategy for. This lack of ability manifested itself along multiple dimensions:
- Inability to play the game effectively/successfully against a human player
- Inability to articulate the optimal game playing strategy
- Inability to design an optimal game playing algorithm
This made us suspect the game design is sufficiently out of distribution for the LLM. Does the LLM “know” this? Models will not hesitate to play the game, confidently articulate an optimal strategy, or write an optimal game-playing algorithm — but do they have some internal sense that they are actually not very good at this particular task?
Note: You might be able to approximate something similar using perplexity, but this would require having access to logprobs data on the prompt input tokens, which is not exposed by most frontier model providers currently. Moreover, here we were more interested in evaluating explicit model stated novelty perception, rather than implicit internal uncertainty measures.
Experimental Setup
We developed a simple prompt for models to rate how interesting or novel a particular problem statement is:
You are a novelty expert, rating questions/topics/problems on a scale of 1 (mundane, boring) to 100 (extremely interesting, novel). The rating does not imply the difficulty of solving the problem, it is intended to measure how out of distribution the particular problem is compared to the corpus of all existing human knowledge. A score of 100 means the topic is entirely out of distribution.
And then developed a set of baseline control problems of varying difficulties:
Easy, Trivial Problems
- How many 'r's are in the word 'strawberry'?
- What is 2 + 5 * 9?
- What is the optimal Tic-Tac-Toe strategy for the first player to win?
- What is the optimal strategy for the classic game of Battleship?
- What is the optimal path through a simple maze from start to finish?
Very Difficult Problems
For these we took each of the example problems from the GPQA: A Graduate-Level Google-Proof Q&A Benchmark paper for the following subject areas:
- GPQA Chemistry General
- GPQA Organic Chemistry
- GPQA Genetics
- GPQA Molecular Biology
- GPQA Astrophysics
Including these very difficult GPQA questions is important because these questions do seem to contain more interesting knowledge (e.g. relative to the easy questions), are very difficult, but are definitely included in the training data for current frontier models.
Novel Problems
To test for novelty, we used some of the problem statements we are developing for the LMIQ benchmark:
- Maze Path-Finding Variant
- Rover Game Challenge
- Novel Battleship Variant
For now, we are keeping the details of these secret. However, each of these is a relatively simple variation of an existing game or problem. They are relatively simple and relatively easy for humans to understand and solve, but significantly difficult for modern LLMs, which makes us suspect they may be out of the training data distribution. They are certainly not anything groundbreaking.
Results
To obtain novelty scores, we one-shot prompted models with the prompt and the problem statement. Each score query was submitted with no prior context. We tested two groups of models: a group of slightly older, fast and affordable models (easier to test initially) and a group of current state of the art frontier models. For each model, we repeated the test 10 times to mitigate variance in the response scores and took the average of all 10 scores. This resulted in over 1,000 data points.
In general, the models all identify our novel problems as somewhat more novel and interesting on this spectrum than the baseline controls.
Fast (Affordable) Models
We tested the following models:
- GPT 4o-mini
- Gemma 3 12B
- Olmo 3.1 32B
- Mimo V2 Flash
- Z.ai GLM 4.5 Air
Frontier Models
We test the following models:
- ChatGPT 5.2
- Gemini 3 Pro
- Claude Opus 4.5
- Grok 4.1
The full prompts and result data is available in this GitHub repo.
Conclusion
This preliminary experimentation does indicate that modern LLMs have some internal novelty compass. It is interesting that LLMs exhibit some basic corollary of this human ability, which suggests some degree of “self-awareness” or meta-cognition, or at least an implicit understanding of the gap between the knowledge they have been trained on and new information that may be out of distribution. The models have been trained on essentially all human data, so it is not that surprising that they have some awareness of when a new data point is partially or completely out of distribution.
Now, we have been using the terms “novelty” and “interesting” a bit loosely. There is a distinction between:
- Some new factual knowledge which is novel, but relatively uninteresting. For example, “Some distant planet is composed of 35% iron” would be novel information but somewhat uninteresting.
- Some new abstract idea, which is novel and very interesting. For example, some new scientific theory which explains an area of science which is otherwise not understood well.
These are rich concepts and humans have a very well developed sense of what is novel and interesting. This internal human compass is a fundamental component of our intelligence which directs us to pursue specific problem areas (out of the very large problem space of reality) which are more likely to expand our collective knowledge and understanding about the world.
Admittedly, our prompting approach does mix together these concepts of “novelty”, “interestingness” and “out-of-distribution”. A useful follow-up would be to refine these definitions more rigorously, for instance:
- “Novelty” implies something completely out of distribution of existing human knowledge.
- “Interestingness” implies … well—we’ll do that experiment once we have a good definition of what interestingness is.
For now, this is a useful approach for us to validate how out of distribution a particular problem idea is for an LLM.