A Sense of Novelty in Large Language Models

December 17, 2025

Introduction

The ability to identify novel directions of interest is one of the key aspects of human intelligence. Do large language models have a “sense” of when something is new?

We became interested in this idea after developing a simple variation of a well-known game, which leading frontier AI models had a hard time playing and a hard time developing an optimal playing strategy for. This lack of ability manifested itself along multiple dimensions:

This made us suspect the game design is sufficiently out of distribution for the LLM. Does the LLM “know” this? Models will not hesitate to play the game, confidently articulate an optimal strategy, or write an optimal game-playing algorithm — but do they have some internal sense that they are actually not very good at this particular task?

Note: You might be able to approximate something similar using perplexity, but this would require having access to logprobs data on the prompt input tokens, which is not exposed by most frontier model providers currently. Moreover, here we were more interested in evaluating explicit model stated novelty perception, rather than implicit internal uncertainty measures.

Experimental Setup

We developed a simple prompt for models to rate how interesting or novel a particular problem statement is:

You are a novelty expert, rating questions/topics/problems on a scale of 1 (mundane, boring) to 100 (extremely interesting, novel). The rating does not imply the difficulty of solving the problem, it is intended to measure how out of distribution the particular problem is compared to the corpus of all existing human knowledge. A score of 100 means the topic is entirely out of distribution.

And then developed a set of baseline control problems of varying difficulties:

Easy, Trivial Problems

Very Difficult Problems

For these we took each of the example problems from the GPQA: A Graduate-Level Google-Proof Q&A Benchmark paper for the following subject areas:

Including these very difficult GPQA questions is important because these questions do seem to contain more interesting knowledge (e.g. relative to the easy questions), are very difficult, but are definitely included in the training data for current frontier models.

Novel Problems

To test for novelty, we used some of the problem statements we are developing for the LMIQ benchmark:

For now, we are keeping the details of these secret. However, each of these is a relatively simple variation of an existing game or problem. They are relatively simple and relatively easy for humans to understand and solve, but significantly difficult for modern LLMs, which makes us suspect they may be out of the training data distribution. They are certainly not anything groundbreaking.

Results

To obtain novelty scores, we one-shot prompted models with the prompt and the problem statement. Each score query was submitted with no prior context. We tested two groups of models: a group of slightly older, fast and affordable models (easier to test initially) and a group of current state of the art frontier models. For each model, we repeated the test 10 times to mitigate variance in the response scores and took the average of all 10 scores. This resulted in over 1,000 data points.

In general, the models all identify our novel problems as somewhat more novel and interesting on this spectrum than the baseline controls.

Fast (Affordable) Models

We tested the following models:

Easy, Trivial Problems
GPQA Problems (Hard)
Novel Problems
Problem Average Novelty Score (0-100)
Count 'r's in 'strawberry'
1.8
Simple Arithmetic
1.2
Tic-Tac-Toe
5.2
Classic Battleship
11.4
Maze Path-Finding
18.2
GPQA Chemistry General
47.7
GPQA Organic Chemistry
61.8
GPQA Genetics
36.6
GPQA Molecular Biology
39.9
GPQA Astrophysics
31.7
Maze Path-Finding Variant
76.9
Rover Game Challenge
76.8
Novel Battleship Variant
72.9
0 25 50 75 100

Frontier Models

We test the following models:

Easy, Trivial Problems
GPQA Problems (Hard)
Novel Problems
Problem Average Novelty Score (0-100)
Count 'r's in 'strawberry'
2.4
Simple Arithmetic
1.0
Tic-Tac-Toe
2.5
Classic Battleship
6.5
Maze Path-Finding
7.9
GPQA Chemistry General
38.6
GPQA Organic Chemistry
50.7
GPQA Genetics
21.0
GPQA Molecular Biology
23.6
GPQA Astrophysics
17.9
Maze Path-Finding Variant
42.0
Rover Game Challenge
48.2
Novel Battleship Variant
51.4
0 25 50 75 100

The full prompts and result data is available in this GitHub repo.

Conclusion

This preliminary experimentation does indicate that modern LLMs have some internal novelty compass. It is interesting that LLMs exhibit some basic corollary of this human ability, which suggests some degree of “self-awareness” or meta-cognition, or at least an implicit understanding of the gap between the knowledge they have been trained on and new information that may be out of distribution. The models have been trained on essentially all human data, so it is not that surprising that they have some awareness of when a new data point is partially or completely out of distribution.


Now, we have been using the terms “novelty” and “interesting” a bit loosely. There is a distinction between:

These are rich concepts and humans have a very well developed sense of what is novel and interesting. This internal human compass is a fundamental component of our intelligence which directs us to pursue specific problem areas (out of the very large problem space of reality) which are more likely to expand our collective knowledge and understanding about the world.

Admittedly, our prompting approach does mix together these concepts of “novelty”, “interestingness” and “out-of-distribution”. A useful follow-up would be to refine these definitions more rigorously, for instance:

For now, this is a useful approach for us to validate how out of distribution a particular problem idea is for an LLM.