In my work with GriffinAI, I have been deeply involved in exploring and developing the nuanced capabilities of large language models. GriffinAI's mission is to push the boundaries of what AI can achieve in creative and logical domains, and the recent review I wrote on "Large Language Interpolators Can Learn Logical Reasoning" fits perfectly into this broader context. At GriffinAI, we are constantly focused on enhancing our products and offerings, and are very interested in seeking ways to improve both the reasoning capabilities and the efficiency of the models with which we work.
GriffinAI's approach to AI agent development
GriffinAI’s emphasis on advancing AI reasoning capabilities is a key component of our development strategy. We are researching ways of refining the cognitive processes of the AI models with which we work to ensure they go beyond simple pattern recognition. This aligns directly with the study entitled Large Language Interpolators Can Learn Logical Reasoning, which offers insights into how AI models may be exhibiting logical understanding instead of relying purely on memorization. The application of advanced optimization algorithms, including gradient descent and reinforcement learning techniques, may allow us to traverse more complex reasoning pathways.
Using Knights and Knaves to improve AI reasoning
The study I reviewed, centered around the Knights and Knaves puzzles, investigates whether large language models truly understand logical reasoning or if their apparent success is merely due to memorization. GriffinAI's research aligns with this inquiry—determining the balance between memorization and true reasoning in our AI products. As we explore and research into more advanced reasoning in AI, we draw on studies like this to understand how modifications in training can impact generalization and genuine problem-solving abilities.
Investigating logical reasoning vs. memorization
The research highlighted the importance of distinguishing between reasoning and memorization. At GriffinAI, this differentiation is very interesting to us, and we are actively researching different approaches to transfer learning and domain-specific adaptation, enabling LLMs to adjust to new scenarios while minimizing reliance on rote memorization. This study informs how we may fine-tune our models to achieve this kind of goal.
Applying metrics for logical consistency at GriffinAI
In this research, the metric developed (Acc * (1 - CR)) serves as an insightful tool for assessing a model's tendency towards memorization. At GriffinAI we are looking at integrating such metrics into our iterative model development, ensuring that our solutions improve their logical consistency without relying too heavily on learned examples. The implications are significant, especially when developing AI for applications that require adaptability and flexible reasoning rather than rote responses. It may be a very viable approach to leverage such metrics during model validation stages, running tests that involve hyperparameter tuning and cross-validation to guarantee model robustness.
Enhancing model adaptability and reasoning at GriffinAI
Through my involvement in developing GriffinAI's offerings, I see how vital it is for us to navigate the fine line between effective learning and memorization. The findings of this review directly inform how computer scientists working with LLMs may opt to structure training data and the methodologies used to enhance reasoning capabilities across different domains. This study is very resonant with parts of GriffinAI’s broader mission to develop products that genuinely learn, adapt, and reason in a way that is more akin to human cognitive processes. Advanced methodologies, including dropout regularization and attention mechanism refinements, may prove highly effective for preventing overfitting and fostering deeper reasoning pathways.
Investigating the core issues
The authors of the article cited above aimed to specifically investigate this issue. They devised a task for the study and a metric to assess the balance between memorization and understanding in the model. How can this balance be measured? Let's take an example: when someone is preparing for an interview or an exam, they might not fully grasp all the core principles but might memorize a few problems. And when they encounter one of those problems, they can solve it. However, if even a small change is made to one of the steps, they may struggle.
Characteristics of memorization
Two key characteristics of memorization, based on this example, are:
A) High accuracy on previously seen problems.
B) Low accuracy on new, very similar problems (due to a lack of understanding of the underlying principle).
Developing the metric for assessment
The authors created a formula that reflects both traits. First, they measure accuracy on a set of tasks — this is Acc (for Accuracy). Then, they make a minor change to these tasks, something that doesn't affect the complexity but leads to a different answer and check the responses. The metric CR (Consistency Rating) is the proportion of tasks that were solved correctly both before and after the change. The higher the CR, the better the model handles slightly altered tasks.
The formula is: Acc * (1 - CR). The higher the score, the more likely it is that the model relies on memorization rather than genuine understanding or reasoning. The higher the CR, the smaller the second factor, and the smaller the overall score — which makes sense: if the model solves new, modified tasks, there’s no reason to think it simply memorized them.
Real-world implications and testing
Let’s look at an example. Suppose there are 10 tasks, and the model solves 9 of them without any modifications — Acc = 0.9, everything looks good. But after a slight change in conditions, it only solves 1 task. CR = 1 / 9 ≈ 0.11. So, the second factor is quite large (0.89), and the final score is 0.9 * 0.89 = 0.8, which is quite high — likely memorization. But if it solved 8 out of 9 after the changes, the score would be 0.9 * (1 - 0.89) = 0.1, which is very low. The authors consider anything above 0.1 as relying on memorization.
The role of Knights and Knaves in assessing AI reasoning
To measure this, they needed to find a task where the conditions can be easily changed without altering the complexity of the solution, and where new answers could be generated automatically. The authors recalled the game Knights and Knaves, which they encountered in school — knights always tell the truth, knaves always lie. Both exchange a few statements, and you must use a chain of reasoning to figure out who is who. Each task is characterized by the number of people and statements.
Questioning the reasoning capabilities
One separate question is whether such a task can be classified as "requiring reasoning." The main thing is to determine this before we see LLMs solving them because if the accuracy is 5%, it's reasoning, but if it's 95% — then maybe it's not.
Observations from initial model responses
Here is a picture with the results of models out of the box, using a simple prompt, without examples of solutions.
The leftmost image shows the accuracy of various LLMs depending on the number of people in the task. Even for two participants, the best tested models don't exceed 70% accuracy (32% for 5 people). It's unfortunate that OpenAI's o1 models aren't included. But they probably hadn't been released at the time.
The other two images display the metric we discussed earlier, which is the product of two numbers. The higher the number in the cell, the more the model relies on memorization, and the worse it performs on modified tasks.
It's also clear that the best models have a high LiMem (greater than 0.1, which the authors set as the threshold between memorization and reasoning). On examples with 2-3 people, this is okay—maybe there really were similar examples on the internet. But where GPT-4o shows a metric of 0.14-0.15 on tasks with 6-7 people in the middle image — that raises doubts for me. I think it's highly unlikely that a significant portion of the 100 tasks randomly generated by the authors using their own program, involving 6-7 people(!), appeared on the internet and that the models had seen them. Or that someone at OpenAI/Anthropic happened to be doing the same thing and accidentally wrote a similar task generator.
Concluding the study with further research aims
And in general, the researchers note that, judging by the lag of other models, there are VERY few texts with such tasks on the internet, and they rarely make it into the training data.
Next, the authors conduct several experiments with training the LLAMA-3-8m and GPT-4o-mini models and testing them on tasks different from the training ones. I'll save you some time and go straight to the conclusions:
The generalization ability of the models increases with their level of task memorization. This can be tracked by the increase (rather than decrease) in the metric they introduced (the formula we discussed), meaning the performance on modified tasks drops relative to the original ones. But at the same time, performance on previously unseen tasks also improves. So, there's no pure overfitting, but memorization is present. It seems impossible to separate it from reasoning.
Moreover, training on tasks involving N people also improves performance on tasks with M participants, regardless of whether M is greater than N or not. This kind of training makes the reasoning chains in solving more complex tasks more reliable; they more often lead to correct answers.
Concluding thoughts and future directions
The latest research is unveiling the intricate relationship between memorization and reasoning in AI, highlighting both the opportunities and the challenges. As we continue to develop GriffinAI's offerings, we strive to refine our approach to strike an optimal balance between these two elements. The insights gained from studying the wider community’s research and development in this field continue to influence how we proceed with our own implementation and development of AI agents at GriffinAI.
For those interested in experimenting and exploring more, we invite you to try GriffinAI's Playground, where our latest agents are available for testing and interaction. Your engagement and feedback are crucial to shaping the next evolution of AI reasoning. Visit the GriffinAI Playground to see for yourself.
Let’s advance AI together, pushing the boundaries of what’s possible with large language models to achieve truly adaptable and human-like reasoning capabilities.