Humans, LLMs & Intelligence: Model Evaluation Metrics at Griffin AI

Every so often at Griffin AI, we like to bring you some real deep thinking on the latest in machine learning, from deep within the Griffin AI labs, straight from the mind of one of our top tech team thinkers.

Step inside the thoughts of our devs in lab, and the kind of issues they grapple with as they strive to develop the newest tech on the planet. These are AI Agents designed for DeFi, the very first in the world, and what’s more, they are building too a decentralized infrastructure for these agents to be deployed upon. You think your head is spinning? Check out what our tech team spend their time on, below.

Thinking from the Griffin AI lab: What’s in a benchmark?

When the question arises of how to measure a model's "intelligence," the first instinct is to test it in a manner similar to evaluating humans - through standardized school or specialized professional examinations. This has led to the emergence of numerous benchmarks in the world of LLMs. The approach is to take questions from a specific domain with multiple-choice answers, have the model take the test, and quickly obtain an automated verdict to understand the level of intelligence of the model.

This article proposes to find an answer to the question: is there a universal method for evaluating the performance of LLM models? To this end, I will discuss the existing benchmarks for this purpose and why one cannot rely solely on them, how the Chatbot Arena LLM Leaderboard operates, who the AI trainers are, and whether one model can accurately evaluate another.

Benchmarks for evaluating generative model capabilities

To construct benchmarks for evaluating generative models, one can utilize both human educational tests such as the OGE, EGE, AP, GRE, SAT, and professional exams like the Uniform Bar Exam, USMLE, and Certified Sommelier, as well as tests specifically created for LLMs: MMLU, GPQA, ARC, and many others.

Likely, no article or press release on a new LLM version today omits mentioning these tests: GSM8K and MATH to assess mathematical abilities, HumanEval to evaluate coding skills, and DROP and RACE to measure text comprehension and question-answering capabilities. However, perhaps the most popular benchmark over the past year has been MMLU as a measure of general knowledge. It covers a wide range of human knowledge domains, from school-level mathematics to chemistry, physics, marketing, jurisprudence, and much more. The test comprises 57 topics and 16,000 questions, with a well-educated expert in the relevant field scoring around 90% on average.

Furthermore, it is essential to remember that unlike humans, LLM models lack an internal representation of the world, making it more challenging for them to handle questions that seem obvious to us. To evaluate this aspect, there are numerous "common sense" benchmarks, such as COPA, PIQA, OpenBook, WinoGrande, and many others.

If any of the questions in the image above were asked to a human, they would likely respond with bewilderment - how could such nonsensical questions be asked? But for a model, all of these answers are not obvious, and these tests are necessary to understand the reasonableness of the model.

In addition to the aforementioned general academic benchmarks, one can create a test for any important skill. Many IT companies create their own internal benchmarks for this purpose and test their models on them. Our company is no exception in this regard, we are in the process of creating a benchmark for each of skills that might be potentially useful across out projects

Questions from internal benchmarks for models:

Knowledge of Specialized Facts: In addition to answers to commonly accepted tests, it is important for LLM to know important facts, including those from the cryptocurrency domain.
Provocations: LLMs are characterized by a tendency to try to be as helpful as possible and always answer the user's questions, which can lead to a situation where the model confidently responds to unanswerable "gotchas" and hallucinates. To assess how easily our models can be provoked, we have compiled a special benchmark with provocative questions.
Following Formatting: There is a well-known English-language test called IFEval that evaluates whether a model can follow instructions on the format of the response, such as writing the answer in N sentences where each word starts with the letter, good benchmark should follow formatting based on IFEval.

A very important nuance - when we create custom benchmarks, it is important to validate them carefully.

So, benchmarks are easy to interpret, easy to track progress, and easy to compare different models - they are very understandable, and you can create them for any specific skill. What could be wrong? Unfortunately, quite a lot.

Of course, we are primarily talking here not about the intentional transfer from the validation sample to the training set, but about unconscious leaks: as a rule, part of the data used for pre-training is collected by crawling web articles.

Therefore, it is quite possible that some information from the benchmark itself or some auxiliary information useful for answering the question may be present on the web pages that ended up in the pre-training data. There are different approaches to assessing the degree of benchmark contamination for a specific model: for example, one is described in the article "Benchmarking Benchmark Leakage in Large Language Models".

But the fact remains - many LLMs are susceptible to leaks to varying degrees, we also try to carefully monitor any benchmark contamination: we have a model that finds text fragments similar to known benchmarks and ruthlessly cuts them out of the pre-training dataset. As a final check on the degree of contamination, we raise a search on the pre-training dataset, and experts themselves search for all kinds of reformulations and queries from the benchmark that are useful for answering the question, and assess the degree of leakage.

Static, classical benchmarks have long been the gold standard for assessing model quality, but their limitations are becoming increasingly obvious. In addition to the problems of leaks, instability to the measurement method and overfitting to the test format, benchmarks become obsolete quite quickly. If a couple of years ago the best LLMs solved no more than half of the complex mathematical problems from the MATH benchmark, now many models show a result of 85+, and in another year, they will probably lose their relevance altogether, since many models will show a high score on this benchmark.

But perhaps the most important thing to remember is that the score on the benchmark is not equivalent to the intelligence of the model. For example, many models already outperform humans on the MMLU test. Does this mean that all physicists, mathematicians, lawyers, and other experts can already be replaced by a model?

The problem is the most static benchmarks try to assess the model's ability to solve problems on which it has already been trained, rather than its intelligence (such as the ability to generalize, not just copy learned knowledge, but to develop genuine understanding of a task, adapt and acquire new skills without prior intensive fine-tuning). But such an assessment of skills is only a distant proxy approximation of how the model will perform in real-world business scenarios.

The Chatbot Arena LLM leaderboard: Evaluating generative models

So, benchmarks have quite strong limitations. And we thought - what if, as an assessment of which model is better, we give users the opportunity to freely interact with the model on any topic and blindly choose whose answer is better? This is how our LMSYS Chatbot Arena came about.

The Arena Leaderboard is set up simply: there is a question and two answer options, one of which you need to vote for. Then it's roughly the same system as in a chess tournament: in case of a win, you get points, in case of a loss - you lose them.

It seems that now we objectively understand that the higher the score on the leaderboard, the better the model. What could go wrong?

What's wrong with the Arena leaderboard?

As everywhere, there are also "buts" here.

First - narrow topic of queries. The Chatbot Arena is primarily visited by users who are interested in LLMs, which means they are more often interested in IT-related questions about code. Such visitors are unlikely to give a task on a restaurant review analysis system with specific characteristics and real review examples. Exceptions happen, but IT-related topics predominate.

Second - the type of tasks that people come to the Arena Leaderboard with. If a person is asked to give a task to an LLM, the first thing that will probably come to mind is to ask the model a question or ask it to come up with something. Few people will ask a question like: "I have such a data labeling task {2-page instruction}, please perform it on {a set of real data}."

Another problem - bias towards format. It was highlighted not long ago by a heated debate: GPT-4o mini beats Claude. People in X are outraged: "Claude solves all my tasks, while GPT-4o mini can't do anything for me!" The creators of the Arena Leaderboard even had to publish a sample dataset so that people could see how this happened.

What happened is this: it's not only what you say that matters to people, but also how you say it. This same rule applies to LLM responses.

People like long, neatly formatted answers with lots of references, even if there's a mistake somewhere in the answer, even if the answer is of lower quality. As an attempt to at least partially mitigate such subjectivity, the Arena Leaderboard introduced a style adjustment. It was just after this style-based controversy and attempts to separate content and form that we saw Claude suddenly rise to the top, while GPT-4o mini dramatically fell.

So, Arena Leaderboard users are not required to conduct a comprehensive assessment: to choose topics evenly and ask questions about all kinds of skills and tasks. They are also not required to understand the topic they are voting on and to conduct a detailed fact-check. They are free to choose the answer they simply like. What can be done? You can try to solve these problems with the help of external controlled quality marking of answers to the given questions.

It's worth mentioning here a rather popular direction - LLM-as-a-Judge. The principle is the same: you fix a set of requests, determine the "rules of the game" on how to determine which answer is better and which is worse, but instead of expert assessment, you use a powerful LLM (for example, GPT-4o) or even an ensemble of high-quality models. This saves a lot of budget: the assessment of 1000 tasks will cost less than $10, plus it's much faster and you won't have to build a complex system of control and operational management.

Prompt example for LLM-as-Judge

The MT and Arena-Hard benchmarks are made on this principle. Moreover, this method has a rather high consistency with human preferences (89.1% for Arena-Hard-Auto-v0.1 GPT-4–1106-Preview Judge), which allows us to quite well separate models of similar quality (87.4% Separability for Arena-Hard-Auto). Compared to classical static benchmarks, which quickly become obsolete, "leak", or become too simple for modern models, this approach allows you to quickly and cheaply evaluate the quality of a model quite close to user preferences. But even here, not everything is unicorns and rainbows.

Firstly, judge models are prone to "narcissistic bias": they prefer their own answers or answers of models trained on the outputs of the judge model, to the answers of competitor models. This is absolutely fair for all models.

Secondly, judge models, like humans, are biased towards the style of the answer (the so-called verbosity bias). Models prefer longer, more structured, and more detailed answers, even if an unnoticed error may creep into them.

Thirdly, and most importantly and painfully for me as an analyst, LLMs are not the best fact-checkers. If a model cannot confidently distinguish truth from fiction, its assessments risk being superficial or misleading. When we rely on LLMs as arbiters, it is important to remember that their "confidence" in the answers does not always equal actual accuracy.

Final thoughts

Unfortunately, there is no "silver bullet", no single "correct" solution on how to evaluate LLMs. You will have to constantly look at the data, doubt, research, and regularly annotate yourself.