Maybe Our Truths Are Not The Only Ones: Provocative Questions Around Benchmarking

If we are talking about transformative AI, then we need to at least check what is transformative: “causing a marked change in someone or something.” If we are discussing transformation, we must be cautious with our assumptions. We need to revisit them repeatedly, slowing down and critically examining our perspectives. This is particularly important because of the nature of the AI field, which has its core in academia, within a scientific context. The scientific field is cumulative, where ideas build upon one another. Community participation, seeking diverse perspectives, and approaches such as community facilitation and deep democracy can be extremely helpful. To practice opening our minds and hearts to different perspectives and truths and to double-check our assumptions, I have prepared some provocative thoughts. Here are some questions to consider!

The Donkey Work Dilemma: Rethinking AI Thinking Benchmarks
What we are teaching machines to classify and categorize is a human way of thinking developed after “modernity.”
Collaborative Benchmarks: A New Approach to AI Evaluation
Rethinking Benchmarks: Evaluating AI Without Traditional Training Signals
Whose Ideas Do We Follow?
Conclusion

The Donkey Work Dilemma: Rethinking AI Thinking Benchmarks

Alan Turing once referred to thinking as “donkey work”:

«As soon as one can see the cause and effect working themselves out in the brain, one regards it as not being thinking but a sort of unimaginative donkey work.»

He went on to say:

«From this point of view, one might be tempted to define thinking as consisting of those mental processes that we don’t understand. If this is right, then to make a thinking machine is to make one which does interesting things without really understanding quite how it is done.»

Does this assumption by Turing hold true? Can we ignore how thinking works and simply evaluate a machine’s output by comparing it to human thinking? Consider a provocative analogy that questions Turing’s perspective on machine intelligence:

In 1967, psychologist Paul Ekman traveled to Papua New Guinea to test his theory that human emotions are universal by studying the isolated Fore people. Despite initial frustrations due to cultural and communication barriers, Ekman’s work laid the foundation for the modern affect recognition industry, which uses AI to analyze facial expressions to infer emotions. This industry, despite lacking robust scientific support, has grown significantly and is used in various applications, including security and hiring. Ekman’s research suggested that specific facial expressions correspond to universal emotions. His Facial Action Coding System (FACS) provided a standardized method for categorizing these expressions, which became crucial for training AI systems. However, subsequent research, including a comprehensive review in 2019, has shown that facial expressions are not reliable indicators of emotional states, challenging the foundational assumptions of affect recognition technology. Despite these scientific doubts, the AI industry has embraced affect recognition, driven by military, security, and commercial interests. The approach fits well with the capabilities of emerging computer vision technologies and has been promoted by institutions with vested interests in its validity. Critics argue that this simplification of emotions for computational purposes overlooks the complexities and cultural nuances of human emotional expression. The persistence of this flawed approach raises ethical concerns and highlights the need for more nuanced and scientifically grounded methods in emotion recognition.

As with the example of emotion recognition systems, it is dangerous to make assumptions and build algorithms on these assumptions without scientifically grounded methods. The benchmarks we create to measure the intelligence of both humans and machines can fail if we simply equate intelligence with “donkey work.” How can we measure intelligence accurately without first understanding what intelligence truly is?

What we are teaching machines to classify and categorize is a human way of thinking developed after “modernity.”

In the late 1920s and early 1930s, the Soviet Union implemented rapid social and economic changes in remote regions like present-day Uzbekistan and Kyrgyzstan, transforming individual farming and herding into collective farms and promoting industrial development. This revolution led to an interconnected economy and educational reforms, introducing literacy and abstract thinking to previously illiterate populations. Psychologist Alexander Luria saw this transformation as a natural experiment to study cognitive changes. Luria’s research revealed that exposure to modern education and collective work fostered abstract thinking in villagers. Those with some formal education could group colors and shapes by abstract categories, while premodern villagers described them in practical terms based on their daily experiences. For instance, they couldn’t conceptually group a map and a watch, seeing them as fundamentally different. Luria’s work showed that modernity enabled people to use «education»—deducing principles from given facts—even without prior experience. Premodern individuals, however, relied on concrete experiences and struggled with abstract reasoning tasks, such as those on Raven’s Progressive Matrices tests. Luria found that most remote villagers were not subject to the same optical illusions as citizens of the industrialized world, like the Ebbinghaus illusion. Which middle circle below looks bigger? If you said the one on the right, you’re probably a citizen of the industrialized world. The remote villagers saw, correctly, that they were the same, while the collective farmers and women in the teachers’ school picked the one on the right.

What we are teaching machines is primarily classification and categorization. This research shows us that categorization and classification are practices that human beings began to emphasize after modernity; it was not our only way to understand reality. With this perspective, I cannot help but ask, what kind of Artificial Intelligence systems would our ancestors have created? What would be the core functions of these systems other than classification and categorization? Would benchmarks exist in such a setting, and if so, how?

Collaborative Benchmarks: A New Approach to AI Evaluation

The article “Rethinking Competition in the AI Race” discusses the shift in tech competition from social networks to AI models and suggests that collaboration, rather than intense rivalry, might be more beneficial in the long run. The classic Prisoner’s Dilemma illustrates how rational players tend to defect, leading to suboptimal outcomes. However, Robert Axelrod‘s variation, where players repeatedly interact, showed that cooperation («tit for tat») is the best long-term strategy. Using this analogy, the article argues that the industry’s focus on rapid innovation and short-term gains can degrade user experience and ethical standards. Cooperation can build long-term trust and sustainability, especially for social media platforms facing «enshittification.«

Coming from the fields of alternative economies and gift culture, I can assure you that competition is not the only truth of human existence. However, it seems we often confine our thinking to competitive narratives. In OpenAI’s paper titled «AI Safety via Debate,» the researchers explore how to make AI systems effective for complex real-world tasks. They argue that AI must learn human goals and preferences, but direct human judgment can be challenging for intricate tasks. To address this, they propose training agents through self-play in a zero-sum debate game, where two agents present arguments, and a human judges which agent provides the most truthful and useful information. Once again, we see a leading company employing a zero-sum game perspective in the theoretical design of an automated evaluation mechanism.

Benchmarks in the AI field are not only used as evaluation methods; they also, with existing leaderboards, become symbols of fierce competition. With these analogies in mind, how can we create new benchmarks with cooperation at their core to help us build long-term trust and sustainability?

Rethinking Benchmarks: Evaluating AI Without Traditional Training Signals

Training a machine learning (ML) system for a given task requires evaluating its performance using a training signal, which can take the form of labels, rewards, or other feedback mechanisms. For tasks that can be evaluated automatically, such as winning a game of Go or chess, generating a training signal is straightforward and algorithmically determined. In these cases, the training signal is described as «algorithmic.» However, most real-world tasks lack such algorithmic training signals and require human input. Humans provide training signals by demonstrating tasks, labeling data, or assigning rewards based on their judgments, which is known as a «human» training signal. For more complex tasks, obtaining a meaningful training signal is particularly challenging. Tasks like making economic policy decisions, advancing scientific knowledge, or managing large computer networks are beyond the capabilities of a single human to perform or judge adequately due to their complexity and vast observation spaces. While long-term feedback, such as evaluating economic growth over several years, is possible, it is slow and impractical for efficient learning. Currently, there is no effective method to train ML systems to perform these tasks significantly better than humans.

In David Epstein’s book, Range: Why Generalists Triumph in a Specialized World, he recounts how IBM’s supercomputer Deep Blue defeated chess grandmaster Garry Kasparov in 1997 by evaluating 200 million positions per second, demonstrating the tactical prowess of machines. Kasparov acknowledged that modern chess programs surpass his skills, emphasizing that machines can outperform humans in tasks that can be codified. This defeat led Kasparov to recognize Moravec’s paradox, which highlights the opposite strengths and weaknesses of machines and humans. While computers excel at tactics—short-term moves and combinations—humans are superior in strategic, big-picture thinking. This realization inspired Kasparov to pioneer «advanced chess,» where humans teamed up with computers. In this collaboration, machines handled the tactical calculations, allowing humans to focus on strategy. This approach leveled the playing field and enhanced human creativity in the game. The concept evolved into «freestyle chess,» where teams of humans and multiple computers competed, nullifying the advantage of years of specialized practice. A duo of amateurs with standard computers even outperformed top grandmasters and supercomputers. Kasparov concluded that the winning teams excelled at directing the computers to analyze specific aspects of the game, combining human strategic oversight with machine precision. This partnership, known as «centaur» teams, represents the highest level of chess play and emphasizes the unique human contribution in open-world strategy tasks. The success of these teams illustrates that as tasks shift towards broader strategic contexts, human insight and decision-making become increasingly valuable.

With these analogies in mind, how can we define new benchmarks when there is no training signal to evaluate the model´s performance?

Whose Ideas Do We Follow?

As Turing was a very capable marathon runner according to Wikipedia, does this ability of his affect all our setting rules of how we envision and evaluate our AI systems:

“While working at Bletchley, Turing, who was a talented long-distance runner, occasionally ran the 40 miles (64 km) to London when he was needed for meetings, and he was capable of world-class marathon standards. Turing tried out for the 1948 British Olympic team, but he was hampered by an injury. His tryout time for the marathon was only 11 minutes slower than British silver medallist Thomas Richards’ Olympic race time of 2 hours 35 minutes. He was Walton Athletic Club’s best runner, a fact discovered when he passed the group while running alone. When asked why he ran so hard in training he replied: «I have such a stressful job that the only way I can get it out of my mind is by running hard; it’s the only way I can get some release.”

As we see in the article The Olympics of AI: Benchmarking Machine Learning Systems, the marathon analogy continues to serve us in benchmarking AI systems. The article discusses how running a mile in under four minutes was once considered not just a daunting challenge but, by many, an impossible feat. In 1954, Sir Roger Bannister, at a track in Oxford, England, completed the mile in 3 minutes 59.4 seconds, shattering the threshold and making history. «Hence, while benchmarks are tools for comparison and competition, their true value lies in their ability to unite a community around a shared vision. Much like Bannister’s run didn’t just break a record but redefined athletic potential, a well-conceptualized benchmark can elevate an entire discipline, shifting paradigms and ushering in new eras of innovation.»

Human beings have an incredible capacity to accomplish goals. Especially when working towards the same goal together, we can create miracles, as we all know. For decades, human beings have been following the visions of many pioneers such as John McCarthy and Alan Turing. It is also true that with the transition from rule-based models to statistical models, the value system of the capitalist economy and extraction was in place. As Kate Crawford says in Atlas of AI:

“The AI industry has fostered a kind of ruthless pragmatism, with minimal context, caution, or consent-driven data practices while promoting the idea that the mass harvesting of data is necessary and justified for creating systems of profitable computational ‘intelligence.’ This has resulted in a profound metamorphosis, where all forms of image, text, sound, and video are just raw data for AI systems and the ends are thought to justify the means. But we should ask: Who has benefited most from this transformation, and why have these dominant narratives of data persisted? …The logic of extraction that has shaped the relationship to the earth and human labor is also a defining feature of how data is used and understood in AI.”

Which values do we want to make our new north stars to follow, and what are our future visions? In these many different future visions, what role can benchmarks play?

Conclusion

In the first article about the history of benchmarks, «The Benchmarking Paradox: Measuring Depth and Surface in AI,» we concluded by asking, «Can benchmarks embrace both depth and surface?» While finalizing this one, I pose more provocative ones: Do benchmarks prevent us from going deeper into meaning? How can we redesign new benchmarks, create algorithms that serve everyone and flourish humanity, convert big data into deep data, and support us in discovering the true meaning of being human and becoming our better selves? Finding answers to these questions could lead to AI systems that not only perform better but also contribute to a more meaningful and equitable human experience. I invite you to share your thoughts and answers to these questions as we collectively explore the future of AI and its impact on our world.