UNamur unveils AI Score: the first "reliability meter" for educational chatbots

ChatGPT, Copilot, Grok, Mistral, NotebookLM: artificial intelligence tools are becoming widespread in schools, universities, and businesses. However, until now, there has been no way for users to know how reliable these educational chatbots are. AI Score, a new tool developed by researchers at UNamur, fills this gap by measuring the educational reliability of chatbots. "AI Score is to chatbots what the speedometer was to cars," says Professor Michaël Lobet, one of the authors of the research. "The arrival of the automobile at the beginning of the 20th century revolutionized usage... but it was the invention of the speedometer that made it a controlled and reliable tool. Today, educational chatbots and other chatbots used in businesses in general are at a similar stage: powerful and exciting, but without reliable control instruments. The AI Score aims to be that speedometer," he explains.

In the same way that NutriScore, EcoScore, and PEB certification help citizens make informed choices, AI Score provides a simple and immediate reading of the level of trust that can be placed in a chatbot. "At a time when trust in generative AI is becoming a societal issue, AI Score guides teachers and companies in their choice of tools to put in the hands of their students or customers," says Dr. Miguël Dhyne, scientific collaborator at UNamur, educator, and physics researcher. "It can also help institutions evaluate AI solutions before deployment or verify their reliability over time," he adds.

A scientific method that is rigorous and accessible to all

It evaluates four essential dimensions:

Initial performance: Does the AI respond correctly the first time?
Robustness: does it maintain its response when questioned?
Self-correction ability: does it recognize and correct its mistakes?
Unreliability: does it contradict itself or lose track of the conversation?

To evaluate these four dimensions, each chatbot is subjected to a test carried out under identical conditions to ensure fairness and comparability.

The chatbots are first subjected to a set of 10 multiple-choice questions, carefully selected to highlight any errors or hesitations in the AI. These questions are therefore designed to be discriminating and have a balanced level of difficulty.

After each response, the chatbot is prompted again to verify that it maintains its position, admits an error, or does not contradict itself.

Since AI does not always respond in the same way from one test to another, the test is performed five times. This ensures that the result truly reflects the capabilities of the chatbot.

These criteria are based on the standards ISO/IEC TR 24028:2020 and ISO/IEC 42001:2023.

Each model tested is then given an overall score and a letter grade, similar to the scores used in food or energy ratings.

It has recently been demonstrated that methods for ranking large language models (LLM leaderboards/Chatbot Arena) based on popular votes are not very robust to changes in a few preference votes. Malicious votes, evaluation biases, popularity effects, or data leaks can therefore impact the rankings on which business, investment, marketing communication, and educational and technical decisions are based. "In contrast, the AI Score offers a robust, reliable, and transparent method that anyone can apply independently to judge the relevance of their tested platform," add the Namur researchers.

An open, free tool that can be used today

The AI Score is available free of charge to the general public, teachers, journalists, institutions, and anyone wishing to objectively compare the performance of chatbots: https://aiscore.academy

The site offers:

free access to the methodology,
examples of scores,
educational resources,
and, soon, enhanced documentation based on feedback from early users.

Making the protocol available to the general public makes it easy to reproduce and apply to different models. The researchers therefore invite the community to familiarize themselves with the tool and contribute to its improvement. The AI Score is an open initiative that is designed to be scalable and to continuously improve based on user feedback.

A 100% Belgian innovation, supported by the University of Namur

The AI Score was developed by a multidisciplinary team of researchers at UNamur:

Prof. Michaël Lobet: F.R.S.-FNRS qualified researcher at the University of Namur and Professor in the Department of Physics. He is also affiliated with Harvard University.
Dr. Miguël Dhyne: scientific collaborator at UNamur, educator and researcher in physics, expert in educational innovation, EdTech, and educational AI. His role is to design practical solutions and train teachers in the use of digital tools.
Laurence Dumortier: holder of a PhD in Mathematical Sciences from the University of Namur, IT specialist at the TICE Unit (UNamur/FaSEF). She also supports teachers in mastering educational technologies.
Jean-Roch Meurisse: IT specialist at the TICE Unit (UNamur/FaSEF), where he is responsible for co-administering and developing the institutional LMS. He is responsible for supporting teachers and researchers in the selection, implementation, and development of digital teaching tools.

The tool has been submitted for scientific publication and is currently being reviewed.

UNamur as a player in technological development

The University of Namur is establishing itself as a key player in artificial intelligence (AI) by integrating this technology into its teaching programs, conducting cutting-edge research on the subject, and putting its expertise at the service of society.

Learn more about AI at UNamur