By: Justine Brooks
3 Dec, 2025
As AI models become more prevalent and capable, ensuring they are accurate, free from bias and aligned with human values is more important than ever. Organizations and governments lack standardized AI safety evaluation frameworks and model capabilities are advancing by the day, making this a difficult task.
AI safety benchmarks are used to test whether frontier AI models (models that can perform a wide range of tasks that often exceed human performance) pose a risk to human beings and whether they are helpful and accurate before being released to the public. This is an important step for AI development, as even a small failure in model performance can affect millions of people, leading to misinformation, bias and more. The recent International AI Safety Report update recognized this need to improve evaluation methods as a key mechanism for addressing issues in AI safety.
Two Canada CIFAR AI Chairs at the Vector Institute, Wenhu Chen and Victor Zhong, are developing their own set of advanced evaluation methods, which are attracting significant industry notoriety due to their adaptability and robustness. Through their commitment to developing better AI evaluations tools, they are shaping a stronger and more robust AI ecosystem in Canada, establishing Canada as a leader in AI safety. This leadership is an important step for building public trust and driving the widespread adoption of AI.
For Wenhu Chen, an assistant professor at the University of Waterloo, one of the major problems with current benchmarking sets is their lack of diverse topics. Most benchmarks today focus on a few common subjects, like math and coding, leaving many other topics we expect AI to navigate underrepresented.
The solution Chen says, is not only to pay more attention to these niche areas and integrate them into existing evaluations, but to diversify the sets of evaluations used. The majority of benchmarks today are nearly identical, so testing a model on multiple, similar sets becomes redundant. Chen’s work aims to help developers diversify their evaluation benchmarks to cover more capabilities, domains and skill sets.
Beyond this, Chen is addressing the possibility of models correctly guessing answers on evaluations. “There are cases where the model doesn’t know how to solve the problem, but it has the instinct that one option is better than the other, so it still picks the correct answer. We felt like this problem is pretty serious. It can lead to overestimation of the model’s capability,” Chen says.
This is an issue he addressed in his set of benchmarks, MMLU-Pro. Previously the benchmarks only had four options to choose from, Chen and his team increased the difficulty by presenting AI models with ten total options. “The likelihood of the model guessing it correctly is significantly lowered,” he explains. This approach improved the benchmarks considerably, and they are now widely adopted by major companies like OpenAI, Google and Anthropic.
Improving evaluations will allow developers to make more informed decisions on the training and safety of their models, says Chen. This ultimately leads to the creation of better and safer products for the end user.
Modern foundational AI models quickly surpass static benchmarks, making it crucial for developers like assistant professor at the University of Waterloo Victor Zhong to create dynamic benchmarks that evolve with the models’ increasing capabilities. While static benchmarks provide a fixed set of tasks to be evaluated on, dynamic benchmarks are designed to prevent models from memorizing or optimizing for those specific challenges.
This need for more robust evaluations led Zhong to create OS World, which utilizes a virtual machine to evaluate AI models on more general and realistic scenarios. “One great test bed for the capabilities of these models is to actually just use the computer as humans would,” Zhong explains. It can perform open-ended tasks typically performed on a computer like browsing the internet, running software and creating documents.
He and his team then developed what is called a ‘Computer Agent Arena,’ where one group of people submit instructions for an AI model and another group evaluates its performance. This creates a ‘living’ benchmark, giving OS World the capability to continuously adapt and improve.
His approach is setting a new standard for benchmarking techniques, so much so that it is now the primary benchmark used by OpenAI and Anthropic. His contributions to the field have a global impact, creating a more effective framework for measuring the progress of advanced AI models.