Skip to content
CIFAR header logo
fr
menu_mobile_logo_alt
  • Our Impact
    • Why CIFAR?
    • Impact Clusters
    • News
    • CIFAR Strategy
    • Nurturing a Resilient Earth
    • AI Impact
    • Donor Impact
    • CIFAR 40
  • Events
  • Programs
    • Research Programs
    • Pan-Canadian AI Strategy
    • Next Generation Initiatives
    • CIFAR Arrell Future of Food Initiative
  • People
    • Fellows & Advisors
    • CIFAR Azrieli Global Scholars
    • Canada CIFAR AI Chairs
    • AI Strategy Leadership
    • Leadership
    • Staff Directory
  • Support Us
  • About
    • Our Story
    • Awards
    • Partnerships
    • Publications & Reports
    • Careers
    • Equity, Diversity & Inclusion
    • Statement on Institutional Neutrality
    • Research Security
  • fr
AI and Society

Building safer AI with advanced evaluation methods

By: Justine Brooks
3 Dec, 2025
December 3, 2025
Wenhu Chen and Victor Zhong

How Canada CIFAR AI Chairs are creating a new barometer for AI safety

As AI models become more prevalent and capable, ensuring they are accurate, free from bias and aligned with human values is more important than ever. Organizations and governments lack standardized AI safety evaluation frameworks and model capabilities are advancing by the day, making this a difficult task. 

AI safety benchmarks are used to test whether frontier AI models (models that can perform a wide range of tasks that often exceed human performance) pose a risk to human beings and whether they are helpful and accurate before being released to the public. This is an important step for AI development, as even a small failure in model performance can affect millions of people, leading to misinformation, bias and more. The recent International AI Safety Report update recognized this need to improve evaluation methods as a key mechanism for addressing issues in AI safety.

Two Canada CIFAR AI Chairs at the Vector Institute, Wenhu Chen and Victor Zhong, are developing their own set of advanced evaluation methods, which are attracting significant industry notoriety due to their adaptability and robustness. Through their commitment to developing better AI evaluations tools, they are shaping a stronger and more robust AI ecosystem in Canada, establishing Canada as a leader in AI safety. This leadership is an important step for building public trust and driving the widespread adoption of AI.

Wenhu Chen, MMLU-Pro

For Wenhu Chen, an assistant professor at the University of Waterloo, one of the major problems with current benchmarking sets is their lack of diverse topics. Most benchmarks today focus on a few common subjects, like math and coding, leaving many other topics we expect AI to navigate underrepresented. 

The solution Chen says, is not only to pay more attention to these niche areas and integrate them into existing evaluations, but to diversify the sets of evaluations used. The majority of benchmarks today are nearly identical, so testing a model on multiple, similar sets becomes redundant. Chen’s work aims to help developers diversify their evaluation benchmarks to cover more capabilities, domains and skill sets.

Beyond this, Chen is addressing the possibility of models correctly guessing answers on evaluations. “There are cases where the model doesn’t know how to solve the problem, but it has the instinct that one option is better than the other, so it still picks the correct answer. We felt like this problem is pretty serious. It can lead to overestimation of the model’s capability,” Chen says.

This is an issue he addressed in his set of benchmarks, MMLU-Pro. Previously the benchmarks only had four options to choose from, Chen and his team increased the difficulty by presenting AI models with ten total options. “The likelihood of the model guessing it correctly is significantly lowered,” he explains. This approach improved the benchmarks considerably, and they are now widely adopted by major companies like OpenAI, Google and Anthropic.

Improving evaluations will allow developers to make more informed decisions on the training and safety of their models, says Chen. This ultimately leads to the creation of better and safer products for the end user.

Victor Zhong – OS World

Modern foundational AI models quickly surpass static benchmarks, making it crucial for developers like assistant professor at the University of Waterloo Victor Zhong to create dynamic benchmarks that evolve with the models’ increasing capabilities. While static benchmarks provide a fixed set of tasks to be evaluated on, dynamic benchmarks are designed to prevent models from memorizing or optimizing for those specific challenges.

This need for more robust evaluations led Zhong to create OS World, which utilizes a virtual machine to evaluate AI models on more general and realistic scenarios. “One great test bed for the capabilities of these models is to actually just use the computer as humans would,” Zhong explains. It can perform open-ended tasks typically performed on a computer like browsing the internet, running software and creating documents. 

He and his team then developed what is called a ‘Computer Agent Arena,’ where one group of people submit instructions for an AI model and another group evaluates its performance. This creates a ‘living’ benchmark, giving OS World the capability to continuously adapt and improve.

His approach is setting a new standard for benchmarking techniques, so much so that it is now the primary benchmark used by OpenAI and Anthropic. His contributions to the field have a global impact, creating a more effective framework for measuring the progress of advanced AI models.

  • Follow Us

Related Articles

  • CIFAR welcomes new and renewed Canada CIFAR AI Chairs
    December 04, 2025
  • CIFAR launches new AI safety Networks to address synthetic evidence in the legal system and linguistic inequality
    November 19, 2025
  • Advancing AI research in Canada: Meet the newest Canada CIFAR AI Chairs
    October 02, 2025
  • Calls open for global AI alignment research initiative
    August 05, 2025

Support Us

The Canadian Institute for Advanced Research (CIFAR) is a globally influential research organization proudly based in Canada. We mobilize the world’s most brilliant people across disciplines and at all career stages to advance transformative knowledge and solve humanity’s biggest problems, together. We are supported by the governments of Canada, Alberta and Québec, as well as Canadian and international foundations, individuals, corporations and partner organizations.

Donate Now
CIFAR header logo

MaRS Centre, West Tower
661 University Ave., Suite 505
Toronto, ON M5G 1M1 Canada

Contact Us
Media
Careers
Accessibility Policies
Supporters
Financial Reports
Subscribe

  • © Copyright 2025 CIFAR. All Rights Reserved.
  • Charitable Registration Number: 11921 9251 RR0001
  • Terms of Use
  • Privacy
  • Sitemap

Subscribe

Stay up to date on news & ideas from CIFAR.

This website stores cookies on your computer. These cookies are used to collect information about how you interact with our website and allow us to remember you. We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors both on this website and other media. To find out more about the cookies we use, see our Privacy Policy.
Accept Learn more