TORONTO — Major firms including Manulife and RBC are now using their own internal tests to better evaluate the performance of new AI models in an attempt to cut through the industry hype.
While AI developers regularly tout the performance of their models on popular third-party evaluations, executives say such assessments don’t really show whether the systems are well-suited for their business needs. “The benchmarks can actually lead you astray,” according to Jodie Wallis, global chief AI officer at Manulife.
Manulife’s tech team has built its own set of tests based on 47 actual use cases in areas like customer service and risk assessment, Wallis said. The program lets the firm plug the latest releases from AI labs into its existing tools, using its own data. That cuts the time to evaluate a new model down from weeks to minutes, and allows it to more quickly adapt when better technology becomes available, she claimed.
Talking Points
Large firms like Manulife and RBC have developed their own benchmarks for testing AI models to work out how well-suited they are for their businesses
Tech firms often tout how well their models perform against popular benchmarks, but mathematical or knowledge exams don’t always reflect how well they do on real business tasks
Model makers typically cite the performance of their products on popular tests like MMLU, which tests accuracy across different tasks; MATH, which assesses mathematical problem-solving; or BIRD-SQL, which checks coding ability. Developers also watch AI models’ scores on university admissions exams.
Manulife doesn’t need its tools to ace the LSAT or the MCAT, but it does, Wallis said, want models that deliver the most accurate results at the lowest cost for its in-house applications.
New models out-performing their predecessors by a few percentage points isn’t that important for large firms adopting AI, said Foteini Agrafioti, senior vice-president of data and AI at RBC. Rather than raw power, the bank needs models that it can use securely and quickly scale across the company.
RBC plans to use Cohere’s technology for most of its generative tools. The bank wants to standardize and centralize its AI platform, the same way it has with software for other functions like cybersecurity or engineering, Agrafioti said. A single system makes it easier to keep data safe and transfer tools between different parts of the business, she added.
RBC tested technology from other firms, including OpenAI, Agrafioti said, but the bank wasn’t willing to send client data to a system hosted on another company’s cloud. So, RBC bought access to North, a Cohere product that lets users launch AI agents powered by the Toronto firm’s large language models. It will run Cohere’s technology on its own servers; the bank has built what it claims is Canada’s largest cluster of the graphics processing units used to power AI.
Related Articles
RBC and Cohere are co-developing a version of North that meets the bank’s security and regulatory requirements. RBC employees building new AI applications that touch sensitive data must use North, which is rolling out to developers at the company this summer. In March, Canada’s biggest bank announced it’s aiming to generate up to $1 billion in earnings using AI by 2027. “It’s graduating out of experiments,” Agrafioti said, to become “mainstream.”
Betting on Cohere as RBC’s main generative AI provider comes with risks, Agrafioti said, since other AI firms might advance the technology faster. The bank will monitor those alternatives and give staff access to the latest models, she said. Still, she’s keen to avoid the distraction of flashy new AI models. “We have something in our hands that does the job we need it to do really well, and we’re going to use that,” she said.
While big businesses may have the data, the technical staff and use cases to test AI models, smaller firms rely on third-party benchmarks such as LLM Arena and MMLU, including some that researchers have accused tech firms of trying to game.
Toronto’s Vector Institute is trying to provide a more independent analysis. In April, it released its first “state of evaluation” study, which ran 11 models through 16 tests. “Large companies have built this infrastructure to do evaluations internally,” said Deval Pandya, the institute’s vice-president of AI engineering. “But most of it is not openly accessible.”