On Monday, a group of technology specialists launched a worldwide appeal, seeking the most challenging questions to present to artificial intelligence systems, which are increasingly handling popular benchmark tests with ease. The initiative, named ‘Humanity’s Last Exam,’ aims to ascertain when AI has reached expert-level proficiency. According to the organizers, the Centre for AI Safety (CAIS) and the startup Scale AI, the project is designed to remain pertinent even as AI capabilities continue to evolve in the coming years.
This call for tougher questions comes just days after the creator of ChatGPT unveiled a new model, dubbed OpenAI o1, which ‘obliterated the most popular reasoning benchmarks,’ according to Dan Hendrycks, the executive director of CAIS and an advisor to Elon Musk’s xAI startup. Hendrycks co-authored two 2021 papers that proposed AI system tests now widely utilized, one focusing on undergraduate-level knowledge in areas like US history, and the other assessing models’ ability to reason through competition-level math. The undergraduate-style test has garnered more downloads from the online AI hub Hugging Face than any other dataset of its kind.
At the time of those papers, AI was providing almost random answers to exam questions. ‘They’re now crushed,’ Hendrycks told Reuters. For instance, the Claude models from the AI lab Anthropic have improved from scoring about 77% on the undergraduate-level test in 2023 to nearly 89% a year later, according to a leading capabilities leaderboard. These common benchmarks have consequently lost much of their significance.
AI has shown weaker performance on less commonly used tests involving plan formulation and visual pattern-recognition puzzles, as highlighted in Stanford University’s AI Index Report from April. For example, OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, according to the ARC organizers on Friday. Some AI researchers contend that results like these indicate that planning and abstract reasoning are better indicators of intelligence, though Hendrycks noted that the visual aspect of ARC makes it less suitable for evaluating language models. ‘Humanity’s Last Exam’ will necessitate abstract reasoning, he said.
Answers from common benchmarks may have inadvertently been used in training AI systems, according to industry observers. To prevent AI systems from relying on memorized answers, Hendrycks stated that some questions on ‘Humanity’s Last Exam’ will remain confidential. The exam will feature at least 1,000 crowd-sourced questions due by November 1, which are difficult for non-experts to answer. These questions will undergo peer review, with winning submissions eligible for co-authorship and prizes of up to $5,000 sponsored by Scale AI.
‘We urgently need more challenging tests for expert-level models to gauge the rapid advancements in AI,’ said Alexandr Wang, Scale’s CEO. One restriction: the organizers have requested that no questions about weapons be included, as some argue that such topics are too hazardous for AI to study.