Written on

AI Agents Surpass Chatbots, But By How Much?

OpenAI's o1 and Anthropic's Claude 3.5 both score just 21% on the ARC reasoning benchmark, against 84% for humans, with a $1m prize. We wrapped Claude in an agentic framework that forms hypotheses, tests them and course-corrects. How far did the score jump?

AI Agents Surpass Chatbots, But By How Much?

Business automation needs AI that can reliably plan and reason. OpenAI has recently released their ' o1' models for this, ahem, reason. Earlier today, Anthropic released its upgraded version of Claude3.5, claiming even better reasoning ability.

Can We Make AI Even Better?

The o1 models are an incredible achievement for many reasons, but it's not really a fair competition between chatbots and AI agents. Agents are just a framework which use common chatbots like ChatGPT, Claude or o1-preview under the hood, but give them the tools to reflect and update plans in a 'chain of thought' and importantly, tools to access or analyse data.

Agents have all the advantages.

Of course, AI agents can be built with these new 'reasoning' models from OpenAI, but we were surprised to see how much they can surpass 'o1-preview' on its own.

The Test

The test is the ARC Prize (ARC = Abstract Reasoning Corpus), which is a set of pattern prediction challenges similar to those for IQ tests. Whilst other benchmarks have fallen to AI's progress, these have proven resilient.

The challenges are designed to test for human level intelligence, which Francois Chollet, the Google engineer who proposed ARC, believes is "what we do when we don't know what to do". It is testing for the ability to reason by analogy, which is described in detail in Chollet's latest talk. The ARC Prize offers $1m of prize money for AI that can match humans.

For an example challenge, see the below image, mapping inputs to outputs.

Given these examples can you guess what comes next? It's easy for a human, but AI can get lost

Results

The current top models are Anthropic's Claude3.5 and OpenAI's o1-preview. Both score 21%, whereas human performance is closer to 84%.

We also tested Anthropic's latest Claude3.5 (released today 22-Oct) and found it to be a little worse than the old Claude3.5 (released 24-Jun) at the ARC challenges. The easiest ARC challenges are the 'training set', not used for the leaderboard but a quick and useful test nonetheless. New Claude3.5 scored 30% vs old Claude3.5's 32.5% over the 403 challenges.

To test the advantage of the agentic way of doing things, Agentico created AI agents using Claude3.5 under the hood. The chatbot was given a notebook to investigate each challenge example by example, make hypotheses and test them. As a result...

...the chatbot's score rose by almost a quarter (24%) on the ARC training set when operating in an agentic framework

Good, but, no $1m prize! Agents are a leap forward but do not match humans on this benchmark. Note, o1-preview was not available for integration into such agents at the time of this report, but we look forward to that being possible soon.

How They Do It

What are agents doing that straight forward chatbots with 'chain of thought' do not achieve? Agentic workflows allow the chatbot to test their ideas and 'course correct' as they go. Even capable reasoning models like o1 need to test their hypotheses on real data. As you may have guessed, trial and error works. Of course, the successes and errors can be recorded and used as 'lessons learnt' for future trials, Stanford University proposed just such a memory for agents.

We did this research as part of a wider piece on AI Safety (more another day). Details of this work can be found on Github .

Talk to knowledgeable AI and Machine Learning practitioners:

Get in Touch

Agentico, AI with Humans in Mind.

Related posts

See all posts
How to Easily Sway AI Into Buying...

How to Easily Sway AI Into Buying...

Gartner says a third of enterprise software purchases will involve an AI agent by 2028, and machines are assumed immune to persuasion. We ran 8,000 trials across five frontier models. Which techniques work, which backfire, and what does that mean for selling to agents?

The Learning Loop is the Moat

The Learning Loop is the Moat

The gap between top models has collapsed to 5%, and GPT-3.5-level inference cost fell 280-fold in under two years. If AI is now a commodity, competitors can buy the same agent tomorrow. So where does durable advantage live, and why can't it be bought?

Where's the Frontier in Agentic AI?

Where's the Frontier in Agentic AI?

Berkeley's second Agentic AI Summit drew Google, OpenAI, NVIDIA, IBM and frontier researchers for talks on where agents are heading, from Chi Wang's MassGen to the Linux Foundation's 'Internet of Agents'. So what does the near future of agentic AI actually hold for enterprises?