AI Agents Surpass Chatbots, But By How Much?

Business automation needs AI that can reliably plan and reason. OpenAI has recently released their ' o1' models for this, ahem, reason. Earlier today, Anthropic released its upgraded version of Claude3.5, claiming even better reasoning ability.

Can We Make AI Even Better?

The o1 models are an incredible achievement for many reasons, but it's not really a fair competition between chatbots and AI agents. Agents are just a framework which use common chatbots like ChatGPT, Claude or o1-preview under the hood, but give them the tools to reflect and update plans in a 'chain of thought' and importantly, tools to access or analyse data.

Agents have all the advantages.

Of course, AI agents can be built with these new 'reasoning' models from OpenAI, but we were surprised to see how much they can surpass 'o1-preview' on its own.

The Test

The test is the ARC Prize (ARC = Abstract Reasoning Corpus), which is a set of pattern prediction challenges similar to those for IQ tests. Whilst other benchmarks have fallen to AI's progress, these have proven resilient.

The challenges are designed to test for human level intelligence, which Francois Chollet, the Google engineer who proposed ARC, believes is "what we do when we don't know what to do". It is testing for the ability to reason by analogy, which is described in detail in Chollet's latest talk. The ARC Prize offers $1m of prize money for AI that can match humans.

For an example challenge, see the below image, mapping inputs to outputs.

Given these examples can you guess what comes next? It's easy for a human, but AI can get lost

Results

The current top models are Anthropic's Claude3.5 and OpenAI's o1-preview. Both score 21%, whereas human performance is closer to 84%.

We also tested Anthropic's latest Claude3.5 (released today 22-Oct) and found it to be a little worse than the old Claude3.5 (released 24-Jun) at the ARC challenges. The easiest ARC challenges are the 'training set', not used for the leaderboard but a quick and useful test nonetheless. New Claude3.5 scored 30% vs old Claude3.5's 32.5% over the 403 challenges.

To test the advantage of the agentic way of doing things, Agentico created AI agents using Claude3.5 under the hood. The chatbot was given a notebook to investigate each challenge example by example, make hypotheses and test them. As a result...

...the chatbot's score rose by almost a quarter (24%) on the ARC training set when operating in an agentic framework

Good, but, no $1m prize! Agents are a leap forward but do not match humans on this benchmark. Note, o1-preview was not available for integration into such agents at the time of this report, but we look forward to that being possible soon.

How They Do It

What are agents doing that straight forward chatbots with 'chain of thought' do not achieve? Agentic workflows allow the chatbot to test their ideas and 'course correct' as they go. Even capable reasoning models like o1 need to test their hypotheses on real data. As you may have guessed, trial and error works. Of course, the successes and errors can be recorded and used as 'lessons learnt' for future trials, Stanford University proposed just such a memory for agents.

We did this research as part of a wider piece on AI Safety (more another day). Details of this work can be found on Github .

Talk to knowledgeable AI and Machine Learning practitioners:

Get in Touch

Agentico, AI with Humans in Mind.