Written on November 28, 2024

Understand, Edit & Steer AI - via API

At a Goodfire and Apart hackathon, our team built a tool to catch hallucinations in medical diagnostic AI, then steer the model around them neuron by neuron via a simple API. What does it mean when a business can edit a model's internals this easily?

agentic-ai ai-safety

We just finished a weekend hackathon organised by Goodfire and Apart Research! Our team built a tool to uncover hallucinations in medical diagnostic AI and 'steer' the AI to better performance on those diagnoses.

Goodfire advertise themselves as 'unlocking deep customisation and insights by examining and modifying the internals of generative AI models". Its an entirely new field and new world of possibilities.

We were amazed by how easy it was to observe the internal mechanics of an AI model and then steer it neuron by neuron - truly 'fine' tuning. We submitted our medical benchmark and quickly found the key features deep inside the model which were triggering the model to occasionally hallucinate. We could then 'steer' the model around this block, reducing hallucinations. Only months ago this was an enormous amount of work, now it is a simple API.

Thanks to Apart Research for organising the the Hackathon kick off talk with legendary Neel Nanda of Google. Teamed up with University of Buenos Aires for this project, calling ourselves 'Gradients Anatomy'. Thanks for their efforts working into the late hours over three days on the hallucination detection and mechanistic interpretability:

Matías Zabaljauregui (Applied AI Labs, Univ. Buenos Aires)
Diego Sabajo (ZennoAI, Paramaribo)
Eitan Sprejer (GetGloby, Buenos Aires)

See all posts

We Under-Imagined the Zombie

April 21, 2026

Anthropic measured something like emotions inside Claude — not mimicry, but representations that direct it. Intervene and the behaviour changes. The debate fixates on one question: is AI conscious? This suggests both sides ask the wrong thing. What does it mean for business?

Agentico Teams Up with Apart Labs for AI Research

December 19, 2024

Agentico is committing one day of staff time a week to Apart Lab's AI safety and interpretability research. With models like OpenAI o1 now matching physicians on reasoning, advising on AI has become less like automation engineering and more like recruiting.

Commercial Rewards from AI Safety

June 09, 2024

Human-in-the-loop oversight doesn't scale — people suffer 'vigilance decrement', and nobody could review the thousands of lines of code GPT-5 writes daily. Meanwhile AI safety researchers build techniques where systems check each other. So how do these methods become a commercial edge?

Find what you need

Understand, Edit & Steer AI - via API

Related posts

We Under-Imagined the Zombie

Agentico Teams Up with Apart Labs for AI Research

Commercial Rewards from AI Safety

Before you start

Assistant unavailable