Understand, Edit & Steer AI - via API

olivermorris83
Nov 28, 2024
1 min read

We just finished a weekend hackathon organised by Goodfire and Apart Research! Our team built a tool to uncover hallucinations in medical diagnostic AI and 'steer' the AI to better performance on those diagnoses.

Goodfire advertise themselves as 'unlocking deep customisation and insights by examining and modifying the internals of generative AI models". Its an entirely new field and new world of possibilities.

We were amazed by how easy it was to observe the internal mechanics of an AI model and then steer it neuron by neuron - truly 'fine' tuning. We submitted our medical benchmark and quickly found the key features deep inside the model which were triggering the model to occasionally hallucinate. We could then 'steer' the model around this block, reducing hallucinations. Only months ago this was an enormous amount of work, now it is a simple API.

Thanks to Apart Research for organising the the Hackathon kick off talk with legendary Neel Nanda of Google. Teamed up with University of Buenos Aires for this project, calling ourselves 'Gradients Anatomy'. Thanks for their efforts working into the late hours over three days on the hallucination detection and mechanistic interpretability: