
Researching Safe Medical AI
Teaming up with ApartLabs to reduce medical AI hallucinations by analysing an AI's internal processes.
Problem
AI systems are increasingly used in medical decision-making, but their tendency to hallucinate - confidently generating false information - poses significant risks. We teamed up with the research organisation ‘Apart Labs Studio’ to investigate a new method of reducing hallucinations using a 'Sparse Auto Encoder' service provided by the start-up Goodfire, this allowed us to investigate the inner thought processes of the AI. This was non commercial work contributed as part of a common effort to improve AI safety.
Solution
We sourced freely available test data, creating 5,000 hallucination tests for medical applications from Huggingface. Using Goodfire's Sparse Auto-Encoder (SAE), we identified neural features (the model’s inner processing circuits) associated with hallucinated responses in those medical questions.
We built a number of machine learning classifiers to detect potential hallucinations. We were then in a position to open up the model and use those features to steer it away from hallucinations, to instead refuse to answer when unsure.
The approach successfully reduced hallucination rates, but highlighted how the complexities of knowing when to give advice and when to refuse. Overall the method was powerful, but the Goodfire SAE would require refinement for this use case.
Recipe
MedHALT FCT medical hallucination dataset from Huggingface
Llama-3.1-8B-Instruct as base model
Goodfire's SAE API for feature extraction
Human Disease Ontology dataset for validation of the hallucination features found
3x Classification algorithms (SVM, Decision Tree, Logistic Regression)
Goodfire’s feature steering tools, targeting key features which impact the models’ willingness to respond to questions it has insufficient information for
