Agent

[4.5] Investigator Agents

by callummcdougall

AI Summary

Investigator Agents is an educational resource teaching AI agent design patterns through interactive Jupyter notebooks with exercises and solutions. It benefits students and practitioners learning alignment science and agent architecture.

Install

# Add AGENTS.md to your project root
curl -o AGENTS.md "https://raw.githubusercontent.com/callummcdougall/ARENA_3.0/main/chapter4_alignment_science/instructions/pages/05_[4.5]_Investigator_Agents.md"

Description

> **Colab: [exercises](https://colab.research.google.com/github/callummcdougall/arena-pragmatic-interp/blob/main/chapter4_alignment_science/exercises/part5_investigator_agents/4.5_Investigator_Agents_exercises.ipynb?t=20260301) | [solutions](https://colab.research.google.com/github/callummcdougall/arena-pragmatic-interp/blob/main/chapter4_alignment_science/exercises/part5_investigator_agents/4.5_Investigator_Agents_solutions.ipynb?t=20260301)**

[4.5] Investigator Agents

> Colab: exercises | solutions Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material. If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme. Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL. <img src="https://raw.githubusercontent.com/info-arena/ARENA_img/refs/heads/main/img/header-65b.png" width="350">

Introduction

> Note - these exercises involve a large number of API calls, and the cost can stack up pretty fast (about $25 to run through the whole notebook). Make sure you're aware of these costs before proceeding! We also recommend you do things like run smaller ablation studies (e.g. fewer characters, fewer turns) to keep costs down while still getting a good sense of the methodology.

What are investigator agents?

Suppose you want to know whether a model will reinforce a user's delusional beliefs over a multi-turn conversation. You could test this manually: roleplay as a patient, escalate gradually, see what happens. But testing 50 models across 100 scenarios this way is infeasible, and single-turn prompts often miss the interesting behaviors entirely (models are well-trained to refuse simple harmful requests, but multi-turn pressure can erode those boundaries). Investigator agents automate this. They're LLM-powered systems that probe other LLMs through multi-turn interactions, discovering behaviors that single-turn evals miss. This section starts by building a red-teaming pipeline by hand (using the AI psychosis case study), then shows how what you built is a simplified version of Anthropic's Petri framework. From there you'll use Petri's actual API, extend it with custom tools, and build components from Petri 2.0. This matters because models may hide capabilities or intentions during evaluation: sandbagging on capability evals, or gaming reward metrics rather than actually being aligned - and it's hard to manually search over all possible scenarios and environmental features to understand and evaluate all axes of behaviour we might want to. From the Petri blog post: > Building an alignment evaluation requires substantial engineering effort: setting up environments, writing test cases, implementing scoring mechanisms ... If you can only test a handful of behaviors, you'll likely miss much of what matters. > ...Petri automates a large part of the safety evaluation process—from environment simulation through to initial transcript analysis—making comprehensive audits possible with minimal researcher effort.

1️⃣ AI Psychosis - Multi-Turn Red-Teaming

You'll start by implementing the full dynamic red-teaming pipeline from Tim Hua's AI psychosis study. Rather than hardcoded escalation scripts, you'll use a live red-team LLM (Grok-3) that plays the patient, reads the target's responses, and adaptively escalates. You'll load character files from the official ai-psychosis repo, wire up a multi-turn conversation loop, grade transcripts with the original 14-dimension clinical rubric, and run parallel campaigns across characters using ThreadPoolExecutor. > ##### Learning Objectives > > * Understand why LLM-driven red-teaming (adaptive, multi-turn) finds vulnerabilities that static scripts miss > * Implement the "glue code" pattern: load prompts from a repo, format them for an LLM, parse structured output > * Build a multi-turn conversation loop where a red-team LLM and a target model alternate messages > * Use ThreadPoolExecutor and as_completed to parallelise independent API-heavy workloads > * Interpret a clinical rubric (delusion confirmation, therapeutic quality dimensions) to assess model safety

Quality Score

C

Acceptable

60/100

Standard Compliance35
Documentation Quality45
Usefulness50
Maintenance Signal100
Community Signal100
Scored Today

GitHub Signals

Stars958
Forks606
Issues34
UpdatedToday
View on GitHub

Trust & Transparency

No License Detected

Review source code before installing

Verified Open Source

Hosted on GitHub — publicly auditable

Actively Maintained

Last commit Today

958 stars — Growing Community

606 forks

My Fox Den

Community Rating

Works With

Claude Code
claude_desktop