How do I install [4.5] Investigator Agents?

[4.5] Investigator Agents is a Agent hosted on GitHub at https://github.com/callummcdougall/ARENA_3.0. Visit the ImAiFox page at https://imaifox.com/boosters/callummcdougall-arena-3-0-05-4-5-investigator-agents for the AI-ready install prompt you can copy directly into Claude Code, Cursor, or Windsurf.

How popular is [4.5] Investigator Agents?

[4.5] Investigator Agents has 997 GitHub stars and 635 forks. The repository has not had recent commits.

Is [4.5] Investigator Agents free?

Yes — [4.5] Investigator Agents is open source and free to use. The source code is publicly available on GitHub at https://github.com/callummcdougall/ARENA_3.0.

Agent

[4.5] Investigator Agents

Name: [4.5] Investigator Agents
Author: callummcdougall

by callummcdougall

AI Summary

Investigator Agents is an educational resource teaching AI agent design patterns through interactive Jupyter notebooks with exercises and solutions. It benefits students and practitioners learning alignment science and agent architecture.

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to set up the "[4.5] Investigator Agents" agent in my project.

Please run this command in my terminal:
# Add AGENTS.md to your project root
curl --retry 3 --retry-delay 2 --retry-all-errors -o AGENTS.md "https://raw.githubusercontent.com/callummcdougall/ARENA_3.0/main/chapter4_alignment_science/instructions/pages/05_[4.5]_Investigator_Agents.md"

Then explain what the agent does and how to invoke it.

git github ai agent

Description

> **Colab: [exercises](https://colab.research.google.com/github/callummcdougall/ARENA_3.0/blob/main/chapter4_alignment_science/exercises/part5_investigator_agents/4.5_Investigator_Agents_exercises.ipynb?t=20260305) | [solutions](https://colab.research.google.com/github/callummcdougall/ARENA_3.0/blob/main/chapter4_alignment_science/exercises/part5_investigator_agents/4.5_Investigator_Agents_solutions.ipynb?t=20260305)**

[4.5] Investigator Agents

> Colab: exercises | solutions Please send any problems / bugs on the #errata channel in the Slack group, and ask any questions on the dedicated channels for this chapter of material. If you want to change to dark mode, you can do this by clicking the three horizontal lines in the top-right, then navigating to Settings → Theme. Links to all other chapters: (0) Fundamentals, (1) Transformer Interpretability, (2) RL. <img src="https://raw.githubusercontent.com/info-arena/ARENA_img/refs/heads/main/img/header-65b.png" width="350">

Introduction

> Note - these exercises involve a large number of API calls, and the cost can stack up pretty fast (about $25 to run through the whole notebook). Make sure you're aware of these costs before proceeding! We also recommend you do things like run smaller ablation studies (e.g. fewer characters, fewer turns) to keep costs down while still getting a good sense of the methodology.

What are investigator agents?

Suppose you want to know whether a model will reinforce a user's delusional beliefs over a multi-turn conversation. You could test this manually: roleplay as a patient, escalate gradually, see what happens. But testing 50 models across 100 scenarios this way is infeasible, and single-turn prompts often miss the interesting behaviors entirely (models are well-trained to refuse simple harmful requests, but multi-turn pressure can erode those boundaries). Investigator agents automate this. They're LLM-powered systems that probe other LLMs through multi-turn interactions, discovering behaviors that single-turn evals miss. This section starts by building a red-teaming pipeline by hand (using the AI psychosis case study), then shows how what you built is a simplified version of Anthropic's Petri framework. From there you'll use Petri's actual API, extend it with custom tools, and build components from Petri 2.0. This matters because models may hide capabilities or intentions during evaluation: sandbagging on capability evals, or gaming reward metrics rather than actually being aligned - and it's hard to manually search over all possible scenarios and environmental features to understand and evaluate all axes of behaviour we might want to. From the Petri blog post: > Building an alignment evaluation requires substantial engineering effort: setting up environments, writing test cases, implementing scoring mechanisms ... If you can only test a handful of behaviors, you'll likely miss much of what matters. > ...Petri automates a large part of the safety evaluation process—from environment simulation through to initial transcript analysis—making comprehensive audits possible with minimal researcher effort.

1️⃣ AI Psychosis - Multi-Turn Red-Teaming

You'll start by implementing the full dynamic red-teaming pipeline from Tim Hua's AI psychosis study. Rather than hardcoded escalation scripts, you'll use a live red-team LLM (Grok-3) that plays the patient, reads the target's responses, and adaptively escalates. You'll load character files from the official ai-psychosis repo, wire up a multi-turn conversation loop, grade transcripts with the original 14-dimension clinical rubric, and run parallel campaigns across characters using ThreadPoolExecutor. > ##### Learning Objectives > > * Understand why LLM-driven red-teaming (adaptive, multi-turn) finds vulnerabilities that static scripts miss > * Implement the "glue code" pattern: load prompts from a repo, format them for an LLM, parse structured output > * Build a multi-turn conversation loop where a red-team LLM and a target model alternate messages > * Use ThreadPoolExecutor and as_completed to parallelise independent API-heavy workloads > * Interpret a clinical rubric (delusion confirmation, therapeutic quality dimensions) to assess model safety

Discussion

0/2000

Loading comments...

Health Signals

MaintenanceCommitted 4mo ago

◐ Stale

Adoption100+ stars on GitHub

997 ★ · Growing

DocsREADME + description

Well-documented

GitHub Signals

Stars997

Forks635

Issues37

Updated4mo ago

View on GitHub

No License

My Fox Den

Community Rating

Works With

Claude Code

Claude.ai

Related Agents

Openclaw MCP Server

MCP Server

receiving-code-review

Skill

dispatching-parallel-agents

Skill

using-git-worktrees

Skill

View all Agents →