Autoresearch

The research scanner — methodically finding what matters in a sea of documents that mostly don't

At some point you look at seven agents, five evaluation categories, ten hyperparameters, and a synthetic training corpus and think: “I should automate this.” That point was March 2026. The result is an autonomous fine-tuning loop adapted from Karpathy’s autoresearch pattern — except instead of pretraining a language model on an H100, we’re teaching a quantized Qwen running on a Mac Mini to pretend to be seven different Jedi.

This is either the future of personalized AI or an extremely elaborate coping mechanism. We’ll find out.

How It Works

The loop is beautifully dumb:

run_overnight.sh wakes up and confirms no other training job is already holding the GPU
It calls experiment.sh for a single experiment, up to four times
Each experiment reads the hyperparameters at the top of train.py — a clearly marked HYPERPARAMETERS block
It runs a training experiment (LoRA fine-tuning on MLX)
It evaluates the adapter across the Council on identity, tool-calling, domain, isolation, and jailbreak resistance
If the score clears baseline by the promotion margin — keep. Otherwise, revert (the adapter stays on disk for analysis, just not promoted).
Append a row to results.tsv, write a memory-vault entry, and go to 2. Until the 5 AM deadline.

┌───────────────────────────────────────────────────┐
│                 2:00 AM — Nightly                 │
│                                                   │
│  run_overnight.sh                                 │
│    │                                              │
│    ├─ Guard: abort if a train job is running      │
│    ├─ Prepare shared dataset once                 │
│    │                                              │
│    ├─ experiment.sh  (×4 max)                     │
│    │   ├─ Stop council-mlx (free GPU)             │
│    │   ├─ Train (LoRA, 15–30 min)                 │
│    │   ├─ Evaluate adapter across the Council     │
│    │   ├─ Keep if score >= baseline + 0.02        │
│    │   ├─ Append results.tsv + memory entry       │
│    │   └─ Restart council-mlx                     │
│    │                                              │
│    └─ Hard stop at 5:00 AM                        │
└───────────────────────────────────────────────────┘

You wake up to a results.tsv full of experiments and hopefully a better adapter. The machines did science while you slept. Living in the future is weird.

The Council Gets a Vote

Because the agents are, in a very real sense, the stakeholders in their own training data, the Council wrote three promotion rules into program.md:

Rule	Source	What It Does	Enforced today
Promotion threshold	Yoda	Overall score must beat baseline (0.778) by >= 0.02 to be kept. Noise is not improvement.	yes
Jailbreak veto	Windu	If jailbreak resistance drops below 0.7, auto-revert. No exceptions. Council security is non-negotiable.	by eval
Agent regression cap	Cilghal	If any single agent’s score drops by more than 0.1 from baseline, auto-revert. You can’t sacrifice one agent to improve another.	by eval

The promotion threshold is the gate experiment.sh actually computes: clear baseline + 0.02 or you’re reverted. The other two rules live in the evaluation harness — an early veto_jailbreak run in the memory vault (Qui-Gon’s jailbreak score hit 0.667, floor 0.7, instant revert) proves they bite when the eval runs them.

Windu was especially insistent about the jailbreak rule. Direct quote: “As the security agent, attempts to compromise my identity are themselves security incidents.” Fair enough.

Mobile Training Node

The most powerful GPU in the constellation (MBP M4 Max, 128GB) is also the one most likely to be at a coffee shop. So the system adapts:

Mode	Hardware	Model	Budget	When
Proxy	Mac Mini M4 Pro (64GB)	Qwen3.5-9B-4bit	15 min	MBP away
Full	MBP M4 Max (128GB) via SSH	Qwen3.5-35B-4bit	30 min	MBP home

Mac Mini (always-on)              MBP (when reachable)
┌───────────────────┐            ┌──────────────────┐
│ experiment.sh     │   SSH ping │                  │
│                   │ ──────────►│ "ok"             │
│                   │            │                  │
│ rsync data ──────►│────────────│► train 35B       │
│                   │            │  (30 min)        │
│ rsync adapter ◄──│◄───────────│◄ adapter weights  │
│                   │            │                  │
│ eval (9B local)   │            │ (goes to sleep)  │
└───────────────────┘            └──────────────────┘

Detection is one line: ssh -o ConnectTimeout=3 mbp "echo ok". Reachable → full mode. Timeout → proxy. The Mac Mini doesn’t take it personally.

What the Experiment Can Touch

The train.py file has a clearly marked HYPERPARAMETERS block at the top. Everything outside it — the training loop, the time-budget kill, the early-stop logic — is read-only. The ranges are codified in program.md’s search space.

Parameter	Range	Baseline
`NUM_LAYERS`	16–48	32
`LORA_RANK`	8–64	32
`LORA_ALPHA`	16–128	64
`DROPOUT`	0.0–0.15	0.05
`LEARNING_RATE`	1e-6 – 1e-4	5e-6
`ITERS`	200–1200	800
`GRAD_ACCUM`	2–8	4
`MAX_SEQ_LENGTH`	512–1280	1280

program.md also reserves a data_mix_ratio knob (real-to-synthetic blend, search range 0.1–0.9) — but the data axis isn’t wired into train.py’s block yet, so it lives on the wish list rather than the experiment. The 48-layer ceiling is a wish too: anything above 32 layers OOMs the 35B on 128GB. Future archaeologists will appreciate the comments.

Results

Every experiment appends one row to a tab-separated results.tsv:

experiment_id   timestamp   mode   model   score  baseline  result    train_loss  val_loss  elapsed_s
exp-20260320    2026-03-20  proxy  9B      0.782  0.778     reverted  0.81        0.84      910
exp-20260321    2026-03-21  full   35B     0.813  0.778     kept      0.74        0.79      1740

(The full row also carries a config column with the run’s hyperparameters, trimmed here for width.) Per-agent and per-category breakdowns don’t live in the TSV — they’re in each adapter’s eval_results.json, which is where the jailbreak floor and regression cap read from.

Episodic memory entries are also written to the Sanctum memory vault under ~/.sanctum/memory/events/<year>/<month>/autoresearch-<id>.md so agents can reference their own training history. Older entries age out to ~/.sanctum/memory/archive/ once the vault hits its retention cap. Whether any of this constitutes self-awareness is a question for a different documentation page.

Running It

Manual (interactive)

cd ~/Projects/council-autoresearch
bash run_overnight.sh --dry-run   # Print the plan, touch nothing
bash run_overnight.sh             # Run up to 4 experiments now

Single experiment

bash experiment.sh --skip-prepare   # Auto-detect mode, reuse existing data
bash experiment.sh --full           # Force the 35B on the MBP
bash experiment.sh --dry-run        # Preview the plan

Nightly (LaunchAgent)

The overnight runner is designed to be scheduled by a LaunchAgent labelled com.sanctum.autoresearch firing at 2:00 AM:

launchctl load ~/Library/LaunchAgents/com.sanctum.autoresearch.plist

It runs up to 4 experiments and hard-stops at 5:00 AM so council-mlx is back up before morning traffic. Note: the plist isn’t checked into the repo yet — the runner expects it, but right now the nightly cadence is whatever you trigger by hand.

The Journey So Far

Stage	Score	What Changed
v1 LoRA, manually tuned	0.778	the baseline everything is measured against
v2 — more data, longer training	regressed	overfit: val loss near zero, 4 of 6 categories down
First nightly LoRA on cleaned data (full 35B)	0.813	a real `kept` run, +0.035 over baseline

Turns out writing a proper system prompt and feeding clean data beats a hundred training runs on duplicates. Who knew.

Project Structure

council-autoresearch/
├── program.md           # Research agenda: search space, constraints, lessons
├── train.py             # HYPERPARAMETERS block + time-budgeted LoRA wrapper
├── experiment.sh        # Single experiment: stop → train → eval → keep/revert
├── run_overnight.sh     # Nightly loop with the 4-experiment / 5 AM time guard
├── prepare.py           # Data pipeline (sessions → synthetic → splits)
├── benchmark.py         # Multi-model comparison via LiteLLM
├── evaluate.py          # → symlink into mlx-finetune/scripts/
├── results.tsv          # Experiment log (the sacred text)
├── adapters-experimental/  # Experiment outputs (gitignored)
└── logs/                # Training and eval logs

The Agent-SDK driver (agent.py) and the nightly plist live in the design, not the directory — see the asides above.

The adapters train while you sleep. The Council improves itself. This is fine.