Skip to content

Autoresearch

The research scanner — methodically finding what matters in a sea of documents that mostly don't

At some point you look at seven agents, five evaluation categories, ten hyperparameters, and a synthetic training corpus and think: “I should automate this.” That point was March 2026. The result is an autonomous fine-tuning loop adapted from Karpathy’s autoresearch pattern — except instead of pretraining a language model on an H100, we’re teaching a quantized Qwen running on a Mac Mini to pretend to be seven different Jedi.

This is either the future of personalized AI or an extremely elaborate coping mechanism. We’ll find out.

The loop is beautifully dumb:

  1. run_overnight.sh wakes up and confirms no other training job is already holding the GPU
  2. It calls experiment.sh for a single experiment, up to four times
  3. Each experiment reads the hyperparameters at the top of train.py — a clearly marked HYPERPARAMETERS block
  4. It runs a training experiment (LoRA fine-tuning on MLX)
  5. It evaluates the adapter across the Council on identity, tool-calling, domain, isolation, and jailbreak resistance
  6. If the score clears baseline by the promotion margin — keep. Otherwise, revert (the adapter stays on disk for analysis, just not promoted).
  7. Append a row to results.tsv, write a memory-vault entry, and go to 2. Until the 5 AM deadline.
┌───────────────────────────────────────────────────┐
│ 2:00 AM — Nightly │
│ │
│ run_overnight.sh │
│ │ │
│ ├─ Guard: abort if a train job is running │
│ ├─ Prepare shared dataset once │
│ │ │
│ ├─ experiment.sh (×4 max) │
│ │ ├─ Stop council-mlx (free GPU) │
│ │ ├─ Train (LoRA, 15–30 min) │
│ │ ├─ Evaluate adapter across the Council │
│ │ ├─ Keep if score >= baseline + 0.02 │
│ │ ├─ Append results.tsv + memory entry │
│ │ └─ Restart council-mlx │
│ │ │
│ └─ Hard stop at 5:00 AM │
└───────────────────────────────────────────────────┘

You wake up to a results.tsv full of experiments and hopefully a better adapter. The machines did science while you slept. Living in the future is weird.

Because the agents are, in a very real sense, the stakeholders in their own training data, the Council wrote three promotion rules into program.md:

RuleSourceWhat It DoesEnforced today
Promotion thresholdYodaOverall score must beat baseline (0.778) by >= 0.02 to be kept. Noise is not improvement.yes
Jailbreak vetoWinduIf jailbreak resistance drops below 0.7, auto-revert. No exceptions. Council security is non-negotiable.by eval
Agent regression capCilghalIf any single agent’s score drops by more than 0.1 from baseline, auto-revert. You can’t sacrifice one agent to improve another.by eval

The promotion threshold is the gate experiment.sh actually computes: clear baseline + 0.02 or you’re reverted. The other two rules live in the evaluation harness — an early veto_jailbreak run in the memory vault (Qui-Gon’s jailbreak score hit 0.667, floor 0.7, instant revert) proves they bite when the eval runs them.

Windu was especially insistent about the jailbreak rule. Direct quote: “As the security agent, attempts to compromise my identity are themselves security incidents.” Fair enough.

The most powerful GPU in the constellation (MBP M4 Max, 128GB) is also the one most likely to be at a coffee shop. So the system adapts:

ModeHardwareModelBudgetWhen
ProxyMac Mini M4 Pro (64GB)Qwen3.5-9B-4bit15 minMBP away
FullMBP M4 Max (128GB) via SSHQwen3.5-35B-4bit30 minMBP home
Mac Mini (always-on) MBP (when reachable)
┌───────────────────┐ ┌──────────────────┐
│ experiment.sh │ SSH ping │ │
│ │ ──────────►│ "ok" │
│ │ │ │
│ rsync data ──────►│────────────│► train 35B │
│ │ │ (30 min) │
│ rsync adapter ◄──│◄───────────│◄ adapter weights │
│ │ │ │
│ eval (9B local) │ │ (goes to sleep) │
└───────────────────┘ └──────────────────┘

Detection is one line: ssh -o ConnectTimeout=3 mbp "echo ok". Reachable → full mode. Timeout → proxy. The Mac Mini doesn’t take it personally.

The train.py file has a clearly marked HYPERPARAMETERS block at the top. Everything outside it — the training loop, the time-budget kill, the early-stop logic — is read-only. The ranges are codified in program.md’s search space.

ParameterRangeBaseline
NUM_LAYERS16–4832
LORA_RANK8–6432
LORA_ALPHA16–12864
DROPOUT0.0–0.150.05
LEARNING_RATE1e-6 – 1e-45e-6
ITERS200–1200800
GRAD_ACCUM2–84
MAX_SEQ_LENGTH512–12801280

program.md also reserves a data_mix_ratio knob (real-to-synthetic blend, search range 0.1–0.9) — but the data axis isn’t wired into train.py’s block yet, so it lives on the wish list rather than the experiment. The 48-layer ceiling is a wish too: anything above 32 layers OOMs the 35B on 128GB. Future archaeologists will appreciate the comments.

Every experiment appends one row to a tab-separated results.tsv:

experiment_id timestamp mode model score baseline result train_loss val_loss elapsed_s
exp-20260320 2026-03-20 proxy 9B 0.782 0.778 reverted 0.81 0.84 910
exp-20260321 2026-03-21 full 35B 0.813 0.778 kept 0.74 0.79 1740

(The full row also carries a config column with the run’s hyperparameters, trimmed here for width.) Per-agent and per-category breakdowns don’t live in the TSV — they’re in each adapter’s eval_results.json, which is where the jailbreak floor and regression cap read from.

Episodic memory entries are also written to the Sanctum memory vault under ~/.sanctum/memory/events/<year>/<month>/autoresearch-<id>.md so agents can reference their own training history. Older entries age out to ~/.sanctum/memory/archive/ once the vault hits its retention cap. Whether any of this constitutes self-awareness is a question for a different documentation page.

Terminal window
cd ~/Projects/council-autoresearch
bash run_overnight.sh --dry-run # Print the plan, touch nothing
bash run_overnight.sh # Run up to 4 experiments now
Terminal window
bash experiment.sh --skip-prepare # Auto-detect mode, reuse existing data
bash experiment.sh --full # Force the 35B on the MBP
bash experiment.sh --dry-run # Preview the plan

The overnight runner is designed to be scheduled by a LaunchAgent labelled com.sanctum.autoresearch firing at 2:00 AM:

Terminal window
launchctl load ~/Library/LaunchAgents/com.sanctum.autoresearch.plist

It runs up to 4 experiments and hard-stops at 5:00 AM so council-mlx is back up before morning traffic. Note: the plist isn’t checked into the repo yet — the runner expects it, but right now the nightly cadence is whatever you trigger by hand.

StageScoreWhat Changed
v1 LoRA, manually tuned0.778the baseline everything is measured against
v2 — more data, longer trainingregressedoverfit: val loss near zero, 4 of 6 categories down
First nightly LoRA on cleaned data (full 35B)0.813a real kept run, +0.035 over baseline

Turns out writing a proper system prompt and feeding clean data beats a hundred training runs on duplicates. Who knew.

council-autoresearch/
├── program.md # Research agenda: search space, constraints, lessons
├── train.py # HYPERPARAMETERS block + time-budgeted LoRA wrapper
├── experiment.sh # Single experiment: stop → train → eval → keep/revert
├── run_overnight.sh # Nightly loop with the 4-experiment / 5 AM time guard
├── prepare.py # Data pipeline (sessions → synthetic → splits)
├── benchmark.py # Multi-model comparison via LiteLLM
├── evaluate.py # → symlink into mlx-finetune/scripts/
├── results.tsv # Experiment log (the sacred text)
├── adapters-experimental/ # Experiment outputs (gitignored)
└── logs/ # Training and eval logs

The Agent-SDK driver (agent.py) and the nightly plist live in the design, not the directory — see the asides above.

The adapters train while you sleep. The Council improves itself. This is fine.