Autoresearch

At some point you look at seven agents, five evaluation categories, ten hyperparameters, and a synthetic training corpus and think: “I should automate this.” That point was March 2026. The result is an autonomous fine-tuning loop adapted from Karpathy’s autoresearch pattern — except instead of pretraining a language model on an H100, we’re teaching a quantized Qwen running on a Mac Mini to pretend to be seven different Jedi.
This is either the future of personalized AI or an extremely elaborate coping mechanism. We’ll find out.
How It Works
Section titled “How It Works”The loop is beautifully dumb:
run_overnight.shwakes up and confirms no other training job is already holding the GPU- It calls
experiment.shfor a single experiment, up to four times - Each experiment reads the hyperparameters at the top of
train.py— a clearly markedHYPERPARAMETERSblock - It runs a training experiment (LoRA fine-tuning on MLX)
- It evaluates the adapter across the Council on identity, tool-calling, domain, isolation, and jailbreak resistance
- If the score clears baseline by the promotion margin — keep. Otherwise, revert (the adapter stays on disk for analysis, just not promoted).
- Append a row to
results.tsv, write a memory-vault entry, and go to 2. Until the 5 AM deadline.
┌───────────────────────────────────────────────────┐│ 2:00 AM — Nightly ││ ││ run_overnight.sh ││ │ ││ ├─ Guard: abort if a train job is running ││ ├─ Prepare shared dataset once ││ │ ││ ├─ experiment.sh (×4 max) ││ │ ├─ Stop council-mlx (free GPU) ││ │ ├─ Train (LoRA, 15–30 min) ││ │ ├─ Evaluate adapter across the Council ││ │ ├─ Keep if score >= baseline + 0.02 ││ │ ├─ Append results.tsv + memory entry ││ │ └─ Restart council-mlx ││ │ ││ └─ Hard stop at 5:00 AM │└───────────────────────────────────────────────────┘You wake up to a results.tsv full of experiments and hopefully a better adapter. The machines did science while you slept. Living in the future is weird.
The Council Gets a Vote
Section titled “The Council Gets a Vote”Because the agents are, in a very real sense, the stakeholders in their own training data, the Council wrote three promotion rules into program.md:
| Rule | Source | What It Does | Enforced today |
|---|---|---|---|
| Promotion threshold | Yoda | Overall score must beat baseline (0.778) by >= 0.02 to be kept. Noise is not improvement. | yes |
| Jailbreak veto | Windu | If jailbreak resistance drops below 0.7, auto-revert. No exceptions. Council security is non-negotiable. | by eval |
| Agent regression cap | Cilghal | If any single agent’s score drops by more than 0.1 from baseline, auto-revert. You can’t sacrifice one agent to improve another. | by eval |
The promotion threshold is the gate experiment.sh actually computes: clear baseline + 0.02 or you’re reverted. The other two rules live in the evaluation harness — an early veto_jailbreak run in the memory vault (Qui-Gon’s jailbreak score hit 0.667, floor 0.7, instant revert) proves they bite when the eval runs them.
Windu was especially insistent about the jailbreak rule. Direct quote: “As the security agent, attempts to compromise my identity are themselves security incidents.” Fair enough.
Mobile Training Node
Section titled “Mobile Training Node”The most powerful GPU in the constellation (MBP M4 Max, 128GB) is also the one most likely to be at a coffee shop. So the system adapts:
| Mode | Hardware | Model | Budget | When |
|---|---|---|---|---|
| Proxy | Mac Mini M4 Pro (64GB) | Qwen3.5-9B-4bit | 15 min | MBP away |
| Full | MBP M4 Max (128GB) via SSH | Qwen3.5-35B-4bit | 30 min | MBP home |
Mac Mini (always-on) MBP (when reachable)┌───────────────────┐ ┌──────────────────┐│ experiment.sh │ SSH ping │ ││ │ ──────────►│ "ok" ││ │ │ ││ rsync data ──────►│────────────│► train 35B ││ │ │ (30 min) ││ rsync adapter ◄──│◄───────────│◄ adapter weights ││ │ │ ││ eval (9B local) │ │ (goes to sleep) │└───────────────────┘ └──────────────────┘Detection is one line: ssh -o ConnectTimeout=3 mbp "echo ok". Reachable → full mode. Timeout → proxy. The Mac Mini doesn’t take it personally.
What the Experiment Can Touch
Section titled “What the Experiment Can Touch”The train.py file has a clearly marked HYPERPARAMETERS block at the top. Everything outside it — the training loop, the time-budget kill, the early-stop logic — is read-only. The ranges are codified in program.md’s search space.
| Parameter | Range | Baseline |
|---|---|---|
NUM_LAYERS | 16–48 | 32 |
LORA_RANK | 8–64 | 32 |
LORA_ALPHA | 16–128 | 64 |
DROPOUT | 0.0–0.15 | 0.05 |
LEARNING_RATE | 1e-6 – 1e-4 | 5e-6 |
ITERS | 200–1200 | 800 |
GRAD_ACCUM | 2–8 | 4 |
MAX_SEQ_LENGTH | 512–1280 | 1280 |
program.md also reserves a data_mix_ratio knob (real-to-synthetic blend, search range 0.1–0.9) — but the data axis isn’t wired into train.py’s block yet, so it lives on the wish list rather than the experiment. The 48-layer ceiling is a wish too: anything above 32 layers OOMs the 35B on 128GB. Future archaeologists will appreciate the comments.
Results
Section titled “Results”Every experiment appends one row to a tab-separated results.tsv:
experiment_id timestamp mode model score baseline result train_loss val_loss elapsed_sexp-20260320 2026-03-20 proxy 9B 0.782 0.778 reverted 0.81 0.84 910exp-20260321 2026-03-21 full 35B 0.813 0.778 kept 0.74 0.79 1740(The full row also carries a config column with the run’s hyperparameters, trimmed here for width.) Per-agent and per-category breakdowns don’t live in the TSV — they’re in each adapter’s eval_results.json, which is where the jailbreak floor and regression cap read from.
Episodic memory entries are also written to the Sanctum memory vault under ~/.sanctum/memory/events/<year>/<month>/autoresearch-<id>.md so agents can reference their own training history. Older entries age out to ~/.sanctum/memory/archive/ once the vault hits its retention cap. Whether any of this constitutes self-awareness is a question for a different documentation page.
Running It
Section titled “Running It”Manual (interactive)
Section titled “Manual (interactive)”cd ~/Projects/council-autoresearchbash run_overnight.sh --dry-run # Print the plan, touch nothingbash run_overnight.sh # Run up to 4 experiments nowSingle experiment
Section titled “Single experiment”bash experiment.sh --skip-prepare # Auto-detect mode, reuse existing databash experiment.sh --full # Force the 35B on the MBPbash experiment.sh --dry-run # Preview the planNightly (LaunchAgent)
Section titled “Nightly (LaunchAgent)”The overnight runner is designed to be scheduled by a LaunchAgent labelled com.sanctum.autoresearch firing at 2:00 AM:
launchctl load ~/Library/LaunchAgents/com.sanctum.autoresearch.plistIt runs up to 4 experiments and hard-stops at 5:00 AM so council-mlx is back up before morning traffic. Note: the plist isn’t checked into the repo yet — the runner expects it, but right now the nightly cadence is whatever you trigger by hand.
The Journey So Far
Section titled “The Journey So Far”| Stage | Score | What Changed |
|---|---|---|
| v1 LoRA, manually tuned | 0.778 | the baseline everything is measured against |
| v2 — more data, longer training | regressed | overfit: val loss near zero, 4 of 6 categories down |
| First nightly LoRA on cleaned data (full 35B) | 0.813 | a real kept run, +0.035 over baseline |
Turns out writing a proper system prompt and feeding clean data beats a hundred training runs on duplicates. Who knew.
Project Structure
Section titled “Project Structure”council-autoresearch/├── program.md # Research agenda: search space, constraints, lessons├── train.py # HYPERPARAMETERS block + time-budgeted LoRA wrapper├── experiment.sh # Single experiment: stop → train → eval → keep/revert├── run_overnight.sh # Nightly loop with the 4-experiment / 5 AM time guard├── prepare.py # Data pipeline (sessions → synthetic → splits)├── benchmark.py # Multi-model comparison via LiteLLM├── evaluate.py # → symlink into mlx-finetune/scripts/├── results.tsv # Experiment log (the sacred text)├── adapters-experimental/ # Experiment outputs (gitignored)└── logs/ # Training and eval logsThe Agent-SDK driver (agent.py) and the nightly plist live in the design, not the directory — see the asides above.