EXP-0001 — Repeatability Band (Seed Variance)
experiment_id: EXP-0001 · status: draft
EXP-0001 — Repeatability Band (Seed Variance)
Status: Complete
Created: 2025-12-25 Last Updated: 2025-12-25
Hypothesis
If we train on the same baseline dataset (D0) with only the random seed changed, then the organism’s learning curves and early behavior snapshots will fall within a narrow variance band, because the training loop and data distribution are otherwise identical.
Setup / Test Plan
What stays fixed:
- Dataset: D0 only (fairy tales baseline).
- Dataset file:
data/staging/phases/phase0a_early-childhood/fairy_tales.jsonl - Manifest:
data/manifests/month1_manifest_v1.yaml
- Dataset file:
- Model config:
organism/configs/phase0_organism.yaml(unless superseded by a dedicated month-1 config later). - Training budget: choose one budget and keep it identical across seeds.
- Recommended week-1 budget:
--max-steps 20000 - Optional smoke budget:
--max-steps 2000(not used for conclusions).
- Recommended week-1 budget:
- Prompt suite: fixed, versioned prompts.
- File:
organism/prompts/v1.json prompt_set_id:month1_v1
- File:
- Eval cadence: run evaluation at least at the end of training; optionally every 500 steps early.
- Resume behavior: for this experiment, runs start from scratch (no resume).
What changes:
seedonly (e.g., 3 runs).
Planned runs:
EXP-0001-seed-07EXP-0001-seed-11EXP-0001-seed-13
Prior run (counts as a pilot baseline)
phase0_organism(already executed on D0 fairy tales) with prompt setphase0a_fairy_tales_v1.- This run is useful as a “first footprint” and sanity check, but it is not directly comparable to Month‑1 prompt suite
month1_v1. - We keep it in the experiment record as a pilot and begin the repeatability band with the stabilized prompt suite.
- This run is useful as a “first footprint” and sanity check, but it is not directly comparable to Month‑1 prompt suite
Measurements (Pass/Fail)
Primary (quantitative):
- Loss curve variance band:
- Compare loss vs
tokens_seenfor each run. - Compute variance of loss in the last 20% of training steps.
- Compare loss vs
- Plateau coefficient consistency:
- Define early slope as linear regression slope of loss over tokens_seen for steps in 10–30%.
- Define late slope as linear regression slope of loss over tokens_seen for steps in 70–90%.
- Plateau coefficient =
abs(late_slope) / max(abs(early_slope), eps). - Compare distribution across seeds.
Secondary (behavior snapshots):
- Eval intelligibility + collapse checks from
organism/eval/eval.py:intelligiblechar_repetition_ratengram_repetition_rate
- Qualitative tags (human-labeled): coherent / incoherent / style drift.
Pass criteria (initial):
- Plateau coefficient and late-loss variance are “small enough” that week-2+ changes are attributable.
- Thresholds to set after first 3 runs; record the variance band as the baseline.
Results
Runs executed:
- Three runs were executed on identical dataset of children's fairy tales. Each run substituted on the seed value, and executed 20000 steps.
Observed:
- Initial high-level observations that loss is predictably falling on a (need the right value is it a function) curve and plateau. Prompt evaluation was consistent, with most responses having a unique string followed by a repeated string.
Interpretation
- Nothing substantial, the training set was minimal, runs capped, and prompts generic. As of now we can only say that the loss curve is acceptable and current dataset produces little additional value. Further experiments will elucidate this.
Decision
- Adopt
- Next actions: -- Experiment 2
Runs
View runs| run_id | loss_best | plateau | tokens_seen | prompt_set |
|---|---|---|---|---|
| EXP-0001-seed-13 | 1.2741494178771973 | 0.050275210908983584 | 81920000 | month1_v1 |
| EXP-0001-seed-11 | 1.2917333841323853 | 0.0649510946667559 | 81920000 | month1_v1 |
| EXP-0001-seed-07 | 1.2775521278381348 | 0.09764066947888612 | 81920000 | month1_v1 |
Training
Runs + metrics.
Eval
Prompt snapshots.
Insights
Notes + conclusions.