EXP-0001 — Repeatability Band (Seed Variance)

experiment_id: EXP-0001 · status: draft

EXP-0001 — Repeatability Band (Seed Variance)

Status: Complete

Created: 2025-12-25 Last Updated: 2025-12-25

Hypothesis

If we train on the same baseline dataset (D0) with only the random seed changed, then the organism’s learning curves and early behavior snapshots will fall within a narrow variance band, because the training loop and data distribution are otherwise identical.

Setup / Test Plan

What stays fixed:

Dataset: D0 only (fairy tales baseline).
- Dataset file: data/staging/phases/phase0a_early-childhood/fairy_tales.jsonl
- Manifest: data/manifests/month1_manifest_v1.yaml
Model config: organism/configs/phase0_organism.yaml (unless superseded by a dedicated month-1 config later).
Training budget: choose one budget and keep it identical across seeds.
- Recommended week-1 budget: --max-steps 20000
- Optional smoke budget: --max-steps 2000 (not used for conclusions).
Prompt suite: fixed, versioned prompts.
- File: organism/prompts/v1.json
- prompt_set_id: month1_v1
Eval cadence: run evaluation at least at the end of training; optionally every 500 steps early.
Resume behavior: for this experiment, runs start from scratch (no resume).

What changes:

seed only (e.g., 3 runs).

Planned runs:

EXP-0001-seed-07
EXP-0001-seed-11
EXP-0001-seed-13

Prior run (counts as a pilot baseline)

phase0_organism (already executed on D0 fairy tales) with prompt set phase0a_fairy_tales_v1.
- This run is useful as a “first footprint” and sanity check, but it is not directly comparable to Month‑1 prompt suite month1_v1.
- We keep it in the experiment record as a pilot and begin the repeatability band with the stabilized prompt suite.

Measurements (Pass/Fail)

Primary (quantitative):

Loss curve variance band:
- Compare loss vs tokens_seen for each run.
- Compute variance of loss in the last 20% of training steps.
Plateau coefficient consistency:
- Define early slope as linear regression slope of loss over tokens_seen for steps in 10–30%.
- Define late slope as linear regression slope of loss over tokens_seen for steps in 70–90%.
- Plateau coefficient = abs(late_slope) / max(abs(early_slope), eps).
- Compare distribution across seeds.

Secondary (behavior snapshots):

Eval intelligibility + collapse checks from organism/eval/eval.py:
- intelligible
- char_repetition_rate
- ngram_repetition_rate
Qualitative tags (human-labeled): coherent / incoherent / style drift.

Pass criteria (initial):

Plateau coefficient and late-loss variance are “small enough” that week-2+ changes are attributable.
- Thresholds to set after first 3 runs; record the variance band as the baseline.

Results

Runs executed:

Three runs were executed on identical dataset of children's fairy tales. Each run substituted on the seed value, and executed 20000 steps.

Observed:

Initial high-level observations that loss is predictably falling on a (need the right value is it a function) curve and plateau. Prompt evaluation was consistent, with most responses having a unique string followed by a repeated string.

Interpretation

Nothing substantial, the training set was minimal, runs capped, and prompts generic. As of now we can only say that the loss curve is acceptable and current dataset produces little additional value. Further experiments will elucidate this.

Decision

Adopt
Next actions: -- Experiment 2

Runs

View runs

run_id	loss_best	plateau	tokens_seen	prompt_set
EXP-0001-seed-13	1.2741494178771973	0.050275210908983584	81920000	month1_v1
EXP-0001-seed-11	1.2917333841323853	0.0649510946667559	81920000	month1_v1
EXP-0001-seed-07	1.2775521278381348	0.09764066947888612	81920000	month1_v1

Training

Runs + metrics.

Eval

Prompt snapshots.

Insights

Notes + conclusions.