Research Atlas
Live Research Index

EXP-0001 — Repeatability Band (Seed Variance)

experiment_id: EXP-0001 · status: draft

EXP-0001 — Repeatability Band (Seed Variance)

Status: Complete

Created: 2025-12-25 Last Updated: 2025-12-25

Hypothesis

If we train on the same baseline dataset (D0) with only the random seed changed, then the organism’s learning curves and early behavior snapshots will fall within a narrow variance band, because the training loop and data distribution are otherwise identical.

Setup / Test Plan

What stays fixed:

  • Dataset: D0 only (fairy tales baseline).
    • Dataset file: data/staging/phases/phase0a_early-childhood/fairy_tales.jsonl
    • Manifest: data/manifests/month1_manifest_v1.yaml
  • Model config: organism/configs/phase0_organism.yaml (unless superseded by a dedicated month-1 config later).
  • Training budget: choose one budget and keep it identical across seeds.
    • Recommended week-1 budget: --max-steps 20000
    • Optional smoke budget: --max-steps 2000 (not used for conclusions).
  • Prompt suite: fixed, versioned prompts.
    • File: organism/prompts/v1.json
    • prompt_set_id: month1_v1
  • Eval cadence: run evaluation at least at the end of training; optionally every 500 steps early.
  • Resume behavior: for this experiment, runs start from scratch (no resume).

What changes:

  • seed only (e.g., 3 runs).

Planned runs:

  • EXP-0001-seed-07
  • EXP-0001-seed-11
  • EXP-0001-seed-13

Prior run (counts as a pilot baseline)

  • phase0_organism (already executed on D0 fairy tales) with prompt set phase0a_fairy_tales_v1.
    • This run is useful as a “first footprint” and sanity check, but it is not directly comparable to Month‑1 prompt suite month1_v1.
    • We keep it in the experiment record as a pilot and begin the repeatability band with the stabilized prompt suite.

Measurements (Pass/Fail)

Primary (quantitative):

  • Loss curve variance band:
    • Compare loss vs tokens_seen for each run.
    • Compute variance of loss in the last 20% of training steps.
  • Plateau coefficient consistency:
    • Define early slope as linear regression slope of loss over tokens_seen for steps in 10–30%.
    • Define late slope as linear regression slope of loss over tokens_seen for steps in 70–90%.
    • Plateau coefficient = abs(late_slope) / max(abs(early_slope), eps).
    • Compare distribution across seeds.

Secondary (behavior snapshots):

  • Eval intelligibility + collapse checks from organism/eval/eval.py:
    • intelligible
    • char_repetition_rate
    • ngram_repetition_rate
  • Qualitative tags (human-labeled): coherent / incoherent / style drift.

Pass criteria (initial):

  • Plateau coefficient and late-loss variance are “small enough” that week-2+ changes are attributable.
    • Thresholds to set after first 3 runs; record the variance band as the baseline.

Results

Runs executed:

  • Three runs were executed on identical dataset of children's fairy tales. Each run substituted on the seed value, and executed 20000 steps.

Observed:

  • Initial high-level observations that loss is predictably falling on a (need the right value is it a function) curve and plateau. Prompt evaluation was consistent, with most responses having a unique string followed by a repeated string.

Interpretation

  • Nothing substantial, the training set was minimal, runs capped, and prompts generic. As of now we can only say that the loss curve is acceptable and current dataset produces little additional value. Further experiments will elucidate this.

Decision

  • Adopt
  • Next actions: -- Experiment 2
run_idloss_bestplateautokens_seenprompt_set
EXP-0001-seed-131.27414941787719730.05027521090898358481920000month1_v1
EXP-0001-seed-111.29173338413238530.064951094666755981920000month1_v1
EXP-0001-seed-071.27755212783813480.0976406694788861281920000month1_v1
Training
Runs + metrics.
Eval
Prompt snapshots.
Insights
Notes + conclusions.