# SKLD — Project Journal
## Entry #17: The Clean-Code Overhaul
**Date**: April 19, 2026
**Session Duration**: ~6 hours
**Participants**: Matt + Claude Code (Opus 4.7, 1M context)
---
### The Starting Point
The codebase worked — atomic evolution ran end-to-end, seven Elixir
families were seeded, the homepage was shipping. But it had been built
in bursts. Matt put it plainly: "we vibe coded this thing in a couple
days and i'm certain the code is probably a mess." Then the challenge:
"imagine thousands of people are reviewing our codebase and they will
nitpick all the details."
The request had two parts. Build a concise clean-code reference doc
that captures Python + React/TS best practices (with a functional
preference). Then refactor the codebase to meet that standard.
---
### Phase 1: Write the rubric first
Every PR is reviewed against something. Rather than refactor first and
document after, the first wave landed `docs/clean-code.md` — a
341-line scannable rubric with ten sections (naming, functions,
errors, data, async, functional idioms, React/TS, testing, comments)
and a 15-item review checklist. Grounded in actual anti-patterns
found during exploration: `breeder.py:153` for bare-except-plus-print,
`evolution.py:46` for mutable globals, `AtomicRunDetail.tsx:131` for
raw fetch chains.
Putting the standard in writing first meant every subsequent PR had a
contract — the review decision was "does this match the doc?" rather
than "do I like this?"
---
### Phase 2: Tooling before surgery
Wave 1 was the least glamorous and probably the most load-bearing.
The codebase had ruff but no `mypy`, no ESLint, no Prettier, no
pre-commit hooks, no CI. Adding them surfaced 45 pre-existing ruff
errors on the stricter baseline, plus a flaky test
(`test_run_variant_evolution_happy_path`) that had been failing
intermittently for weeks — the symptom was "invalid x-api-key,"
which sounded like an Anthropic problem until CI exposed the root
cause. A sibling test in `test_config.py` was reloading the config
module with `SKILLFORGE_COMPETITOR_BACKEND=managed` and relying on
monkeypatch teardown order that didn't work the way the author
thought. CI on a clean checkout revealed it because there was no
stale local DB to paper over the damage.
The `test_taxonomy_api.py` failures followed the same pattern —
tests assumed a populated DB from prior `uvicorn` runs. On CI's
fresh filesystem, `TestClient(app)` didn't even trigger the
lifespan (you have to use it as a context manager). Fixed both by
entering `TestClient` correctly and pointing the bootstrap at a
per-test temp DB.
"CI surfaces latent bugs the local loop hides" is the cleanest way
to state the lesson.
---
### Phase 3: Cross-cutting hygiene
Wave 2 was the big one for exception discipline. 75 bare
`except Exception` catches across the codebase. Some were
unjustifiable — catch-and-print diagnostics that swallowed failures
into stdout. Some were legitimate boundaries — boot-time lifespan
steps that mu
# SKLD — Project Journal
## Entry #16: The Atomic Pipeline Comes to Life
**Date**: April 13, 2026
**Session Duration**: ~6 hours
**Participants**: Matt + Claude Opus 4.6 (1M context) + Sonnet subagents
---
### The Starting Point
Entry #15 closed with the frontend sprint shipped — SKLD-bench pages, taxonomy capabilities, homepage pipeline flow, journal browser, and the first round of scoring visibility. But the run detail page was still showing a molecular-era demo: generic "Variant A / Variant B" labels, an L1-L5 judging pipeline that no longer reflected reality, and no connection to the actual atomic evolution engine. Matt said "we need to redesign this for atomic evolution — it doesn't just need to be the demo, but the actual working pipeline."
---
### Phase 1: The Run Detail Redesign
The existing EvolutionArena was built for molecular evolution — spawn N variants, compete all on M challenges, judge, breed, repeat. Atomic evolution works fundamentally differently: decompose a skill into 12 dimensions, evolve each dimension independently (design 1 challenge, spawn 2 variants, compete, score, pick winner), then assemble the winners into one composite.
We built three new components:
**AtomicSidebar** — replaces the molecular ProcessFlow sidebar with a dimension progress tracker. Foundation dimensions listed first, then capabilities. Each dimension shows pending/running/complete status with fitness scores. A progress bar at the top shows overall completion.
**DimensionsOverview** — new default tab on completed runs showing per-dimension fitness bars with raw Sonnet baseline comparison and lift percentages. Summary cards for dimension count, avg fitness, baseline, and skill lift. Bench tier breakdown table.
**New `/api/runs/{runId}/dimensions` endpoint** — returns variant_evolutions joined with winning variants, sorted foundation-first. This gives the frontend the per-dimension status, tier, fitness, and winner info that the old generation-based API didn't expose.
### Phase 2: Phase 6 Engine Integration
With the UI showing what composite scoring looks like, we wired it into the actual evolution engine.
Three new modules built with Sonnet subagents in parallel:
**`engine/scorer.py`** — async wrapper around the composite scorer scripts. Uses `asyncio.to_thread()` to run the sync compile/AST/behavioral checks without blocking the event loop. Handles in-memory challenges (writes temp files for the L0 scorer). Returns zero-fallback on any error so the engine never crashes.
**`engine/transcript_logger.py`** — saves every competitor dispatch to the `dispatch_transcripts` table. Extracts prompt from trace, serializes the full trace as raw_response, includes composite score breakdown. Best-effort — never raises.
**Wired into `variant_evolution.py`** — after each competitor runs, the composite scorer scores the output and merges results into `pareto_objectives`. The transcript logger records everything. The existing `run_judging_pipeli
# SKLD — Project Journal
## Entry #15: The Scoring Overhaul and the Frontend Sprint
**Date**: April 12, 2026
**Session Duration**: ~8 hours (continuation of Entry #14's marathon)
**Participants**: Matt + Claude Opus 4.6 (1M context)
---
### The Starting Point
Entry #14 ended with a crisis: raw Sonnet scored 93.3% on our 867 challenges using string matching alone, and 5 of 6 sources scored identically on the hardest challenge. We had a prototype composite scorer in `/tmp/scoring_test/` and a plan (`PLAN-V2.1.2`), but nothing was wired into the real pipeline. The `dispatch_transcripts` table didn't exist yet. Deep-dive outputs were still in `/tmp`.
Matt's directive was clear: fix scoring, classify challenges, validate skill lift, and do it properly — save everything, no shortcuts.
---
### Phase 1: Data Foundation (Phase 0 of PLAN-V2.1.3)
The plan crystallized into 6 phases (`plans/PLAN-V2.1.3.md`), superseding the earlier V2.1.2 draft. Phase 0 was about preventing future data loss.
Built the `dispatch_transcripts` table — 16 columns covering the full audit trail for every agent dispatch. Added a `scores` TEXT column to `benchmark_results` via additive migration to hold the multi-level breakdown JSON alongside the existing scalar `score` column.
Archived the 18 deep-dive outputs from `/tmp/skld-level-test/` into `dispatch_transcripts` before the OS could clean them up. Copied the `ast_quality.exs` prototype to its permanent home at `scripts/scoring/ast_quality.exs`.
This was the lesson from Entry #14 applied: "if you don't save it, it didn't happen."
---
### Phase 2: The Composite Scorer Goes Live (Phases 1-2)
**Phase 1 — Compilation + AST**: Built `compile_check.py` (namespace-aware, word-boundary regex for `MyApp` → `SkldBench` substitution), `ast_analyze.py` (shells out to the Elixir AST walker, falls back to Python regex analysis), and the main `composite_scorer.py` orchestrating all levels.
Created 7 Mix scaffolds — one per family — with appropriate dependencies. Phoenix families got `mix phx.new`, Ecto families got `mix + ecto`, Oban got `mix + oban`. Each scaffold lives at `taxonomy/elixir/<family>/scaffold/skld_bench/`.
Fixed 3 real bugs in the phoenix-liveview `score.py`: pipe-operator blindness (`stream(socket, :posts` didn't match `|> stream(:posts`), variable-vs-literal (`limit: -50` didn't match `limit: -@page_size`), and cross-cutting weight inflation (7 "free" anti-pattern checks were adding 2.5 weight each to the absent score, inflating every output).
Re-scored all 18 deep-dive transcripts and 135 phoenix-liveview benchmark outputs. The baseline dropped from 0.855 to 0.684 — compilation alone caught 22 failures (16.3%) that string matching had missed.
**Phase 2 — Behavioral Tests**: Built `behavioral_test_runner.py` with generic LiveView tests using `live_isolated` — no router config needed. The runner extracts the module name and `handle_event` names from the code, generates ExUnit tests for mount + each event
# SKLD — Project Journal
## Entry #14: Seven Seed Runs and the Scoring Crisis
**Date**: April 12, 2026
**Session Duration**: ~14 hours (single marathon session)
**Participants**: Matt + Claude Opus 4.6 (1M context)
---
### The Starting Point
Entry #13 left us with 4 seed runs shipped (phoenix-liveview,
ecto-sandbox-test, security-linter, oban-worker) and 3 remaining
(ecto-schema-changeset, ecto-query-writer, pattern-match-refactor).
All the prep work was done — families seeded, variants spawned, 35/35
v2 variants confirmed. The task was straightforward: dispatch
competitors, score, persist winners, assemble composites, ship.
What actually happened was more interesting.
---
### Phase 1: Finishing the Seed Runs
The first half of the session was pure execution. Schema-changeset
went cleanly: 44 competitor dispatches across 12 dimensions, all
scored via score.py, winners persisted, Engineer assembled a 398-line
composite, fitness 0.987 — the highest of all 7 families. Shipped as
PR #33.
Query-writer was messier. The first batch of 8 dispatches scored
mostly 0.0 because I'd manually crafted prompts that didn't match the
actual challenge content. A lesson in "read the challenge JSON, don't
guess from truncated output." After re-dispatching with the correct
prompts, 48 competitors scored properly. 4 v1 wins, 8 v2 wins,
composite fitness 0.935. PR #34.
Pattern-match-refactor went fastest — I delegated 36 of 44 competitor
dispatches to a single background Opus agent that solved and scored
them all autonomously. 1 v1 win, 10 v2 wins, fitness 0.945. PR #35.
**All 7 seed runs shipped.** 83 dimensions evolved across 7 Elixir
families, ~300 real competitor dispatches, 7 composite skills
assembled. Total API cost across all 7 runs: $63.18 at current Opus
4.6 pricing ($5/$25 per M tokens) — dramatically lower than the
$28-35 estimates we'd been showing on the Registry, which were based
on old Opus 4.1 pricing ($15/$75). Matt noticed the $0.00 cost
display on the new runs and asked me to compute real costs from token
usage, which led to updating all 7 runs with accurate numbers.
---
### Phase 2: SKLD-bench — The Baseline Nobody Expected
Matt asked a question that changed the trajectory of the session:
"Wouldn't running the ENTIRE challenge pool against raw Sonnet and
Opus tell us how capable the models are without any skill guidance?"
Yes. And the answer was uncomfortable.
We designed the benchmark infrastructure: a `benchmark_results` table
(15 columns, unique on challenge_id + model), a runner script that
dispatches raw model against challenges and scores via score.py, and
a report generator. Then launched 7 background Sonnet agents in
parallel — one per family — to process all 874 challenges.
The results came back over about an hour:
```
ecto-schema-changeset: 100 challenges, avg 0.990, 100% pass
ecto-query-writer: 151 challenges, avg 0.980, 100% pass
ecto-sandbox-test: 151 challenges, avg 0.958, 100% pass
pattern-mat
# SKLD — Project Journal
## Entry #13: The Install Test — three quality gates, still broken
**Date**: April 11, 2026
**Session Duration**: ~12 hours across two long stretches
**Participants**: Matt + Claude Opus 4.6 (1M context)
---
### The Starting Point
Entry #12 closed with the SKLD-bench content layer shipped — 867 Elixir
challenges across 7 lighthouse families, all on main. This session
picked up from there with a specific goal that kept expanding: take the
phoenix-liveview mock pipeline run (shipped in PR #18 the previous
session) and turn the production Registry page for that run into
something people would actually want to click through.
What started as "polish the run detail page" ended up being:
1. A ground-up rebuild of `/runs/:runId` into a 7-tab rich showcase
2. A full `mock → seed` rebrand across every user-facing string
3. A second rebrand of the run_id itself because "mock" was still
visible in the URL bar
4. OG meta tags + a brand image + per-run server-side meta injection so
links would render rich previews when dropped in Discord
5. And then — finally, near the end — the install test that discovered
the whole thing was silently broken
---
### Phase 1: The rich run detail page
The starting state: the Registry page for the phoenix-liveview mock run
loaded, but it was a "skill with no story". Raw SKILL.md preview,
synthetic Fitness Radar fed hardcoded data, empty Growth Curve, 12
variant rows hidden behind a "Show Advanced" toggle, export buttons at
the bottom. Nothing explained what the 12 evolved capabilities did, how
the composite was assembled, which challenges it was tested against,
or why the composite scored 0.94. A first-time visitor would bounce.
Matt framed the problem plainly: "it's a skill with no story." The fix
was a complete restructure. Over several iterations with real-time
feedback:
- **7 tabs, sticky header**: Composite, Competition, Metrics, Tests,
Narrative, Lineage, Package. Sticky tab bar with the always-visible
header showing run title, Gen 1/1 pill, export buttons.
- **Plain-English first, metrics second**: an `OverallAssessment` prose
card at the top of the Composite tab, followed by a `PipelineOverview`
mini-diagram (24 challenges → 12 variants → 1 composite), then the
rendered composite SKILL.md with section anchors.
- **Per-challenge competition breakdown**: `CompetitionBracket` with 12
mini-brackets, each showing Variant 1 (seed) vs Variant 2 (spawn)
with per-challenge scores and a `buildRationale()` sentence
explaining why the winner won. Preempts questions like "did you run
baselines?" and "is this multi-gen?"
- **Master-detail lineage**: the first cut had 12 parent cards that
expanded in-place below a grid, but clicking parent #7 scrolled past
the click target. Fixed by moving to a sticky left rail that stays
visible while the right panel updates on selection.
- **Stacked parent→composite sections**: side-by-side comparisons
squeezed both into narr