# Book of Elixir
*What we learned from evolving 7 Elixir skill families across 867 challenges.*
---
## What This Is
This book documents the Elixir-specific findings from SKLD-bench — a controlled evaluation of AI-generated Elixir code across 7 domain families. Every finding is grounded in measured data from real model dispatches, scored through compilation, AST analysis, and behavioral testing.
If you're building Elixir skills for AI agents, training models on Elixir code, or evaluating AI-generated Elixir, this is the empirical reference.
---
## Chapter 1: The Seven Families
We chose 7 Elixir skill families that cover the most common areas where developers use AI assistance. Each family has its own challenges, scoring criteria, and distinct failure patterns.
| Family | Challenges | What It Tests | Compile Rate |
|--------|-----------|---------------|-------------|
| phoenix-liveview | 135 | Phoenix 1.7+ LiveView idioms, HEEx, streams, PubSub | 83.7% |
| ecto-sandbox-test | 151 | Test isolation, sandbox checkout, async safety | 94.0% |
| pattern-match-refactor | 130 | Idiomatic pattern matching, pipe chains, guards | 83.8% |
| security-linter | 100 | OWASP-style security patterns, Plug middleware | 58.0% |
| oban-worker | 100 | Background job workers, queue config, error handling | 32.0% |
| ecto-schema-changeset | 100 | Schema definitions, changeset validations, types | 13.0% |
| ecto-query-writer | 151 | Ecto query composition, preloads, dynamic queries | 0.7% |
**Source:** Phase 3 cross-family classification, 867 benchmark results with composite scoring, April 12 2026.
The compile rates tell a story about how much external context each domain requires — covered in Chapter 2.
---
## Chapter 2: The Context Dependency Spectrum
Elixir families fall on a spectrum from "self-contained" to "deeply embedded." This spectrum directly predicts compile rates and determines how skills need to be designed.
### Self-Contained Families (high compile rate)
**ecto-sandbox-test (94.0% compile)** and **phoenix-liveview (83.7%)** and **pattern-match-refactor (83.8%)** produce code that mostly stands alone. A LiveView module needs `use MyAppWeb, :live_view` and a few callbacks — the scaffold provides the web framework. A pattern-match refactor just needs pure Elixir.
**What fails:** The 6-17% that don't compile typically reference application-specific modules (`MyApp.Accounts`, `MyApp.Blog`) in mount callbacks or event handlers, or use invalid syntax patterns.
**How we learned this:** Phase 1 compile check against the Phoenix scaffold. 22/135 LiveView outputs failed compilation. Most failures were in `mount/3` where the model called `MyApp.Blog.list_posts/1` — a context module that doesn't exist. *(Phase 1, compile_check.py results, April 12 2026)*
### Context-Dependent Families (low compile rate)
**oban-worker (32.0%)**, **security-linter (58.0%)**, and especially **ecto-schema-changeset (13.0%)** and **ecto-query-writer (0.7%)** produce code t
# Book of Genesis
*Universal principles of AI skill engineering, discovered through empirical evolution.*
---
## What This Is
This book documents what we've learned about building skills for AI coding agents — not through theory, but through running 867 controlled experiments across 7 skill families and measuring what actually works.
Every finding here was discovered the hard way: by building something, testing it, finding out it was broken, and fixing it. Each section cites exactly where the learning came from so you can trace the evidence yourself.
These principles apply regardless of programming language or domain. The experiments happened to use Elixir, but the lessons are universal.
---
## Chapter 1: The Scoring Problem
**The single most important finding in this project: if your evaluation can't tell good code from bad code, nothing else matters.**
### 1.1 String Matching Is Nearly Worthless
We built 867 coding challenges and scored AI-generated solutions by checking whether the output contained expected strings — patterns like `stream(socket, :posts` or `limit: -50`. This is how most LLM benchmarks work: check if the output contains the right keywords.
Raw Sonnet (no skill guidance) scored **93.3% average** across all 867 challenges. It looked like the model already knew everything and skills had nothing to add.
Then we actually tried to compile the code.
**How we learned this:** We ran a deep-dive experiment on the 3 hardest Phoenix LiveView challenges, generating code from 6 different sources (2 models × 3 skill configurations). Five of the six sources scored identically at 0.636 on the hardest challenge. The string matcher couldn't tell them apart. *(Journal #14, "The Scoring Crisis", April 12 2026)*
**The evidence:**
| Scorer | Sonnet Baseline | Can It Rank? |
|--------|----------------|-------------|
| String match (L0) | 93.3% | No — 5/6 sources tied |
| + Compilation | 68.4% | Yes — catches broken code |
| + AST quality | 68.4% | Yes — rewards idiomatic style |
| + Behavioral tests | 51.1% | Yes — only 14% actually work |
**The principle:** Any evaluation that doesn't test whether code actually runs is measuring "does this look like code?" not "is this good code?" String matching is a necessary check (does the output contain the right concepts?) but should carry no more than 10% of the total score weight.
### 1.2 Compilation Is the Cheapest High-Value Gate
Adding a single binary check — does this code compile? — dropped the baseline from 93.3% to 68.4% and caught bugs that every other quality gate missed.
**How we learned this:** During the deep-dive experiment, one source (Opus + v1 skill) was ranked as the *best* solution by string matching. When we compiled it, it had a syntax error. The skill had guided the model toward an invalid capture pattern. String matching scored it highest; compilation scored it zero. *(Deep-dive experiment, source: opus-v1-skill-hard-07, Journal #14)*
**The evidence:** Across all 867