Skip to content
03 / ENGINEERING CASE STUDYClaude Code Framework · Persistent Agent · Maintenance AutopilotPUBLIC PREVIEW

cortex-x

Discipline, not the model. How a framework that catches its own mistakes came together solo in seven weeks — and what that reveals about how AI research and implementation actually work.

BUILT BY DAVID RAJNOHA ·  APR–MAY 2026
Scroll
00
Project Index
Role
Solo Builder & ArchitectDesign · implementation · operations
Industry
Developer Tools / Agentic AIInfrastructure for AI-assisted work
Year
2026Status · Public Preview ↗
Stack
Node.js ≥22 · CJSApache-2.0 · v0.4-pre
0
Tests Passing
0
Nightly cron workflows
0
Weeks · founding sprint
0
Runtime npm/pip deps
01
Thesis

Discipline, not the model. The bottleneck in AI-assisted engineering is not model capability — it is the operator's consistent operational discipline.

I built cortex-x solo over seven weeks, from nine files (2026-04-17) to its current state. The work produced one conclusion: the model is not the limit. The limit is whether the operator holds the same procedure at 3 a.m., at the sixtieth commit, and when the thing “almost works.” So cortex-x is not the subject of this case study — it is the evidence of a transferable method. And because a model's training data is frozen, the model doesn't even know its own current version: every claim about present state — framework versions, pricing, APIs, a11y standards — must be verified, not guessed. Even research is subject to the discipline.

  1. 01Training data is frozen, so every claim that depends on current state gets verified by web research before I write it as fact.
  2. 02Code does not review itself; any non-trivial diff passes an adversarial review pipeline before it merges.
  3. 03Discipline held in the head does not survive the night — it has to be externalized into a mechanism that does not forget.
  4. 04Documentation rots fastest precisely in AI projects, because state changes faster than the prose describing it.
  5. 05Context dies with every session — whatever must survive has to be explicitly persisted to disk.
  6. 06A flat list of standards fails under pressure; under pressure it is tiered precedence that decides, not good intentions.
  7. 07Real validation of a method is not one pretty project, but that the same patterns transfer across domains.
  8. 08Repeatability beats virtuosity — a procedure anyone can run a second time beats a one-off brilliant move.

The rest of this case study is the proof chain — for each of the eight beliefs there is a concrete file, commit, and sprint that exists precisely because the belief exists.

02
The Collision
Context
Five concurrent production agents
Surface area
5 CLAUDE.md · 5 memory layers · 5 postures
Cost of context loss
~10 hrs / month re-orientation

In April 2026 the operator was already shipping production AI agents — RELO, a back-office agent for Czech real estate (27 tools, 1,700+ tests, three-layer memory with autoDream consolidation), plus a multi-tenant chatbot platform serving production clients. Both ran in production, both passed audits, both were built solo.

By the second month a pattern was visible across them. The operator wasn't writing product code anymore. The work had drifted: safe-tool wrappers, three-layer memory scaffolds, cost guards, multi-agent review pipelines, OWASP Agentic Top 10 mappings, session-start hooks, recommendation queues. The scaffolding was the product. Every new project needed the same scaffolding rebuilt from memory.

“The scaffolding was the product. cortex-x is what fell out of three months of shipping with discipline before that discipline had a name.”
The Challenge

Stop rebuilding the scaffolding — externalize the discipline.

Five concurrent projects. Five CLAUDE.md files. Five distinct memory layers. Five separately-evolving security postures. One operator. One head.

A senior engineer reloading project context after a two-week gap loses 30–60 minutes per session — not coding, just re-orientation. Five projects × monthly drop-in cadence = ~10 hours of pure re-orientation per month, plus the silent tax of decisions made twice because last quarter's reasoning was no longer accessible.

The Solution

An institutional-wisdom layer for the operator's Claude Code installation — cortex-x.

CLAUDE.md holds current state that changes weekly — tech stack, sprint status, env.

cortex-x holds standards, lessons, decisions, and the agent runtime that doesn't change at all between projects. Zero overlap. The split is enforced by hook contract — not by discipline.

03
Philosophy → architecture

Philosophy → architecture: eight pairs

Each of the eight beliefs below has a concrete artifact in the repository that exists precisely because that belief exists. Consistent philosophy produces consistent architecture — these are not arbitrary tools but conviction made into code you can open and read.

01

Research-before-assert → R1 mandatory dispatch

Principle

Training data is frozen; about the current state of the world — framework and model versions, pricing, a11y standards — I know nothing reliable. So any answer or implementation that depends on external state must be verified before I state it as fact. AI is a tool, not an oracle.

Architecture

R1 anchors verify-first as a mandatory step: the standard describes the protocol, cortex-goal.md embeds it into Phase 3 (Research) of the plan, and the sprint skill enforces it inside the pipeline. Findings are cited with URLs and cached, so research is reusable across sessions rather than one-shot.

02

Review-before-merge → R2 six-agent pipeline + Pass-2 skeptic

Principle

A diff that wrote itself cannot judge itself. A non-trivial change passes six independent reviewers before the operator is allowed to merge — and a second round, the Pass-2 skeptic, is tasked with refuting the first round's findings.

Architecture

r2-review.js is a dynamic workflow orchestrating six parallel agents; pre-commit-review-gate.cjs blocks the commit until a verdict arrives. Consensus HIGH findings are applied in-commit, not deferred to a backlog.

Empirical proof

Arc 1: 23 HIGH bugs caught, zero refuted by the Pass-2 skeptic, all fixed in-commit before push — nothing escaped to main. In Sprint 2.46 six independent reviewers converged on a single bug at confidence 99/98/96/96/95/92.

03

Externalize discipline → Signed verdict v2

Principle

Discipline that relies on human memory fails exactly when fatigue is highest. A rule you can bypass by forgetting is not a rule — it belongs outside the head, in a cryptographic artifact.

Architecture

Signed verdict v2 carries an HMAC-SHA256 or Ed25519 signature over a payload containing commit_sha (cross-checked against HEAD), staged_tree (the index contents), workflow_run_id (a single-use nonce in a journal), and secret_tier (env > persisted random > host-derived). A replayed verdict is rejected, a stale verdict is rejected, and a host-derived secret is rejected under STRICT_SECRET=1. The signed verdict complements [skip-review] as a second unblock path beside the session marker — the session marker remains the highest-priority allow path.

04

Docs rot fastest in AI projects → Doc-currency lint

Principle

When an agent writes code faster than a human reads, documentation diverges from reality within a single sprint. Stale docs are worse than none — they impersonate a truth they no longer hold.

Architecture

cortex-doc-currency.cjs lints documentation against the measured repository state and flags stale numbers and claims; cortex-doc-regen.cjs regenerates derived passages from the source of truth. The documentation.md standard defines what is derived — and thus regenerable — versus hand-written.

05

Context dies each session → Wisdom-layer split (SSOT)

Principle

Every new session starts with empty memory. What changes in weeks — current state — must not live in the same place as what is stable for years — institutional wisdom — otherwise it duplicates and drifts apart.

Architecture

cortex-load.md is a mental cheat-sheet that brings a new session up to context; memory-decay.cjs governs memory aging so stale entries do not outweigh fresh ones; lessons.cjs manages the lessons-learned layer. One source of truth for each piece of information — never duplicate.

06

A flat standards list fails under pressure → Tier rules 0/1/1.5/2/3

Principle

When there are 35 standards and all carry equal weight, under time pressure none gets applied. Rules need an explicit priority order so you can decide what yields when two budgets conflict.

Architecture

Tier 0 (Ship-Ready) → 1 (SSOT/modularity) → 1.5 (coding behavior) → 2 (security/testing/correctness) → 3 (process) gives the standards a lexicographic order. RULE-1.md codifies the SSOT layer; action-kinds.cjs maps the 21 action_kinds to rules; code-review.md applies that same order during review.

07

Real validation = patterns transfer cross-domain → Centralized ~/.claude/shared/

Principle

A pattern that works in one project alone may be coincidence. Only when the same standard improves both RELO and the multi-tenant chatbot platform is it a demonstrably transferable method, not a local trick.

Architecture

Everything shared lives in ~/.claude/shared/, so standards, hooks, and skills load into every project identically — README.md is the index of 35 standards and auto-orchestrate.cjs nudges parallel agents regardless of which project is running. Centralization is the mechanism of transferability.

08

Repeatability beats virtuosity → The /cortex-sprint Skill

Principle

One brilliant sprint proves nothing; a method you can run again and again at the same quality proves everything. The bottleneck is not model capability but the operator's consistent operational discipline.

Architecture

/cortex-sprint wraps the whole cycle — plan → R1 → implementation → R2 → verdict → capture — into a single repeatable skill; sprint-pipeline.md defines the phases and gates. A sprint plan like sprint-2-44-plan.md is a documented instance of that same process, not a unique performance.

04
What cortex-x actually does

A four-tier trajectory, two already shipped.

FIG. 01 — TRAJECTORY
Two AI surfaces, one runtime
Claude Code by day · Steward by night · ~$0.0008 / run
cortex-x four-tier trajectory and runtime architecturecortex-x ships in four tiers. Tier 0 (foundation) and Tier 1 (verification + multi-agent) are shipped. Tier 2 (compound learners) is essentially closed (~80%). Tier 3 (productization) and Tier 4 (persistent entity) are planned. Two AI surfaces share one runtime: Claude Code interactively in the IDE, Steward autonomously overnight.DAY · INTERACTIVENIGHT · AUTONOMOUSClaude CodeIDE · session-start hookfeature work · review · refactorREADS ~/.claude/shared/Stewardnightly cron · draft PR onlyrecommendations.md → diff~$0.0008 / RUN · 6 SAFETY LAYERS~/.claude/shared/cortex-x runtime35 standards · 9 review agents · 8 hooks · 15 skillsapplyAction → spec-verifier → runNpmTest → gh pr create --draftdraft-PR onlySTEWARD_HALT killswitchUSD caps (d/w/m)3-failure circuit breakerloop detectionatomic rollback
Rule 1 · Invariants

SSOT · Modular · Scalable

Three architectural invariants non-negotiable across every scaffolded project. The structural floor — break one of these and the rest stops compounding.

Rule 2 · PR-blocking

Security · Testing · Observability · Correctness

Four critical standards block PRs in any scaffolded repo. Failures here surface as red CI lanes, not silent drift. The correctness floor.

Rule 3 · Process

Thirty-plus further standards (warnings, not blockers)

Code style, doc hygiene, dependency hygiene, naming. Surfaced as warnings so the operator can sequence work; not gated because they degrade gracefully.

Thirty-five standards across five rule tiers, ordered by a single mental model: structure first, then correctness, then polish.

Tier 2 — compound learners — is essentially closed (~80%): daily and weekly Dreaming consolidation crons run on the cortex-x repo since 2026-05-09, alongside the AlphaEvolve A/B harness and FTS5-backed lesson retrieval. Tier 1 has shipped and holds, with the 2.3b runner and Stryker mutation testing still pending. Tier 3 and Tier 4 are stated commitments, not deliveries.

trajectory.txt
// Four-tier trajectory · two already shippedTier 0 · Foundation ─────────── shipped  Scaffolds new projects · 11 stack profiles  · 35 standards · 9 review agents · 8 hooks  · install in ~3 minutesTier 1 · Verification + multi-agent ── shipped  7-criterion spec verifier · Phoenix OTLP  · autoresearch · senior-tester review  · 6-agent parallel review pipeline  · multi-window cost safetyTier 2 · Compound learners ──────── ~80% done  AlphaEvolve A/B harness v0  · self-extending capabilities  · FTS5 lessons  · daily + weekly Dreaming consolidationTier 3 · Productization ──────────── planned  Capability marketplace · WaaS template  · voice → recommendation pipelineTier 4 · Persistent entity ───────── 2027+  Self-hosted home server · soul abstraction  · Obsidian SSOT · multi-source life ingest
bin/steward/_lib/spec-verifier.cjs
// Seven criterion kinds — sprint 1.9.0 + 2.18 + 2.3.1// Verifier sits between applyAction and runNpmTest.{  kind: "shell",          // exit code + stdout match  kind: "file_predicate", // exists · mtime · content hash  kind: "regex",          // pattern match in named files  kind: "ears_text",      // EARS-shape NL clause  kind: "llm_judge",      // Sonnet-grade boolean verdict  kind: "read_set",       // proof the LLM read claimed files  kind: "mutation_score", // Stryker survival threshold (2.3.1)}// Verifier fails closed — any criterion that does// not pass aborts the action, triggers atomic rollback,// writes the failure to the journal, produces no PR.
05
Hard-won decisions

The design choice worth defending — and the day green tests weren't enough.

“One incident class equals one defense layer plus one regression test. The rule that closes the gap between green tests and actual safety.”
DECISION · 01

Spec-driven verification — seven criterion kinds

In April the runtime had one hardcoded check before any Steward edit hit disk: a destructive-rewrite guard. By mid-May this had generalized into the spec-verifier — a runner sitting between applyAction and runNpmTest that gates every action against per-kind acceptance criteria.

Each of the 21 action_kinds in the Steward registry — autoresearch · evolve_daily · senior_tester_review · secret_history_sweep · workflow_hardener · wiki_consolidate · release_notes_drafter and fourteen others — declares its own criteria list. Verification runs as a layer independent of the model — it checks the output against executable criteria instead of relying on the model having been right. The mainstream agent runtime trusts the LLM's self-report; cortex-x writes the proof in code.
DECISION · 02

Cost as a first-class verifier output

Multi-window cost safety — daily, weekly, monthly USD caps plus a 50K-token-per-5-minute velocity cap plus a cross-session loop detector (five hits on the same criterion id across seven days triggers automatic STEWARD_HALT) — is the difference between an experimental autopilot and an autopilot the operator forgets to monitor.

The gap analysis was real: a steady $5/day burn for 30 days would clear a daily cap unflagged but blow an $80 monthly ceiling. Multi-window plus velocity catches the pattern, not the snapshot.
DECISION · 03

Signed verdict v2 — cryptographic, not remembered

The review gate stopped trusting memory. The verdict is now signed — HMAC-SHA256, or Ed25519 — over commit_sha, staged_tree, a workflow_run_id nonce, and a secret tier. A replayed, stale, or host-derived signature (under STRICT_SECRET=1) is rejected.

It does not replace [skip-review] — it complements it as a second unblock path; the session marker stays the highest-priority allow path. Discipline moved out of human memory and into a cryptographic artifact: the gate no longer trusts whoever says it passed, only what can be verified.
INCIDENT · 01

The Hermes rebrand

The runtime was originally named Hermes — placeholder picked in early May. Two weeks later NousResearch published their open-source LLM family under the same name and accumulated 139,000 GitHub stars in a fortnight.

Sprint 4.7 (2026-05-08) was a hard pivot: every reference renamed to Steward, ten shim modules deleted in the same drop instead of carried as backward-compatibility debt, 115 test failures repaired same-day. Commit 8064b34. Lesson: when a public-facing name collides with a project in the same problem space, fix it before the public tag, not after. Search the namespace before committing the brand.
INCIDENT · 02

Sprint 1.6.18 — when green tests weren't enough

The day before, Steward v0.5b runtime had passed a full test suite and was a git push away from public-preview. Operator discipline ran a 6-agent parallel review pipeline against the diff anyway — acceptance-auditor · blind-hunter · correctness-auditor · security-auditor · ssot-enforcer · edge-case-hunter, each with differentiated context scope.

Eight ship-blockers came back the same day: tightened path-traversal needed NUL-byte and flag-injection guards plus realpath containment; the editPlan shape needed an explicit shape gate; a data === null guard was missing; default model alignment had drifted from SSOT; CLI help text was stale; MIGRATIONS.md was unbackfilled. All eight fixed the same day. Tests prove behavior, not architecture.
06
Arc 1 — self-correction

When a sprint catches its own bugs before they reach main

cortex-x verdict-gate flowA non-trivial diff is dispatched to six parallel review agents (security-auditor, correctness-auditor, acceptance-auditor, ssot-enforcer, blind-hunter, edge-case-hunter). Their findings pass to a Pass-2 skeptic that re-derives each at 0 to 100 confidence. Survivors are written into a signed r2-verdict.json (HMAC-SHA256 or Ed25519, bound to commit_sha and a single-use nonce). The pre-commit gate verifies the signature and decision: PASS lets the commit land with the verdict hash in its body; FAIL sends the change back to the implement phase.Implementcode + testsR2 dispatch6 PARALLEL AUDITORSsecurity-auditorcorrectness-auditoracceptance-auditorssot-enforcerblind-hunteredge-case-hunterPass-2 skepticre-derive · 0–100r2-verdict.jsonsignedHMAC-256 / Ed25519commit_sha + noncepre-commit gateverify sig + decisioncommit landsR2-verdict: hash8PASSFAIL · back to implement

Verdict-gate flow: R2 dispatch → 6 agents → Pass-2 skeptic → signed verdict → pre-commit gate.

Between 30 May and 3 June 2026 four sprints shipped in a chain: 2.46 (signed verdict gate), 2.46.1 (Ed25519 v2 + nonce journal), 2.46.2 (doctor integration + qualified-prose tolerance), and 2.3.1 ( mutation_score as the seventh criterion kind). The Arc 1 ledger is plain: +145 tests (3,290 → 3,435), 23 HIGH bugs caught and fixed before push, and zero of them on main. No Pass-2 skeptic refuted a single one.

Sprint 2.46 was the meta-recursive moment. The workflow whose job was to ship the signed-verdict gate produced structural defects in its own deliverables — a fictional gate-behavior table in standards/sprint-pipeline.md, an over-promised commitSha binding, and path drift across six reviewers. All six independent reviewers flagged the same bug (confidence 99/98/96/96/95/92). R2 caught them, the parent agent fixed them in-commit, the signed verdict regenerated, and the commit landed. None reached main.

This is the structural difference between “we have review” and “review is load-bearing.” When the pipeline finds a defect in an artifact the gate itself was meant to build, the discipline no longer lives in my head — it is externalized into code. pre-commit-review-gate.cjs does not wait for me to remember; the session marker stays the highest-priority allow path, but when I do not consciously choose a skip, the gate holds on its own.
This was the moment the project stopped being a set of tools and became a mechanism — the system corrected itself at a level I had not caught myself.
07
Founding sprint + Arc 1

9 files → a framework in 7 weeks. Solo.

The founding sprint began on 2026-04-17 from nine files and ran to a first public preview. Then came Arc 1 — hardening through 2026-06-03. Since 2026-05-09 the same engine opens real draft PRs straight on this repo: 17 active cron workflows, draft-PR only, atomic rollback on test failure.

01
WEEK 01 · FOUNDATION
Phase 1 init · 11 standards · 9 profiles
Apr 17 – Apr 23Phase 19 profiles

Cross-platform install (Bash + PowerShell 5.1 + 7). Three foundational hooks: session-start, block-destructive, pre-compact. Projects library, cortex-thinker agent, insights, journal, coding-behavior tier, ship-ready gate. The scaffolding floor.

02
WEEK 02 · RULE 2 + DETECTORS
Correctness pillar · agentic security · runtime SLOs
Apr 24 – Apr 30Rule 27 MUSTs

Agentic Security section (lethal trifecta, 7 MUSTs) plus runtime SLOs and circuit breakers. Deterministic profile + stage classifiers under 100ms in detectors/. agentskills.io spec, browser-agent profile, Tirith scanner integration.

03
WEEK 03 · STEWARD ENGINE
Steward v0 → v0.5b OpenRouter engine
May 01 – May 07Steward v0.5b0 runtime deps

Zero-deps preserved via native fetch. 8 distinct error codes. Pluggable engine seam (mock / openrouter / claude-cli). First real LLM call validated end-to-end. Atomic rollback on test failure. Journal cost capture. gh pr create --draft Phase 11 integration.

04
WEEK 04 · VERIFICATION + DOGFOOD
Spec-driven verification · v0.3.0 public preview
May 08 – May 14v0.3.0public preview

Sprint 1.9 spec-driven verification (5 criterion kinds initially, +1 in Sprint 2.18). Multi-window cost safety + cross-session loop detector. Sprint 2.0 Phoenix OTLP observability. Sprint 2.1 autoresearch. Sprint 2.3 Stryker mutation baseline. 2026-05-14 autonomous burst: 11 sprints + 4 R2 rounds shipped. The first nightly cron workflows go live.

05
ARC 1 · 2026-05-30 → 06-03 · SELF-CORRECTING
Signed verdict gate · mutation_score · R2 catches its own bugs
+145 tests23 HIGH caughtv0.4-pre

Four sprints in a chain — 2.46 (signed verdict gate), 2.46.1 (Ed25519 v2 + nonce journal), 2.46.2 (doctor integration + qualified-prose tolerance), 2.3.1 (mutation_score as the 7th criterion kind). The meta-recursive moment: the workflow meant to ship the signed-verdict gate itself produced structural defects in its own deliverables. All six independent reviewers flagged the same bug (confidence 99/98/96/96/95/92). R2 caught it, the parent agent fixed it in-commit, the signature regenerated, the commit landed. Nothing reached main.

Test count progression: 207 (Apr 17) → 600 (May 7) → 3,290 (Arc 1 start) → 3,435 (Jun 3). Five-lane CI matrix — Ubuntu bash, macOS bash, Windows Git Bash, Windows PowerShell 7, Windows PowerShell 5.1 — green throughout. The 17 active workflows include steward-harvest · steward-evolve-daily · steward-evolve-weekly · steward-flaky-test-repair · steward-secret-history-sweep · steward-doc-drift · steward-pr-review-responder · steward-senior-tester-review · steward-workflow-hardener and others.

08
Stack
PRIMARY RUNTIME · 01
Node.js ≥22
CJS · native fetch
Native fetch made the OpenRouter engine possible without node-fetch or axios. Every dependency considered was first asked: can this be 200–400 lines of Node built-ins instead?
ENGINE · 02
Steward
Zero npm dependencies. Pluggable LLM seam: openrouter / claude-cli / mock. 21 action_kinds in the registry.
DEFAULT MODEL · 03
DeepSeek V4 Flash
Via OpenRouter at ~$0.0008/run. $0 marginal cost via Anthropic Max subscription on the claude-cli engine.
OBSERVABILITY · 04
Phoenix OTLP
Zero-deps protobuf encoder — ~370 lines of hand-written CJS rather than @opentelemetry/exporter-otlp-http.
DISTRIBUTION · 05
git clone + install.sh / .ps1
Syncs to ~/.claude/shared/. 600-line installer audited line-by-line.
LICENSE · 06
Apache 2.0
SPDX Apache-2.0 · relicensed 2026-05-12. Stryker · c8 · fast-check ship as dev deps only — not bundled with runtime.

Zero runtime npm or pip deps. Steward must be auditable, vendorable into client infrastructure, and runnable on hardened CI without supply-chain surface. The spec-verifier now ships seven criterion kinds including mutation_score, and a signed r2-verdict gates commits into the pipeline.

09
Numbers

Numbers (snapshot 2026-06-03).

Hand-verified against the live repo on 2026-06-03 — and the doc-currency lint guarantees these don't rot silently.

3,435
tests · 0 failing · 2 pre-existing skips
419
commits, founding sprint → Arc 1 close
35
standards (36 incl. the README index) across the 5-tier rule hierarchy
21
Steward action_kinds in the registry
7
spec-verifier criterion kinds (shell · file_predicate · regex · ears_text · llm_judge · read_set · mutation_score)
9
review agents (6 R2 auditors + Pass-2 skeptic + cortex-thinker · planner · synthesizer)
20
reusable prompts bound to slash commands
15
user-discoverable slash skills
17
nightly cron workflows live since 2026-05-09
11
project profiles (Next.js SaaS · ai-agent · chatbot · cli-tool · tauri-desktop · kiosek · qa-engineer · astro-static · waas-template · browser-agent · minimal)
5
CI lanes green (ubuntu-bash · macos-bash · win-gitbash · win-pwsh7 · win-ps5.1)
0
runtime npm or pip dependencies
10
Results & Learnings

Results

  • R/01A maintenance autopilot that actually runs unattended. Since 2026-05-09, 17 cron workflows have been opening draft PRs on the cortex-x repo nightly without manual intervention. Each PR carries a journal trailer with cost, phase timings, and rollback receipts. Real validation, not a screenshot.
  • R/02Seven independent enforcement layers, each able to stop a commit on its own. block-destructive intercepts a destructive shell command; a policy denylist refuses forbidden operations; multi-window USD caps cut off spend across several time windows; a loop detector catches runaway cycles; a circuit breaker halts a cascade of failures; atomic rollback returns the tree to a clean state; the signed R2 verdict is the last gate before a commit lands. Compromising one does not bypass the others — seven mutually independent locks, a structural property, not a config you can switch off by accident.
  • R/03The XDG separation. Personal data — project library entries, journal traces, research cache, insights — lives under $CORTEX_DATA_HOME (default ~/.cortex/). Framework code stays under ~/.claude/shared/. cortex-uninstall --purge requires a second confirmation step. The framework can be wiped clean without touching months of accumulated work.
  • R/04Public Apache-2.0 release with stranger-reproducible install. One-line install on five platforms. 600-line installer audited line-by-line. cortex-doctor validates the install end-to-end with drift detection and auto-fix prompts. The framework leaves the operator's laptop on terms a stranger could verify.

Learnings

  • L/01
    Build the product before the framework.

    RELO came first — an AI back-office agent in production. The framework is the distilled discipline that produced RELO, not the precondition for producing it. The order matters: extract the pattern from a working result, then formalize. The opposite order produces frameworks nobody uses.

  • L/02
    Tests prove behavior; multi-agent review proves architecture.

    Arc 1 showed it meta-recursively. Sprint 2.46 was meant to deliver the signed-verdict gate; yet the very workflow that was supposed to build the gate delivered structural defects in its own deliverables — a fictional gate-behavior table in standards/sprint-pipeline.md, an over-promised commitSha binding, path drift across six reviewers. All six independent reviewers flagged the same bug at confidence 99/98/96/96/95/92. R2 caught it, the parent agent fixed it in-commit, the signed verdict regenerated, the commit landed. None of it reached main. That is the difference between having a review process and review being load-bearing.

  • L/03
    Security mechanics is a structural gate, not documentation.

    The seven independent layers do not exist because they look good in a README. They exist because relying on operator discipline under fatigue and time pressure fails — not occasionally, but reliably, at the worst moment. An architectural gate is more expensive to write once, but cheaper to operate long-term than human vigilance. Discipline I have to remember, I will forget one night at three a.m. A gate wired into the code never forgets.

  • L/04
    Discipline externalized in code beats discipline held in memory.

    The signed verdict, the hooks, the acceptance_criteria on every action_kind — all of it survives model generations. When Claude 5 arrives, I don't have to teach the new model what done means; the definition of done is written into 35 standards and into the code, not into the model's head or mine. A rule that lives only in memory disappears with the context window. A rule written to a file stays.

  • L/05
    Repeatability beats virtuosity.

    Sprint 2.45 shipped the /cortex-sprint Skill encoding the canonical 8-step pipeline. On 2026-06-03 I forbade freestyle sprint dispatch — every sprint must go through that Skill. The result is an identical plan / R2-summary / verdict structure across sprints; not because I'm equally sharp every time, but because the pipeline replaces the need for that. And because ~/.claude/shared/ is globally available, the discipline transfers across projects. Three documented events confirm it: the RELO RLS-first multi-tenant pattern carried into lasertgame-funos (2026-05-01); the portfolio retrofit where the review pipeline autonomously caught 3 HIGH findings with no explicit invoke (2026-04-21); and news-bot, where agentic-workflow discipline applied in a project with no cortex sprint at all (2026-06-03).

11
Honest limits

What cortex-x doesn't claim — yet.

Let me be honest about the limits. I built this alone — one person, seven weeks from nine files (2026-04-17) to the current state. That is both the strength and the weakness: the coherence of a single mind, but also a bus factor of one. The framework is coupled to Claude Code; it is not a neutral tool that runs anywhere. Third-party validation is in progress, not done— partially closed by three documented cross-project transfers (lasertgame-funos, the portfolio retrofit, news-bot), but those are still my projects, not strangers' hands.

The deferred backlog: a multi-agent git-worktree spawner, an Anthropic Memory Tool wire-up, and graduating the Pass-2 skeptic from opt-in to default-on. The Tier 1 2.3b runner plus Stryker is underway, not finished. What is no longer deferred: mutation_score shipped on 2026-06-03 in Sprint 2.3.1 as the seventh spec-verifier criterion kind, so I stop promising it and start using it.

The framework is in public preview because the engine is real, not because everything on the roadmap is done.
12
Where this goes

A persistent agent on operator-owned infrastructure.

Instead of “an AI assistant in the cloud,” I am building a personal, personalized factory for the AI era — a second brain shaped like markdown that survives model generations (Claude 4.7 → 4.8 → 5 → …). Institutional wisdom as code that compounds across years: every lesson, every decision, every standard written once and available to every future session, in every project.

Tier 2 is essentially closed at roughly 80%. Tier 3 — productization — is next. Tier 4 is a persistent entity: a self-hosted home server on which the factory runs independently of the hardware under me. It is not tied to one machine, nor to one provider.

cortex-x is not for sale. It is open source under Apache 2.0. The work behind it is the work I want to do next.

Open source on GitHub
13
Closing

What working on this looks like

If you're reading this as a hiring manager or a client, this is what my work looks like — and this repository is the proof of it, not the claim.

  • 95% confidence baseline. Before I write the first line, I ask clarifying questions until I'm roughly 95% sure of both scope and acceptance criteria. One round of questions saves three to four rounds of corrections.
  • Autonomous overnight runs with checkpoint discipline. When the model's smart zone starts to degrade, I checkpoint and clear context — I don't push on blind. Quality over count: fewer commits that hold beats a pile that has to be reverted.
  • Cross-project transfer — measured, not declared. Three documented carries: the RLS-first multi-tenant pattern from RELO into lasertgame-funos; a review pipeline that caught 3 HIGH findings on the portfolio with no explicit invoke; and agentic-workflow discipline applied in news-bot with no cortex sprint at all.
  • Externalized discipline. What I do by hand today, I codify into a hook, a skill, or a verdict tomorrow. Discipline that lives only in your head is technical debt — it belongs in code.
On April 17 I started from nine files. By June 3 the repo holds 35 standards, 21 action_kinds, and 17 nightly workflows — and across all of Arc 1 not a single R2 HIGH finding shipped to main, with 23 caught before push. When the pipeline ships defects in its own deliverables and catches them before push, that's proof the discipline is externalized into code — not just in my head. That's what working with me looks like.