How To Build Reliable Workflows With OpenAI Codex

Most developers hit the same wall with Codex. The first week feels electric. You prompt, it codes, things work. Then the codebase grows. Fixes start landing in the wrong layer. Architecture quietly degrades. The tool that felt like a superpower starts feeling like a liability.

The gap between casual Codex use and productive Codex use comes down to five systems: AGENTS.md, skills, worktrees, plan mode, and verification gates. None of them are complicated. All of them change the quality of what Codex produces.

This tutorial assumes you already have Codex installed and have run a few sessions. You know the basic commands. What you need now is structure. The kind that turns a chatbot into something closer to a reliable engineering teammate.

Prerequisites: Codex CLI or the Codex app, an OpenAI Plus/Pro/Team subscription, a Git repo you actively work in, and terminal comfort.

What You'll Learn

AGENTS.md gives Codex persistent project rules that survive across sessions and work in every major coding agent
Skills package repeatable workflows into reusable bundles any team member can run without explanation
Git worktrees let you run multiple Codex agents in parallel on the same repo without merge conflicts
Plan mode and verification gates prevent the architectural decay that casual AI-assisted coding produces

AI-generated summary, reviewed by an editor. More on our AI guidelines.

AGENTS.md: Teaching Codex How Your Project Works

Every Codex session starts the same way. The agent scans your repository, pokes through files, and assembles a mental model of your project. Without guidance, that model is a guess. Often a bad one.

AGENTS.md fixes this. Picture a README written for the agent, not for humans. Codex loads AGENTS.md files before doing any work. It walks from your current directory up to the project root, collecting instructions along the way. Closer files win when rules conflict.

The fastest way to start is the scaffold command.

codex /init

That gives you a template. But templates are generic. The real value comes from making your AGENTS.md specific to how your project actually works.

A good AGENTS.md answers four questions: what is this project and how is it organized, which commands should Codex run (and which should it avoid), what coding standards must it follow, and what does "done" look like.

# AGENTS.md

## Project
Node.js REST API using Express + TypeScript. PostgreSQL via Prisma ORM.
Monorepo: `src/` (app code), `tests/` (Jest), `prisma/` (schema + migrations).

## Commands
- Run tests: `npm test`
- Run single test: `npm test -- --testPathPattern=<file>`
- Lint: `npm run lint`
- Type check: `npx tsc --noEmit`
- Dev server: `npm run dev`
- Do NOT run `npm run build` or `prisma migrate deploy` without asking.

## Standards
- Strict TypeScript. No `any` types unless explicitly justified.
- All new endpoints need request validation via Zod schemas.
- Error handling through centralized middleware, not per-route try/catch.
- Tests required for every new route and service function.

## Definition of Done
- All existing tests pass (`npm test`)
- Lint and type check pass
- New behavior has new tests
- No unrelated file changes in the diff

Two layers work together here. Your personal ~/.codex/AGENTS.md defines how you work across all projects: your preferred error handling style, typing strictness, how aggressive changes should be. Your repository's AGENTS.md defines how this specific codebase works: folder structure, test commands, fragile areas, team conventions.

Kirill Markin, who runs multiple projects with Codex, puts it simply: "Personal AGENTS.md says how I work. Repository AGENTS.md says how this codebase works."

One pattern that pays off fast: if Codex makes the same mistake twice, update AGENTS.md. The file should grow organically from real friction, not from imagined requirements. Start with five lines. Add rules only when something breaks.

AGENTS.md also works across tools. The same file is read by Cursor, Copilot, Amp, Gemini CLI, Windsurf, and over 60,000 open source projects. Write once. Every coding agent in your stack follows the same rules.

Skills: Packaging Reusable Workflows

AGENTS.md handles project-wide rules. Skills handle specific, repeatable tasks.

A skill is a bundle of instructions, optional scripts, and reference docs packaged around a SKILL.md file. When you find yourself giving Codex the same multi-step instructions for the third time, that workflow belongs in a skill.

OpenAI's built-in skill ecosystem covers common integrations: Figma for pulling design specs into React components, Vercel and Cloudflare for deployment, Linear for issue tracking. The Codex app surfaces these in a dedicated Skills tab. But the real power comes from building your own.

Every skill lives in its own directory with a specific structure.

my-skill/
├── SKILL.md              # Required: instructions for the agent
├── scripts/              # Optional: executable scripts
│   └── deploy.sh
├── references/           # Optional: docs, examples, specs
│   └── api-docs.md
└── agents/
    └── openai.yaml       # Optional: metadata, UI config, dependencies

The SKILL.md file is the core. It covers preconditions (what must be true before the skill runs), the exact steps, what success looks like, and recovery when something chokes.

# Deploy to Staging

## Preconditions
- You are on a feature branch (not `main`)
- All tests pass locally (`npm test`)
- No uncommitted changes (`git status` is clean)

## Steps
1. Run `npm test` and confirm all tests pass
2. Run `npm run build` and confirm no errors
3. Run `npx vercel --env staging` to deploy
4. Wait for the deployment URL in the output
5. Open the deployment URL and verify:
   - Home page loads without errors
   - API health endpoint returns 200
   - New feature or fix is visible and working
6. Copy the deployment URL and print it for the user

## Success
- Build completes without warnings or errors
- Deployment URL is live and returns HTTP 200
- Manual check of the changed feature confirms expected behavior

## On Failure
- If tests fail: stop and report which tests failed
- If build fails: check for TypeScript errors first
- If deploy fails: check Vercel logs with `npx vercel logs`

Aman Mittal built a skill that syncs his Linear "Todo" tasks directly into his Obsidian vault using Playwright MCP to read the browser. The skill runs every time he types "sync linear todo" in the CLI. No API keys, no custom integrations, just the agent reading a web page and writing markdown.

Here is the test for whether a skill is ready: could a new team member run it cold, no 20-minute Zoom call required? Yes? Ship it. "Well, they'd need to know about this one edge case..." That edge case belongs in the SKILL.md.

You can create skills from scratch, but it is faster to ask Codex to scaffold one for you. The built-in $skill-creator generates the directory structure and a starter SKILL.md. Edit from there.

Check skills into your repository and they travel with the codebase. Everyone on the team, every Codex surface, picks them up automatically.

Worktrees and Parallel Agents: Running Multiple Tasks at Once

This is where Codex stops being a faster way to write code and starts being a different way to work.

Git worktrees are simpler than they sound. You get a second (or third, or fifth) checked-out copy of your repo in its own directory, but all copies share one .git history underneath. No duplicate .git folders bloating your disk. Each worktree has its own HEAD and index. Commits, branches, and remotes stay shared.

# Create a worktree for a new feature branch
git worktree add ../feature-auth feature/auth

# Your filesystem now looks like this:
# my-project/          ← main worktree (.git lives here)
# feature-auth/        ← linked worktree (pointer back to main .git)

The Codex app automates this entirely. When you start a new task thread and select "Worktree" instead of "Local," Codex spins up an isolated worktree at $CODEX_HOME/worktrees/ in detached HEAD state. The agent works in that isolated directory. Your main checkout stays untouched.

The practical result: you can run three, five, even ten agents simultaneously on the same repository. Agent A refactors the auth module. Agent B fixes a payment bug. Agent C updates documentation. None of them touch each other's files. Zero conflict surface.

Verdent AI benchmarked this pattern on an 800-file Node.js monorepo. Five parallel worktrees cut total task time from 42 minutes (single agent) to 14 minutes. Worktree creation took under a second each because Git only checks out working files, the object store is already local.

But parallel execution only works when tasks are genuinely independent. Before spinning up worktrees, run a quick mental check.

Task Relationship	Strategy
Completely independent modules	Full parallel, separate worktrees
Same module, different functions	Coordinate merge order carefully
Same file, overlapping lines	Run sequentially, or split the file first
One task depends on another's output	Sequential only, worktrees cannot see uncommitted work

The biggest gotcha with worktrees: dependencies. When Codex creates a worktree, it checks out tracked files only. Everything in .gitignore (your node_modules/, .env, dist/) does not exist in the new worktree. Your agent will try to run code in a directory with zero dependencies installed.

The fix is a setup script that runs automatically after worktree creation.

#!/bin/bash
# .codex/setup.sh — runs automatically after worktree creation
set -e
cd "$CODEX_WORKDIR"

# Install dependencies (offline cache makes this fast after first run)
if [ -f "package.json" ]; then
  npm ci --prefer-offline
fi

# Copy environment variables from the main worktree
if [ -f "../.env" ]; then
  cp "../.env" ".env"
fi

echo "Worktree ready: $CODEX_WORKDIR"

On macOS with APFS, you can also use copy-on-write clones for node_modules. Near-zero cost for read-heavy dependency trees.

cp -c ../main/node_modules ./node_modules   # CoW clone on APFS — near-instant

When a worktree task finishes, review diffs in the Codex app's built-in diff viewer rather than your editor. Stage or revert chunks before committing. This is where the human judgment lives in a multi-agent workflow, not in writing the code, but in approving it.

Clean up finished worktrees regularly. They do not auto-delete.

git worktree list                                    # see what is sitting around
git worktree remove $CODEX_HOME/worktrees/thread-1   # remove a finished worktree
git worktree prune                                   # clean up stale metadata

Get Implicator.ai in your inbox

Strategic AI news from San Francisco. No hype, no "AI will change everything" throat clearing. Just what moved, who won, and why it matters. Daily at 6am PST.

No spam. Unsubscribe anytime.

Plan Mode and Verification: Stopping Code Rot Before It Starts

Codex is extremely good at fixing the thing you show it. It is extremely bad at fixing the system you do not describe. Mohsen Nasiri documented this pattern after watching his cooking app accumulate one-off conditionals: a "bunch" unit that should not scale got a hardcoded if statement instead of a domain model change. The bug vanished. The architecture got worse.

Plan mode is the first line of defense. Enter it with /plan or by pressing Shift+Tab when Codex detects you are discussing strategy. In plan mode, Codex reads your codebase, proposes where changes should live, identifies affected modules, and builds a step-by-step execution plan, all without touching a single file.

Review the plan. Reject anything that starts with "add a conditional to this file." Approve, and Codex executes against the agreed structure.

For anything beyond a small UI tweak, OpenAI's own team recommends this approach. Alex, the Codex PM, described in a recent team interview how he uses plan mode even for features he will not personally own: "I go through the motions of a plan mode and exploring it. And then I just have a better mental model of what we need to do."

The second line of defense is verification gates. By default, Codex decides when it is finished. For real projects, you should not let it. Define explicit completion criteria.

## Definition of Done

A task is NOT complete until all of the following pass:

1. `npm test` — all existing and new tests pass
2. `npm run lint` — zero warnings, zero errors
3. `npx tsc --noEmit` — type check passes
4. `git diff --stat` — only files related to the task are changed
5. No hardcoded values, magic strings, or one-off conditionals added
6. If a new API endpoint was added: request validation exists via Zod
7. If UI was changed: visually verify in the browser using Playwright

Do not report the task as complete until every gate passes.

The third layer is what Nasiri calls an "AI contract," a persistent file that defines what Codex is forbidden from doing. No one-off fixes for specific values. Domain rules must live in domain models, not UI conditionals. Passing tests is necessary but not sufficient.

# AI_CONTRACT.md

- No one-off fixes for specific values (units, ingredients, titles, etc.)
- Domain rules live in domain models or services, never in UI components
- UI components must not contain business logic
- Passing tests is necessary, not sufficient — architecture must be correct
- No new dependencies without explicit approval
- If a fix requires more than one `if` statement, the abstraction is wrong

Gates do not make the model smarter. They make bad solutions expensive. Once the cheapest path aligns with correct design, Codex's first attempt improves dramatically.

One more pattern worth adopting: test-driven development. Without tests, Codex verifies its own work using its own judgment. Tests create an external source of truth. Write the test first, then let Codex implement against it. The agent loops: run test, read failure, adjust code, run test again. It keeps grinding until everything passes. GPT-5.4 handles this well because it can scale its thinking time mid-task when it hits a harder problem.

Common Mistakes and Pitfalls

Treating AGENTS.md like a novel. Long, detailed instruction files sound like a good idea. They are not. Codex works best with concise, specific rules. Start with 10 lines. If your AGENTS.md exceeds 50 lines, you are probably duplicating information that belongs in skills or code comments. Every rule should earn its place by preventing a real mistake.

Running parallel agents on overlapping files. Worktrees isolate the working directory, not the intent. If Agent A refactors the user model while Agent B adds a field to that same model, you will get merge conflicts, or worse, conflicting architectural decisions that compile fine individually but break together. Check task independence before parallelizing. Same module with different functions is a yellow flag. Same file with overlapping lines is a red one.

Never compacting long sessions. Codex carries context forward, but context windows are not infinite. GPT-5.4 offers 1M tokens, which sounds enormous until you realize the agent is also reading files, running commands, and tracking its own plan. Use /compact to compress earlier conversation context. Use /new to start fresh when shifting to an unrelated task. OpenAI explicitly warns against using one giant thread per project.

Skipping the setup script for worktrees. This is the number one cause of mysterious failures in parallel workflows. Your source code is there but node_modules is gone. So is .env. And your build cache. The agent tries running tests, they blow up because dependencies are missing, and it starts "fixing" things that were never broken. Write a .codex/setup.sh once and attach it to every worktree.

Accepting output without running verification. Codex will report a task as complete the moment the code compiles or the immediate test passes. That is not the same as "correct." Always have Codex run the full test suite, not just the tests related to the change. Always have it check linting and type safety. If you are building a web application, tell Codex to open it and verify the behavior visually using Playwright. Trust but verify.

Over-engineering MCP when shell tools already work. MCP servers are powerful for connecting to external services. But many developers wire up custom MCP integrations for things gh, npm, vercel, or aws already handle from the command line. If your workflow involves a CLI tool that Codex can run directly, skip the abstraction. Fewer tokens. Faster execution. Less to maintain.

What Comes Next

You now have the five systems that separate reliable Codex workflows from fragile ones: AGENTS.md for project rules, skills for reusable processes, worktrees for parallel execution, plan mode for architectural alignment, and verification gates for quality control.

Three directions worth exploring from here. First, automations: scheduled Codex tasks that run without prompting. OpenAI uses them internally for daily issue triage, CI failure summaries, and release briefs. The pattern is read-only by default, artifact-producing, and human-reviewed before changes ship.

Second, subagents: a manager agent that decomposes your task and spawns specialized worker agents in isolated cloud sandboxes. Each worker has full tool access but cannot communicate with other workers directly. All coordination flows through the manager. OpenAI shipped this to general availability in March 2026, running on GPT-5.4 with a 1M token context window.

Third, the Agents SDK integration. Codex CLI can run as an MCP server itself, exposing codex() and codex-reply() tools that the Agents SDK calls to orchestrate multi-agent workflows programmatically. A project manager agent creates requirements. A designer agent produces specs. Frontend and backend agents implement in parallel. A tester agent validates. Not a demo. A documented pattern in the OpenAI Cookbook.

The ceiling keeps rising. But the foundation stays the same: clear instructions, isolated execution, and verification that does not depend on the model's own judgment.

Frequently Asked Questions

What is AGENTS.md and how is it different from a README?

AGENTS.md is an instruction file that Codex reads automatically before every session. Unlike a README written for humans, it contains specific rules the agent follows: allowed commands, coding standards, file structure, and what 'done' means. It works across Codex, Cursor, Copilot, and other coding agents.

Do I need the Codex app to use worktrees?

No. You can create Git worktrees manually from the terminal with git worktree add and run separate Codex CLI sessions in each one. The Codex app automates the creation and cleanup, but the underlying mechanism is standard Git.

How many parallel agents can I run at once?

There is no hard limit from Git's side. Verdent AI tested five parallel worktrees on an 800-file monorepo without issues. The practical limit is your machine's resources and your OpenAI rate limits. Token consumption scales non-linearly with more agents.

What is the difference between skills and AGENTS.md?

AGENTS.md sets project-wide rules that apply to every session. Skills are task-specific workflow packages that Codex loads on demand. Use AGENTS.md for standards and conventions. Use skills for repeatable multi-step processes like deploying, reviewing, or generating reports.

How do I prevent Codex from making one-off hacks instead of proper fixes?

Three layers work together: plan mode forces Codex to propose where changes belong before writing code, verification gates define explicit completion criteria beyond just passing tests, and an AI contract file lists forbidden shortcuts like hardcoded conditionals or UI-layer business logic.

AI-generated summary, reviewed by an editor. More on our AI guidelines.

Tools & Workflows

Marcus Schuler

San Francisco

Editor-in-Chief and founder of Implicator.ai. Former ARD correspondent and senior broadcast journalist with 10+ years covering tech. Writes daily briefings on policy and market developments. Based in San Francisco. E-mail: [email protected]

Intermediate Tutorial: How To Build Reliable Workflows With OpenAI Codex

AGENTS.md: Teaching Codex How Your Project Works

Skills: Packaging Reusable Workflows

Worktrees and Parallel Agents: Running Multiple Tasks at Once

Plan Mode and Verification: Stopping Code Rot Before It Starts

Common Mistakes and Pitfalls

What Comes Next

Marcus Schuler

Get the Morning Briefing in your inbox.

Related Stories

Repo Radar: 5 GitHub Projects Worth Your Week

TypeWhisper Founder Built His Dictation App After a Stroke

Professional Tutorial: Build a Pi-to-Pi Agent Communication Bus