Matt Rickard

The Spec Layer

matt@mattrickard.com (Matt Rickard) — Tue, 31 Mar 2026 14:30:00 GMT

An AI agent implements a feature. The code compiles. The tests pass. It still misses the point.

The wrong kind of correct.

Most of our software tooling is optimized for the failures humans used to make. Agents fail differently.

They usually don't break the build. They disable the failing test. They reuse the nearest pattern. They preserve the old path and add a new one beside it. Everything looks reasonable until the codebase starts filling with locally valid mistakes.

The failure modes are familiar:

I just disabled the failing tests.
I just reused the existing service.
I did not change the existing behavior.
You're right. I assumed that…

When a decision isn't written down, the agent has to decide it again. Context windows are finite and even imperfect within. The deeper issue is too much freedom at execution time.

Compilers, linters, and tests help. They catch syntax errors, broken imports, and failing behavior. They are worse at telling you whether the agent made the right call. Even a large test catalog is weak against additive change.

Code generation improved faster than the systems that constrain it. The problem is underconstrained execution: too much freedom at the point where the agent has to act. Written intent is one way to constrain that freedom. Specs are one layer that can provide it. The historical case for that layer is clearest in protocols.

Protocol engineering is the cleanest historical evidence. Not because protocols capture every rejected alternative, but because they define interfaces that many implementations can target. RFC 791 standardized Internet Protocol in 1981. HTTP semantics live in RFC 9110. TLS 1.3 lives in RFC 8446. HTML is maintained as a living standard by WHATWG. In each case, the spec lets many implementations evolve over time.

But specs do not remove the hard part. Dijkstra's narrow-interfaces critique shows that precision work does not disappear when you move from code to prose. Lamport and TLA+ show why explicit invariants still matter before implementation. Model-driven development shows the risk of pushing the abstraction too far and turning the spec into the thing you have to edit.

So the goal is to reduce execution freedom.

Spec-driven development means writing durable intent down before implementation, then using it to plan, build, check, and revise the work.

The word spec is a bit overloaded. Separate what the system must do from how this codebase will do it, the task list, and the rules that should survive later changes.

Each one narrows a different choice. Specs constrain intent. Plans constrain approach. Tasks constrain sequencing. Tests, schemas, and lint constrain behavior. Harnesses constrain execution.

The real disagreement is where to put the constraint. GitHub Spec Kit and Kiro keep them near the change workflow: requirements, design, and tasks for one piece of work. OpenSpec moves them into the repo as a decision record that survives the change.

Tessl pushes further and asks whether the spec itself should become the thing you edit, which is where the Dijkstra objection lands hardest: "a sufficiently detailed spec is code." Intent treats the spec as shared state. Symphony treats it as an orchestration contract for autonomous runs.

Each one tries to pin the agent down at a different point.

Underneath the product differences, they keep rebuilding the same skeleton: durable context, feature intent, a technical plan, explicit tasks, and verification. The goal is to give the agent less room to improvise.

So what would the ideal model look like today? Smaller than most current tools imply, with a cleaner handoff between intent and execution.

The spec should be declarative, so the agent matches the code to the intent instead of replaying a brittle patch script. It should be layered, so product requirements do not quietly turn into architecture and technical plans do not quietly add product scope. And it has to be cheap to revise. If a spec is expensive to update, replace, or delete, the process hardens into ceremony and the ceremony becomes the work.

Where a rule can be enforced mechanically, move it out of the spec and into lint, schemas, tests, or the harness. Use less prose. Enforce more. Specs matter, but they are only one layer. Full SDD should stay optional for small bug fixes, fast prototypes, and exploratory UX.

The winning model puts a narrow interface between human intent and machine execution: intent narrows the search space. Code, tests, and harnesses govern behavior. Smaller specs, harder checks, less guessing.

Using Claude Code from Anywhere

matt@mattrickard.com (Matt Rickard) — Sat, 30 Aug 2025 14:30:00 GMT

I've been using multiple instances of Claude Code and Codex CLI almost every day. But I've gotten frustrated enough to build something that solidifies my workflow. Before, it looked something like this:

git worktree for parallel instances
docker for sandboxing work and tooling
tmux for automation and management of terminal emulator windows
ssh to a cloud instance for managing work on-the-go.

But I was frustrated by a few things:

Parallelism tax. Even with automation, the setup/clean-up grind is tedious. Worktrees share the same git object store, so you still need to be careful with operations and cleanup. Managing claude in docker means that I need to mount files, move around secrets, and manage environment. Remote instances need to be synced.
Laptop-locked. SSH from mobile or an iPad will probably never be a good experience, especially with a long-running process like claude code. Laptops aren't made to be treated like servers.

Current solutions are good, but have some shortcomings.

Unsupervised agents (Codex Web / Claude Code GitHub Actions). Short feedback loops make Claude Code great. If it makes a wrong turn, you can interrupt and get it back on the right path. Codex Web and Claude Code GitHub Actions are powerful, but often times spend 15 minutes working on a technically correct, but wrong implementation of a feature. Or they get blocked on something that you could have fixed easily.
SSH into a VM. You become the platform team: images, secrets, logs, UI, lifecycle. Not a bad choice, but lots of work.
Desktop UI: Solves some of the terminal-bound issues: window management, worktree automation, syntax highlighting, patch management. However, still laptop bound.

So my new workflow:

Web UI → ephemeral sandbox per chat → live, interactive session → patch/PR

On-demand sandbox execution: Ephemeral, quick to boot, isolated jobs per task with code, tools, and AI agents.
Live, steerable session. Stdout/stderr stream in real time; I can interrupt/approve and keep the loop tight—same Claude Code behavior, just remote.
Chat Management. Automated branch-per-chat and pull-request creation. Persistence for chats and code changes that isn't in your $HOME folder.

I put up an early version on standard-input.com. Let me know what you think. I'll buy you a coffee if you break out of the sandbox. dangerously-skip-permissions has been renamed to vibe.

Pseudonyms in American History

matt@mattrickard.com (Matt Rickard) — Tue, 05 Dec 2023 14:30:00 GMT

Debates around the ratification of the Constitution and the early formation of the United States happened through pseudonymous authors. They often used names borrowed from Greek or Roman History.

Why?

Plausibly some protection against retaliation. However, most pseudonymous writing was quickly attributed to authors.
Power in names. The names weren’t chosen at random. Often, they called back to famous Romans who took part in the formation of the Roman Republic. Or others who were known for their virtue or principles.

Alexander Hamilton might have written under the most pseudonyms (at least five). Benjamin Franklin used at least three. Here’s a list of some of the more popular ones around the time of the American Revolution.

Phocion (Alexander Hamilton) — Essays defending the Jay Treaty with Great Britain. Phocion was an Athenian statesman known for his integrity and opposition to demagoguery.

Columbus (Alexander Hamilton) — Defending the Continental Congress and criticizing British policies.

Publius (Alexander Hamilton, James Madison, John Jay) — The authors of the Federalist Papers, which were a series of essays advocating for the ratification of the Constitution. Individual authorship wasn’t released until Hamilton’s death, and even then historians are still trying to match authors to text. It’s hypothesized that Hamilton wrote 51 essays, Madison 29, and Jay 5. Publius Valerius Poplicola was a Roman consul known for his role in founding the Roman Republic.

Historicus (Alexander Hamilton) — Essays on various topics related to the Constitution and federalism.

Pacificus (Alexander Hamilton) — Used to defend President George Washington's Neutrality Proclamation of 1793 (declared the U.S. neutral in the conflict between France and Great Britain). “Making peace” in Latin.

Helvidius (James Madison) — Written in response to Pacificus (Hamilton), these essays defended the constitutional authority of Congress in foreign affairs. Helvidius Priscus was a Roman senator known for his defense of republicanism and freedom of speech.

Americanus (John Jay, John Stevens, Jr.) — Federalists essays.

Candidus (Benjamin Franklin) — Writings advocating for various causes, including opposition to oppressive British policies.

Silence Dogood (Benjamin Franklin) — A fictitious widow created by Franklin to offer social commentary.

Richard Saunders “Poor Richard” (Benjamin Franklin) — Used to publish Poor Richard’s Almanack. The name comes from a popular London almanac, Rider’s British Merlin.

“Common Sense” — Thomas Paine’s pamphlet advocating for American independence was initially published anonymously.

Cincinnatus (Arthur Lee) — Anti-federalist papers.

A Farmer (John Dickinson) — Essays titled "Letters from a Farmer in Pennsylvania," which argued against the Townshend Acts imposed by the British.

Cato (George Clinton) — Anti-federalist essays around the time of the ratification of the Constitution. Attributed to George Clinton, but not confirmed. Cato the Younger was a Roman statesman known for his staunch republicanism and opposition to Julius Caesar.

Brutus (Robert Yates) — An ally of George Clinton’s who wrote more anti-federalist essays. Marcus Junius Brutus was a Roman senator famous for his role in the assassination of Julius Caesar, symbolizing resistance to tyranny.

Centinel (Samuel Bryan) — A series of anti-federalist essays critical of the proposed U.S. Constitution's centralizing tendencies.

Americanus (John Stevens, Jr.) — Essays written to support the Federalist cause and the ratification of the U.S. Constitution.

Poplicola (John Adams) — Essays defending the British constitution and criticizing the Stamp Act. The same Publius Valerius Poplicola used by Hamilton.

Novanglus (John Adams) — A series of essays written in response to Massachusettensis, defending colonial rights. Latinization of “New Englander”.

A Citizen of New York (Martin Van Buren) — political essays.

Fairchildren

matt@mattrickard.com (Matt Rickard) — Mon, 04 Dec 2023 14:30:00 GMT

In 1956, William Shockley, Stanford professor and winner of the Nobel Prize in Physics for his work on semiconductors, recruited a team of young Ph.D. graduates to product a new company. The company would be called Shockley Semiconductor.

But Shockley was a terrible manager, and the students left to form their own company the next year, Fairchild Semiconductor. They would be later known as the “traitorous eight”.

The founders of Fairchild Semiconductor were: Gordon Moore, C. Sheldon Roberts, Eugene Kleiner, Robert Noyce, Victor Grinich, Julius Blank, Jean Hoerni, and Jay Last.

Fairchild Semiconductor became the proto-company of Silicon Valley. Many major technology companies can somehow trace their founding or story to Fairchild.

Intel - Founded by Robert Noyce and Gordon Moore, both former employees of Fairchild Semiconductor.

AMD (Advanced Micro Devices) - Founded by Jerry Sanders, another Fairchild alumnus.

Kleiner Perkins - A venture capital firm co-founded by Eugene Kleiner, a former Fairchild employee.

Sequoia Capital— Don Valentine worked at Fairchild Semiconductor for seven years before moving to National Semiconductor (another Fairchild). Then, he started Sequoia Capital.

Other companies founded by Fairchild employees: SanDisk, National Semiconductor, Altera, LSI Logic, Amelco, Applied Materials, and more.

ChatGPT After One Year

matt@mattrickard.com (Matt Rickard) — Sun, 03 Dec 2023 14:30:00 GMT

ChatGPT was released on November 30th 2022. What has changed since then?

Hundreds of open-source models. Varying sized models from small to very large. Many are chat-tuned similar to ChatGPT.
Distilled models from ChatGPT. Academics and competitors both used data from ChatGPT conversations to train or fine-tune their own models.
Competition. Microsoft launched Bing Chat. Google launched Bard. Poe, Pi, Perplexity. Claude by Anthropic. Not to mention self-hosted open-source chat UIs and other wrappers. There’s no shortage of competition (although ChatGPT still is the most popular).
RAG is hard. “Browse with Bing” and Bing Chat launched but hallucinations are still an issue. Browsing the internet doesn’t seem like the catch-all
Not every launch increased performance across the board. Every new iteration of ChatGPT launched changed the way the model behaved. Many queries got better. Some got worse. Google has always had this problem as well, but applications aren’t build on Google.
A consumer subscription model. ChatGPT Plus was released in February 2023. The consumer model maybe competes with the developer and enterprise products (why not just use the API?).
Multi-modal. ChatGPT started to accept images and files in the chat. DALL-E and the vision API became integrated into the chat window. There are open-source models that are multi-modal, but so far no experience is as sleek as OpenAI’s.
Plugins launched but never found product-market fit. Plugins launched but didn’t become the App Store that OpenAI hoped. Custom GPTs seem to be the next strategy for extensibility, although they won’t launch until next year.
Code Interpreter is getting better. Agents and tool-use is still hard for LLMs. But it’s getting better and becoming more useful. Files can now be added directly to the UI to chat with.

McNamara Fallacy

matt@mattrickard.com (Matt Rickard) — Sat, 02 Dec 2023 14:30:00 GMT

The McNamara Fallacy is named after Robert McNamara, the US Secretary of Defense during the Vietnam War. The fallacy describes making decisions using only quantitative metrics and ignoring anything else.

The fallacy usually follows the same four steps.

Measure what can easily be measured.
Dismiss what can’t be measured easily.
Presume what can’t be measured easily isn’t important.
Extrapolate and conclude that what can’t be measured doesn’t exist.

You can find the McNamara Fallacy in all types of disciplines. The emphasis on standardized tests in education (at the expense of less quantifiable qualities and learning). Or when the success of treatments in medicine is based only on easy to measure outcomes (not quality of life, mental health, or overall well-being). Or optimizing for short-term financial metrics at the expense of brand reputation, employee satisfaction, or other intangibles.

Data Quality in LLMs

matt@mattrickard.com (Matt Rickard) — Fri, 01 Dec 2023 14:30:00 GMT

Good data is the difference between Mistral’s LLMs and Llama, which share similar architectures but different datasets.

To train LLMs, you need data that is:

Large — Sufficiently large LMs require trillions of tokens.
Clean — Noisy data reduces performance.
Diverse — Data should come from different sources and different knowledge bases.

What does clean data look like?

You can de-duplicate data with simple heuristics. The most basic would be removing any exact duplicates at the document, paragraph, or line level. More advanced versions might look at the data semantically, figuring out what data should be omitted because it’s better represented with higher quality data.

The other dimension of clean data is converting various file types to something easily consumed by the LLM, usually markdown. That’s why we’ve seen projects like nougat and donut convert PDFs, books, and LaTeX to better formats for LLMs. There’s a lot of training data that’s still stuck in PDFs and human-readable but not so easily machine-readable data.

Where does diverse data come from?

The surprising result of the success of the GPTs is that web text from the Internet is probably one of the most diverse datasets out there. It contains usage and data that aren’t found in many other data corpora. That’s why models tend to perform so much better when they’re given more data from the web.

Discord and AI GTM

matt@mattrickard.com (Matt Rickard) — Thu, 30 Nov 2023 14:30:00 GMT

Midjourney is the largest discord server, with 16.5 million total users. It accounts for 13% of total Discord invites. Midjourney launched in March 2022 and doesn’t have a web application. Many other AI apps (Leonardo, Pika, Suno, And AI Hub) are on Discord (or even Discord-only).

Why is Discord such a good GTM for AI applications?

Text interface. Most users are just generating images, videos, and audio in these Discord servers. Prompts are easily expressible in simple text commands. It’s why we’ve seen image generation strategies like Midjourney (all-in-one) flourish in Discord while more raw diffusion models haven’t grown as quickly (e.g., Stable Diffusion with many configurable parameters).
Virality. Prompt engineering models is difficult and more art than science (today). Users can see generations by other users and collectively see what’s working and what isn’t. This means that these communities often have the most advanced prompts and best images.
Low friction. Go to where your users already are. Most developers have Discord now. One fewer application to sign up for.
Free hosting. Discord pays for the image hosting and bandwidth. At Midjourney scale, this is not negligible.

But Discord has it’s risks as a platform to build on.

Platform risk. Discord could (easily?) build its own Midjourney-type application into the platform. Using all of the prompt-image pairs (along with reactions as a RLHF), it could probably distill a much better model from Midjourney (questionably legal but technically easy). This reminds me of the Zynga / Facebook relationship. Zynga accounted for 19% of Facebook’s revenue at one point. Facebook reduced Zynga’s API access and launched its own gaming platform.
Multi-modal. How does multi-modal fit into the Discord text-first interface? Sure there are images and audio that can be uploaded via the interface, but it’s hard to image the UI that a multi-modal AI will need in the future.

Standard Causes of Human Misjudgment (Munger)

matt@mattrickard.com (Matt Rickard) — Wed, 29 Nov 2023 14:30:00 GMT

In 1995, Charlie Munger gave a speech at Harvard on The Psychology of Human Misjudgment. It was filled with the research he had done later in life on human psychology, matched with real-life examples that he had observed in his work. The result was a succinct list of the top cognitive biases grounded in real-life experiences. I’ve summarized the biases here, but it’s worth giving the entire speech a listen to hear the stories behind each. I’ve tried to keep Charlie’s language and numbering when possible.

Underestimation of Incentives: Despite understanding the significant influence of incentives (reinforcement in psychology and incentives in economics), there's a tendency to consistently underestimate their power.
Psychological Denial: This is the refusal to accept reality because it is too painful or difficult to bear.
Incentive-Cause Bias: This occurs when personal incentives or those of a trusted advisor create a conflict of interest, leading to biased decisions.
Bias from Consistency and Commitment: This involves a strong tendency to stick to pre-existing beliefs or commitments, even in the face of contradictory evidence.
Bias from Pavlovian Association: This bias refers to the error of basing decisions on past associations or correlations without considering their current relevance or accuracy.
Bias from Reciprocation Tendency: This bias involves a natural inclination to reciprocate actions and behaviors, including conforming to others' expectations, especially when one is experiencing success or is 'on a roll.'
Bias from Over-Influence by Social Proof: This bias refers to the heavy reliance on the actions or decisions of others, especially in situations of uncertainty or stress.
Bias from Favoring Elegance over Practicality in Theory: This bias involves a preference for theories or explanations that are mathematically elegant or intellectually satisfying, even if they are less accurate in practical terms. “Better to be roughly right than precisely wrong” — Keynes.
Bias from Contrast-Induced Distortions: This bias refers to the way our perceptions, sensations, and cognition can be significantly altered by contrasts.
Bias from Over-Influence by Authority: This bias involves the tendency to conform to instructions or opinions provided by an authority figure, even when these instructions conflict with one's own moral judgment or common sense.
Bias from Deprival Super Reaction Syndrome: This bias is characterized by an intense reaction to losing or the threat of losing something, especially something that one perceives as almost possessed but never fully owned.
Bias from Deprival Super Reaction Syndrome: This bias is characterized by an intense reaction to losing or the threat of losing something, especially something that one perceives as almost possessed but never fully owned.
Bias from Envy/Jealousy: This bias stems from feelings of envy or jealousy towards others.
Bias from Chemical Dependency: This bias relates to the cognitive and behavioral changes that result from chemical dependency, such as addiction to drugs or alcohol.
Bias from Gambling Compulsion: This bias refers to the compulsive urge to gamble, driven by the psychological principle of variable reinforcement.
Bias from Liking Distortion: This bias involves a preference for things that are familiar or similar to oneself, including one's own ideas, kind, and identity.
Bias from Disliking Distortion: This is the opposite of liking distortion, where there's a tendency to reject or not learn from sources that are disliked.
Bias from the Non-Mathematical Nature of the Human Brain in Probability Assessment: This bias refers to the human brain's tendency to rely on crude heuristics and be easily misled by contrasts when dealing with probabilities, rather than using precise mathematical approaches.
Bias from Over-Influence by Extra Vivid Evidence: This bias describes the tendency to give disproportionate weight to particularly vivid or emotionally striking information when making decisions.
Stress-induced mental changes, small and large, temporary and permanent.
Mental Confusion from Poorly Structured Information and Inadequate Explanations: This bias involves difficulties in understanding or decision-making due to information that is not well-organized or lacks a coherent theoretical framework.

The Unreasonable Effectiveness of Monte Carlo

matt@mattrickard.com (Matt Rickard) — Tue, 28 Nov 2023 14:30:00 GMT

Monte Carlo methods are used in almost every branch of science: to evaluate risk in finance, to generate realistic lighting and shadows in 3D graphics, to do reinforcement learning, to forecast weather, and to solve complex game theory games.

There are many types of Monte Carlo Methods, but they all follow a general pattern — using random sampling to model complex systems.

A simple example: Imagine a complex shape you want to know the area of.

Place the shape on a dartboard.
Randomly throw darts at the dartboard.
Count the number of darts that are inside the shape and outside.
The estimated area of the shape is = (number of darts in shape / number of darts outside of shape) * the area of the dartboard.

(This is computing a definite integral numerically with a method that doesn’t depend on the dimensions! You can even easily estimate the error given the number of samples).

Monte Carlo Tree Search (MCTS). Or use it to play a game like Blackjack (Chess, Go, Scrabble, and many other turn-based games) with Monte Carlo Tree Search. AlphaGo and its predecessors (AlphaGo Zero and AlphaZero) used versions of Monte Carlo Tree Search with reinforcement learning and deep learning.

The idea is fairly simple — add a policy (i.e., a strategy to follow) to the random sampling process. You might start with a simple one (random or stay with a hand under 18). For every move in a game, add that to a tree that describes the game. For Blackjack, that might be a series of hits or stays. When a game is won or lost, go back and update all of the nodes in the tree for that game (the “back propagation”).

After many games, you have a tree of expected utility for each move — that means you can sample the next move much more effectively. The value says something like — “given this current hand and set of actions, I won X% of the time”. You can get more advanced with the reward and update function — for example, you might discount wins that take many turns and prioritize quicker wins.