Roadmap

April 25, 2025

Writing is an invaluable tool for the thinking process. Even if the only reader of a piece is the author, it has value in that it is a forcing function for organizing, critiquing, and solidifying thoughts. Sort of in the same way developers talk about "rubber ducking" something, in trying to explain an intractable problem to a rubber duck, often times just explaining something reveals the answer. And with AI agents, a common tool is giving an agent "memory", that is, just having them write down and manage a note file of what they're doing and how much they've done.

This document is mostly for myself to hone my plan for testing my Theory of DX. I have a rough idea but I want a clearer, well-thought-out picture.

I'm also a big fan of the idea that "planning is essential, but plans are useless". Plans are meant to be broken; in the middle of execution mode, it's much easier to say in the moment "do I do the next thing on my list or something else" than "what do I do now?". If the plans don't change along the way then most likely either the project was so trivial it didn't need to be planned, too much time was spent planning up-front, or critical thinking wasn't happening along the way.

So without further ado, here's my plan.

Working Backwards

My ultimate goal is to have:

Statistics on how much various tools and assets help AI agents succeed.
Correlation measurements between engineering metrics and AI agent success rates.

For "how effective" something is, I want to see statistically significant success rate improvements when the AI agent is given that thing, vs not having it.

Tools and repo assets includes:

Written documents
- Project-level documentation
- Package-level documentation
- Product specifications
Generated documents
- API specifications
- Library interfaces
- Sitemaps
- Dependency maps
Static analysis
- ESLint rules
- TypeScript configurations
System prompts
- Problem solving guidance
- Planning guidance
- Product outline
- Project architecture

That covers four of the five categories in Theory of DX. The fifth are "quality metrics", and that's where I need to measure correlation.

There are two groups of metrics: leading and live indicators. The leading indicators are engineering metrics such as test coverage and interface complexity. I'm interested in, for example, how much greater test coverage affects the success rate of agents making changes (fixes, feature additions, or refactors) to a piece of code. For interfaces I might measure how difficult it is for an agent to consume or develop two packages which solve the same problem but with greater or lesser configurability and features. Each of these will require their own approach.

Live metrics are a bit more tricky... And also out of scope. I'm interested in particular how all this stuff relates to AI agents, and I can link their efficacy to engineering metrics, but linking engineering metrics to product, reliability, and performance metrics is a separate thing. Important and useful information, but I believe it's generally accepted that these engineering metrics correlate with user experience and developer velocity, and ultimately long-term business success.

Prerequisites

Okay, then given the above is what I want to produce, what do I need to have or build to get these answers?

A Baseline - A fairly complex codebase or set of codebases that collectively have all those things listed above. Written and generated documentation, linters, and reports on engineering metrics.
A Workflow Tool - Some way to define workflows which can be given to an agent to execute.
A Suite of Evals - A way to provide an agent with a workflow to execute on the baseline (and/or variants of it), and a way to measure if they were successful.

Then beyond that, it's a matter of hooking these up with a variety of models, then gathering and crunching the data.

Let's look at each of these in turn, vetting if they're truly necessary and what minimum work is necessary.

Baseline

My approach to building the baseline is to create SAF and a series of sites with SAF. This framework gives me a location to put all my shared logic and documentation across projects, and develop solutions for gluing together my preferred libraries and services. I'll have a complete baseline once SAF supports all those features I want to test above.

Could I get away with just existing packages? Well, for some of the things probably. I could find some open source packages, then add or remove the things I want to test. I could probably even find some existing open source project in a monorepo architecture. But I have two reasons for building my own stack and testing with it:

Familiarity. While I'm interested in learning more Rust and Go and other things, I can move faster if I can tend to use the tools I already know. And if I build the repo, I'll know it in and out.
I'm going to build it anyway. I want/need to build a bunch of sites so I want a stack that will support them, and one that has a good DX. Getting the metrics this way overlaps with a bunch of other goals I'm pursuing and just things I want to do.

The way I'd like to do this is develop evidence in this controlled environement of mine, then see if I can port some of the learnings to elsewhere like live open-source projects.

How developed does the baseline need to be? To get started? Not much more. This week I did a general software design sweep from back-to-front, and I'm happy enough with the design of each layer that I think it's usable enough to serve as a baseline. Types are fairly smooth, I'm 80% sure my usage of these tools is correct, and I've used them in enough cases.

The major thing I'm missing is really developed and integrated static analysis. I'm basically just making sure problems the IDE highlihts are resolved, but I haven't integrated my linter into CI or gotten comprehensive type-checking in place at all. I think that's important for agents to really be "set up for success", because of how often I see things go awry that could have been caught by one of those checks.

Workflow Tool

Why is a "workflow tool" necessary anyway? Technically the workflow could be part of the eval. The eval could prompt the agent with a series of tasks or questions and then take the output and assess it. That would work just fine.

The thing is, I want to use these workflows, too. Embedding workflows in a testing tool doesn't make them very accessible, and I want to create some workflows, try to use them myself, and start testing/optimizing them only when I've got something I feel is good enough to get some signal.

How will this workflow tool work?

I'm thinking npm script or bash tool. I could build an MCP server but I kind of want to make this tool usable by humans, too, and anyway it seems fairly standard for most IDEs and other tools to give the agent a bash tool. I may eventually create a server which also has an MCP interface but I don't think that's necessary or prudent at this stage.

I expect the workflow tool to work like an iterator. You can "peek" and receive the current step and receive instructions, context, references, and a list of next step options. You can then "pop" that step and it will move to the next state, possibly first running a check to make sure the agent is allowed to move forward. For example, if the step is to write tests for the implementation that they just produced, then you can't pop to the next step unless the tests pass.

The tool also needs a kickoff step. Some way to string together a series of workflows with some required and optional parameters specific to that plan. I imagine a JSON object whose structure is defined either with TypeScript or JSON Schema. Then the plan's structure can be validated.

The available workflows and their parameters should be defined by packages. For example, my vue-spa package could have workflows for creating a SPA, adding pages to it, adding Tanstack queries, and creating form elements. Whereas my drizzle-sqlite can have instructions for adding a database package, adding to the schema, and adding queries. The "-dev" versions of these packages will have workflows to add tests for the implementations of their counterparts. This way installing a package also installs the set of workflows the package offers.

Finally, it would be handy for the tool to list all available workflows and their parameters, so the agent can construct a plan array which is just a series of workflow objects. The tool takes this plan as part of the kickoff, and keeps track of which workflow the agent is on and what state they're in for that workflow.

Altogether then the tool needs the following commands:

workflow ls - to know what workflows are available
workflow kickoff - given a plan file, sets up current state
workflow status - see what the current step in the plan is
workflow next - runs checks and if they pass, prints the next task

I think this will be a really powerful model. If you can break routine work down into a series of workflows, that work should be able to happen fairly reliably and quickly, and you can build from there:

Async workflows where you need to wait days for a deployment to happen, or months for an API to be deprecated.
Integration with central repositories, task tracking systems, package managers, and check-in rhythms, so things like setting up PRs, updating tasks, publishing changes, and status updates can be managed.
Systems for automatically getting notice when a workflow is running which will edit the same package you're working on, or a package that you own.
A "check" could either be an automated check and/or a human check. That is, part of the workflow could be a report on the key changes made, starting with for example the interface changes.
Other kinds of workflows such as making the plan in the first place, refactoring to update to the latest best practices and helper methods, handling code review comments, migrating code to a new version or library, triaging issues, improving performance, or fixing bugs.

But I digress. Point is, a tool like this will enable evals to test workflows and workflow variants for efficacy and cost. And those same workflows that are tested and optimized can themselves be used directly by developers and by agents.

Eval Suite

There's not too much to say here. Basically I need a way to:

Define a task, doable with a workflow
Create a scoring method, probably either an automated test or a written response evaluation
Provide a variety of inputs, such as variants of the workflow to try (break into tiny steps or not?), context to provide (documents, system prompts), and guardrails to enable (linters, type checks).
Run the task across a multitude of models.
Receive stats on the results.

From that I can discern which models are best for what workflows, what makes a good workflow, how much and what kind of context to provide for the task, what kinds of system prompts have what effects, and which guardrails help the most.

I'm familiar with the Inspect framework so I'll likely just go with that. It certainly has everything listed above!

Well that's that. I'm excited to get started on the workflow tool; I suspect that'll really streamline things I've been doing. I'm also really curious to get some hard data on what all these things that feel helpful are actually helpful.

Thanks for reading!

Roadmap ​

Working Backwards ​

Prerequisites ​

Baseline ​

Workflow Tool ​

How will this workflow tool work? ​