Arthur B.Follow Arthur Breitman. Machine learning, functional programming, applied cryptography, and these days mostly #tezos. Husband of @breitwoman, oligocoiner.

The Frayed Edge of Bespoke Software in the Age of Agents

The Frayed Edge

In 2021, I purchased a property in Tuscany, sight unseen. This was in the midst of COVID restrictions, and UK residents were not allowed to travel outside of the country. The realtor did a tour of the place with a phone camera, and the deal was struck. The property had been renovated completely in the early 2000s, but was mostly left abandoned. With just a bit of work, it would be ready next year. Oh, sweet summer child.

Renovation in Italy is not for the faint of heart. The pristine scenery of Tuscany, its authenticity, comes at a cost. Permitting is … difficult. I am now five years into this renovation project, and I think I see light at the end of the tunnel. This post is not about my woes as the owner of international real estate; it’s about using coding agents, like Claude code, for project management, so bear with me.

The renovation work turned complex for a few reasons. The design was outdated and needed a refresh, but design plans (including interior design) must be filed with the municipality as part of an overall renovation plan. The pool turned out never to have been filled, and if it had, it would have slid down the hill, which is largely made of clay. This meant it had to be re-permitted and reinforced with dozens of deep concrete pillars. All in all, the work involves architects, contractors, lawyers, surveyors, banks, permits, land purchases, and a steady stream of email in English and Italian. For years, the actual state of the renovation ended up scattered across Gmail, attachments, shared folders, some calendar reminders, and whatever I happened to remember.

The amount of communication itself was quite manageable: on average, over the past five years, it’s amounted to about an email a day, with long periods of quiet and sudden bursts when permits, contracts, or financing were moving. The management challenge lay in how the project spanned years and involved many parties who were waiting on one another. A surveyor would promise something in October that became relevant again in February. A number buried in a PDF attachment would resurface months later in a discussion with a bank. A planning question that seemed settled would quietly reopen because nobody had a clean record of what had already been agreed.

When I figured out something was blocked, I could look things up in Gmail if I roughly remembered the date or the sender, or had a decent guess at the right keyword. Most of the time, the problem was broader than that. I needed to recover the current state of the project: what had happened, what was still open, who owed what, what had slipped, and which loose ends were likely to matter later. Reconstructing that by reopening threads and attachments each time was slow, and the result was never very satisfying because the useful context lived across too many places.

I have been an early adopter of AI for coding. I started using GitHub Copilot in 2021, invested in magic.dev’s pre-seed round… I was blown away by the step change when OpenAI’s o1 first came on the scene. This was the first model to pass my favorite interview questions, a disarmingly simple question that many CS graduates shockingly fail because they are trained to answer these questions with pattern recognition instead of reasoning. I first used Claude Code around May 2025.

My first reaction was that the interface was finally right, a CLI blends the conversational aspect of AI with the natural workflows of building software, and that the model was finally decent at planning. Tool use had been around for a while – the toolformer paper came in February 2023 – but this was the first time I was really interacting with models that had significant RL training in tool use. I immediately cranked up the difficulty of what I was trying to accomplish with it and quickly noted limitations; the model would simply cheat… replace the unit test it had written with one it could pass, change the winning condition. It couldn’t admit defeat, didn’t know how to ask for help, and didn’t understand how to conduct experiments to root out bugs. Those are still issues today, but to a lesser extent. I didn’t play with coding agents until later that year, when I caught word that they had finally gotten to the point where they were getting useful. I started completing a host of small personal projects I had needed, and launched experiments to generalize their use across the software companies I founded. Towards the end of 2025, I started hearing about people using coding agents to manage their emails, and it got me curious. I didn’t want to give Claude Code unfettered access to my email, but the idea that the model’s planning and tool-use abilities could extend to tasks beyond programming felt intriguing.

In February 2026, I set out to use Claude Code to build a system to help me manage the renovation. The privacy concerns were modest; all the emails involved are banal and do not contain particularly sensitive information, so I could easily filter them out. The scope of the project was also small, and it would help me build up the experience to scale this approach to much larger project management tasks.

A mistake would have been to connect Claude or OpenAI by MCP to my inbox and ask it to “manage” the project or to suggest what to do. The models are not good enough for that. In many ways, they optimize to give you a satisfying-seeming answer right away, not to do the right thing. Ask an MCP connected model for “what’s urgent right now,” and it will do something like search for the keywords urgent | time sensitive | important in your inbox, look at a few messages, and hurry back with a polished but superficial answer.

I had to build a “harness”. Noam Brown, a researcher at OpenAI and an expert on planning, holds the view that all custom-built harnesses will become irrelevant and be replaced by better reinforcement learning training. He’s probably right, but that is at least months from now! This is the harness I had Claude Code build first:

The system starts by syncing project email from Gmail into a local archive of raw messages and attachments. Local files are easier to grep, cache, inspect, and process repeatedly than a live inbox reached through an API. It also means the agent doesn’t have unfettered access to the inbox, only what the script, which is run externally, is programmed to download. The pipeline then walks those emails for shared links and downloads the corresponding files from Google Drive, Dropbox, and WeTransfer, because a large share of the actual project lives in drawings, contracts, invoices, surveys, permits, and estimates rather than in the message body. It also extracts text from documents and useful descriptions from images, so information does not get trapped in attachments.

Once that corpus is assembled, batches of email and attachment text are processed through Claude’s API to extract structured events: commitments, decisions, issues, financial facts, deliverables, deadlines, milestone changes, and anything else that deserves to be a first-class item in the record. This is appended as JSON lines to a log file. Why use the API when the agent environment uses the same models and offers cheaper tokens? Because long repetitive runs need software around the model, the program handles batching, retries, checkpointing, deduplication, validation, and recovery from malformed output. Claude provides the judgment inside a narrow task; the surrounding code keeps the process stable over thousands of messages. You could try asking your coding agent to go over 5 years of emails one by one and build a coherent timeline, but it will most likely veer off track, change its approach midway through, or delete the data. There’s still untapped potential in putting AI models in a well-controlled for-loop.

Those events feed a multi-scale timeline. At the bottom, there is the append-only event log. Above that sit monthly summaries and a current project context that can be queried or rendered into a readable timeline. A reminder layer compares open commitments against Google Calendar and proposes follow-ups.

So far, it has processed about 2,000 emails spanning 2021 to 2026, extracted 1,338 structured events, identified 102 people, and organized the work across 10 workstreams over 62 months. The number of decisions and resolutions since February is five times the previous rate, indicating that the system has been quite effective at getting things moving. I attribute a big part of it to tracking small commitments, those little promises that go unfulfilled and later pile up as delays: revised drawings next week, a document to be sent tomorrow, a correction to a deed, a cost figure to be confirmed, a map someone is still waiting on. They rarely look important in isolation, but they add up and become blockers when nobody records them.

A key design choice was to represent the work as events on a timeline rather than ask a model to rummage through an inbox whenever I had a question. In a multi-year project, the email is usually the wrong unit of thought. What matters operationally is that a contractor promised revised drawings by Friday, a lawyer said a deed amendment was required, a bank requested supporting documentation, or a permit moved from draft to filing. Once those facts exist as dated records with people, workstreams, and links back to source material, the project becomes much easier to reason about.

Search still plays an important role; I regularly search the timeline and the underlying archive. But the search now runs over a representation that already carries context. Gmail search can retrieve fragments if I remember enough to ask the right question. A timeline preserves the dependencies and unresolved issues that give those fragments meaning.

Claude Code was valuable because it gave me a working environment rather than a chat window. It could read and write files, run commands, inspect failures, patch code, rerun jobs, and keep iterating until the workflow behaved properly. That made it cheap to develop and iteratively test tooling around the project. The first version came together in just a few hours, but over the next two months, it steadily improved with use: better recovery from malformed model output, improved file handling, headless authentication for remote runs, a unified CLI, and prompts and schemas that more closely matched the project – normal product work, but in a much tighter loop than would have been practical before.

Humans use tools because they take away physical or cognitive load and provide a repeatable way to perform a task. It is natural for agents to use tools and even build them for their own purposes. So why not connect the agent to a project management tool, and have it file Jira tickets or track progress in Asana? Partly because the project was small enough, I did not want the added complexity of interacting with a project management tool. But more interestingly, the structure that worked well here looked slightly different from that of ordinary project-management software. Existing tools are built for people who stare at dashboards, drag cards across boards, and update fields by hand. The agent could work better with something plainer: files, markdown, JSON, directories, scripts, and a timeline with stable links back to source material. I suspect a lot of software built primarily for agents will end up looking like that. It may be less polished to the eye and more useful in practice. It’s also clear to me that the future of much enterprise software is to build software for agents, at least in the short run, before they just write what they need on the fly.

Today, I mostly “use” the system by talking to Claude Code, but a lot of the value now lives in code written earlier, in scripts that call the API, and in workflows that began as conversations and later hardened into commands. The boundary between software, agent behavior, and conversation has become fairly porous, and that has proven helpful. Useful workflows often begin as one-off interactions; once they prove valuable, I pin them down in code. When the project shifts, I can loosen part of the workflow again or add another layer around it. A packaged product would have forced more structure up front than this project wanted. I call this software with a “frayed edge”: a tightly woven core that, after a while, doesn’t change much; an interface at the periphery that gets rewritten as the workflow evolves; and a chat interface that can handle the unexpected. Packaging “project management with an AI agent” as a product would miss much of the point. Software crystallizes workflows into a polished but inflexible interface, whereas much of the value I’m getting comes from the system’s overall fluidity.

Whilst the agent conveniently did all the coding, I still had to understand what I was building from an architectural perspective. No, this doesn’t point to some long-arching synergy between human and AI; the centaur era (where man + machine is greater than man or machine alone) will not last long, but we are in it, and it’s thus relevant to talk about it. Among the decisions I had to make were whether the timeline should be append-only, what belonged in the schema, how large a batch should be, which data should be cached locally, which tasks belong in deterministic code, and which ones benefit from model judgment. Those questions do not disappear because a model writes the code.

That should still be encouraging for non-programmers. The bar is no longer “can you implement the whole thing by hand?” It is closer to “do you understand the workflow well enough to shape the system, inspect its outputs, and notice when it is going wrong?” You still have to get your hands dirty, but you do not need to be a professional developer to build something useful.

Seen from a distance, a fair amount of middle management consists of maintaining institutional memory and doing follow-up. Who promised what and when is it due? What does it depend on, and who needs reminding? Which disagreement is being reopened because nobody remembers how it was settled? Strategy, judgment, negotiation, hiring, politics, and taste remain human problems for the foreseeable future. But the connective tissue of coordination is much more legible to software than many organizations admit.

The cost of bespoke internal software has now dropped sharply. Someone who understands a workflow well, working in an agentic coding environment and with a short iteration cycle, can assemble something that previously would have required a small internal tools team. This will also reduce the importance of middle management, as more leverage can be gained when information is surfaced and managed easily. This is not the first time this has happened; computers and IT have delivered on some of that promise and extended the scale at which individuals could operate, but this stretches it even further.

A new type of software is emerging, one that exists on a continuum between robust, repeatable tooling code, high turnover features rewritten by an agent, and an agent orchestrating their use. The agent becomes useful not just as a developer but also as a somewhat competent planner, capable of using tools and making sense of structured information collated from disparate sources.

P.S. We should bear in mind AI will likely make the vast majority of intellectual labor obsolete in the coming years, then recursively self-improve at an increasing pace. This will not stop at the digital world; molecular nanotechnology, as envisioned by Drexler, offers the prospect of doubling the stock of productive capital every hour; the surface of the earth will be covered with solar panels, and planets will be disassembled to surround the sun with a Dyson sphere. The odds are stacked against humanity’s survival in this process, without a collective realization of what’s happening. Even if humans do survive, the world will be largely unrecognizable. In this context, the future of middle management and custom software is quaint, and the wisdom of pouring time and energy into a renovation project in its fifth year is questionable. But there is little sense in living our lives as if we were right on the cusp of the singularity, and at least we can finally enjoy getting a handle on things for a brief moment.

16 Apr 2026

ai
software

« Consumption inequality Even a Bag of Words Has a World Model »

ex.rs

The Frayed Edge of Bespoke Software in the Age of Agents

Explore →