KBS: Fast Forward

Code generation on cruise control.

2026-03-01 at 3:45

articlellmocamlpythonsoftware-engsoftware-eng-auto

On today's episode: a lot of code. The previous work prepared the project codebase to guide agents in generating good quality code, which is what we did.

Update, Resolve, Archive Commands

After the Kb_service module was broken apart, the coding agent easily generated full verticals for update, resolve, and archive commands. The process I'm following:

Work with the agent to peel off a chunk of functionality from docs/product-requirements.md.
Then write a prompts/activities/implementation-plan.md for building the functionality.
After the code is generated, I review the code, as well as the agent (prompts/activities/code-review.md), and apply changes for all issues.

Related commit: 0eb5584 — feat: add update, resolve, and archive commands

Design Document

In anticipation of more complicated functionality, I wanted a step before the implementation plan, not focused so closely on implementation, and instead centering different approaches. Thus prompts/activities/design-document.md. For several of the next features I make use of the design document, but I am not happy with the results and plan to figure out a different process.

The issue with the design document part of the process is that it focuses the agent on producing a single punchline. Despite the instructions to consider alternative solutions and discuss design decisions, the agent doesn't research or explore nearly enough and just drills down on producing the overall document. I think for the future I will break up the process into distilling requirements and detailing the background/status quo, then a next step of listing research paths and open questions, then a next step of producing documents for all of those, and then a final step of synthesizing all of this into a document.

Related commit: 15f876f — feat: add design-document activity prompt

Relations, Deferred Tests, Parsing Refactor

Relations between notes and todos are an even more complicated vertical, requiring domain models, repository logic, a service, and a command. The coding agent handled it with ease. We also worked to implement the integration tests that were imagined during the add subcommand but deferred, and then refactored the codebase to separate out the domain model parsing logic shared by a number of the services.

Additionally I made a documentation upkeep maintenance prompt to be run periodically to sweep the docs for broken references, missing concepts, etc.

Related commits:

0b49409 — test: implement deferred list integration tests
eef388a — feat: add relate command for creating typed relations between items
a49137b — refactor: add input parsing layer with result-returning Data parsers and Service.Parse

SQLite ↔ JSONL Synchronization

Probably the most complex change we've tackled so far, a full vertical down to a new serialization format, an intermediate snapshot representation of the database, code to rebuild the database, and two subcommands (flush and rebuild). The design document was handy here, helping to remove some of the complexity I had assumed needed to be built, but also not fully satisfactory, as discussed above, and I had to push on doing more research and investigation.

Related commit: 81498af — feat: add SQLite-JSONL two-way synchronization

Unused Symbol Checking

Something I've observed a few times is that coding agents consider code functions in the context of the agent's current semantic purpose. If the purpose of a change is to move logic around, the agent will delete the "old" function and make the "new" one. However, the agent is not always "thinking" this way, sometimes due to the structure of the conversation, it considers the change to simply be an addition, adding the "new" function and leaving the "old" around. To deal with this, and start building out some more automated guidance for the agents to follow, we built an unused symbol checker, a dead-code analyzer.

We explored a couple other options and ended up settling on a Python script to start up the OCaml language server, query for symbols, query for references to those symbols, and shut the server back down. With this in hand we deleted ~5 functions that agents had forgotten.

Related commit: f3cfc04 — feat: add unused-export checker and remove dead code

Fixes and Testing

Something I've noticed with the product requirements document and having the agent drive what functionality to implement, is that it's keying off the already existing structure in the document — the numbered use cases. It is not doing a good job of seeing what exists in the system and adjusting the chosen functionality to cover the gaps. For example, when we implemented the show command, relations had not been implemented yet, so they were skipped. When relations were picked up to be implemented, the agent ignored show entirely, requiring this later fix to display relations when show-ing. Similarly, the --json flag is mentioned a couple times in the document, but I guess due to its horizontal nature, it never got picked up, resulting in the below commit adding it to all commands. I've attempted to address this specific feature gap with guidance, we'll see if it's effective.

The TypeId bug is another interesting agent blindspot. Because of the random nature of the TypeId values, both the unit tests and integration tests failed to identify the fact that the suffixes of all the TypeIds were the same. They have a prefix derived from the current time, so they never collided during testing, and the assertions all needed to be generic enough to pass. I think some guidance work is needed here in the future. This bug was found because I started bootstrapping bs development with itself 🎉, working the --json feature as notes and todos. The resulting .kbases.jsonl showed weirdly similar ids.

I wanted some more integration tests focused on the whole lifecycle, to help cover cross-command defects. We implemented them, as well as an adjustment to the architecture guidance.

Related commits:

e8c2b37 — fix: display outgoing and incoming relations in show command
c862b5b — test: add workflow integration tests for multi-command scenarios
9f1ef2e — feat: add --json flag to all subcommands and git-excludes on init
d637a13 — fix: typeid suffixes not random

Minor New Features

Last of this long coding session, I added some coding agent oriented functionality, and worked an ergonomics issue I watched the coding agent make. During bs init, we made the application install a short helper in the project's AGENTS.md; this will probably be tweaked over time, I'm not sure how well the limited example performs. And the second feature was to make bs close an alias for bs resolve after I watched the coding agent "guess" the close command, fail, look up the --help, then succeed with resolve.

Related commits:

bba27ca — feat: install AGENTS.md on bs init and enrich --help output
30d81e7 — feat: bs close subcommand as alias for resolve

Collected Future Work

In-depth design processing. As discussed above, a more robust design process to follow.
Random testing. I think there is some kind of lib testing principle needed for the typeid bug. I've gotten away with letting the ids be random in places, controlled in others, but I think a test is needed to demonstrate the randomness of the id generation, but controlled via the seed.
More testing guidance. I think there may need to be an integration testing principle around adding to the workflow_expect.ml tests for new subcommands.
More functionality. I'll likely go through another round of product requirement ideation, to document all the small things I've thought of that didn't make the basics of the first pass.
Ergonomics. Now that I've started having the agent use bs in earnest, more hallucinations will happen, more shortcuts will be desired by the agent, and I'll capture these and work on them.
Performance and concurrency testing. I am not convinced that the flush on every write operation will scale to thousands of notes and todos, or that the auto-rebuild at any time will scale, so I will be thinking about how to test them. I also have not paid any attention to concurrent uses of bs within the same repository, so I'm going to investigate it and see what failures pop out. I'm not sure if I want to support concurrent uses in the same directory, as telling users to use git worktree would make this a non-issue, but part of me wants to see how hard it is to guard against.

Aside from these ideas for future work, I think we're nearing the end of this experiment. All of the basic functionality is implemented and the program is usable by coding agents. I will have to pause at some point and do an in-depth, by-hand code review to be able to come to any conclusions about automatic software development while maintaining quality.