What I Learned Building a Document Format from Scratch

I started thinking the hard problem was document parsing. Then format design. Then sync logic. The real challenge was building at the interface between two domains that don't talk to each other.

Two distinct platforms connected by a bridge under construction. One side represents AI tooling, the other document formatting. Flat technical illustration.
The hardest part wasn't either domain. It was the interface between them.

Part 3 of 3 in the Sidedoc series. Read Part 1: Why AI Document Workflows Are Broken | Read Part 2: The Design Decisions Behind an AI-Native Document Format

The idea didn't come from a technical project. It came from photography.

I was working with RAW image files and thinking about how non-destructive editing works. With proprietary RAW formats like Nikon's NEF, the original sensor data stays untouched. Your edits, exposure adjustments, color grading, crops, get stored separately in XMP sidecar files or in the editor's catalog. You never modify the source. The rendering engine combines the original data with your adjustments on output.

That pattern was in my head when I was using AI to edit a Word document and watching the formatting break for the third time. Every round-trip destroyed something. Font choices reverted, heading hierarchies flattened, list numbering broke.

And I kept thinking: the principle behind non-destructive editing applies here. When we feed a document to an AI, we're deconstructing it. What if there was a sidecar for documents? Content in one place, formatting metadata alongside it, recombined on rebuild.

That's how Sidedoc started. Not with a product vision. With a mental model borrowed from image processing and the frustration of re-applying heading styles by hand.

The Gap I Kept Seeing

Before I wrote a line of code, I looked at who else was working on this problem.

Pandoc has been the workhorse for document conversion for years. I use it constantly. It's excellent at what it does: convert between formats. But it's designed for one-time conversion, not repeated round-trips. Format conversion and format preservation are different problems.

Microsoft's MarkItDown extracts content from docx into markdown. One direction only. Tools like Pactify and MassiveMark go the other direction, generating new Word documents from markdown. Also one direction. Nobody was building for the multi-pass AI editing loop that's becoming the default workflow for enterprise document teams.

That gap told me something. It wasn't that the problem was too hard for these teams to solve. It's that it sits at the intersection of two communities that don't overlap. AI tooling builders don't think about document formatting. Document processing builders don't think about AI editing loops. The problem lives in the seam between them, and seams don't have natural owners.

I'd spent twenty-plus years in enterprise integration, making systems work together that weren't designed to. This had the same shape as every integration gap I've seen. Two sides building excellent tools. The interface between them is where the value leaks out.

What I Underestimated

I thought it would take a few weekends.

Separate content from formatting, store them in parallel files, put them back together. The concept is simple. I'd worked with enterprise document systems long enough to know that OOXML (the XML spec behind .docx files) was complex, but I figured python-docx would handle the heavy lifting and I'd mostly be writing the sync logic.

I was wrong in the way that experienced engineers are often wrong: I understood the problem well enough to be confident but not well enough to see the depth underneath it.

The OOXML specification is not just complex. It's deceptively complex. A heading that renders as 24pt Montserrat Bold in Word might get those properties from three separate sources: the font from a named style, the size from a direct override, the color from the document theme. To capture the heading's actual appearance, you have to resolve all three layers.

Three stacked layers resolving into one heading. Named style provides the font, direct override sets the size, document theme controls the color.
One heading, three sources. This is what format preservation actually means.

This is true for every element in a document. Five or six layers of inheritance for something that looks simple in the rendered output.

The lesson wasn't about OOXML specifically. It was about the cost of building at the boundary of a mature, complex system. OOXML has been accumulating features and edge cases since 2006. Any tool that operates at that boundary inherits a portion of that complexity whether it wants to or not.

I've seen this same dynamic in enterprise API integrations, legacy system migrations, and cloud platform abstractions. The complexity of the adjacent system always leaks across the interface.

Understanding that early would have changed my estimation approach. Instead of scoping from the concept down ("how hard could separation of concerns be?"), I should have scoped from the adjacent system up ("how much of OOXML's complexity will I have to absorb?"). That's a transferable lesson for anyone building at the boundary of a system they didn't design.

Defining "Good Enough" for a New Category

There's no existing standard for round-trip document fidelity. Nobody has published a benchmark for "did the formatting survive five AI editing passes?" I had to define what quality meant before I could test for it.

You can't byte-compare Word documents. Every save changes timestamps, unique identifiers, revision numbers. Two files that look identical in Word will differ on disk. So I built the test suite to compare at the semantic level: does a heading with specific font, size, and color properties come back intact? Does a merged table cell spanning three columns rebuild correctly?

That framing decision mattered more than the test count. Defining fidelity as "formatting properties survive at the semantic level" rather than "files match at the byte level" set the quality bar in a way that was achievable, testable, and aligned with what users actually care about. Nobody opens two Word documents and runs a binary diff. They look at the headings, tables, and fonts.

188 tests, 84% coverage. I invested heavily in the extraction and rebuild pipelines where bugs cause visible formatting loss. Some test documents exist purely for edge cases that nobody would create intentionally. But they catch interaction bugs that matter when real documents with messy formatting hit the pipeline.

What Early Conversations Taught Me

The hardest point wasn't a technical problem. It was a confidence problem.

About three weeks in, I had paragraph and heading extraction working, but every new element type had its own complexity. I seriously considered shipping a tool that only handled paragraphs and headings, just to get something out the door.

What stopped me was talking to people in the target audience. Consulting teams, legal teams, enterprise content teams. Those conversations taught me three things I couldn't have learned from the code.

First, tables are non-negotiable for credibility. Every person I talked to said some version of the same thing: "if it can't handle our tables, we can't use it." Not "it would be nice." Can't use it. That's a different signal than a feature request. It's a credibility threshold.

Second, the pain is worse than I assumed. I expected people to say formatting loss was annoying. What I heard was that it was costing real hours every week. One consulting team estimated their analysts spent 3-4 hours per engagement reformatting documents after AI edits. They'd accepted it as the cost of using AI tools. Nobody had told them it was fixable.

Third, the buying signal I didn't expect: track changes. Legal teams lit up when I described CriticMarkup flowing through the round-trip. Tracked changes in contracts aren't cosmetic. They're the audit trail. Losing them isn't an inconvenience. It's a compliance risk. That conversation moved track changes from "nice to have" to "ship with this."

None of these insights came from reading documentation or analyzing the competitive landscape. They came from putting an incomplete thing in front of real people and listening to what they reacted to. That's not a novel observation, but it's one I have to keep re-learning.

The Feature That Validated the Architecture

Tables looked straightforward until I opened a real consulting document.

Simple grids map cleanly to GFM pipe tables. But real documents have merged cells spanning three columns in header rows, per-cell background shading, border styles that vary between sections, and header rows formatted differently from data rows. The mapping between a simplified markdown grid and the full OOXML cell structure is where the complexity lives.

I built the table implementation with Claude Code. Dozens of iterations: write the extraction logic, test it against a complex document, find a merge case that broke, refine, test again. Multiple complete rewrites of the merge reconciliation logic.

Tables mattered beyond their feature value because they were the hardest test of the core architecture. If the separation-of-concerns model could handle merged cells, per-cell formatting, and content changes flowing through the sync mechanism without losing fidelity, the architecture was sound. When the table tests finally all passed, that was the moment I stopped questioning whether the format would work and started thinking about where it should go.

The Real Lesson: Build at the Seam

Here's what I didn't expect to learn from this project.

I started thinking the hard problem was document parsing. Then format design. Then sync logic. None of those were the real challenge. The real challenge was building at the interface between two domains that don't talk to each other.

Both sides have good technology. The gap between them is where the value leaks out.

Two solid platforms representing AI tooling and document processing with a highlighted gap between them. The gap is labeled as the opportunity.
Two communities building excellent tools. The gap between them is where the value lives.

This pattern should be familiar to anyone who's done enterprise integration work. Two systems work well in isolation. The interface between them is where everything breaks. And the interface is where nobody wants to build, because it requires understanding both sides deeply enough to design something that serves each on its own terms.

That's where the highest-impact work happens. Not inside either domain, but at the seam. The hardest part of building Sidedoc wasn't learning OOXML or designing a sync algorithm. It was holding both the AI consumption model and the document formatting model in my head simultaneously and finding an architecture that respected the constraints of each without forcing either to compromise.

If you're looking for high-impact problems in any AI-adjacent space, look for these seams. Look for the places where AI tools stop and hand off to human workflows, where the interface is manual, lossy, or nonexistent. Documents are one example. There are dozens of others: CAD files, financial models, design systems, legal filings. Every domain where AI needs to consume and produce structured artifacts has the same gap. The tools that close those gaps will define the next layer of AI infrastructure.

Where This Goes

Sidedoc is open-source under MIT, and that's not a default choice. It's the strategy.

A document format only has value if people use it. The formats that become standards don't win by being technically superior. They win by being available everywhere. Markdown won because it's free, readable, and supported by every tool. PDF won because Adobe published the spec. The pattern is consistent: open the format, build the ecosystem, let adoption create the moat.

The near-term work is making Sidedoc available where AI editing actually happens. An MCP server that lets AI agents work with Sidedoc files natively. A LangChain document loader. An API service for teams that want the functionality without running a CLI. Each of these lowers the friction between "I want to use AI to edit this document" and "the formatting survived."

But the real goal is larger than any of those integrations. The goal is for Sidedoc to become a format that document editors and AI tools both understand natively. Not a conversion step in a pipeline, but a first-class format. An editor that opens a .sidedoc/ directory renders the formatted document from the metadata. An AI agent that receives a Sidedoc file edits the markdown and triggers a rebuild. Neither side needs to know about the other's concerns because the format handles the translation.

A completed bridge connecting AI tooling and document formatting platforms. Documents flow across in both directions without loss.
The separation of content and formatting should be invisible. It should just be how documents work.

That's the vision that drove the photography analogy from the beginning. RAW image editors don't make you think about sensor data versus editing metadata. The format handles that separation transparently. Non-destructive editing became invisible infrastructure. I want the same thing for AI document workflows: the separation of content and formatting should be invisible to both the AI and the human. It should just be how documents work.

I don't know if Sidedoc gets there. Building a format standard is a longer road than building a tool. But the gap between AI tooling and document processing isn't closing on its own. Someone has to build the bridge. I decided to start.

You can install Sidedoc with pip install sidedoc (Python 3.11+). The source is on GitHub. Documentation, format specification, and benchmark results are at sidedoc.io.


This is Part 3 of a three-part series. Part 1: Why AI Document Workflows Are Broken covers the problem. Part 2: The Design Decisions Behind an AI-Native Document Format covers the architecture and data.