Building Trafford

A Desktop Agent OS for Knowledge Work

The Setup

Background & hypothesis

8 years in product — Amazon, startups, co-founded an AI startup on LangGraph

Claude Code as my operating system for 6 months — glimpses of tool → teammate

7 weeks building, dogfooding, and testing

The hypothesis

The models are already capable enough. But building production agent systems carries a massive infrastructure tax — structured memory, tool orchestration, multi-model routing, agent coordination — that only well-resourced engineering teams can actually afford to pay. Non-technical knowledge workers and small teams are locked out entirely. The missing layer isn't better AI — it's the management infrastructure that makes agents actually useful for the people who need them most.

Sessions Dogfooded

Specialist Agents

0K+

Lines of TypeScript

Five Questions

Before I started building, I wanted to really clarify my goals. I was not aiming to build a set of features — I wanted to build and use a multi-agent system for myself, to learn more about a set of 5 questions.

1. How do you create agents that are actually good?

The result is a binary state — well-resourced engineering teams can build world-class agents, and non-technical users can build a bot that summarizes meeting notes into Notion.

2. How do you make agents remember like a team of infinite human specialists?

Most implementations recall conversations but lose what they learned from them. The hard part isn't forgetting — it's remembering the wrong things.

3. How do you mimic the magic of cross-functional collaboration with agents?

Different models seem to have genuinely different reasoning strengths — which makes me think there's something worth exploring in how they might complement each other.

4. How can AI make context switching seamless instead of painful?

Context switching isn't just friction — it's a tax that compounds through the day.

5. How do you build the right workspace for different types of knowledge work?

For most knowledge work that is not code, the workspace should not only be a terminal or a chat window.

III

How I Went About It

I'm not a developer by training — I'm a PM who had been building production systems with Claude Code and Codex for the better part of a year.

Before writing any code or requirements documentation, I first spent a day setting up my product development system. It was a combination of five well-tested Claude Code skills, instructions on how a feature should flow from an idea to deployed code — including ticket types and structure on Linear, testing infrastructure, documentation templates, coding agent roles (Claude Code, Codex), and a requirements gathering protocol so that Claude Code could ask me a holistic set of questions around what needed to be built and come up with an implementation plan accordingly.

From there I worked in two-day sprints against a living strawman spec, with research spikes wherever uncertainty was high, revisiting the strawman at the end of each sprint based on what I'd learned, and moving towards target milestones with lots of intermediate testing.

What Trafford Is

Surfaces

Work with your agents across surfaces. Always in sync.

Desktop

always in sync

Terminal

Voice

Web

Mobile

Your Team

Specialist agents that get better the more you work together.

IrisAgent Architect

SpecialistAgent

Scalable team

Work modes

1:1

Group

Council

Each agent has

System PromptSkillsTool AccessMemoryPersonality & ToneScopeModelAnti-patternsWorkspace

Iris creates agents from your conversation + platform capabilities. Agents share context automatically. You review. You approve. Not a black box.

The Plumbing

What powers the team.

Memory

5-stage pipeline: observe, consolidate, extract, organize, maintain

Background memory agent continuously processes and structures knowledge

Cross-agent propagation — agents share what they learn

Semantic search over everything

Models & Providers

8 curated models from 6 providers

Right model for the right task

Tools & Integrations

Connect to 250+ tools — Gmail, Slack, GitHub, Linear, etc.

Knowledge & Documents

Per-agent workspaces — knowledge, decisions, session history, reference docs

Import/export: Google Docs, Notion, local files

Private and local by default — cloud toggle when ready

Backend Infrastructure

WebSocket BridgeAgent SDKCloud APIAuth (Clerk)Database (Turso)Vector Search (LanceDB)Cloud SyncElectron IPC

Three Deep Dives

Iris — Agent Architect

Trafford's agent architect. Describe what you need in conversation — system prompt, skills, tool access, personality, anti-patterns all come from that conversation, not a form.

Research-firstAdversarial testingModel routingSkill import

Memory Pipeline

5-stage pipeline adapted from Mastra (94.87% LongMemEval). Background processing on Qwen 3.5. Decisions supersede, observations propagate across agents.

5-stage pipelineCross-agent propagation143 testsAgent workspaces

Multi-Agent Reasoning

Three modes of collaboration — 1:1, group threads, and council sessions. Testing where multi-model diversity genuinely adds value vs. where a single strong model is enough.

Group threadsCouncil sessions8 models, 6 providersBrainstorming & synthetic audiences

Architecture Explorer

Those three systems sit inside a larger architecture. Here's the full picture — click into any node to see the implementation detail.

Surfaces

Desktop App— Electron shell. Download, double-click, start working. No terminal, no setup. Security fuses prevent process hijacking.

Terminal (Claude Code)— Same agents, same memory, same tools — via terminal. Always in sync with desktop.

Future Surfaces— Voice, web, and mobile planned. Architecture is surface-agnostic — agents don't care how you reach them.

Agents & Interaction

Iris (Agent Architect)— Builds agents through conversation. Research-first methodology, adversarial testing, and model routing for each step of the creation process.

12 Specialist Agents— Each has a domain, personality, behavioral rules, custom skills, and scoped tool access. Each gets its own workspace — knowledge, decisions, session history, reference docs.

Multi-Profile System— Work and personal profiles with full isolation — separate agents, knowledge, transcripts, and tool access per profile.

1:1, Group, and Council Modes— Work with one agent focused, multiple agents in a thread with structured handoffs, or run a Council session — specialist panels, brainstorming, or synthetic focus groups.

System Capabilities

Memory Pipeline— Five-stage pipeline adapted from Mastra: Observer, Consolidator, Extract, Organize, Maintain. Background memory agent runs on Qwen 3.5 (open-source) for near-zero processing cost.

Provider Registry— 8 curated models from 6 providers. Right model for the right task — routed by task type and complexity.

Tool Execution— 250+ integrations powered by Composio with managed OAuth, per-agent scoping, and runtime health checks.

Semantic Search— LanceDB vector search across all agent knowledge — conversations, documents, memory observations. Finds relevant context by meaning, not keywords.

Infrastructure

WebSocket Bridge— Typed message protocol connecting the React dashboard to the Node.js runtime. Handles streaming, tool invocation, and session management. API keys never leave the server.

Cloud API— Hono server on Fly.io. Handles sync, auth webhook processing, rate limiting, and user management.

Auth (Clerk)— JWT-based authentication. User sync via webhooks, token validation on every request.

Database (Turso)— On-device SQLite via Turso. Fast, private, edge-ready. Cloud sync toggle for when you're ready.

Vector Search (LanceDB)— Per-agent vector databases for semantic search. Context injection at session start so agents draw on relevant knowledge.

Electron Packaging— Security fuses at binary level, auto-update, native OS integration. Download-and-go deployment.

VII

How I Built It + What I Learned

Seven weeks, two interleaved pathways, and Claude Code with five custom skills.

The Build Process

Thinking Pathway

Living strawman spec

Research spikes on uncertainty

Test assumptions, evolve the spec

Agents as thinking partners

Action Pathway

Tickets with acceptance criteria

Two-day sprints

Intermediate testing

Deploy and validate

Building reveals things that change the spec — they feed each other

The Toolkit

Claude CodePrimary agent — planning, implementation, testing, iteration

5 custom skills

DeveloperImplementation

FrontendUI & Components

ArchitectSpec & Decisions

PMTickets & Criteria

QATesting

CodexPeer coding agent — architecture reviews, debugging

Each skill has protocols for how it takes an idea to implementation-ready work. A lot of critical thinking happens before anyone writes code.

What I Learned

Vibe coding is a recipe for disaster.

The most instructive failure pattern: skipping the thinking pathway. Whenever I skipped architecture and planning, whenever acceptance criteria were loose and I hadn't thought through implications — that's where things broke. AI amplifies whatever you feed it. Feed it rigor, you get rigorous output. Feed it vibes, you get expensive garbage.

Building is cheap. Knowing what to build is expensive.

With AI agents, the cost of building a feature drops dramatically. What matters is vision discipline — knowing WHAT to build, having tested your assumptions, maintaining a spec that reflects real needs. Building the wrong thing fast is worse than building the right thing slowly. That inverts the traditional PM instinct to cut scope.

Infrastructure compounds.

Skills, wiki, ticket breakdown process, testing protocols — it all felt like overhead at week 1. Paid for itself by week 4. Essential by week 6. I spent about a day setting up skills and skill protocols. That day paid for itself many times over. Three days would have compounded even more.

Honest assessment

Trafford is not ready for external users yet. That's not the current priority. I'd rather be honest about where it is. If I rebuilt from scratch with everything I know now, the architecture would be tighter. But the point wasn't perfect code — it's validated decisions and a deep understanding of what matters.

146 CLI unit tests143 memory pipeline tests14 Playwright E2E15-test production WebSocket harness

VIII

A Glimpse Into How Trafford Works + Where It's Going

The ask

"Prepare outreach briefs for 15 people I want to reconnect with."

Research Agent

→Content Agent

2 autonomous decisions

11 of 15 briefs included existing touchpoints. Four were flagged as former teammates I hadn't realized were in my network.

The ask

"Write a customer proposal for [prospect] based on our discovery call notes."

Strategy Agent

→Research Agent

→Content Agent

1 autonomous decision

Final draft used the prospect's own terminology and referenced their recent company moves. One round of my edits, then ready to send.

The ask

"I found this open-source n8n workflow for competitive analysis. Can we integrate it?"

Iris

1 autonomous decision

New competitive analysis skill running in under an hour. Existing skills tested and confirmed working.

What's occupying my mind

Voice AI — Newer interaction experiences beyond text — speech-to-text workflows, but also low-latency voice agents with strong contextual grounding. Eventually, something that plugs into Trafford natively.

Local models — Open-weight models running locally for speed, privacy, and cost.

Managed agents as MCP servers — Agents that expose their capabilities as tools for other agents — like SaaS tools, but for AI. Capabilities can be dynamic and might even improve as more people use the same agent.

SDK independence — Anthropic Agent SDK is the foundation now. Long-term: the architecture should be portable across SDK changes.

Cost as a design constraint — Cost-per-experiment determines how many hypotheses you can test. Model routing by complexity — not everything needs frontier models.

Autonomous lab — Trafford as a self-improving system — autonomously benchmarking new models, testing new skills, evaluating new agent configurations. A laboratory that runs experiments while I sleep.

Building Trafford has been a journey of discovery in more ways than one. I've discovered a lot about agent building, about AI-native product development, and quite frankly a lot about what constitutes work itself — something a lot of us might have taken for granted all these decades.