How we build Oda

The Skill Game

The hard optimization problem we play every day.

The fundamental insight

An AI agent is a while-true loop. Each iteration calls a model — and each call burns tokens, money, and time. The agent loops until it decides it's "done."

while not done:
    think()    # burns tokens. every call is real dollars.
    act()      # calls a tool: LLM, external model, script, API
    observe()  # reads the result, adds to context

The game: maximize output quality while minimizing total cost. This is a compression problem. We compress human expertise into things the agent reads before it starts: instructions, utilities, and exit criteria. The better the compression, the fewer tokens the agent wastes rediscovering what we already know.

LLMs are inherently probabilistic — ask the same question twice, get two different answers. Our job is to make this process as deterministic as possible. Not by fighting the randomness, but by constraining it: proven templates, strict file structures, verification scripts that catch drift, and exit criteria that leave no room for interpretation. The creative parts stay free. The mechanical parts become predictable.

The line

There's a line that separates what any AI can do from what actually matters. Below it: generating a website, drafting an email, creating a logo. Anyone can build these. They're commodities.

Above it: sending the email, tracking the reply, following up in 3 days, and updating your pipeline. Posting on your Instagram, analyzing what works, adjusting the strategy. Booking the meeting, sending the invoice, handling the support ticket. Real actions, on real services, with real consequences.

ChatGPT can draft an email. It cannot send it. That gap is where Oda lives.

We optimize both sides — but with different weights.

Below the line — hooks

Brand kits, webapps, landing pages. These get users in the door. Our edge isn't quality alone — it's speed and cost. If we produce a brand kit in 3 loops where competitors take 12, we spend 4x less per user. At scale, that's survival.

Optimize for: fewer tokens, lower cost, good-enough quality fast.

Above the line — product

Instagram management, email outreach, CRM, invoicing. These touch real services, real data, real money. A wrong Instagram post is worse than a slow one.

Optimize for: reliability, accuracy, human-in-the-loop at critical moments.

The framework is the same. The weights change.

The four levers

1. Instructions — compressed expertise

Everything the agent knows before its first loop. The difference between an expert and a beginner. An expert doesn't deliberate over which database to use — they already know. One decision eliminated, one loop saved.

The trap is over-specifying. 200 lines of instructions means the agent spends tokens reading and weighing them. Be prescriptive where fragile, free where creative, silent where obvious.

2. Utilities — crystallized loops

A utility is a loop that was expensive once, solved, and frozen into a script. Instead of the agent spending thousands of tokens debugging a setup across 5 loops, a script does it in one deterministic call with zero LLM tokens.

The most undervalued form is verification. Without it, the agent has to reason about whether it's done — and reasoning is the expensive thing. A script that returns "pass" or "fail: these 3 specific issues" replaces thousands of reasoning tokens.

3. Model selection — the right brain for the job

Not every step needs the same model. A cheap model verifies. A mid-range model writes code. An expensive model makes critical judgment calls. A skill that uses the expensive model for everything costs 5-30x more than one that uses it only for the step that truly needs it.

4. Definition of done — the exit condition

The most important lever and the hardest to get right. Without it, you're paying for an infinite loop — or the agent stops too early and delivers garbage.

Two layers: structural correctness (does it compile? does it run?) verified by scripts at zero token cost, and quality judgment verified by calibrated AI judges with anchored criteria — not vibes, but "rate simplicity: 3 = cluttered, 6 = clean, 9 = iconic."

The real cost function

Cost(skill) = Σ (tokens × price_per_token)
            + Σ (external_api_calls × price_per_call)

This is not loops × flat rate. A verification loop costs ~$0.001. A reasoning loop with a powerful model costs ~$0.30. The cost is dominated by token volume and model choice, not loop count.

The other hard problem: UX

Optimizing the agent is half the game. The other half is how you experience the result.

We're mobile-first. We genuinely believe the future of running a business is from your phone — checking your pipeline on the metro, approving an Instagram post between meetings, reviewing a lead while waiting for coffee. The web version exists for when you need more space, but the phone is the primary interface.

This creates its own set of hard problems. How do you preview a full webapp the agent just built — on a 6-inch screen? How do you let the user edit a generated brand kit without rebuilding Figma? How do you show a 20-step agent process without overwhelming someone who just wants to see the result?

These aren't solved problems. We iterate on them daily — testing on real devices, watching how founders actually use the app, adjusting. The UX is as much a craft as the AI.

What gets better over time

Three things improve the more Oda is used — and they feed each other.

Your connections. Every service you connect — Instagram, Gmail, calendar, Stripe — makes Oda more useful. Not just for that service, but across all of them. The lead who liked your Instagram post also emailed you last week and has a meeting on Tuesday. Oda sees the full picture because everything is wired together.

Your context. Oda learns your business — that croissant posts outperform bread 3:1 on Tuesdays, that Jean-Pierre said "maybe" last week and needs a follow-up, that your regulars prefer stories over reels. The longer you use it, the less you need to explain.

Everyone's feedback. Every time someone rejects a brand kit or tweaks a caption, that correction makes the skill better for the next person. Oda learns what "good" looks like from real usage — not from benchmarks or synthetic tests.

Each of these is its own hard problem. How does the agent remember what matters without drowning in context? How do corrections from one user improve things without breaking what works for another? We don't have all the answers. We work on it every day.

Models move fast. That's fine.

Here's the honest truth: some of what we spend weeks optimizing today will be trivial in six months. A single call to a future model might do what currently takes 12 loops and 3 utilities. We know this.

It doesn't change the strategy. When a better model drops, we plug it in. The skills get cheaper and faster overnight. But the definitions of done, the quality criteria, the verification scripts, the production-calibrated judges, and the accumulated per-business knowledge — none of that comes from a model upgrade. That's built over months by shipping, measuring, and iterating.

So we stay focused. Head down. Keep optimizing. The models get better, and so does everything we've built on top of them.

Where we are

We're at 0.1% of what we think is possible. And we're not alone in thinking the gap is enormous — Anthropic's Economic Index (January 2026) found that while 94% of tasks in some fields are theoretically exposed to AI, actual usage sits at around 33%.

We think business operations is where that gap is widest. The models are capable. The integrations exist. What's missing is the layer that ties it all together — instructions, verification, context, and the patience to optimize every skill until it's reliable enough to trust with your real business.

That's the game we play. Every day we ship new capabilities. Every day we tune the levers. The roadmap is infinite — every real task a business does is a potential skill. We find out what works by shipping, measuring, and iterating.

Want to see it in action?

Try Oda