ESSAY 7 Nov 2025

I Watched AI Generate a Perfect Todo App in 3 Minutes. Then I Spent 3 Days Fixing It.

Every AI coding tool demo starts the same way.

“Build me a todo app.”

Four words. Maybe ten seconds of typing. Then you sit back and watch the magic: files appear, databases materialize, endpoints generate themselves. The AI spins up authentication, adds a sleek frontend, writes tests. Three minutes later, you have a working application.

It’s impressive. It’s seductive. And for production software you’ll maintain for years, it’s a starting point at best—not a solution.

The Demo That Sells vs. The Code You Ship

I’ve spent five months deep in AI coding tools—Claude Code, claude-flow, and everything in between. I’ve watched hundreds of demos. I’ve read the marketing. And I’ve built actual production SaaS applications.

Here’s what the demos won’t tell you: that three-minute todo app works because it makes a thousand architectural decisions you never specified. And the moment your requirements diverge from those invisible assumptions, the whole thing falls apart.

Let me show you what I mean.

The Eight Decisions That Actually Matter

When you say “build me a todo app,” you think you’re giving clear instructions. But try building real production software and you’ll immediately hit these questions:

1. JWT Claims Structure

What exact fields go in your JWT payload?
Do you store roles as an array or a single string?
Where do permissions live? In the token? In the database?
Do you include user metadata or just an ID?

The demo picks one. It might not be the one you need. And changing it later? That’s not a refactor. That’s rearchitecting your entire auth system.

2. Token Rotation

15-minute access tokens with 7-day refresh tokens?
Refresh token rotation on every use?
Where do you store refresh tokens—database, Redis, or in-memory?
httpOnly cookies or localStorage?

The demo makes a choice. You won’t know what it chose until you’re debugging your third session timeout bug in production.

3. UI Library

shadcn/ui? Material-UI? Chakra? Ant Design? Headless UI?
Tailwind CSS or CSS-in-JS?
Which component patterns?

“Use a modern UI library” means nothing. I needed shadcn/ui specifically because it works with my design system, ships minimal JavaScript, and uses Tailwind. The demo gave me Material-UI. That’s not a theme change—that’s rebuilding the entire frontend.

4. Stripe Integration

Checkout flow or Payment Intents?
Subscription model or one-time payments?
Customer portal or custom UI?
Which webhooks do you handle?

The difference isn’t cosmetic. Checkout and Payment Intents are architecturally different. Choosing wrong means rewriting your entire billing integration.

5. Email Provider

SendGrid? Resend? Postmark? AWS SES?
Template system?
Transactional vs. marketing?

Each provider has different APIs, rate limits, pricing models, and deliverability characteristics. “Add email notifications” doesn’t specify any of this.

6. Database ORM

Prisma? Drizzle? TypeORM? Kysely?
Type generation approach?
Migration strategy?

Your ORM choice affects type safety, migration workflows, query performance, and deployment strategy. It’s not swappable. It’s foundational.

7. Testing Framework

Vitest? Jest? Mocha?
Supertest for integration tests?
What coverage target?

The testing framework dictates how you structure tests, handle mocks, and integrate with CI/CD. Changing it later means rewriting every test.

8. Deployment Target

Vercel? AWS? Docker compose? Railway?
What Vercel-specific features do you need?
Environment variable strategy?
Database hosting (Neon? Supabase? RDS?)?

Deployment isn’t the last step. It shapes your entire architecture—serverless vs. long-running, filesystem access, background jobs, caching strategies.

The “Just Refactor It” Myth

When I point this out, the response is always: “Just refactor what the AI generated.”

Have you actually tried this?

Swapping Prisma for Drizzle isn’t a find-and-replace operation. It means:

Rewriting your schema in a different DSL
Changing how you handle migrations
Updating every database query
Modifying your type generation
Adjusting your seeding scripts
Updating your testing setup

We’re not talking about an afternoon. We’re talking about days of work. And that’s for ONE of these eight decisions.

Change the ORM, the UI library, and the auth token structure? You’re not refactoring. You’re rebuilding.

What “Build Me an App” Actually Produces

Here’s the brutal truth: autonomous AI tools generate generic boilerplate that matches their training data’s most common patterns.

They give you:

Whatever stack is most popular on GitHub
Whatever patterns appear most in tutorials
Whatever architecture is easiest to generate

They don’t give you:

Your company’s conventions
Your infrastructure constraints
Your team’s expertise
Your product’s specific requirements

The demo works because demos don’t have requirements. Real projects die in the gap between “an app” and “our app.”

Why This Matters for Production Code

If you’re at a big company with a team of 20 engineers, maybe you can absorb the rebuild cost. You have engineering hours to burn. You have people to maintain legacy code while others refactor.

Most of us don’t have that luxury.

Whether you’re building solo, on a small team, or shipping client work, you’re living with every architectural decision for years. You can’t afford to spend three days ripping out Material-UI because an autonomous tool decided that’s what “modern UI library” meant. You can’t rebuild your auth system because the JWT structure doesn’t match your API contracts. You can’t rewrite billing integration because the tool guessed Checkout when you needed Payment Intents.

Wrong architectural decisions compound. When you’re responsible for maintaining the code—whether that’s yourself, a small team, or a client relationship—you need to understand and own those decisions.

That’s why production code requires control, not autonomy.

The Interactive Alternative

Compare that to working with Claude Code:

Me: “Add authentication to this project.”

Claude Code: “I can help with that. A few questions:

JWT or session-based auth?
If JWT, what should the token payload include?
Where should refresh tokens be stored?
What’s your refresh token rotation strategy?”

Me: “JWT. Payload should have userId, email, roles as an array, and permissions as a nested object. Refresh tokens in database with rotation on every use. 15-minute access, 7-day refresh. httpOnly cookies.”

Claude Code: “Got it. I’ll implement that exactly.”

The specification happened through dialogue. I clarified the architectural decisions before any code was written. The AI generated exactly what I specified, not what it guessed I might want.

When the auth system is running in production six months later and I need to debug a token issue, I understand every decision because I made every decision. I’m not reverse-engineering someone else’s assumptions. I’m working with my own architecture.

When Autonomy Actually Works

Autonomy isn’t wrong—it’s just context-dependent. There are places where “just handle it” is absolutely the right answer:

README generation: Standard markdown structure is fine
ESLint configuration: Default configs work for most cases
.gitignore files: Use the templates
Boilerplate CRUD endpoints: If they follow established patterns exactly
Prototypes you’ll throw away: Exploration where decisions don’t matter yet

These are low-stakes decisions with high standardization. Getting them “wrong” doesn’t cascade. You can change them later without rebuilding your application. Single-prompt generation shines here.

But authentication? Database schema? Tech stack? These are high-stakes, foundational decisions with cascading effects. This is where precision matters and guesswork fails.

The Autonomy Illusion

Here’s what the AI tool marketing doesn’t tell you:

More agents doesn’t mean better code. It means less control.

Sophisticated orchestration doesn’t mean better results. It means more complexity hiding the same specification problem.

“Just describe what you want” doesn’t work when architectural decisions require precision that natural language can’t provide.

I tested claude-flow—a sophisticated multi-agent system with 10+ agent templates, health monitoring, auto-scaling, 3-tier memory, and 60+ task types. Impressive infrastructure. But it still runs on string-based specifications. When I asked for shadcn/ui, there was no type safety, no validation, no guarantee the agent would interpret “shadcn/ui” as “shadcn/ui and absolutely nothing else.”

The specification layer is still natural language. And natural language is ambiguous.

The Real Question

The question isn’t “Can AI build an app from a single prompt?”

The answer to that is yes. Absolutely. The demos prove it.

The real question is: “Can AI build YOUR app—with YOUR architecture, YOUR conventions, YOUR constraints—from a single prompt?”

The answer to that is no.

Not because the AI isn’t capable of generating code. It’s excellent at that.

But because “build me an app” leaves a thousand architectural decisions unspecified. And every one of those decisions matters when you’re shipping production software you’ll maintain for years.

What Works Instead

After five months of research, building real projects, and testing multiple tools, here’s what actually works:

Start with control:

Make architectural decisions consciously
Specify tech stack, libraries, patterns explicitly
Use interactive tools that let you clarify requirements
Review and understand what’s being generated

Move to autonomy for execution:

Once patterns are established, autonomous tools can replicate them
Use autonomy for boilerplate that follows decided patterns
Let AI handle repetition, not decision-making

Return to control for integration:

Debugging requires understanding
Maintenance requires ownership
Evolution requires knowing why decisions were made

The cycle is: design with control, execute with autonomy, integrate with control.

Not: autonomous generation followed by days of “just refactor it.”

The Real Power of AI Coding

The promise of AI coding tools isn’t “describe an app in four words and get perfect code.”

The promise is: “Make architectural decisions at the speed of thought, then have those decisions implemented flawlessly.”

Interactive AI tools let you think at the architecture level while the AI handles the implementation level. You make decisions. The AI writes code. You maintain control and understanding. The AI handles the tedious translation from intent to syntax.

That’s the real 10x improvement.

Not “build me an app” magic that produces generic boilerplate you’ll spend days rebuilding.

But the ability to say “JWT with these exact claims, refresh rotation with this lifecycle, stored in httpOnly cookies” and get exactly that. First try. No guessing. No rebuilding.

The Bottom Line

If you’re building serious software—production SaaS, client projects, anything you’ll maintain beyond next week—you need to understand what you’re building.

Autonomous tools that guess at your architecture don’t save time if you spend days fixing wrong assumptions.

Code you don’t understand becomes a liability the moment something breaks.

Decisions you never made can’t evolve with your requirements.

Control isn’t about micromanaging the AI. It’s about owning the architecture of software you’re responsible for maintaining.

The demos are impressive. The marketing is seductive. The promise of “just describe it” is tempting—and genuinely useful for the right contexts.

But for production software with real requirements, real constraints, and real consequences? Interactive tools that let you specify precisely what you need will outperform autonomous guesswork every time.

Building production SaaS as a solo technical founder? I write about AI tools, architectural decisions, and shipping solo. Subscribe to get the next essay.

Be skeptical of demos. Demand control. Ship code you understand.