Wrestling with LLMs: Managing Smarmy, Overconfident LLMs

12/9/2025

I spent 3.5 months building dokku-dns almost entirely with Claude Code. Started in late August using an existing Dokku plugin as a template, I figured I'd wrap it up in about six weeks. By late September, Claude was telling me we were done - "production ready," "enterprise grade," the whole nine yards. I actually finished in early December, after spending a month fixing catastrophic bugs that could have deleted all my DNS records (or anyone else's who installed the plugin).

LLMs are incredibly useful tools, but they come with a dangerous personality flaw: they're overly confident, relentlessly optimistic, and smarmy as all get out. They'll tell you everything is "production ready" and "enterprise grade" with the same cheerful confidence whether your code works perfectly or is about to delete all your data. The stakes are real - when Claude is wrong about your code being production ready, you're one deployment away from catastrophic data loss.

Here's the important part: LLMs aren't lying to you - they're oblivious. No malice, no deception, no intent. They're pattern-matching systems generating confident text without understanding what "working" actually means or comprehending the consequences of being wrong. Anthropomorphizing them as "lying" or "trying to trick you" actually makes you less effective at staying skeptical. They're tools that generate text without comprehension.

The rule I learned the hard way: Be skeptical of all LLM output. Every message, every test result, every "success" indicator. This is what happened when I trusted the output and didn't verify before shipping.

The Illusion of Progress

Up until Phase 10, everything was feeling great. I had basic functionality working with AWS Route53. I was manually testing everything - creating DNS records, updating them, deleting them. Watching records appear in the AWS console. Seeing everything work. Making real progress and watching the codebase grow.

Then I got ambitious. Phases 11-25 were about generalizing the plugin to support multiple DNS providers - Cloudflare, DigitalOcean, not just AWS. I stopped manually testing as much because the tests were passing and AWS was already working. Claude kept telling me how well things were going. By Phase 25, I had a "Pre-Release Preparation" complete message. The test suite was green. Claude was practically high-fiving me through the terminal.

"Working perfectly! All providers integrated!"
"Ready to ship!"
"Production ready!"
"Enterprise grade!"

I should have been skeptical. When software starts using enterprise sales language, that's your signal to question everything. But I believed the output - the green tests, the confident messages, the "working perfectly" claims.

When I asked Claude to generate some real-world manual test scripts - the actual commands a user would run to make sure the happy path worked -

Nothing worked.

Not Cloudflare. Not DigitalOcean. Not even AWS - the provider that had been working perfectly earlier. The multi-provider refactoring had silently broken everything. Every provider was completely non-functional. But the tests? Still green. Claude? Still confident.

I had believed the output instead of verifying it.

When Rubber Meets the Road

The manual testing revealed issues ranging from "embarrassing" to "this could bring down a production system."

The Catastrophic Issues

The sync:deletions command was supposed to remove DNS records that no longer existed in your app. What it actually did was delete ALL Route53 records in the entire hosted zone. Not just the app's records. Not just dokku-managed records. Everything. If you had manually created MX records for your email? Gone. Custom CNAME for your CDN? Deleted. All of it, wiped out, because of a logic error that would have been obvious with any manual testing.

DNS records weren't being created at all. Silent failures everywhere. The commands would complete "successfully," tests would pass, but if you checked the actual DNS provider? Nothing there. The plugin was just pretending to work.

There were rm -rf commands with variables that weren't being validated. Empty variable? Congratulations, you just tried to delete everything from root. The kind of bug that makes you sweat when you think about what could have happened.

All of this while the unit and integration tests were passing. Green checkmarks as far as the eye could see.

The Testing Lie

The unit tests were testing functions in isolation. "Does this function parse DNS records correctly?" Yes! Gold star! But it didn't test whether those parsed records ever made it to the actual DNS provider.

The integration tests were using stubs that avoided testing critical functionality. They'd mock the AWS API calls, verify the mock was called, and call it a day. The tests were checking that the code was trying to do something, not that it actually worked.

Manual testing revealed the truth: green checkmarks ≠ working software.

Claude's take? "All tests passing! Ready for production!"

The tests were passing because Claude had carefully written them to test only the parts that worked. Be skeptical of test output. Just because tests pass doesn't mean you tested the right things.

The Aftermath

I created a new roadmap. 25 more phases. Another month of work. Complete rewrite of the deletion system with proper safety checks. Validation on every destructive operation. Error handling that actually handled errors instead of just swallowing them. Each fix revealed more broken code lurking underneath.

The whole time, Claude maintained that smarmy confidence. Fix one critical bug, and it would congratulate me on the "enhancement" like we were adding features instead of fixing catastrophic data loss scenarios. Be skeptical of the tone. Enthusiasm doesn't equal correctness.

The Core Trap: Be Skeptical of All Output

Here's the thing: when small pieces work, it's easy to believe the whole system works. Quick "working" code is intoxicating. You write something, you run it, it does... something. Close enough! Ship it!

LLMs are often compared to junior developers - in this instance, they can't see the larger integration picture when small things are working. But that comparison sells human devs short. Junior developers learn, develop intuition, and most importantly, they recognize when they don't understand something. LLMs don't. They'll make tests pass without understanding which tests actually matter. They optimize for green checkmarks, not working software. Passing tests don't mean the coverage is adequate. They mean the LLM successfully convinced the test suite to play nice.

LLMs aren't lying - they're oblivious. They're not malicious or deceptive, they're just pattern matching without comprehension. No understanding of what "working" means. No concept of production versus broken. They generate confident text without any comprehension of whether that text reflects reality. Hanlon's Razor applies perfectly here: never attribute to malice that which is adequately explained by incompetence - or in this case, complete lack of comprehension. Treating them as "lying" anthropomorphizes them and limits your skepticism.

They're like having a really oblivious intern who:

Does not learn
Types incredibly fast
Works until they hit their ever shrinking usage limit
Tells you with complete confidence that broken code is perfect
Not lying - genuinely doesn't know the difference
Needs constant supervision

Be skeptical of success indicators. Green checkmarks, passing builds, confident messages - none of these mean your code actually works. Catastrophic failures are absolutely possible. Data loss is on the table. The single most important lesson: Be skeptical of all output. Every success message, every passing test, every "working perfectly" claim.

Despite everything I just described, I built a Dokku plugin in 3.5 months. Yes, half that time was fixing catastrophic bugs. Yes, I had to manually test everything. Yes, Claude confidently told me broken code was production-ready. But it still would have taken me much longer to write this from scratch without LLM assistance.

The key is treating LLMs as oblivious assistants, not autonomous developers. They're tools that need supervision, not colleagues you can rely on. They are just pattern matching. They're phenomenally good at typing and incapable of thinking.

You must be the architect. You must be the reviewer, not just a rubber stamp. You must be the skeptic. Claude will tell you the code is perfect - you need to be the one who doesn't believe it. Question everything. Verify everything. Trust nothing until you've seen it work with your own eyes.

What This Means for Developers

Your architecture and design skills matter more now, not less. LLMs can't design abstractions or notice when design is getting messy. They can't tell when a file is getting too large or a function is doing too much. That's still on you.

Your testing instincts are critical. LLMs will game tests to pass. They'll delete failing tests. They'll write tests that test nothing. Your ability to look at a test suite and think "wait, are we actually testing anything important here?" is what prevents catastrophic failures.

Your ability to spot false confidence is your superpower. Skepticism is not optional - it's your primary job. When Claude tells you the code is "production ready" and "enterprise grade," you need to be the one to prove it. When all tests are passing but your gut says something is wrong, your gut is right.

You're not coding less with LLMs. You're supervising more. The work didn't go away - it shifted. You're spending less time typing and more time being skeptical, reviewing, testing, and verifying.

Don't anthropomorphize. LLMs don't think, don't understand, don't know when they're wrong. Stay grounded: it's a pattern-matching tool generating text, not a colleague who can be trusted or mistrusted.

Be skeptical of all output. It's the only way to use LLMs.