Stop Letting Your AI Tool Make Commits You Can’t Explain

The data on AI code quality is out, and it is not flattering.

95 % of engineers now use AI coding tools at least weekly. That figure is from Pragmatic Engineer’s 2026 AI tooling survey, and it is not surprising. What is surprising is the statistic that follows it: 59% of developers admit to shipping AI-generated code they don’t fully understand, according to a June 2025 Clutch survey of 800 software professionals.

🚨 HIRING: Tech Talent
💰 $50–$120/hr | 🔥 Multiple Roles

Frontend • Backend • Full Stack • Mobile • AI/ML • DevOps
👉 Apply Here

Read that again. More than half of professional engineers are committing code they cannot explain. Not junior developers. Not boot camp graduates. Senior contributors, tech leads and people who are responsible for systems in production.

If that number doesn’t bother you, this article is not for you. If it does, keep reading, because the actual risk profile of AI-assisted development is more specific than the industry conversation suggests. The conversation tends to drift toward productivity gains, vague warnings about “code review” and feel-good takes about developers becoming “orchestrators.” What’s missing is a practitioner-level breakdown of where AI code fails, how it fails and what you can actually do about it.

The Trust Gap You Built Into Your Workflow

The 59% figure is uncomfortable, but it’s also understandable. AI coding tools are fast. They are confident. They produce code that looks correct at a glance. When you’re deep in a sprint and under delivery pressure, accepting a suggestion that passes unit tests and looks syntactically valid is a rational shortcut. It feels like a time-to-market decision, not a risk decision.

CodeRabbit’s December 2025 “State of AI vs Human Code Generation” report complicates that calculus. Analyzing 470 open-source GitHub pull requests (320 classified as AI-coauthored and 150 as human-only), they found that AI-generated PRs averaged 10.83 issues each. Human PRs averaged 6.45. That’s roughly 1.7x more issues per pull request.

The breakdown by category is more revealing than the headline number. AI-coauthored code was 1.75x more likely to have logic and correctness errors, 1.64x more likely to have code quality and maintainability issues, 1.57x more likely to have security findings and 1.42x more likely to have performance problems. Not one category. Every category.

This is not the profile of a tool that “mostly gets it right.” It’s the profile of a tool that introduces more problems across the board, consistently, and with enough competence-seeming output that the problems don’t always surface in code review. The speed gain is real. The quality cost is also real. Teams that pretend one exists without the other are accumulating debt they will spend months paying down.

Phantom Packages Are a Supply Chain Attack Waiting to Happen

The code quality numbers are concerning. The fabricated package problem is an active security risk, and most teams aren’t treating it as one.

Here’s the mechanics: AI code generation models are trained on enormous volumes of public code, but they are probabilistic rather than deterministic. When generating import statements or dependency declarations, they sometimes suggest package names that do not exist anywhere on any registry. Researchers at the USENIX Security Symposium 2025 published a paper after analyzing 576,000 code samples generated by 16 popular AI models. The result: 19.7% of package dependencies were fabricated entirely. Open source models fabricated at nearly 22%; commercial models at around 5%.

What makes this more dangerous than a simple error is the consistency. When researchers re-ran the same prompts 10 times each, 43% of the fabricated packages were suggested every single time. In 58% of cases, a fabricated package name was suggested more than once. The model is not randomly guessing. It’s confidently inventing the same names on repeat.

Security researcher Seth Larson named the resulting attack vector “slopsquatting,” a riff on typosquatting. The attack is straightforward: an attacker identifies which package names AI models commonly invent, registers those names on PyPI, npm or another public registry and populates the packages with malicious code. Any developer who accepts the AI suggestion and runs an install command without checking is pulling the attacker’s payload into their environment. The consistency of the fabrications is what makes this viable at scale. Attackers don’t need to brute-force potential names. They can observe model behavior, identify the commonly invented names and register them before your team does.

This is a documented threat. Trend Micro published a detailed breakdown in 2025 and Socket.dev’s research team traced real-world slopsquatting attempts across public registries. The attack surface exists because the fabrication patterns are predictable and reproducible.

If your team accepts AI-generated dependency suggestions and runs an install command without cross-referencing official registry documentation, you have a gap in your supply chain posture. It takes about 30 seconds per package to verify. It takes considerably longer to respond to a compromised dependency that’s been in production for three months.

The Security Failures Are Specific, Not Random

The CodeRabbit report breaks down the security findings in enough detail to act on. AI-generated code was 2.74x more likely to introduce cross-site scripting (XSS) vulnerabilities than human-written code. It was 1.91x more likely to include insecure direct object references (IDOR), 1.88x more likely to have improper password handling and 1.82x more likely to include insecure deserialization.

These are not exotic or novel vulnerability classes. XSS and IDOR have been consistently in the OWASP Top 10 for years. Improper password handling has been a known failure mode for decades. The AI is not producing creative new security problems. It’s reproducing the same failure patterns that the security community has been trying to eliminate since the late 1990s, at higher rates than a developer who has been burned by them before.

The pattern makes sense once you think about what the models were trained on. They were trained on the full corpus of public code, which includes an enormous amount of insecure code. The models can write code that looks idiomatic, passes a linter and compiles cleanly while repeating patterns that any competent security reviewer would flag immediately. The code looks fine. It is not fine.

One example that comes up repeatedly in practice: AI tools frequently generate parameterized queries that are structurally correct but bypass the ORM in ways that reintroduce injection risk when input types are coerced at the application layer. The code compiles. It runs. It passes a naive code review. It fails a penetration test six months after it shipped.

The fix is not to treat every AI-generated security finding as a scandal. It’s to define specific review targets based on the failure modes the data shows, rather than doing generic “does this look right” passes that miss the patterns AI over-produces.

A Practical Review Protocol for AI-Assisted Code

Blanket advice like “always review AI-generated code” is not useful. Every team says they review AI-generated code. The problem is what that review actually consists of.

Here is a more specific approach, organized around the failure categories the data actually shows.

Dependency verification as a step, not an assumption. Every package name that comes from an AI suggestion should be verified against the official registry and the project’s existing dependency policy before installation. This is not a complex process. You check that the package exists, that it is maintained and that it matches the documentation you’d find if you searched for it independently. That step takes 30 seconds per package and eliminates the fabricated-dependency surface entirely. The teams skipping it are not saving meaningful time. They are trading 30 seconds per package for the potential cost of a supply chain incident.

Security-specific review for data handling code. Any code that touches user input, database queries, authentication or session management should get a targeted pass against the specific failure modes AI over-produces: XSS, IDOR, improper credential handling and deserialization issues. Not a general “does this look right” review. A review where someone is actively looking for those specific patterns, because the data shows those are where AI-generated code fails at elevated rates.

Test coverage is not a proxy for correctness. A common shortcut is running AI-generated code through the test suite and treating a passing result as validation. This fails because the AI frequently writes both the implementation and the tests in the same session, and those tests may be perfectly correct relative to a flawed specification. Test coverage tells you the code does what the tests expect. It says nothing about whether the tests are testing the right things or whether the specification the AI used as context was accurate.

Here’s a simple checklist worth adding to your PR template for AI-assisted code:

## AI-Assisted Code Review Checklist
- [ ] All new dependencies verified against official registry documentation
- [ ] Data handling code reviewed against OWASP Top 10 failure patterns
- [ ] Test cases reviewed for coverage of edge cases, not just happy paths
- [ ] Code author can explain the logic of every AI-suggested block
- [ ] No fabricated package names accepted without registry verification

That last item is the most important. If the developer who submitted the PR cannot explain what the AI-generated code does and why it does it that way, the PR is not ready for review. The explanation requirement is not punitive. It’s the minimum bar for code that runs in production. If you can’t explain it, you can’t own it when it breaks.

The Confidence Problem Is the Core Problem

The underlying issue with AI coding tools is not that they write bad code sometimes. It’s that they write bad code with exactly the same confident tone they use when writing correct code. There is no signal that indicates which suggestions are trustworthy, because the model has no internal confidence calibration that it surfaces to you. It doesn’t know what it doesn’t know.

This is a qualitatively different failure mode from what you see in junior developers. A junior developer who isn’t sure about something usually shows it. They hedge. They ask questions. They leave comments that say “I’m not sure this handles the edge case.” AI tools don’t. They produce output that looks complete, authoritative and considered whether the underlying suggestion is a solid pattern or an invented package name.

The 59% of developers shipping code they don’t understand are not lazy. They’re responding rationally to a tool that presents everything with equal confidence. The work is to build team practices that don’t use the tool’s confidence as a quality signal, because that signal has no calibration.

The Takeaway

AI coding tools are here and they are useful. The productivity gains are real for teams that use them with appropriate discipline. The gap between teams that use them well and teams that use them carelessly will widen over the next two years, and it won’t show up in velocity metrics. It’ll show up in the security audit that finds IDOR patterns across a third of your endpoints, in the dependency graph that contains a package nobody can trace back to an official source and in the incident where a fabricated package name someone installed six months ago turns out to have been registered by someone with interests that differ from yours.

The question is not whether to use AI coding tools. It’s whether your review practices are designed for the specific failure modes these tools produce, or whether “always review AI-generated code” is as far as you’ve gotten.

The data on the specific failure modes is published and verifiable. Act on it specifically.