Loading
Loading
Code quality improvement without changing the team: Indpro's AI Code Factory took a client's codebase from quality score 62 to 91 in 30 days. Here's the exact methodology.
Author
Tom Bergström
Published
21 May 2026
Reading time
6 min read
Topics
nordic-tech, enterprise, scaling
When a CTO tells you the code quality is poor, the instinct is to hire better developers or invest in training. Both take months and neither addresses the root cause. The codebase this post is about had been written by competent developers — people who knew how to code — but had accumulated quality debt because the system they worked in had no automated quality floor.
30 days after we implemented the AI Code Factory guardrail stack, the SonarQube quality score had moved from 62 to 91. The team was the same. The difference was the system.
62
Day 1 — Quality Score
→
91
Day 30 — Quality Score
SonarQube's quality score aggregates five dimensions: reliability (bugs that will cause failures), security (vulnerability patterns), maintainability (code smells, complexity), coverage (test coverage percentage), and duplications (copy-paste code that creates maintenance debt). A score of 62 typically indicates significant issues in at least 2–3 of these dimensions.
This codebase's breakdown at day 1: reliability at 58 (several critical bugs in production code), security at 71 (some vulnerability patterns, no critical exposures), maintainability at 55 (high complexity, many code smells), coverage at 61 (insufficient test coverage), duplications at 68. Every dimension below 80 required work. The guardrail approach addressed them systematically rather than ad hoc.
| Dimension | Day 1 | Day 30 | Primary Fix |
|---|---|---|---|
| Reliability | 58 | 89 | PR review agent catching bugs pre-merge |
| Security | 71 | 94 | Security patterns encoded in SKILL.md |
| Maintainability | 55 | 88 | Complexity guardrail + pattern standardization |
| Coverage | 61 | 92 | Coverage gate + testing skill file |
| Duplications | 68 | 87 | Component library + SKILL.md pattern reuse |
Days 1–7: Lint cleanup sprint. The existing codebase had 1,847 ESLint violations and 234 TypeScript errors. We ran automated fixes for the straightforward issues (342 auto-fixable), then manually addressed the remaining 1,505 — prioritizing the ones that were blocking the pre-commit hook setup. By day 7, the codebase was clean enough to enable the blocking pre-commit hook without constant interruption.
Days 8–14: SKILL.md library creation. We interviewed the two most senior developers on the client team to document the patterns they considered correct — the things they wished the whole team followed. Those patterns became the SKILL.md files. From day 8 forward, every new code the agent generated followed those patterns. Maintainability score started rising immediately.
Days 15–21: Coverage expansion. We identified the 40% of the codebase below the coverage threshold and used the testing skill file to generate tests for it. Not all generated tests were production-ready — approximately 15% needed manual adjustment. The other 85% were merged directly.
Days 22–30: PR review agent deployment, advisory mode. Calibration. By day 30 the false positive rate was under 12% and the team trusted the output. Blocking mode deployment happened in week 5.
Want to run a quality baseline assessment before committing to the full implementation?
The most important insight from this engagement: the developers weren't writing low-quality code because they didn't know better. They were writing it because the system they worked in had no automated feedback loop, no pattern encoding, and no enforcement of the standards they knew were correct. When you remove "did I follow the pattern?" from the cognitive load of every coding decision, developers write better code. The mental energy that was going to "is this the right way?" goes to "is this the right feature?"
"A 62 quality score is almost always a systems problem, not a people problem. The developers on this team knew what good code looked like. They were producing inconsistent output because consistency requires system-level enforcement, not individual discipline. Every developer's discipline varies day to day. The guardrails don't." — Pavel Siddique, CEO, Indpro AB
At day 30, the score was 91. By day 90 (the end of the formal engagement), it had reached 94. The continued improvement came from the skill library maturing — each sprint that exposed a new pattern gap produced a new skill file that closed it. The rate of improvement slows as the score rises, but the compounding effect means the ceiling keeps moving.
More importantly, the velocity of the team at day 90 was materially higher than at day 1. A codebase at 91 quality score is faster to work in: less time debugging confusing legacy code, less time reviewing standard violations, less time firefighting production incidents. Quality improvement and velocity improvement are the same investment.
Ready to run a quality baseline and see what's possible in 30 days?
Q: What quality scoring tool do you use, and can this approach be applied with other tools?
The client in this case study used SonarQube. The AI Code Factory approach is tool-agnostic — we've seen similar improvements measured via CodeClimate, Codacy, and custom internal scoring systems. The underlying mechanism (enforcing patterns through skill files and guardrails) applies regardless of how quality is measured.
Q: How do you handle the quality of pre-existing code vs. new code going forward?
Two tracks: the guardrails apply to all new code immediately. Legacy code is addressed through a prioritized remediation backlog — we identify the highest-risk legacy issues and address them in dedicated cleanup sprints alongside feature work. We don't recommend stopping all feature work for a legacy cleanup sprint; the business cost is too high. The hybrid approach typically reaches 85+ quality scores within 60–90 days.
Q: Does improving code quality measurably reduce bug rates in production?
On this engagement, production incidents per sprint went from 2.1 (day 1 baseline) to 0.4 (day 90). That's an 81% reduction. We attribute it primarily to the reliability dimension improvement (bugs caught before production) and the coverage improvement (regressions caught by tests before deploy). The correlation is consistent across other engagements we've measured.

CTO & Co-Founder
Tom leads Indpro's technology strategy and engineering standards. With 20+ years of experience building and leading engineering teams across the Nordic region, he ensures every engagement delivers at the highest technical level.
Connect on LinkedIn →Most CTOs facing delivery problems reach for a talent answer: hire more, hire better. The problem is usually the system — the processes, structures, and feedback loops around the talent they already have.
10 pages of practical insight on operating models, compensation benchmarks, and a hiring playbook. Free PDF.
Download the Free GuideOr reach us directly: sales@indpro.se · +46 73 932 21 38