I’ve attempted to find studies detailing the cost of duplicate code within projects. We can all spout the problems these clones produce, but can we quantify the cost from actual studies of real projects? Apparently not.
One pundit suggests the if our code base has 20% duplicated code, then eliminating it would reduce maintenance costs by 20%.
This ignores multiple key points. Consider the cost of actually removing that code and properly accessing that one and only one non-clone. Can you properly disentangle that code from its surrounding web without considerable effort? Will all those test cases work properly? How can you ensure you have not introduced dependency bugs? For the clones in multiple repositories, how should your build system handle that one and only one copy? Your QA should perform complete regressions tests on this code. The overall effort involves a significant amount of work!
Don’t forget that removing the clones would frequently involve creating a one and only one “anti-clone” routine in a library or common repository. Thus, yet another dependency can appear for each repository in which the clones exists. And isolation certainly weakens.
That one “anti-clone” routine that previously involved multiple repositories likely involved multiple groups. Now you have to convince these groups that the “anti-clone one-true-source”” routine should be used in place of all those clones. This likely involves multiple meetings wherein each of the above issues gets discussed, timelines estimated, and resources allocated. How many managers will release resources for this?
If my system had a 20% duplicate code metric, I would literally leave it alone and simply attempt to stop new clone introduction.
I was at one place where the code base had a typical 20% ratio of clones. One of the managers decided to do something about this. He mandated that duplicated code was to be outlawed. His developers dutifully “removed” the clones by changing the names of variables in the clone reports and/or minor logic rearrangement. The task was easily accomplished in a timely manner. Poof! No more clones!! The clone reports were, of course, created by the straight-forward algorithms of matching strings. Since duplication was now minimal, he was happy. I was aghast as now clone detection became impossible with our elementary tools.
But the long term impact of anti-cloning…
The above discussion ignores the long term impact. Everyone knows that maintenance consumes the true cost of code, not the design and development. So a legitimate argument can be made for going through work to remove clones and place the code into some library or repository and everyone must then access that and only that code.
Sigh. Valid point.
Then fixing clones could then make sense for the long term total project costs. I’m still hunting studies that validate this scenario as well.
The real question then becomes:
How much does a clone really cost a project?
I wish I were in a group where we could do the right thing rather than constantly against a scheduled ship date. Unfortunately, almost every company’s motto applies: “No time to do it right but we have time to do it over.”
Code Duplication vs. Isolation
Another concern regards the practice of code duplication for achieving isolation among components. Suppose a team tries to maximize isolation of their components. However, in that process duplicate code appears everywhere. Will this practice of code duplication should create more trouble later as maintenance cost could become very high?
Does isolation really means that components can’t share the same code? Could the cost of duplication be the lesser of isolation benefits?
Duplicated code, “DRY” – Don’t Repeat Yourself, is considered to be one of the most ignored principles of clean code. If this were totally ignored, there would be no methods or functions! Nothing is easier than to repeat code by copy, paste and possibly hack. Then development can go quickly. This happens all too often.
A major problem of duplicated code is the propagation of bugs. That first piece of code will eventually contain a bug due to coding, changes, updates, etc. How many other duplicated pieces will be updated to maintain correctness?
And if that duplicated code was hacked, consider how tricky it is to remember the differences between the two pieces? Now multiply that by all the clones! How confusing to understand one clone, run into another clone slightly different and think you understand it. The hacked clones are not likely to be discovered by a dup checker! Few dup checkers are that clever.
Let’s make an extreme example. Let’s start programming without methods in your app. Every single time you need to perform something that was formally a method in your app, you copied code that performed that logic. Your mainline becomes one large “method”. How difficult would it be to follow the logic? Now it’s tricky to read because your “paragraphs”, ex-methods, have become one huge paragraph. How easy is this to understand? How easy to test, debug and maintain?
There are portions of your app used repeatedly. Instead of calling them, you copy that code and hack the variable names.
When, not if, but when you find a bug in one of the duped portions, you should change it in all the hacked clones. How much more maintenance does that entail? Do you think you can actually find all the hacked clones and reliably rehack? How many more bugs do the clones produce because you must change code all over the place? How stable do you think your code would be? How could you test this mess? How much of a memory overload does this produce?
A solution to this duped code would entail creating a method or class that holds this logic. If minor variations are necessary, would a base class work better? Possibly. The base class must be, of course, at the proper level of abstraction. This is part of the design process that a craftsman should aspire towards.
In our code base I see lots and lots of dups. Even amongst multiple repos! Many of these dups should be in their own module with other refactored dups.
Isolation does not mean two components can’t share the same code. Would you want your own copy of console.log()? Of course not. Refactoring out the duped code for sharing would involve creating one or more libraries holding these new modules/packages. They would have their own module just as console.log lives in a module. Don’t get too fine-grained with something like placing each refactored package or module into its own module – this would again be gross overkill. (Don’t take these suggestions as absolutes. In clean code there are exceptions to everything!)
The advantage to this refactoring are numerous. When a bug is found, it gets corrected and all code using that methods gets corrected automatically. You are thoroughly testing the code, aren’t you? And by thoroughly testing I mean the single instance has thorough tests and the users of the single instance have thorough tests as well. This significantly reduces maintenance time. Remember, the cost of software is the long term maintenance, not the development time. More time spent upfront in development getting it right means lower overall costs in terms of customer satisfaction, debugging, team morale, etc. (Your manager will not like this previous statement!)
The introduction of structured languages like Pascal, ALGOL, Smalltalk, Modula was to help folks break up monolithic programs by introducing advanced concepts like stack variables and dynamic memory allocation. But the main point was to use subroutines/functions and reuse code to produce better structured programs and libraries. Better structured programs are easier to understand and maintain.
OO languages like Smalltalk, C++, CLOS and Java set new programming paradigms to advance software constructs in multiple areas. One aspect was to better enable reuse through inheritance that assist programmers write even more modular code.
If you think you are creating better isolated code by copy/paste, you might want to consider using OO design practices to overload methods for specialization when needed and reuse base implementations whenever possible. Copy/paste of procedural code really is taking a step backwards from good design practices.
Given the above discussion, how should the dreaded duplicates get handled in our applications? The statements below will, hopefully, invite discussions as I have not been able to find studies on this topic.
- Identify current clones within your application. This begins a baseline use to monitor future code changes.
- Identify current clones between you application and related repositories. If possible, include all code in your company’s repository base. If not possible, use repositories related to you domain’s application. This creates another baseline of clones.
- Examine these clones for low hanging fruit – where can some of these clones be easily fixed? What would require the least effort? Under no circumstances should you simply rename variable or slightly reorder statements to “fix” these clones!
- Unit tests seems to breed clones like stagnant water breed mosquitoes. These definitely need attention. Changing a feature could cause a significant amount of work because the replicated code in the multiple tests touches the same feature multiple times. Commented out tests commonly have too much functional duplication as the root cause.
Clone Detection Using Abstract syntax Trees This paper provides a sketch of a tool to detect cloned code using abstract syntax trees. This goes beyond the normal clone detectors that use simple character-by-character matching such as PMD.