Duplicated code must surely rank near the top for bad code. The problem is that a new feature or bug will surely cause changes to one clone. Then the other clones will be silently ignored. This will likely create incorrect behavior. The tests for the clones likely will not catch these bugs either.
The simplest duplicate code has the same statements in two or more places. The simple solution extracts that code and places it into a new class or method.
Another common situation is duplicated code in sibling subclasses. Hoist the duplicated code into the super class.
Duplicated code must be eliminated because:
- Duplicated code hampers the introduction of changes since every implemented variant will likely need changes. Missing some variants is easy which causes bugs to pop up in other places.
- Duplicate code scatters the logic of a system instead of grouping into identifiable artifacts – classes, methods, repositories.
- This leads to systems that are difficult to understand and change. Instead of a understanding just relationships between logical parts you must first identify and understand the duplicates and then the relationships.
- Complexity increases because duplicates are commonly variants of each other. Conceptually maintaining the differences between near identical code in various places can easily lead to bugs.
- Coding and testing times get multiplied by the number of duplications. Bugs thought to be fixed may persist in duplicated locations for months or years.
DRY: “Every piece of system knowledge should have one authoritative, unambiguous representation.”
Don’t Repeat Yourself – DRY is perhaps one of the most misunderstood principles of software craftsmanship. Most people think that DRY is about exact or near exact copies of code. DRY is far grander than that.
Every piece of knowledge in the development of something should have a single representation. A system’s knowledge is far broader than just its code. It refers to database schemas, test plans, the build system, even documentation.
Given all this knowledge, why should you find one way to represent each feature? If you have more than one way to express the same thing, at some point the two or more different representations will most likely fall out of step with each other. Even if they don’t, you’re guaranteeing yourself the headache of maintaining them in parallel whenever a change occurs. And change will occur. DRY is important if you want flexible, readable and maintainable software.
Representing Knowledge Once
The problem: how do you represent all these different pieces of knowledge only once? If it’s just code, then you can obviously organize your code so you don’t repeat things with the help of methods and subroutines. How do you handle things like database schemas? Data migration issues? Using code generation tools, automatic build systems, and scripting languages become necessary. These assist in generating a single, authoritative representations that then generates non-authoritative work products, like code or DDLs (data description languages).
Functionally Duplicated Code
Code does not have to be an exact copy. A developer may copy code, then make minor modifications. At other times the algorithm itself contains duplicated code:
var array1 = []; var array2 = []; var average1 = 0.0 var average2 = 0.0 var sum1 = 0; var sum2 = 0; var i; for (i = 0; i < 4; i++) { sum1 += array1[i]; } average1 = sum1/4; for (i = 0; i < 4; i++) { sum2 += array2[i]; } average2 = sum2/4;
Ignore other anti-patterns in the above code. This duplication will not be detected by common duplicate detection algorithms. Very few companies have more than the minimum character matching code detectors.
Nearly Duplicated Code – The WET Principle
Not all duplicated code is an exact duplicate. Sometimes the same code logic gets used over and over. This particular duplicated class of code will not get detected by normal copy/paste detectors.
The WET principle says to Write Everything Twice – WET. Examine the code below. Hopefully a solution will be obvious.
item1 = parseFloat(document.form1.Stitches.value); item1price = parseFloat(document.form1.StitchesPrice.value); item2 = parseFloat(document.form1.FirstAidKit.value); item2price = parseFloat(document.form1.FirstAidKitPrice.value); item3 = parseFloat(document.form1.Wristband.value); item3price = parseFloat(document.form1.WristbandPrice.value); // ... item76 = parseFloat(document.form1.HushCD10.value); item76price = parseFloat(document.form1.HushCD10Price.value); var subtotal = 0; if (item1 > 0) { var item1total = item1 * item1price; subtotal = subtotal + item1total; } if (item2 > 0) { var item2total = item2 * item2price; subtotal = subtotal + item2total; } if (item3 > 0) { var item3total = item3 * item3price; subtotal = subtotal + item3total; } // ...
Another Functionally Duplicated Code Example
Suppose we have an order system. An attribute of “payment” on the order indicates when the order was paid:
var Order = { paymentDate: null, getPaymentDate: function () { return paymentDate; } ... }
Assume orders are not allowed to be sent for shipping unless they are paid. There will be code somewhere checking this condition.
if(order.getPaymentDate() != null) { ... } // do shipping else { ... } // do not allow shipping
It is also probable that the same check is done in a lot of other places throughout the codebase. We now have the same technical construction order.getPaymentData() != null
used to express the same idea “order is payed” in a lot of places – a clear violation of DRY. Most likely we will also find a lot of places checking the negation order.getPaymentDate() == null
.
Suppose isPaid()
were created for this test:
var Order = { paymentDate: null, getPaymentDate: function () { return this.paymentDate; }, isPaid: function () { return order.getPaymentDate() != null; } }
Multiple benefits exist:
- Changing the technical representation of an unpaid order
becomes possible without having to run around the code-base. - All those places in the code now change at the same time –
because they conceptually become related things (actually they
become conceptually the same thing). - An even more important benefit: the API of the Order
class becomes more expressive, making the client code more
orthogonal and understandable.if(order.isPaid()) { ... } // do shipping else { ... } // do not allow shipping
Locating Duplicated Code in a Repo – CPD
The popular PMD bundles CPD (Copy Paste Detector) that uses string search algorithms to find duplicated code. By using CPD the most egregious violations of copy/paste can be quickly found in a repo. (PMD requires java 7+ to run.)
To install CPD you install all of PMD. PMD may already exist in your eclipse installation. The following instructions will install pmd on a hyper for batch processing.
Download PMD from sourceforge.net at http://sourceforge.net/projects/pmd/
Unzip the downloaded file in a convenient directory: unzip
pmd-bin-5.0.3.zip or the latest version.
Enter into the dir: cd pmd-bin-5.X.X
Get the script that starts pmd into your path: export PATH=$PATH:$(pwd)/bin
Assuming you have already downloaded your repo, cd to the top of that repo.
Assume the repo is mynodeweb. Now run:
PMD=$HOME/pmd-bin-5.0.3 export PATH=$PATH:$PMD/bin export CLASSPATH=$CPASSPATH:$PMD/lib run.sh cpd --minimum-tokens 50 --files mynodeweb --language ecmascript | tee myweb.dups
To detect duplicates between two or more repos:
run.sh cpd --minimum-tokens 50 --files mynodeweb --files fcxonode --language ecmascript | tee repo.dups
The “–files” is the name of the dir at the top of the repo.
O’Reilly has an excellent article on using CPD: “Detecting
Duplicate Code with PMD’s CPD”
Duplicated Code in Unit Tests
I’ve noticed that code in unit tests is often duplicated multiple times. A developer creates a single unit test and makes it work. Because a single unit test for a method almost never provides sufficient testing, additional unit tests that address the very same method must be created.
The logical and even reasonable approach copies, pastes and hacks the original unit test. A substantial portion of that original unit test now lives in the new test.
This process likely continues with many other tests for the original method.
The question: “What is wrong with this? It’s only a unit test!” A dedicated developer practicing Test Driven Development rightly wants to proceed with developing code as rapidly as possible instead of fumbling around with test code. “Why waste time with a unit test?” Why indeed.
Duplicated code is a smell in unit test code just as much as in other code. If you have duplicated code in tests, it makes it harder to refactor the implementation code because you have a disproportionate number of tests to update. And, more importantly, when the code evolves with a new feature or bug fix, all unit tests must be updated to properly test that new feature. How much work have you created with all that duplicated code in your tests? How discouraging does that mountain of work make you feel?
Tests should help you refactor with confidence, rather than create an overwhelming burden that impedes work on the code under tested.
It is important for test code to be readable. In particular, it is important that the intent of a test is clear. I find that if many tests look mostly the same, (e.g. three-quarters of the lines the same or virtually the same) it is hard to spot and recognize the significant differences without carefully reading and comparing them. So I find that refactoring to remove duplication helps readability, because every line of every test method is directly relevant to the purpose of the test. That’s much more helpful for the reader than a random combination of lines that are directly relevant, and lines that are just boilerplate.
Readability is more important for tests. If a test fails, you want the problem to be obvious. The developer shouldn’t have to wade through a lot of heavily factored test code to determine exactly what failed. You don’t want your test code to become so convoluted that you need to write unit-test-tests.
The Good and the Bad
Code copy, pasted and hacked from the internet can be good because:
- Code stored on blogs, forums and the web in general is very easy to find.
- Comments indicate this code has been reviewed. This improves the quality,
- If page ranking is used, the score suggests this is good stuff.
- Posted code generally has qualities of readibility and understandiblity that newly writen code does not.
- The higher quality of published code drives a developer’s ego to attribute that to their own brilliance.
- Code blindly copied and pasted really cuts down on time schedule pressure.
However, copy, pasted and hacked code may be bad because:
- If the author improves the code or fixes a bug, you’re not likely to receive the benefit.
- If you improve the code, it’s highly unlikely that you will redistribute it back to the author.
- Code may be blindly copied and pasted without understanding the various ramifications of what really happened inside that code.
- Page rank does not provide a metric on the quality of the code.
- Published code is frequently “demo code” wherein only the happy path gets listed and explained. The really industrial strength coding gets left as an exercise for the reader.
Links
Software Clone Detection and Refactoring
Clone Detection Using Abstract syntax Trees This paper provides a sketch of a tool to detect cloned code using abstract syntax trees. This goes beyond the normal clone detectors that use simple character-by-character matching such as pmd.