Over the past year we have reduced our website test suite build time by over 99.9%.
- Build time a year ago: 15 hours.
Across 15 EC2 build slaves it took “only” 1 hour of real time.
- Build time today: 15 seconds
On my laptop.
Having a build that took over an hour to run crippled the productivity of our team.
So, how did we make such a drastic improvement? There were no quick fixes, though Lord knows we tried to find them. Instead we have had to completely change the way we test.
Rather than any brilliant new techniques, there were instead three big mistakes that we had made that created such a monster build time. We went down a wrong path, and it took a lot of time and effort to fix it later.
Bad Practice #1: We favoured integration tests over unit tests
We used to be extremely thorough in our integration tests. We used them to test everything, usually instead of unit tests, which were comparatively thin on the ground. Since integration tests are far, far slower than unit tests, this caused a lot of unnecessary work.
To fix this we looked at each integration test in turn and either:
- ditched it (i.e. we increased our tolerance for broken things in exchange for having a faster build)
- rewrote it as a unit test on a specific class
- kept it, as we still needed a few integration tests for each component
Bad Practice #2: We had many, many features that were relatively unimportant
Many of the less used or less strategic features on songkick.com have gone. This was an extremely painful decision to make, and we made it for bigger reasons than just improving our build time. But it certainly improved the build time a lot.
Fixing this and the previous point have turned a library of 1642 Cucumber scenarios into just 200.
Bad Practice #3: Our integration tests were actually acceptance tests
This test suite used to integrate over our website, domain library, database, message bus and background job workers. Each was spun up as a separate process in the test environment. We basically ran all our website tests against a working version of our entire system. Remember I said we tested virtually every code path? This added up to a lot of time.
Nowadays, our website integration tests are really only integration tests. They integrate over code inside a single project. Every interface to another project is stubbed.
All our database access code is isolated in a domain library behind a thin service layer and is stubbed in the website project.
Instead of over a thousand acceptance tests, we now have fewer than 10. They run against our staging and production environments, after deployment, instead of slowly booting up a test environment during the build.
Six months later
Productivity is up! Morale is up! It’s amazing just how much a faster build has improved our working experience.
Remember that the suite described above was only one of our builds. We had multiple projects with builds that took more than 30 minutes to run. Now none of our test builds take longer than 5 minutes, which is now considered “slow”.
These mistakes are far clearer to us in hindsight than they were at the time, so I’d recommend looking carefully to make sure you are not infected by any of these bad practices. If you are, take the time to fix the problem. If you’re not, congratulations on avoiding the pitfalls we fell in to!
PS: Bonus quick fixes for reading this far
Ok, so there were one or two things we did that probably count as Quick Fixes:
- tweak our Ruby interpreter’s GC settings by trial and error (we use REE).
- run the build entirely on the tmpfs in-memory file system
Both of these gave surprisingly significant improvements, and accounted for perhaps 10% of the speedup.