List

Speed up your test suite by throwing computers at it

Speed up your test suite by throwing computers at it

by Daniel Magliola

In the video titled "Speed up your test suite by throwing computers at it," speaker Daniel Magliola discusses strategies to improve continuous integration (CI) times, emphasizing the need to reduce the amount of waiting time associated with CI processes. Acknowledging the common frustration developers face when CI runs become lengthy, the speaker presents a variety of techniques aimed at optimizing CI without the need to rewrite tests. Key points include:

  • Rationale for Speeding Up CI: Daniel highlights the time wasted by developers while waiting for CI results and advocates for a systematic approach to leverage parallel processing by utilizing multiple machines.
  • Parallel Test Execution: The approach focuses on running tests in parallel rather than optimizing individual test performance. This can be achieved by splitting the test suite into manageable chunks, both manually and automatically, to facilitate running them on various servers simultaneously.
  • Managing CI Resources: By strategically managing CI resources, such as selecting appropriate instance sizes for different test types and implementing a catch-all job, teams can avoid the pitfalls of long CI times due to neglected test files.
  • Containerization Insights: The speaker explains how to optimize Docker container initialization times, which often constitute the majority of CI time. Suggestions include using popular and cached images and minimizing unnecessary layers in Docker images.
  • Reducing Setup Times: Magliola delves into specifics such as implementing gem caching, shallow clones for git repositories, and efficient Docker strategies to mitigate latency and enhance speeds.
  • Handling Flaky Tests: Addressing the issue of flaky tests, which exacerbate CI delays, Daniel proposes automating the detection of these tests to streamline reporting and tracking for resolution.
  • Critical Path Optimization: Finally, he emphasizes the importance of identifying critical paths in the CI workflow, advising that efforts should concentrate on optimizing these paths rather than peripheral processes to ensure a faster overall CI time.

Conclusively, the video encourages teams to adopt these approaches to better utilize CI resources, thereby minimizing wasteful waiting and ultimately contributing to improved productivity in software development.

You've probably experienced this. CI times slowly creep up over time and now it takes 20, 30, 40 minutes to run your suite. Multi-hour runs are not uncommon.

All of a sudden, all you're doing is waiting for CI all day, trying to context switch between different PRs. And then, a flaky test hits, and you start waiting all over again.

It doesn't have to be like this. In this talk I'll cover several strategies for improving your CI times by making computers work harder for you, instead of having to painstakingly rewrite your tests.

RailsConf 2021

00:00:09.900 it was the best of times it was the worst of times it was the age of waiting
00:00:17.520 for CI which takes an age and it's my least favorite things to do partly because I'm impatient and I'm bad at
00:00:24.000 multitasking but mainly because if you think about how much time you spend waiting on CI every day and you multiply
00:00:31.019 that by the number of people in your team that's a lot of time that we could be using better enough that it makes
00:00:36.960 sense to put in some real work to make it run faster and today I'd like to tell you about a few techniques that I've
00:00:42.660 used in the past that have given me great results first of all though hi my name is Daniel
00:00:47.820 and I learned all of this working for gocarless a payments company based in London now as you probably noticed I'm
00:00:53.579 not originally from London I come from Argentina so in case you were wondering that's the accent
00:00:59.100 so I want to help you make your CI finish faster and as the title of the talk says I'm proposing that you do this
00:01:05.580 by throwing lots of computers at the problem so I'm not going to talk about how to make individual tests run faster
00:01:12.420 there are lots of resources out there you know using fixtures instead of factories mocking stuff lots of
00:01:18.240 techniques shared by people that can explain them way better than me the problem with these techniques is
00:01:24.240 that they normally involve rewriting your tests and they normally take an amount of time that is linear with the
00:01:30.720 number of tests that you have and your test Suite is probably huge or you wouldn't be looking at my face right now
00:01:36.420 so that's not very fun to do what I want to talk about is how to reduce the total
00:01:42.060 runtime of your CI test Suite with a focus on getting the most impact for the time that you invest and we do that not
00:01:48.960 by making your tests faster but by running them on lots of computers at the same time
00:01:54.180 what I want to focus on is making some systemic changes that are going to let your test still run slowly and I'm going
00:02:00.960 to let your team still write test the way that they used to but when you're running in CI you'll run massively in parallel so you can still finish quickly
00:02:08.160 now of course what I'm advocating for here is throwing money at the problem in exchange for saving developers time and
00:02:14.760 this is not always the appropriate way but for a lot of companies out there when you have an engineer with a typical
00:02:20.280 engineer salary waiting on CI and getting distracted by Haka news there are lots of situations where it makes
00:02:26.520 sense to spend as much money as your CI provider will take from you and to be honest it's not even that much
00:02:32.940 I mean we're using tons of machines and spending on the order of less than a hundred dollars per developer per month
00:02:38.520 in total now before I start I want to make a couple of notes first of all you're going to see a bunch
00:02:45.420 of circle CI on this talk mostly on all of my screenshots and I'll also mention a tool or two that they provide this is
00:02:51.900 because we happen to use circle like ocalis this is not an endorsement in any way I just happen to have the most
00:02:57.900 experience with them because of my day-to-day work and it was the easiest way to get screenshots of complex CI
00:03:03.360 setups and Twitter I also have my share of frustrations with them I'm not trying to recommend them particularly I just
00:03:10.080 happen to have used them a lot but more importantly this is not a circle specific talk everything that I'm going to talk about is around optimizing
00:03:17.099 things that every CI provider would have to do so it should be applicable to pretty much all platforms the same thing
00:03:24.000 goes for every time I say rspec I'm using rspec as an example of habit but almost everything I'll be talking about
00:03:30.360 today will still work with minitest or any other test framework in Ruby or in other languages I've actually used a lot
00:03:37.200 of the same advice for a PHP project that we have and the specific of the
00:03:42.299 projects could not have been more different but the thinking behind what I'm going to share applies to pretty much all languages and that's what's
00:03:49.319 going to help you speed things up another thing to keep in mind is that I'm going to show you a bunch of code to
00:03:55.440 explain these techniques this code is going to be oversimplified so I can explain the concepts quickly and some of
00:04:01.560 it is going to move pretty fast but the talk comes with a supplementary GitHub repo which you can find here and there
00:04:07.440 you're going to find Fuller code examples more more documentation on how they work and you can grab them from
00:04:12.659 there as a starting point for your project now unfortunately most of them you won't be able to just grab and use
00:04:18.000 you'll probably have to adapt them to your needs but I've tried to document what you need to adapt them
00:04:23.220 more importantly this is not a do these three things and you will get this exact result roadmap kind of talk your mileage
00:04:29.880 will vary your CI setup will be different from mine the specific things that are slow for you are different from
00:04:36.000 everyone else's I'm going to share a bunch of techniques some of which will be super relevant to you some maybe not
00:04:41.639 as much this talk is mostly a way to think about the problem of CI times and a bunch of
00:04:47.280 tools and techniques to help you improve those times but you will need to test these see which ones help which don't
00:04:52.620 and tailor them to your specific situations now I mentioned the PHP project that we had in addition to our Ruby ones and
00:04:59.340 with drastically improved runtimes of both of them using these ideas but some things that made a massive difference to
00:05:04.919 one didn't move the needle at all on the other one and vice versa you will have to adapt these to your needs so it'll
00:05:10.259 take some work but it'll be worth it when you no longer need to wait on CI for ages so measure experiment see what
00:05:16.740 works for you all right with that out of the way let's get to it we want to make our CA Suite
00:05:22.259 finish faster and we're going to do that by running things in parallel the first things first how do we even do this how
00:05:27.780 do we parallelize our tests and there are two main ways you can manually split off sections of your test Suite into
00:05:33.600 chunks that make logical sense and then have each of those chunks run in parallel and then you can take each
00:05:39.300 chunk and automatically split it into many machines to also run it in parallel now this is not an either or Choice
00:05:46.139 you'll most likely want to do both so let's start with the simplest one and it's very likely that you're already
00:05:51.900 doing this but it's still worth talking about if you can separate your test Suite into different pieces that make sense which
00:05:58.440 is going to likely be different subdirectories within your specs directory you can create different CI
00:06:03.960 jobs and call aspect on each of those different directories for those tests now rails does this beautifully for
00:06:10.440 example if you look at their test Suite you'll see that they have separate jobs for active model active record action
00:06:15.539 cable Etc each of these is a coherent logical unit it's easy to understand when you're looking at the CI setup and
00:06:22.139 they all run in parallel so if your app is super modular like this this is a great win it's it's a
00:06:27.780 very easy starting point my most rails apps you're going to have integration models controllers this is
00:06:33.780 not as granular and typically one of those is going to take a lot longer to run than the others but it still
00:06:39.180 probably worth separating and it's a good starting point now two things that you should keep in mind one advantage of separating this is
00:06:46.680 that you get more granular control over the running of each job for example some CI providers will let you choose between
00:06:52.860 different instance sizes which obviously cause different amounts of money per minute now sometimes for certain kinds
00:06:58.560 of tests particular integration ones you may need a bigger instance so you can fit all of your dependencies in run now
00:07:04.620 if you take those tests and separate them out into their own job you can pay for larger machines only for the parts
00:07:10.860 of the test Suite that need them and you don't need to make the rest of the tests more expensive
00:07:16.380 this is our setup for example and you can see we have separated search tests this would normally live under
00:07:22.680 integration but they had more dependencies on the other and they needed bigger machines so we split them up
00:07:28.500 the other thing you can do is control dependencies between jobs better for example your JavaScript tests will need
00:07:35.039 to wait until your node modules get installed but your model and unit tests probably don't so if you separate your
00:07:41.639 JS test from your unit test you don't need to wait for yarn to finish before your unit test can start and then they
00:07:47.759 can start sooner the same goes for our search example most of our tests don't need elasticsearch to be running so they
00:07:53.819 don't need to wait for that to boot up now the second thing you want to do is a
00:07:58.860 bit less obvious you want to have a catch-all job if you go very granular on
00:08:04.319 the splitting approach it's very easy to later add a new test subdirectory and forget to add the new CI job for it if
00:08:10.979 you do this and you later have a new directory spec Slash new tests you'll
00:08:16.560 likely forget to add a new CI job for it your new test won't run in CI and you will never notice which is pretty sad
00:08:22.500 and it's dangerous what you want to do instead is have a final catch-all job and instead of
00:08:27.960 targeting a specific subdirectory to run you want to find all of your tests and filter out the ones that are already run
00:08:34.440 by other jobs and using find and gravity lets you do this very easily
00:08:39.539 now I'm not going to lie this card is kind of ugly I get it but
00:08:45.060 it feels safe I mean the caveat is obviously that you need to remember to add an exception here if you later at a
00:08:51.120 new job for some other subdirectory but now the penalty for forgetting something is that you run some tests twice which
00:08:57.540 is way better than than skipping an entire chunk of them so I think that safety justifies the yuckiness
00:09:05.339 so that's how you split your tests to it manually which again I bet many of you are already doing but I think it's worth keeping this KVs in line
00:09:11.940 the other way of parallelizing and the one we're going to focus the most on today is having multiple machines run a
00:09:17.580 single set of tests by automatically splitting the files between now the key part of how you do this is
00:09:23.160 instead of giving your test Runner a directory to run you give it a list of all of the files in that directory and
00:09:29.220 you do this because once you have a list of many files you can split it into chunks with a little magic so you do
00:09:34.500 that in many machines each machine picks a different chunk of at least and runs only those and in aggregate you will
00:09:40.620 have run all of your tests but each of those chunks basically run in parallel now in order to split this so that no
00:09:46.980 two machines run the same file and that all files get run all right all right some of you already know how to do this
00:09:52.800 or you're doing this already and I can see you reaching for your phone right now to check Twitter stay with me for a
00:09:57.839 quick second because I got news as I said at the beginning your mileage will vary and not every section of this talk
00:10:03.060 will be relevant to everyone so I'm taking advantage of these new interactive railscon format to do some
00:10:08.580 unorthodox things that I couldn't do in a live talk now the good news is you can skip over things that you already know
00:10:14.279 if a section Miner applied to some of you I'll make a comment about it and an icon like that is going to show up in
00:10:20.279 the corner with a timestamp so if a particular section doesn't apply to you just keep until the icon is gone and you
00:10:26.820 won't miss a Beat so as I was saying earlier in order to split this list of files that you need
00:10:33.959 to run so that no two machines run the same file and that all the files get run each machine needs to know two things
00:10:39.480 which machine it is and how many total machines there are and then we can have each machine basically know
00:10:46.320 these two things through two environment variables and knowing this each machine can basically take every nth file with
00:10:52.140 an offset at the beginning and that basically does it now of all the things I'm going to talk
00:10:57.540 about today this is the one where different CI providers will vary the most Circle CI for example have this CLI tool
00:11:04.560 that is going to help you find your test files and it will split them for you now you specify a parallelism value for how
00:11:10.140 many boxes you want to run you run a command kind of like that and it pretty much just works
00:11:16.260 and if you put in a little bit more work you can also do smarter allocation based on historical times on each file which
00:11:22.079 is actually pretty cool now code ship has a less ideal approach it lets you specify a number of steps to
00:11:29.339 run in parallel and you can embed environment variables right there and this needs a little bit more work from
00:11:35.100 you I'll show you in a minute how to use this but that'll do it buddy for example has something similar
00:11:41.459 to Circle where they will split the files for you ahead of time and put the list of files for each machine in these
00:11:46.500 M variables called body split one body split 2 Etc and you can just use those directly pass them on to arsenic
00:11:52.860 now sadly most providers don't have any tools to do this directly but you can do it yourself if you
00:11:59.519 slightly abuse a feature that almost all of them have the build Matrix the most provided providers will give
00:12:06.360 you a way to run the same job over and over with slightly different parameters and they call this a build Matrix in the
00:12:12.540 idea is you can run the same test Suite over and over with for example different versions of Ruby or different gem files
00:12:19.019 pointing to different versions of rails and then you can make sure that your code is compatible with all of them
00:12:24.420 now this code above is from GitHub actions but they're all very similar and some of those keywords that you can
00:12:29.700 see are there they mean something for different specific providers keywords like OS and rvm or Ruby but you can make
00:12:37.079 up your own keywords and then use those to set environment variables like you can see up there and that means that we
00:12:42.660 can abuse this feature and bend it to our purpose we can make up a box index in the
00:12:48.600 metrics in The Matrix and we use that to number all of our boxes and then we set these two environment variables the ones
00:12:54.899 I was talking about earlier and if we do this we now get four boxes each box knows which one it is and knows
00:13:01.620 how many there are and so with a little command line hackery you could you can get each box to pick up their part of
00:13:08.040 the split as I was showing you earlier now this script here is a simplified example but what we're doing is we're
00:13:14.160 taking all of the respected files and passing them to orc line by line and in Oak this variable NR will tell you what
00:13:21.360 row of the input you're in right now so you modulo that by the total count of machines you compare it to your machine
00:13:27.779 number and decide whether you want to proceed with this file or discard it and that gives you a split that you need
00:13:33.240 that's it now pay attention to that store there that's important to keep things
00:13:38.820 consistent or you might end up with different orderings in different machines now again that code is very simplified
00:13:44.160 in reality it looks a bit more like this but doing this you can split your test between as many machines as you want and
00:13:50.279 parallelize like crazy now can I admit it this does look a bit
00:13:55.380 ugly but if your CI platform doesn't help you split your tests this works and if you give it enough boxes it'll speed
00:14:01.980 up your test massively
00:14:08.160 okay so pretty much all of this so far has been a introduction I need to
00:14:13.260 explain how we run things in parallel so we can get into the actual part main part of this talk
00:14:18.540 because now is when things get hard and interesting
00:14:23.639 because in theory there's no limit to how many boxes you could have right you can give it a thousand boxes and your test should run almost instantly but
00:14:30.540 obviously that's not how that works the truth is if you do just this you can improve your CI times a lot but it's not
00:14:37.380 going to be ideal there's a wall they're going to hit pretty quickly which is going to be imposed by your startup
00:14:42.779 times you see when you run one of these boxes you just don't just start running instantly for most CA providers these
00:14:49.740 boxes had a container that has to be downloaded it has to be started then you need to run a bunch of setup tasks and only then you're going to run your tests
00:14:56.040 if you're not careful you can easily spend five minutes doing the setup and then you start getting very sharp
00:15:01.680 diminishing returns for each extra box I mean if you think about it even if you have infinite machines if you're doing
00:15:07.920 five minutes of setup it will still take at least five minutes to run your tests right
00:15:13.199 and also while I'm also throwing money at the problem if you're waiting five minutes for each of those boxes you're
00:15:19.740 going to want a lot of boxes and that's a pretty big money bonfire for you so you want to focus on these setup times
00:15:25.440 and make them as small as you possibly can now a typical CI config looks a bit like
00:15:31.800 this and that last step I highlighted over there that's where we actually run
00:15:36.899 our tests but there's a lot of stuff that needs to happen before now the beginning of the talk I said I wouldn't talk about how you make your
00:15:43.079 individual tests run faster all of that work would focus it would focus exclusively on that last step I
00:15:49.440 highlighted and that is what we normally look at to try and make faster I think it makes sense because it's what
00:15:55.740 takes the longest and it feels like we can control it but here's the thing here's how I see this list of steps
00:16:01.680 that last bit is what's actually doing the work that we want and is the part
00:16:07.680 that we can parallelize so if it's taking longer we can just throw more computers at it all the stuff that comes
00:16:13.320 before it doesn't necessary evil but it's waste and if you have more computers all of them need to do those
00:16:19.440 steps anyway so it doesn't parallelize at all now the problem with those is that they don't look like you can do anything
00:16:25.500 about it your tests that's your code you can change it you can optimize it you can do whatever you want with it but a container is going to take as long to as
00:16:32.880 long to start as it takes to start right and bundle is just gonna take however long it takes to install these gems and
00:16:38.399 you do need those gems and all of those steps are necessary you can't get rid of any of them so it
00:16:43.500 really doesn't look like you can do much about it but in reality we can get clever here and there's a lot that we can do to
00:16:50.639 start chipping away at those startup times and to make this a whole lot better
00:16:55.860 so here's what you want to do your CI will probably show you something like this with all of the little steps and
00:17:01.680 again that's what they look like to me so what you want to do is look at all the steps that run in your CI job before
00:17:08.459 and after your actual tests and look at the running times and you want to focus on the ones that are taking a long time
00:17:14.220 and try to start shipping in a way now these times are not intended as a
00:17:19.380 weird Flex I just don't have a before screenshot but believe me they were a lot worse before we optimized them
00:17:26.280 an important Point here is you only care about the slogans if one of the things
00:17:31.320 that I'm going to talk about next only takes five seconds for you you don't care just move right along if you get
00:17:36.660 check out these two seconds there's not much Point doing trickery to optimize it right if it's taking 30 seconds then it
00:17:43.200 may be worth it now at this point you might want to post this talk and take a look at your actual
00:17:48.480 CI setup steps and the runtimes and that should put the rest of the stock into context and highlight what you want to
00:17:54.179 pay more attention to and with that in mind I'm going to talk about three things
00:18:00.000 installing your gems checking out your code and container spin up time
00:18:06.480 now depending on what CI provider you use some of these bits may involve swimming
00:18:11.520 against the current Elite CR providers try to make it easy to get started with them and they do this by giving you a
00:18:17.340 same default that is very easy to set up and that's reasonable things and this is great for getting started
00:18:22.559 quickly but it's not so great if you're trying to squeeze out every last drop of performance from it because for that you
00:18:28.200 will want to customize things and to what extent you can customize things will depend a little bit on your
00:18:33.419 particular CI platform but from what I've seen most of them allow doing most of these things so they just sometimes
00:18:40.260 are going to involve going a little bit outside the beaten path so let's talk about bundle first in
00:18:47.280 order to run your tests you're going to need your gems installed and you either let your CI provider do this magically
00:18:52.620 for you or you install your gems doing something like this but as you know installing all of your
00:18:58.020 gems from scratch takes forever so in order to prevent that most CI providers give you caching capabilities some of
00:19:03.720 them do it automatically for you some let you do it yourself but basically what happens is this
00:19:08.820 before you install your gems you control whether they get stored with setpath and
00:19:14.580 then after you install your gems you save that directory to your Ci's cache now on the next run before installing
00:19:20.400 you restore that cache which means bundling store is going to run instantly because all of your genes are already there unless your jam file changed
00:19:27.299 and you are probably already doing discussion and when you first set this up this works great the first run takes
00:19:33.600 a long time because it needs to install everything but then your jams are cached and future runs actually go through this pretty quickly and every time you change
00:19:40.140 the gem file you only need to install a gem or two that are new that gets cached and things continue to be fast
00:19:46.100 however over time you might find that restoring that cash starts taking a
00:19:51.360 longer and longer time in our case I've seen a restore time of about two minutes at its worth which is pretty bad and
00:19:57.360 this happens because as you upgrade gem versions the old versions are left behind in the cache and they blow the
00:20:03.059 size of the file that you're saving and restoring and moving around and to prevent this you want to tell bundler to
00:20:09.419 automatically clean out data versions when you do this bundle will delete all versions of gems that your gem file no
00:20:14.640 longer uses and that's going to make the CI cache smaller and faster to move around now unfortunately the specifics of how
00:20:20.640 you do this depend on your version of bundler in all the versions for example you pass Dash clean to bundle install
00:20:26.780 you want to look at the documentation for your specific version of bundler and while you're in there look for other
00:20:32.520 flags that look like they can save you space and experiment with them bundler has a lot of configurability
00:20:38.760 I'm also going to talk about manually deleting uses files a bit later and that can save you even more space but it's a
00:20:45.179 bit harder to do and finally you want to look at the docs for your CI provider in detail all of the difference
00:20:52.260 here providers have slightly different features around this that you may be able to use and gain even more time so
00:20:57.480 keep an eye on those bundling and cash restoring time and if they get long this may help
00:21:04.200 now next step is get checkout or git clone this is one where you should only spend time on it if you see this step
00:21:10.320 take a while for our projects this is going to be pretty quick but as you accumulate history and your rep repo
00:21:16.919 gets big you're going to start taking longer and if you're taking 20 30 seconds it may be worth trying to do a
00:21:23.700 shallow checkout where you only get the last few commits instead of the entire history now this may or may not be faster
00:21:30.059 depending on the number of things but it is worth trying as a first approach if if your checkout is taking a long time
00:21:36.299 now unfortunately I can't really tell you how to do this in your situation this is another one where CI platforms will give you very different options
00:21:42.179 either directly or using some library that somebody has made and you can import so you want to check the docs for
00:21:48.120 how to do this I mostly just wanted to call out that this is one to keep an eye on and that a shallow checkout might
00:21:54.299 help and if that doesn't help can I'm going to be talking about how to do this in a different way a bit later with a
00:22:00.900 different approach all right so we've talked about two of the biggest
00:22:06.960 usual time wasters now let's look at probably the worst one container initialization time
00:22:12.720 now this section is probably the bizarest one in my talk I know it will
00:22:17.760 sound like really weird advice but it can pay off massively if you're having this problem so give this a shot
00:22:24.419 generally all CI providers will run your staff on Docker containers and if you've ever used Docker you're probably
00:22:29.460 familiar with this site now unfortunately for this part of the talk to make sense I need to give you a
00:22:35.159 brief introduction to how Docker Works Docker containers are a sort of separate space in your machine with their own
00:22:40.559 little file system and processes that can't touch each other sort of like lightweight virtual machines and just
00:22:47.220 like a virtual machine they get booted up from an image which is basically a giant collection of file files and these
00:22:54.539 images they get created from a set of instructions that you put in a Docker file which looks a little bit like this you may say start from a plain Linux
00:23:02.159 image install Ruby on it copy my gem file in it and run bundle installed and copy all the files from my app and take
00:23:07.980 the result of all of that and that's my image and you can specify when that image Runs run this command to start my app that's
00:23:14.400 The Last Unicorn command on there and just like you can create your own image there are lots of common tools
00:23:20.820 like postgres of redis which have pre-made images that you can just use these live in a central registry similar
00:23:28.080 to Ruby Jones and that means you can tell Docker hey run me a postgres and a redis and this Custom Image that I made
00:23:33.480 which is my app and we'll look at this is when you try to run an image if you don't have it you'll go to registry and download it
00:23:39.840 for you just like bundle does when you install the chip and when you're running in CI is very
00:23:45.000 common to do this you tell your CI provider run me a postgres and redis and also run my test in a container image
00:23:50.340 that has Ruby 2.7 in it and most CI providers give you a bunch of Handy
00:23:56.039 images that they have pre-made for you so you may get an image that already has Ruby in it but also has node and chrome
00:24:01.620 in there so you can easily run your integration test without having to install everything yourself and understanding images is good but if
00:24:09.059 we want to be able to get those containers to start faster we need to go one level deeper we need to talk about layers
00:24:14.700 because one of the really interesting ideas in Docker is that when building this image of yours there are things
00:24:20.100 that take a lot of time to run but they don't change very often and there are things that do change very often but
00:24:27.360 they don't take as long to do so we'll look at this is when it's building your image at each step in the docker file it
00:24:33.120 look what you ask but all of the writing that it does to the file system in that image it gets stored separately from
00:24:38.820 what's already there from the previous steps it gets put in a layer and this layer depends on the previous existing
00:24:44.039 layer and adds or modifies files and looker will also take a hash of what you're doing so that the next time you
00:24:50.100 try to build the same image if the step hasn't changed Docker knows that because the hash matches and it goes ah I'll
00:24:55.559 just use that layer I've got cached over there and saves you a lot of time now in this sample Docker file you'll be
00:25:02.039 building this part here over and over but this whole check at the beginning doesn't change you're always installing
00:25:07.980 the same version of the same thing so Docker just uses the latest array has now this part does change because your
00:25:14.520 app files probably changed and though and so for this it does need to make new layers but for everything else it can
00:25:19.740 save itself a lot of time so the result of this is that your image is not a single monolithic file it's a
00:25:26.279 stack of layers and each of them depends on the previous ones but they can be downloaded and cached independently and
00:25:32.820 they can also get shared so that image you started from it could be anything it can be your own image if
00:25:38.340 you want and it can already come with a lot of layers in it so if you have a bunch of different applications on Ruby
00:25:44.580 2.7 for example you may end up making your own base image that includes the installation of Ruby and all the stuff
00:25:50.580 that you have that's coming to all of your apps and then you can reuse that in the docker file for each app now if you do that the layers in that
00:25:57.419 custom Ruby image that you made will get shared for all of the apps on the machine they will only be built and downloaded once
00:26:03.480 and that shared part is also probably the biggest part of your local image whereas the bottom bit is going to be
00:26:10.380 different for each app and for each build of your app and so it will have to get download every time but at least the
00:26:15.840 big ones on the top get reused and that is what you're seeing when you see this download Screw this is trying
00:26:22.740 to run an image and is downloading all of the layers that it doesn't already have and you've also probably seen
00:26:28.260 something like this where some layers already exists and those don't get downloaded and here for example we're
00:26:34.500 only downloading one layer the other ones are already cached now the reason I'm talking about all of
00:26:39.720 this is that depending on whether you get this or this is going to make a huge difference to
00:26:45.600 how long it takes to start your containers you really really want the machine that runs your test to already have downloaded the images that you're
00:26:51.960 going to use or at least a lot of its layers because if it has your containers will start almost immediately and if it
00:26:58.799 hasn't then you will first have to download probably at worth about a gigabyte and that can take a while and
00:27:04.500 it needs to extract those layers and only then it can start to spin them up so you really really want the machine
00:27:09.960 that's running your tests to already have the layers that you're about to use now unfortunately you generally have
00:27:16.440 absolutely no control of this your CI provider has a gigantic pile of computers each one of them is running
00:27:22.020 tons of containers for lots of people and you have absolutely no control over which machine you test Lansing or what
00:27:27.960 layers that will have already cached and this is why I said earlier that this is a somewhat bizarre piece of advice
00:27:34.380 I'm basically saying try to make sure the machine you have no control over already has your layers that sounds kind
00:27:40.559 of nuts but what runs in these machines isn't random
00:27:46.260 you can't guarantee what a machine has already downloaded but you can try to influence the odds in your favor there
00:27:52.260 are loads of people that are running the stuff in these machines lots of these people are going to be using Ruby so they'll be using a Ruby image probably
00:27:59.100 the Ruby image provided by the CI platform so if you're running groovy test it's quite possible that your machine will
00:28:05.940 have a Ruby image cached because somebody probably run Ruby test there earlier and here's the bit you can control
00:28:13.020 no all of the images that you could use will be equally likely to be used by others
00:28:18.480 some versions are going to be more common than others so you're going to have for example a lot more people using
00:28:24.779 Ruby 2.7.2 than 2.7.0 for example just because it's a newer version
00:28:30.179 and both of them are going to be a lot more common than I don't know 235 just because it's old
00:28:36.240 the simple goes for postgres if you're using postgres latest it is way more likely that somebody else has used that
00:28:42.539 recently than if you're using a random outdated version like I don't know 9 6 17. and this really bit us at one point
00:28:49.919 we were using a weird version for an image that was actually pretty large to download and our continuously taking
00:28:55.080 ages to start and just switching to a very similar but more popular version saved us about a minute of setup
00:29:03.120 so what you want to do is make sure you're using what looks like the most popular versions of your dependencies so
00:29:08.820 you can load the dice in your favor and increase the chances that the images that you use or at least a good chunk of
00:29:14.340 their layers are going to be cached now again this is still weird advice because you have no way of knowing what
00:29:20.820 other people are using but if you see that your container startup times are high you can look inside that container
00:29:26.700 spin up step in your CI platform and you can look at whether it is frequently downloading layers or if it's using the cache and if it's downloading things
00:29:33.480 very often you can start looking at what are the other available images that you could use I'm trying different ones to
00:29:39.779 see if they get cached more often you also want to point to less specific versions which is going to be more
00:29:46.559 common that other people pointed now in Docker you can have many different tags pointing to the same actual image and
00:29:52.020 these tags can get repointed over time so for Ruby for example you will have tags for 271 and 272 but there's also
00:29:59.220 just 2.7 and 2.7 points to the latest patch version and probably most people
00:30:04.860 will be using that one as it keeps repointing itself over time when 273 comes out 2.7 is going to point to that
00:30:10.919 one instead and the same happens if you have a latest tag like postgres latest or redis
00:30:16.260 latest that will generally be the default that most people use so it's going to be more likely to be cached
00:30:22.799 now this advice it comes with a big disclaimer all of the things I'm proposing that you
00:30:28.620 do today like everything in life they have trailers but this one in particular has a massive one because one of the
00:30:35.279 Glorious things about Docker is that you can specify the exact version of everything and fully control your
00:30:40.380 environment so the usual best practice is choose the exact version of everything in CI that you're running in
00:30:46.860 production because then you're testing on the exact same environment or as close as you can get it
00:30:52.080 whereas I'm standing here suggesting that you do the exact opposite and that implies some risk
00:30:58.860 if you use the latest postgres now you're testing on a different version than the one you're actually running in production and that
00:31:05.279 could bite you on the flip side you can save hundreds of develop of
00:31:12.419 hours of developers times collectively if you do that so which doesn't matter which one matters the most will depend on your particular
00:31:18.600 situation and there's also some new ones here some middle ground for example you almost
00:31:24.120 definitely don't want Ruby latest because Ruby changes hello between minor versions you may get hit by deprecations
00:31:29.880 backwards in compatible changes other problems like that but if you're using Ruby 272 in production pointing to the
00:31:36.899 latest 2.7 is actually probably safe enough and your CI will probably still be trustworthy and if you do that you
00:31:43.260 can get the startup speed benefits of having a more common more cached version now if you're running an old version of
00:31:49.200 elasticsearch or some software that has had major backwards and compatibilities then yeah you'll probably have to still
00:31:55.440 run that specific version and that sucks but for things like postgres which have really good background compatibility for
00:32:01.320 example using the latest one is probably fine oh and make sure you're using the docker
00:32:07.080 images that your CI provider gives you not the normal official ones because that's the ones that everybody else is going to be using and they're going to
00:32:13.080 be more cached again it's weird advice you're making CI less deterministic which is normally the
00:32:19.260 opposite of what you want but speed so you're gonna have to wait this one the other big disclaimer of course is
00:32:25.320 that this is also much less deterministic in terms of timings than anything I've talked about so far you're always literally rolling the dice on
00:32:32.399 whether your machine is going to have your layers or not now doing this lets you wake those types A little bit to your advantage but it's still a crop
00:32:39.120 shoot and because of that it can be harder to gauge whether this helped or not because the runs that you're serving
00:32:45.480 when you're trying it may or may not be representative and I'm going to talk a bit more about that later but the general gist for this is try to
00:32:52.919 use common versions of things if you can because you can make your containers boot up a lot faster on average
00:33:03.840 Okay so we've covered how to speed up bundler git checkout and container speed up but now that I've talked about docker
00:33:10.799 I mean you gotta have that for if you're gonna talk about okay now that we've talked about Docker I'd like to take a second look at these same three topics
00:33:17.159 but for a more advanced scenario if you have the ability to build your own containers and push them to a registry
00:33:23.460 and if you're a CI platform lets you run your own containers there's more we can do to gain even more speed
00:33:29.460 now this takes some more work than the previous tips so maybe it only starts being economical when you start having larger Dev teams which means you're not
00:33:35.880 wasting a ton more human time waiting on CI and you also probably have developed some tooling at that point that's going
00:33:41.220 to let you do this more easily so it takes a little bit more work but for us at least it was definitely worth it
00:33:47.039 now if you can't do this or you don't want to have to deal with making your own containers feel free to skip ahead
00:33:52.380 about eight minutes I'll go back to non-doctor topics then so skip ahead until you no longer see the little whale
00:33:57.899 in the corner but if you can build your images easily one thing has worked really well for us
00:34:03.360 is having a custom Docker image for your CR environment now this will not be the same Docker image that you run in
00:34:09.119 production it's going to be very very different so you're going to need two Docker files in your repo one for production one for CI and what you do is
00:34:16.020 you start from a Docker image that your CI provider offers so it's going to be heavily cached and you also try to start
00:34:21.480 from one that already has most of the stuff in it so Ruby Chrome whatever you need and then you add the stuff that you
00:34:27.119 need to run your tests which is basically the same setup you normally do in CI most of it you do it in the docker
00:34:33.060 file instead and the way you will work with this is you will automatically build this image every time you merge to your main branch
00:34:39.540 and you will always tag it with CI and in your CI configuration you point to your registry and to that CI tag so
00:34:46.980 that you're always getting every single now importantly this is not always going to be a reflection of what's in the
00:34:53.220 branch that you're testing right now these images may take a while to build and you don't want to wait for them
00:34:58.560 that's key if you have to wait for them to build before you can run your tests you're actually causing more harm than
00:35:03.599 good but since you're always using the same tag if the last merge domain hasn't finished building that's fine you'll be
00:35:09.180 using the one from the previous merge which is good enough and also most of your tests are going to run in other branches but you're using
00:35:15.119 the docker image from your main branch so you're also going to have to apply all the changes from your branch onto
00:35:20.820 onto that one so it's important to keep this in mind because there's a couple of things that you will still have to do in
00:35:25.980 CI again after the image starts so with that in mind I want to talk about how to do this same three things
00:35:32.339 again but with a Docker twist first bundler as we discussed earlier
00:35:37.680 your CI provider gives you a cache that you can use to save and restore your bundle install and this cache is really
00:35:43.500 useful but sometimes it can actually be quite slow to restore one thing that can
00:35:48.660 sometimes help is running bundle install on your Docker build and this means your gems will already be there when the
00:35:55.140 container boots up and then you don't need the cache anymore and you don't need to do that in CI because you
00:36:00.839 already have the gems in your image now there's two things to keep in mind first this image was built against your
00:36:06.180 main branch if you're testing in a branch where you've updated the gem file you won't have those latest gems so you
00:36:11.220 still need to run bundle in CI the bundle though most of the time is going to finish instantly and very
00:36:17.940 rarely it's just going to install one Jam or two so it's going to be very quick now the other thing to note is that
00:36:23.280 you're trying to save time by not needing to do the cash restore but the local layer where you installed
00:36:28.740 off your gems still needs to be downloaded so you still care about how much space this gems take on disk or the
00:36:34.740 layer is going to be huge and it's going to take forever to download now earlier when we were using the CI cache the main Improvement for this was
00:36:41.160 deleting all gems now that doesn't apply here because you never have old gems if your gem file
00:36:46.980 changed your container build starts that layer from scratch so you don't have any of the old baggage however there is
00:36:54.300 still a lot of bloat when you're using bundle install because bundle keeps a bunch of caches that you don't need and it can be quite big so you want to get
00:37:00.359 rid of them now there's a couple of parameters that bundle gives you to prevent those caches but those have changed over time and I
00:37:06.780 may be doing something wrong here but it still seems to keep those caches around even if you put in all those parameters
00:37:12.720 however a thing you can do is you can explicitly delete those caches after you install your gems and that will end up
00:37:18.839 saving you the download time your bundle install command is going to end up looking a bit like this and
00:37:24.720 you're going to have to tweak those two paths because they will change with your particular system
00:37:29.820 now I'm going to talk a bit more in a minute about how to find those paths and how to figure out how big your layers are and how to make them smaller but as
00:37:36.780 an example in our particular Docker build deleting this files takes this layer from 500 megabytes to 400. we save
00:37:42.599 almost 100 megabytes which otherwise we will be downloading over and over into those containers so it's quite a big
00:37:48.540 reduction it really pays off to do this and by the way this is something you can also do if you're not making your own
00:37:55.079 images if you're doing a normal bundle installing your CI you can still delete these directories after they install them before you save the cache to make
00:38:01.380 that cache smaller the only issue is it's not going to be as obvious to find what those paths should be but if you
00:38:06.420 can find them that will also help so that's how we make bundle faster the same idea applies to git checkout uh
00:38:14.280 or git clone if you remember the problem cloning the whole repo may take a long time because all the history is coming
00:38:19.619 to it however when you're doing a Docker bill you don't care how long it takes so you can just do a full git clone and
00:38:25.920 then when the container runs in CI most of your repo is already there all you need to do is get fetch for the branch
00:38:31.380 that you're testing and that's only going to pick up the few commits that are not already in the image and that's going to be way faster than even doing a
00:38:38.280 shallow Club so you might want to try that I finally I want to talk about
00:38:44.220 optimizing your layers as I was mentioning earlier your container will be composed of many layers one for each
00:38:50.099 command that you run in your Docker file and you want to maximize the likelihood that as many as positive those layers
00:38:55.380 are already cached and that's why we start from a base image that your CI platform gives you but there are layers that will almost
00:39:01.680 never be cached and these are the ones that you added to the base image and here there's a balancing act that you
00:39:07.140 want to play because Docker is going to download a bunch of those in parallel now how many depends on your CI platform's
00:39:12.900 configuration and you want to try to optimize for a number what you're trying to do here is balance the bandwidth that
00:39:19.560 you get for each parallel download stream and the latency of round trips let's say you have a gigabyte of layers
00:39:24.599 to download if you only have one layer that's one gigabyte you won't be parallelizing on non-download at all all of the megabytes that need to come they
00:39:31.079 need to be downloaded all serially basically and you won't be using as much
00:39:36.359 bandwidth as you could if you have more layers that download will happen more in parallel and it would download faster overall however if you have too many
00:39:44.520 layers A lot of them are going to be tiny and you're going to be doing lots of round trips between your CI machine and your local registry and you're going
00:39:51.060 to waste a lot of time there so you don't want to have too many layers and this is why it's very common
00:39:56.940 to see the double Ampersand in Docker files if each of these little instructions
00:40:02.400 here was a separate run command each of them would end up as an individual layer the double Ampersand means you go you're
00:40:09.359 going to run all of the commands and you're going to get a single layer with what's left after doing all of those things
00:40:15.660 and so that's how you optimize for how many layers you have the other thing you're optimizing for is the size of those
00:40:21.420 layers in the first place you don't want it to be huge or they will be slow to download now you want to look at the size of your
00:40:26.820 layers and if they're big see if there is stuff that you can delete some examples of these are temp files
00:40:32.700 Left Behind from build processes or those bundle caches or tar.gc files that you may have downloaded and extracted
00:40:38.820 you want to remember to delete the chart file if it was big now it's very important that you delete
00:40:44.099 those files in the same run command where you create the stuff pay attention to those ampersands because if those
00:40:50.520 deletions were in a separate run command then you have one layer where you add all the user scrap that you don't want
00:40:55.619 and that's going to have to be downloaded later and then you have a second layer that deletes it that completely defeats the purpose
00:41:02.280 okay so to do this you need to see what layers you have in your image and you need to see how big they are and the
00:41:07.380 easiest way to inspect this is by using the docker historic command that'll show you what are all the latest in image how
00:41:12.839 much space I take and if you want to look inside those layers there's this great tool called Dive that's going to show you the file
00:41:18.839 system of that image and what each layer has in there and this is really useful to find stuff that was left behind by
00:41:24.420 build processes uh you can delete like those thumbnail caches that I mentioned earlier and this is also how I found those paths
00:41:31.020 to delete here's what that looks like this is this is the dive tool up here is showing us all of the layers in the
00:41:36.480 image and we can pick which one to look at and over there is showing us the file system and in color the stuff that's new
00:41:41.820 on this layer and you can ask it to only see new stuff and you can start collapsing directories and you can
00:41:47.520 clearly see that 19 megawatt cache directory there and in a minute we are going to see
00:41:55.920 there's the other one you have another 77x so those are the two parts that you saw me delete in the previous command
00:42:01.500 that's how you find them and you delete those two we save 100x so if your
00:42:06.720 container is taking a while to download it may be worth inspecting the layers and seeing if there's any low hanging
00:42:12.119 fruit there that may help all right we're done now with tricks to
00:42:17.700 optimize your startup times and helping a test start faster is one of the most important things you can do
00:42:23.400 that's going to let you paralyze more aggressively and having faster overall times but I want to talk about a few more
00:42:29.520 things that are important to keep in mind first of all observability as I mentioned the number of times now
00:42:36.000 CA times are quite variable and that's because there's a lot of factors involved for starters how long the
00:42:41.400 container takes to start depends on whether the particular machine you're running on has the layers cached but also sometimes the network is running a
00:42:47.339 bit slow and things take longer other times the machine you're running on is just having a sad day
00:42:53.040 and because of this it's really hard sometimes to know whether you're actually making improvements you may do a little experiment in your CI config
00:42:59.880 you put in a branch and it runs super fast and maybe that's because your experiment was successful
00:43:05.099 or maybe you just get lucky it's hard to know because it varies so much and it's also hard to stay on top of your setup
00:43:10.920 which over time is going to Trend towards taking longer and longer and you're probably not going to notice that drift unless you're watching it like a
00:43:17.700 hawk which you aren't so to combat this the very least you can do is sample when you're experimenting
00:43:25.079 before you try some change you're going to want to see how long it's taking now you want to get a Baseline and to do this don't look at just the last build
00:43:31.920 in main look at the last five ten builds over the last couple of days and notice not just the average time the different
00:43:38.460 steps take but also the variance get familiar with how things normally perform what steps are consistent and
00:43:44.460 which are all over the place because those are going to be the ones that lead to Southern builds and the ones that you
00:43:49.560 want to focus on if you can and when you're running an experiment re-push to your branch three five times
00:43:55.380 so you can have more samples and that's going to give you an idea of whether you're actually changing things or just getting lucky
00:44:01.140 but ideally you can build some observability over this it's going to be hard to be specific on what's the best
00:44:06.599 way to do this because it'll depend a lot on your specific obserability stack but as a general pointer GitHub will
00:44:12.900 send you web hooks when CI steps complete and you can use those web hooks to build information about timings and push those to your observability layer
00:44:19.500 of choice and then you to make yourself a beautiful dashboard that's going to let you see with more Precision how your
00:44:25.440 changes affects CI rankings and how they evolve over time now I admit this is quite a bit of work
00:44:31.020 but we've had great results with this because it's an early warning system that thinks again is lower and it also
00:44:36.839 gives us much more confidence on the changes that we make
00:44:45.300 flaky tests are the bane of our existence the working scenario which a lot of are
00:44:51.000 used to is when you tests take forever to run and then a flaky test fails so you need to rerun everything again and
00:44:57.180 wait for it again it just adds insult to injury and no matter how fast you make your CI Suite run if you need to run it
00:45:03.780 again often you're going to have a sad time and for this I have two suggestions first our spec has a feature where you
00:45:11.099 will store in a file the test that failed and then you can execute it again and run only the failed test
00:45:17.460 and the way this works is our spec is going to store the test failures in a file that looks a bit like this
00:45:23.220 and it's going to use that to know what to run next time so you can do this in
00:45:28.500 CI you basically run your test and then re-run only failures in case there were some flakies and this doesn't fix the
00:45:34.800 problem but it makes it less likely that it will make your CI red now the other thing you want to do is
00:45:40.619 fix your flakies and for this it's important to think about motivations when you get a flaky you obviously would
00:45:47.520 like to fix it like the good developer that you are but you will actually try to get something done and you need to ship that thing and the flaky is getting
00:45:53.819 your way and the thing is testing is actually owned by another team anyway so you don't really know what to do with it so it you just retry the branch and move
00:45:59.819 on with the actual thing you are trying to achieve like we all do it we don't like it but life right
00:46:06.780 and then every now and then we do a buck bash or a hackathon or whatever and we're gonna fix this but nobody's
00:46:12.060 actually keeping track of what tests were flaky so you can't fix them either like it's hard to actually get around to
00:46:17.520 fixing flaky tests but we can do better by having robots that help us keep track
00:46:23.220 of them and that gives us the right alignment what you want to do is automatically detect these flakies and create a ticket
00:46:29.280 in your backtrack now the basic idea of how you do this is once you've run and had failures you
00:46:34.800 make a copy of the failure file and run again with only failures and now you have two failure files one for each run
00:46:40.859 and they should be identical and if they aren't then you have a flaky test and with a bit of bash hackery you can
00:46:48.240 find tests that have failed in the first run but succeeded in the second one now don't look at that go too hard you
00:46:53.640 can find the complete thing explained in the supplementary repo but you can find those flakies and pass their paths to a
00:47:00.839 utility that will create a ticket in jira or whatever backtracker they they use in order for somebody to fix that
00:47:07.859 test now jira has a CLI that will do this for you which you can install in your CI Docker file other bug trackers
00:47:13.319 also have clis and worst case scenario you can just curl into that API and now there's an actual ticket opened
00:47:20.700 by a robot so it's not even you being annoying or anything there's a ticket which somebody needs to triage and assign to the right team and I mean you
00:47:27.300 get prior tests for way later sure but now it's visible and it's trackable and trackable means that it's fixable
00:47:34.859 we have done this and it has massively helped us reduce the number of flagies that we have because now it's a ticket
00:47:40.020 that is someone's problem and even if you batch them and solve them a lot later it is way more actionable than
00:47:46.260 having than having something that gets in your way at the worst possible time that you're just gonna retry and get on
00:47:51.720 with it and move on with your life
00:47:58.079 okay let's talk about uneven distributions I take the low I took the law earlier about how certain startup
00:48:04.140 times are the main barrier to lots of parallelism because if your tests take a long time to even start pretty quickly
00:48:09.839 you get to a point where adding more machines isn't really very helpful there's also another barrier you can hit
00:48:15.180 which is test distributed and evenly between machines now they are the idea of paralyzing if you're going to go from
00:48:21.420 this to something kind of like that now we got 10 machines so after setup costs our tests should take a tenth of the
00:48:28.500 original time but this is not exactly what you get what you get is a bit more like this because some testifiers take longer than
00:48:34.980 others so not all of the machines finish like finish at the same time and now your CI time is as long as the longest
00:48:40.740 running machine now what you see here is actually pretty good if you're getting a distribution like this you got pretty lucky that
00:48:47.160 little red line down there that's how much extra time you got over the ideal scenario so that's no bad however
00:48:53.819 sometimes you get into a pathological situation where this happens now this is kind of exaggerated for effect I mean
00:49:00.060 you would have to get really unlucky to get this but you kind of see the idea here if you have this distribution
00:49:05.280 adding more machines doesn't really help you that much unless you get lucky and that rejiggles the files in a more
00:49:11.339 favorable way but hoping for that kind of rejiggle is not a good situation to be in
00:49:16.740 and the way to work around this obviously is to distribute your files with your machine so that the time is end up more even
00:49:22.380 but unfortunately this is hard this is a very annoying problem I know of only two
00:49:27.780 solutions and much to my dismay they both involve mentioning commercial vendors
00:49:33.599 one of them I mentioned in passing earlier Circle CI has the CLI tool that helps to split the test between machines now at least this little tool can split
00:49:40.920 things in very in many different ways and one of them is it involves storing in a file how long
00:49:47.880 each test took with a file that is kind of similar to the aspect one that I show you for the flakies and it stores that
00:49:53.040 file centrally so it will persist it between builds and then it uses those timings to try and split things more fairly and it
00:50:00.000 works pretty well to be honest so if you're in circle already you can do that the other solution I know of is knapsack
00:50:06.000 knapsack is a commercial solution that acts as an external cue and each of your machine is going to talk to a queue and
00:50:12.060 pull tests from it the way it works is they run a server that knows all of the files I have to run and each machine is
00:50:17.700 repeatedly asking the server for more files to run so as a machine turns through files faster it will get more
00:50:24.060 files and if you have a machine that's running longer files it's going to end up getting fewer files and that evens
00:50:29.400 things up now supposedly they also store past test timings and they do a bunch of fancy
00:50:34.980 magic to distribute things better you know hence the name get it I'm not sure
00:50:40.319 how much that helps in my uninformed opinion just having the central queue and the gradual pool is doing most of
00:50:46.020 the heavy lifting level now a side advantage of knapsack is also that for those CI providers that don't
00:50:51.900 help you do parallelism you no longer have to be keep track of which machine is which and split the files manually
00:50:57.420 you basically start as many machines as you want they all pull from the queue and that makes your life easier if your
00:51:02.940 CI provider isn't it's not a super cheap solution but it may be worth trying them and seeing if it helps with your times
00:51:09.119 and if it's worth it now this is something I've thought about
00:51:15.180 but I've never actually tried it so take it with a pinch of salt but that are spec failures file stores
00:51:21.300 how long each test took after running your tests you could do some pre-processing on it and make a little
00:51:26.880 file of your own of how long each file takes and you could persist that file between builds using your Ci's caching
00:51:33.960 mechanism that you would normally use for gems and you could have a little Ruby script in the middle of that
00:51:40.200 command that calls Oak that first sorts all of the files by time descending
00:51:46.140 and if you do that I think you've made yourself your own Poor Man's knapsack
00:51:53.400 now again I haven't tried it so I may be missing something and from the perspective of throwing money at the
00:51:58.440 problem to save the robber's time that's probably a bad idea but if an even tests are killing you and
00:52:04.500 for some reason you can't use something like knapsack it may be worth trying again it's a wild thoughts but under the
00:52:11.400 rhyme circumstances maybe that helps okay so now we've turned our containers
00:52:16.980 into lean mean testing machines they start up super fast we're running them dozens at a time they finish evenly our
00:52:22.619 CI times are amazing which we can see because you have beautiful dashboards and all of our flakies are gone and
00:52:27.900 that's awesome right we win but here's the bad news
00:52:33.359 as I mentioned earlier execution time can vary a lot depending on how sad the machine that you're running in is and
00:52:40.200 that will always affect your CI times a little bit but sometimes you're going to get a
00:52:45.720 machine that is really really sad and it'll just take forever to run your tests or the network is going to fail
00:52:51.960 you a little bit and instead that should be instant is going to take 10 minutes or literally forever you may get stuck
00:52:57.900 and he never actually finishes you need to actually go and cancel it manually now this doesn't happen often but when
00:53:03.660 it happens is really sad and it kind of negates a lot of the improvements that
00:53:08.819 we've made and so here's the the bad news and the sad part
00:53:14.579 the more machines you run in parallel the faster your test will finish but also the more likely it is that one
00:53:21.180 of them is a sad one and that one's gonna take a really long time now again this happens very infrequently
00:53:28.380 but if you're running I don't know 64 machines it'll happen on many more pushes that if you're running in four so
00:53:35.160 you need to keep an eye on that uh because the more machines you add the faster your things will run until you
00:53:40.920 hit a point where things that you actually start taking longer on average because of this product and sadly there isn't a silver bullet
00:53:49.079 for this this is a classic trade-off the best we can do is to have good observability which is going to let us
00:53:55.020 figure out what the sweet spot is that is going to give us the lower CI times on average
00:54:01.980 and finally I want to make a quick note about critical paths depending on the complexity of your project and how much
00:54:07.920 tooling and automation you've developed you may have a CI workflow that is quite complicated this is ours
00:54:14.700 and one thing one thing that is kind of obvious if you think about it but it's very easy to lose sight of is that the
00:54:22.559 only thing that you care about is how long it takes to get your branch to Green you don't care how long it takes
00:54:27.660 each individual step to be green you only care about how long it takes for the last one to be green
00:54:34.020 so you should focus all of your efforts on the steps that are in the critical path for that
00:54:39.180 now in this particular workflow these three are the only steps that matter optimizing anything else is pretty much
00:54:45.960 a waste of effort now for example it's possible that we could make these secrets that go faster I mean it's taking quite a while but that's not
00:54:52.800 going to make the overall workflow faster so we shouldn't bother and this is especially true if you're
00:54:59.579 making steps Faster by adding parallelism because now that makes it more likely that one of them will hit a
00:55:04.619 slow machine as I just mentioned and now you've shot yourself in the foot because for the steps in the critical path
00:55:10.680 there's a sweet spot of parallelism where you balance the time that you gain with the risk of a cell machine
00:55:15.900 but for the steps that are not in the critical path if they are using any more machines that they absolutely need to
00:55:21.300 you are getting all of the extra risk with no benefit to offset it because you're not going to get to Green faster
00:55:27.059 and you're also spending more money on those machines which again you don't get any actual benefit for that so focus on
00:55:33.720 the critical path ignore everything else now of course as you make a step faster you may remove it from the critical path
00:55:39.960 and that's amazing make it no faster than that focus on the new critical ones
00:55:45.599 and importantly this means not only making steps faster but also considering the dependencies between them there may
00:55:52.079 be ways of restructing your tasks so that one of your steps no longer depends on another and you can gain a huge
00:55:57.720 amount of time for this and this is very typical if you have for example a setup bundle and yarn job and then you have a
00:56:04.319 bunch of other steps that depend on it if your unit test step is the slowest one and it is depending on setup bundle
00:56:10.859 and yarn that may be a bad idea because it probably doesn't need yarn and so you can make only the steps that
00:56:17.520 do need it depend on it and save yourself some serious time on the critical plastic which is the final time
00:56:23.099 to green and the same goes for jobs that use a lot of different containers for the dependencies it's it's common for
00:56:28.559 only a few tests to need all of those dependencies you can separate those test out into their own job and you only add
00:56:34.559 those extra containers on that job like we did with search at the beginning of the talk so the rest of the test don't
00:56:39.900 really need to wait in this case for elasticsearch to boot up so that's pretty much I wanted to cover
00:56:48.059 to to do a quick recap you want to parallelize a lot but focus on optimizing your startup times and you
00:56:54.359 can save time on bundle install git checkout and your container spinner type build your own images so you can have
00:57:01.200 absolute control on how all those things work keep those layers tight optimize your dependencies improving
00:57:07.680 your critical path keep an eye on your runtimes over time and get rid of flakies by putting them
00:57:13.920 in your backlog and that's it for me I've covered a lot of different techniques just now some of
00:57:19.500 those will hopefully help again your mileage will vary some of these will help you in particular
00:57:24.900 scenario some won't but hopefully this will give you some ideas on how to approach the problem and some
00:57:30.599 combinations of these techniques are going to allow you to reach CI Bliss
00:57:36.000 thank you