Speed up your test suite by throwing computers at it

00:00:09.900 it was the best of times it was the worst of times it was the age of waiting

00:00:17.520 for CI which takes an age and it's my least favorite things to do partly because I'm impatient and I'm bad at

00:00:24.000 multitasking but mainly because if you think about how much time you spend waiting on CI every day and you multiply

00:00:31.019 that by the number of people in your team that's a lot of time that we could be using better enough that it makes

00:00:36.960 sense to put in some real work to make it run faster and today I'd like to tell you about a few techniques that I've

00:00:42.660 used in the past that have given me great results first of all though hi my name is Daniel

00:00:47.820 and I learned all of this working for gocarless a payments company based in London now as you probably noticed I'm

00:00:53.579 not originally from London I come from Argentina so in case you were wondering that's the accent

00:00:59.100 so I want to help you make your CI finish faster and as the title of the talk says I'm proposing that you do this

00:01:05.580 by throwing lots of computers at the problem so I'm not going to talk about how to make individual tests run faster

00:01:12.420 there are lots of resources out there you know using fixtures instead of factories mocking stuff lots of

00:01:18.240 techniques shared by people that can explain them way better than me the problem with these techniques is

00:01:24.240 that they normally involve rewriting your tests and they normally take an amount of time that is linear with the

00:01:30.720 number of tests that you have and your test Suite is probably huge or you wouldn't be looking at my face right now

00:01:36.420 so that's not very fun to do what I want to talk about is how to reduce the total

00:01:42.060 runtime of your CI test Suite with a focus on getting the most impact for the time that you invest and we do that not

00:01:48.960 by making your tests faster but by running them on lots of computers at the same time

00:01:54.180 what I want to focus on is making some systemic changes that are going to let your test still run slowly and I'm going

00:02:00.960 to let your team still write test the way that they used to but when you're running in CI you'll run massively in parallel so you can still finish quickly

00:02:08.160 now of course what I'm advocating for here is throwing money at the problem in exchange for saving developers time and

00:02:14.760 this is not always the appropriate way but for a lot of companies out there when you have an engineer with a typical

00:02:20.280 engineer salary waiting on CI and getting distracted by Haka news there are lots of situations where it makes

00:02:26.520 sense to spend as much money as your CI provider will take from you and to be honest it's not even that much

00:02:32.940 I mean we're using tons of machines and spending on the order of less than a hundred dollars per developer per month

00:02:38.520 in total now before I start I want to make a couple of notes first of all you're going to see a bunch

00:02:45.420 of circle CI on this talk mostly on all of my screenshots and I'll also mention a tool or two that they provide this is

00:02:51.900 because we happen to use circle like ocalis this is not an endorsement in any way I just happen to have the most

00:02:57.900 experience with them because of my day-to-day work and it was the easiest way to get screenshots of complex CI

00:03:03.360 setups and Twitter I also have my share of frustrations with them I'm not trying to recommend them particularly I just

00:03:10.080 happen to have used them a lot but more importantly this is not a circle specific talk everything that I'm going to talk about is around optimizing

00:03:17.099 things that every CI provider would have to do so it should be applicable to pretty much all platforms the same thing

00:03:24.000 goes for every time I say rspec I'm using rspec as an example of habit but almost everything I'll be talking about

00:03:30.360 today will still work with minitest or any other test framework in Ruby or in other languages I've actually used a lot

00:03:37.200 of the same advice for a PHP project that we have and the specific of the

00:03:42.299 projects could not have been more different but the thinking behind what I'm going to share applies to pretty much all languages and that's what's

00:03:49.319 going to help you speed things up another thing to keep in mind is that I'm going to show you a bunch of code to

00:03:55.440 explain these techniques this code is going to be oversimplified so I can explain the concepts quickly and some of

00:04:01.560 it is going to move pretty fast but the talk comes with a supplementary GitHub repo which you can find here and there

00:04:07.440 you're going to find Fuller code examples more more documentation on how they work and you can grab them from

00:04:12.659 there as a starting point for your project now unfortunately most of them you won't be able to just grab and use

00:04:18.000 you'll probably have to adapt them to your needs but I've tried to document what you need to adapt them

00:04:23.220 more importantly this is not a do these three things and you will get this exact result roadmap kind of talk your mileage

00:04:29.880 will vary your CI setup will be different from mine the specific things that are slow for you are different from

00:04:36.000 everyone else's I'm going to share a bunch of techniques some of which will be super relevant to you some maybe not

00:04:41.639 as much this talk is mostly a way to think about the problem of CI times and a bunch of

00:04:47.280 tools and techniques to help you improve those times but you will need to test these see which ones help which don't

00:04:52.620 and tailor them to your specific situations now I mentioned the PHP project that we had in addition to our Ruby ones and

00:04:59.340 with drastically improved runtimes of both of them using these ideas but some things that made a massive difference to

00:05:04.919 one didn't move the needle at all on the other one and vice versa you will have to adapt these to your needs so it'll

00:05:10.259 take some work but it'll be worth it when you no longer need to wait on CI for ages so measure experiment see what

00:05:16.740 works for you all right with that out of the way let's get to it we want to make our CA Suite

00:05:22.259 finish faster and we're going to do that by running things in parallel the first things first how do we even do this how

00:05:27.780 do we parallelize our tests and there are two main ways you can manually split off sections of your test Suite into

00:05:33.600 chunks that make logical sense and then have each of those chunks run in parallel and then you can take each

00:05:39.300 chunk and automatically split it into many machines to also run it in parallel now this is not an either or Choice

00:05:46.139 you'll most likely want to do both so let's start with the simplest one and it's very likely that you're already

00:05:51.900 doing this but it's still worth talking about if you can separate your test Suite into different pieces that make sense which

00:05:58.440 is going to likely be different subdirectories within your specs directory you can create different CI

00:06:03.960 jobs and call aspect on each of those different directories for those tests now rails does this beautifully for

00:06:10.440 example if you look at their test Suite you'll see that they have separate jobs for active model active record action

00:06:15.539 cable Etc each of these is a coherent logical unit it's easy to understand when you're looking at the CI setup and

00:06:22.139 they all run in parallel so if your app is super modular like this this is a great win it's it's a

00:06:27.780 very easy starting point my most rails apps you're going to have integration models controllers this is

00:06:33.780 not as granular and typically one of those is going to take a lot longer to run than the others but it still

00:06:39.180 probably worth separating and it's a good starting point now two things that you should keep in mind one advantage of separating this is

00:06:46.680 that you get more granular control over the running of each job for example some CI providers will let you choose between

00:06:52.860 different instance sizes which obviously cause different amounts of money per minute now sometimes for certain kinds

00:06:58.560 of tests particular integration ones you may need a bigger instance so you can fit all of your dependencies in run now

00:07:04.620 if you take those tests and separate them out into their own job you can pay for larger machines only for the parts

00:07:10.860 of the test Suite that need them and you don't need to make the rest of the tests more expensive

00:07:16.380 this is our setup for example and you can see we have separated search tests this would normally live under

00:07:22.680 integration but they had more dependencies on the other and they needed bigger machines so we split them up

00:07:28.500 the other thing you can do is control dependencies between jobs better for example your JavaScript tests will need

00:07:35.039 to wait until your node modules get installed but your model and unit tests probably don't so if you separate your

00:07:41.639 JS test from your unit test you don't need to wait for yarn to finish before your unit test can start and then they

00:07:47.759 can start sooner the same goes for our search example most of our tests don't need elasticsearch to be running so they

00:07:53.819 don't need to wait for that to boot up now the second thing you want to do is a

00:07:58.860 bit less obvious you want to have a catch-all job if you go very granular on

00:08:04.319 the splitting approach it's very easy to later add a new test subdirectory and forget to add the new CI job for it if

00:08:10.979 you do this and you later have a new directory spec Slash new tests you'll

00:08:16.560 likely forget to add a new CI job for it your new test won't run in CI and you will never notice which is pretty sad

00:08:22.500 and it's dangerous what you want to do instead is have a final catch-all job and instead of

00:08:27.960 targeting a specific subdirectory to run you want to find all of your tests and filter out the ones that are already run

00:08:34.440 by other jobs and using find and gravity lets you do this very easily

00:08:39.539 now I'm not going to lie this card is kind of ugly I get it but

00:08:45.060 it feels safe I mean the caveat is obviously that you need to remember to add an exception here if you later at a

00:08:51.120 new job for some other subdirectory but now the penalty for forgetting something is that you run some tests twice which

00:08:57.540 is way better than than skipping an entire chunk of them so I think that safety justifies the yuckiness

00:09:05.339 so that's how you split your tests to it manually which again I bet many of you are already doing but I think it's worth keeping this KVs in line

00:09:11.940 the other way of parallelizing and the one we're going to focus the most on today is having multiple machines run a

00:09:17.580 single set of tests by automatically splitting the files between now the key part of how you do this is

00:09:23.160 instead of giving your test Runner a directory to run you give it a list of all of the files in that directory and

00:09:29.220 you do this because once you have a list of many files you can split it into chunks with a little magic so you do

00:09:34.500 that in many machines each machine picks a different chunk of at least and runs only those and in aggregate you will

00:09:40.620 have run all of your tests but each of those chunks basically run in parallel now in order to split this so that no

00:09:46.980 two machines run the same file and that all files get run all right all right some of you already know how to do this

00:09:52.800 or you're doing this already and I can see you reaching for your phone right now to check Twitter stay with me for a

00:09:57.839 quick second because I got news as I said at the beginning your mileage will vary and not every section of this talk

00:10:03.060 will be relevant to everyone so I'm taking advantage of these new interactive railscon format to do some

00:10:08.580 unorthodox things that I couldn't do in a live talk now the good news is you can skip over things that you already know

00:10:14.279 if a section Miner applied to some of you I'll make a comment about it and an icon like that is going to show up in

00:10:20.279 the corner with a timestamp so if a particular section doesn't apply to you just keep until the icon is gone and you

00:10:26.820 won't miss a Beat so as I was saying earlier in order to split this list of files that you need

00:10:33.959 to run so that no two machines run the same file and that all the files get run each machine needs to know two things

00:10:39.480 which machine it is and how many total machines there are and then we can have each machine basically know

00:10:46.320 these two things through two environment variables and knowing this each machine can basically take every nth file with

00:10:52.140 an offset at the beginning and that basically does it now of all the things I'm going to talk

00:10:57.540 about today this is the one where different CI providers will vary the most Circle CI for example have this CLI tool

00:11:04.560 that is going to help you find your test files and it will split them for you now you specify a parallelism value for how

00:11:10.140 many boxes you want to run you run a command kind of like that and it pretty much just works

00:11:16.260 and if you put in a little bit more work you can also do smarter allocation based on historical times on each file which

00:11:22.079 is actually pretty cool now code ship has a less ideal approach it lets you specify a number of steps to

00:11:29.339 run in parallel and you can embed environment variables right there and this needs a little bit more work from

00:11:35.100 you I'll show you in a minute how to use this but that'll do it buddy for example has something similar

00:11:41.459 to Circle where they will split the files for you ahead of time and put the list of files for each machine in these

00:11:46.500 M variables called body split one body split 2 Etc and you can just use those directly pass them on to arsenic

00:11:52.860 now sadly most providers don't have any tools to do this directly but you can do it yourself if you

00:11:59.519 slightly abuse a feature that almost all of them have the build Matrix the most provided providers will give

00:12:06.360 you a way to run the same job over and over with slightly different parameters and they call this a build Matrix in the

00:12:12.540 idea is you can run the same test Suite over and over with for example different versions of Ruby or different gem files

00:12:19.019 pointing to different versions of rails and then you can make sure that your code is compatible with all of them

00:12:24.420 now this code above is from GitHub actions but they're all very similar and some of those keywords that you can

00:12:29.700 see are there they mean something for different specific providers keywords like OS and rvm or Ruby but you can make

00:12:37.079 up your own keywords and then use those to set environment variables like you can see up there and that means that we

00:12:42.660 can abuse this feature and bend it to our purpose we can make up a box index in the

00:12:48.600 metrics in The Matrix and we use that to number all of our boxes and then we set these two environment variables the ones

00:12:54.899 I was talking about earlier and if we do this we now get four boxes each box knows which one it is and knows

00:13:01.620 how many there are and so with a little command line hackery you could you can get each box to pick up their part of

00:13:08.040 the split as I was showing you earlier now this script here is a simplified example but what we're doing is we're

00:13:14.160 taking all of the respected files and passing them to orc line by line and in Oak this variable NR will tell you what

00:13:21.360 row of the input you're in right now so you modulo that by the total count of machines you compare it to your machine

00:13:27.779 number and decide whether you want to proceed with this file or discard it and that gives you a split that you need

00:13:33.240 that's it now pay attention to that store there that's important to keep things

00:13:38.820 consistent or you might end up with different orderings in different machines now again that code is very simplified

00:13:44.160 in reality it looks a bit more like this but doing this you can split your test between as many machines as you want and

00:13:50.279 parallelize like crazy now can I admit it this does look a bit

00:13:55.380 ugly but if your CI platform doesn't help you split your tests this works and if you give it enough boxes it'll speed

00:14:01.980 up your test massively

00:14:08.160 okay so pretty much all of this so far has been a introduction I need to

00:14:13.260 explain how we run things in parallel so we can get into the actual part main part of this talk

00:14:18.540 because now is when things get hard and interesting

00:14:23.639 because in theory there's no limit to how many boxes you could have right you can give it a thousand boxes and your test should run almost instantly but

00:14:30.540 obviously that's not how that works the truth is if you do just this you can improve your CI times a lot but it's not

00:14:37.380 going to be ideal there's a wall they're going to hit pretty quickly which is going to be imposed by your startup

00:14:42.779 times you see when you run one of these boxes you just don't just start running instantly for most CA providers these

00:14:49.740 boxes had a container that has to be downloaded it has to be started then you need to run a bunch of setup tasks and only then you're going to run your tests

00:14:56.040 if you're not careful you can easily spend five minutes doing the setup and then you start getting very sharp

00:15:01.680 diminishing returns for each extra box I mean if you think about it even if you have infinite machines if you're doing

00:15:07.920 five minutes of setup it will still take at least five minutes to run your tests right

00:15:13.199 and also while I'm also throwing money at the problem if you're waiting five minutes for each of those boxes you're

00:15:19.740 going to want a lot of boxes and that's a pretty big money bonfire for you so you want to focus on these setup times

00:15:25.440 and make them as small as you possibly can now a typical CI config looks a bit like

00:15:31.800 this and that last step I highlighted over there that's where we actually run

00:15:36.899 our tests but there's a lot of stuff that needs to happen before now the beginning of the talk I said I wouldn't talk about how you make your

00:15:43.079 individual tests run faster all of that work would focus it would focus exclusively on that last step I

00:15:49.440 highlighted and that is what we normally look at to try and make faster I think it makes sense because it's what

00:15:55.740 takes the longest and it feels like we can control it but here's the thing here's how I see this list of steps

00:16:01.680 that last bit is what's actually doing the work that we want and is the part

00:16:07.680 that we can parallelize so if it's taking longer we can just throw more computers at it all the stuff that comes

00:16:13.320 before it doesn't necessary evil but it's waste and if you have more computers all of them need to do those

00:16:19.440 steps anyway so it doesn't parallelize at all now the problem with those is that they don't look like you can do anything

00:16:25.500 about it your tests that's your code you can change it you can optimize it you can do whatever you want with it but a container is going to take as long to as

00:16:32.880 long to start as it takes to start right and bundle is just gonna take however long it takes to install these gems and

00:16:38.399 you do need those gems and all of those steps are necessary you can't get rid of any of them so it

00:16:43.500 really doesn't look like you can do much about it but in reality we can get clever here and there's a lot that we can do to

00:16:50.639 start chipping away at those startup times and to make this a whole lot better

00:16:55.860 so here's what you want to do your CI will probably show you something like this with all of the little steps and

00:17:01.680 again that's what they look like to me so what you want to do is look at all the steps that run in your CI job before

00:17:08.459 and after your actual tests and look at the running times and you want to focus on the ones that are taking a long time

00:17:14.220 and try to start shipping in a way now these times are not intended as a

00:17:19.380 weird Flex I just don't have a before screenshot but believe me they were a lot worse before we optimized them

00:17:26.280 an important Point here is you only care about the slogans if one of the things

00:17:31.320 that I'm going to talk about next only takes five seconds for you you don't care just move right along if you get

00:17:36.660 check out these two seconds there's not much Point doing trickery to optimize it right if it's taking 30 seconds then it

00:17:43.200 may be worth it now at this point you might want to post this talk and take a look at your actual

00:17:48.480 CI setup steps and the runtimes and that should put the rest of the stock into context and highlight what you want to

00:17:54.179 pay more attention to and with that in mind I'm going to talk about three things

00:18:00.000 installing your gems checking out your code and container spin up time

00:18:06.480 now depending on what CI provider you use some of these bits may involve swimming

00:18:11.520 against the current Elite CR providers try to make it easy to get started with them and they do this by giving you a

00:18:17.340 same default that is very easy to set up and that's reasonable things and this is great for getting started

00:18:22.559 quickly but it's not so great if you're trying to squeeze out every last drop of performance from it because for that you

00:18:28.200 will want to customize things and to what extent you can customize things will depend a little bit on your

00:18:33.419 particular CI platform but from what I've seen most of them allow doing most of these things so they just sometimes

00:18:40.260 are going to involve going a little bit outside the beaten path so let's talk about bundle first in

00:18:47.280 order to run your tests you're going to need your gems installed and you either let your CI provider do this magically

00:18:52.620 for you or you install your gems doing something like this but as you know installing all of your

00:18:58.020 gems from scratch takes forever so in order to prevent that most CI providers give you caching capabilities some of

00:19:03.720 them do it automatically for you some let you do it yourself but basically what happens is this

00:19:08.820 before you install your gems you control whether they get stored with setpath and

00:19:14.580 then after you install your gems you save that directory to your Ci's cache now on the next run before installing

00:19:20.400 you restore that cache which means bundling store is going to run instantly because all of your genes are already there unless your jam file changed

00:19:27.299 and you are probably already doing discussion and when you first set this up this works great the first run takes

00:19:33.600 a long time because it needs to install everything but then your jams are cached and future runs actually go through this pretty quickly and every time you change

00:19:40.140 the gem file you only need to install a gem or two that are new that gets cached and things continue to be fast

00:19:46.100 however over time you might find that restoring that cash starts taking a

00:19:51.360 longer and longer time in our case I've seen a restore time of about two minutes at its worth which is pretty bad and

00:19:57.360 this happens because as you upgrade gem versions the old versions are left behind in the cache and they blow the

00:20:03.059 size of the file that you're saving and restoring and moving around and to prevent this you want to tell bundler to

00:20:09.419 automatically clean out data versions when you do this bundle will delete all versions of gems that your gem file no

00:20:14.640 longer uses and that's going to make the CI cache smaller and faster to move around now unfortunately the specifics of how

00:20:20.640 you do this depend on your version of bundler in all the versions for example you pass Dash clean to bundle install

00:20:26.780 you want to look at the documentation for your specific version of bundler and while you're in there look for other

00:20:32.520 flags that look like they can save you space and experiment with them bundler has a lot of configurability

00:20:38.760 I'm also going to talk about manually deleting uses files a bit later and that can save you even more space but it's a

00:20:45.179 bit harder to do and finally you want to look at the docs for your CI provider in detail all of the difference

00:20:52.260 here providers have slightly different features around this that you may be able to use and gain even more time so

00:20:57.480 keep an eye on those bundling and cash restoring time and if they get long this may help

00:21:04.200 now next step is get checkout or git clone this is one where you should only spend time on it if you see this step

00:21:10.320 take a while for our projects this is going to be pretty quick but as you accumulate history and your rep repo

00:21:16.919 gets big you're going to start taking longer and if you're taking 20 30 seconds it may be worth trying to do a

00:21:23.700 shallow checkout where you only get the last few commits instead of the entire history now this may or may not be faster

00:21:30.059 depending on the number of things but it is worth trying as a first approach if if your checkout is taking a long time

00:21:36.299 now unfortunately I can't really tell you how to do this in your situation this is another one where CI platforms will give you very different options

00:21:42.179 either directly or using some library that somebody has made and you can import so you want to check the docs for

00:21:48.120 how to do this I mostly just wanted to call out that this is one to keep an eye on and that a shallow checkout might

00:21:54.299 help and if that doesn't help can I'm going to be talking about how to do this in a different way a bit later with a

00:22:00.900 different approach all right so we've talked about two of the biggest

00:22:06.960 usual time wasters now let's look at probably the worst one container initialization time

00:22:12.720 now this section is probably the bizarest one in my talk I know it will

00:22:17.760 sound like really weird advice but it can pay off massively if you're having this problem so give this a shot

00:22:24.419 generally all CI providers will run your staff on Docker containers and if you've ever used Docker you're probably

00:22:29.460 familiar with this site now unfortunately for this part of the talk to make sense I need to give you a

00:22:35.159 brief introduction to how Docker Works Docker containers are a sort of separate space in your machine with their own

00:22:40.559 little file system and processes that can't touch each other sort of like lightweight virtual machines and just

00:22:47.220 like a virtual machine they get booted up from an image which is basically a giant collection of file files and these

00:22:54.539 images they get created from a set of instructions that you put in a Docker file which looks a little bit like this you may say start from a plain Linux

00:23:02.159 image install Ruby on it copy my gem file in it and run bundle installed and copy all the files from my app and take

00:23:07.980 the result of all of that and that's my image and you can specify when that image Runs run this command to start my app that's

00:23:14.400 The Last Unicorn command on there and just like you can create your own image there are lots of common tools

00:23:20.820 like postgres of redis which have pre-made images that you can just use these live in a central registry similar

00:23:28.080 to Ruby Jones and that means you can tell Docker hey run me a postgres and a redis and this Custom Image that I made

00:23:33.480 which is my app and we'll look at this is when you try to run an image if you don't have it you'll go to registry and download it

00:23:39.840 for you just like bundle does when you install the chip and when you're running in CI is very

00:23:45.000 common to do this you tell your CI provider run me a postgres and redis and also run my test in a container image

00:23:50.340 that has Ruby 2.7 in it and most CI providers give you a bunch of Handy

00:23:56.039 images that they have pre-made for you so you may get an image that already has Ruby in it but also has node and chrome

00:24:01.620 in there so you can easily run your integration test without having to install everything yourself and understanding images is good but if

00:24:09.059 we want to be able to get those containers to start faster we need to go one level deeper we need to talk about layers

00:24:14.700 because one of the really interesting ideas in Docker is that when building this image of yours there are things

00:24:20.100 that take a lot of time to run but they don't change very often and there are things that do change very often but

00:24:27.360 they don't take as long to do so we'll look at this is when it's building your image at each step in the docker file it

00:24:33.120 look what you ask but all of the writing that it does to the file system in that image it gets stored separately from

00:24:38.820 what's already there from the previous steps it gets put in a layer and this layer depends on the previous existing

00:24:44.039 layer and adds or modifies files and looker will also take a hash of what you're doing so that the next time you

00:24:50.100 try to build the same image if the step hasn't changed Docker knows that because the hash matches and it goes ah I'll

00:24:55.559 just use that layer I've got cached over there and saves you a lot of time now in this sample Docker file you'll be

00:25:02.039 building this part here over and over but this whole check at the beginning doesn't change you're always installing

00:25:07.980 the same version of the same thing so Docker just uses the latest array has now this part does change because your

00:25:14.520 app files probably changed and though and so for this it does need to make new layers but for everything else it can

00:25:19.740 save itself a lot of time so the result of this is that your image is not a single monolithic file it's a

00:25:26.279 stack of layers and each of them depends on the previous ones but they can be downloaded and cached independently and

00:25:32.820 they can also get shared so that image you started from it could be anything it can be your own image if

00:25:38.340 you want and it can already come with a lot of layers in it so if you have a bunch of different applications on Ruby

00:25:44.580 2.7 for example you may end up making your own base image that includes the installation of Ruby and all the stuff

00:25:50.580 that you have that's coming to all of your apps and then you can reuse that in the docker file for each app now if you do that the layers in that

00:25:57.419 custom Ruby image that you made will get shared for all of the apps on the machine they will only be built and downloaded once

00:26:03.480 and that shared part is also probably the biggest part of your local image whereas the bottom bit is going to be

00:26:10.380 different for each app and for each build of your app and so it will have to get download every time but at least the

00:26:15.840 big ones on the top get reused and that is what you're seeing when you see this download Screw this is trying

00:26:22.740 to run an image and is downloading all of the layers that it doesn't already have and you've also probably seen

00:26:28.260 something like this where some layers already exists and those don't get downloaded and here for example we're

00:26:34.500 only downloading one layer the other ones are already cached now the reason I'm talking about all of

00:26:39.720 this is that depending on whether you get this or this is going to make a huge difference to

00:26:45.600 how long it takes to start your containers you really really want the machine that runs your test to already have downloaded the images that you're

00:26:51.960 going to use or at least a lot of its layers because if it has your containers will start almost immediately and if it

00:26:58.799 hasn't then you will first have to download probably at worth about a gigabyte and that can take a while and

00:27:04.500 it needs to extract those layers and only then it can start to spin them up so you really really want the machine

00:27:09.960 that's running your tests to already have the layers that you're about to use now unfortunately you generally have

00:27:16.440 absolutely no control of this your CI provider has a gigantic pile of computers each one of them is running

00:27:22.020 tons of containers for lots of people and you have absolutely no control over which machine you test Lansing or what

00:27:27.960 layers that will have already cached and this is why I said earlier that this is a somewhat bizarre piece of advice

00:27:34.380 I'm basically saying try to make sure the machine you have no control over already has your layers that sounds kind

00:27:40.559 of nuts but what runs in these machines isn't random

00:27:46.260 you can't guarantee what a machine has already downloaded but you can try to influence the odds in your favor there

00:27:52.260 are loads of people that are running the stuff in these machines lots of these people are going to be using Ruby so they'll be using a Ruby image probably

00:27:59.100 the Ruby image provided by the CI platform so if you're running groovy test it's quite possible that your machine will

00:28:05.940 have a Ruby image cached because somebody probably run Ruby test there earlier and here's the bit you can control

00:28:13.020 no all of the images that you could use will be equally likely to be used by others

00:28:18.480 some versions are going to be more common than others so you're going to have for example a lot more people using

00:28:24.779 Ruby 2.7.2 than 2.7.0 for example just because it's a newer version

00:28:30.179 and both of them are going to be a lot more common than I don't know 235 just because it's old

00:28:36.240 the simple goes for postgres if you're using postgres latest it is way more likely that somebody else has used that

00:28:42.539 recently than if you're using a random outdated version like I don't know 9 6 17. and this really bit us at one point

00:28:49.919 we were using a weird version for an image that was actually pretty large to download and our continuously taking

00:28:55.080 ages to start and just switching to a very similar but more popular version saved us about a minute of setup

00:29:03.120 so what you want to do is make sure you're using what looks like the most popular versions of your dependencies so

00:29:08.820 you can load the dice in your favor and increase the chances that the images that you use or at least a good chunk of

00:29:14.340 their layers are going to be cached now again this is still weird advice because you have no way of knowing what

00:29:20.820 other people are using but if you see that your container startup times are high you can look inside that container

00:29:26.700 spin up step in your CI platform and you can look at whether it is frequently downloading layers or if it's using the cache and if it's downloading things

00:29:33.480 very often you can start looking at what are the other available images that you could use I'm trying different ones to

00:29:39.779 see if they get cached more often you also want to point to less specific versions which is going to be more

00:29:46.559 common that other people pointed now in Docker you can have many different tags pointing to the same actual image and

00:29:52.020 these tags can get repointed over time so for Ruby for example you will have tags for 271 and 272 but there's also

00:29:59.220 just 2.7 and 2.7 points to the latest patch version and probably most people

00:30:04.860 will be using that one as it keeps repointing itself over time when 273 comes out 2.7 is going to point to that

00:30:10.919 one instead and the same happens if you have a latest tag like postgres latest or redis

00:30:16.260 latest that will generally be the default that most people use so it's going to be more likely to be cached

00:30:22.799 now this advice it comes with a big disclaimer all of the things I'm proposing that you

00:30:28.620 do today like everything in life they have trailers but this one in particular has a massive one because one of the

00:30:35.279 Glorious things about Docker is that you can specify the exact version of everything and fully control your

00:30:40.380 environment so the usual best practice is choose the exact version of everything in CI that you're running in

00:30:46.860 production because then you're testing on the exact same environment or as close as you can get it

00:30:52.080 whereas I'm standing here suggesting that you do the exact opposite and that implies some risk

00:30:58.860 if you use the latest postgres now you're testing on a different version than the one you're actually running in production and that

00:31:05.279 could bite you on the flip side you can save hundreds of develop of

00:31:12.419 hours of developers times collectively if you do that so which doesn't matter which one matters the most will depend on your particular

00:31:18.600 situation and there's also some new ones here some middle ground for example you almost

00:31:24.120 definitely don't want Ruby latest because Ruby changes hello between minor versions you may get hit by deprecations

00:31:29.880 backwards in compatible changes other problems like that but if you're using Ruby 272 in production pointing to the

00:31:36.899 latest 2.7 is actually probably safe enough and your CI will probably still be trustworthy and if you do that you

00:31:43.260 can get the startup speed benefits of having a more common more cached version now if you're running an old version of

00:31:49.200 elasticsearch or some software that has had major backwards and compatibilities then yeah you'll probably have to still

00:31:55.440 run that specific version and that sucks but for things like postgres which have really good background compatibility for

00:32:01.320 example using the latest one is probably fine oh and make sure you're using the docker

00:32:07.080 images that your CI provider gives you not the normal official ones because that's the ones that everybody else is going to be using and they're going to

00:32:13.080 be more cached again it's weird advice you're making CI less deterministic which is normally the

00:32:19.260 opposite of what you want but speed so you're gonna have to wait this one the other big disclaimer of course is

00:32:25.320 that this is also much less deterministic in terms of timings than anything I've talked about so far you're always literally rolling the dice on

00:32:32.399 whether your machine is going to have your layers or not now doing this lets you wake those types A little bit to your advantage but it's still a crop

00:32:39.120 shoot and because of that it can be harder to gauge whether this helped or not because the runs that you're serving

00:32:45.480 when you're trying it may or may not be representative and I'm going to talk a bit more about that later but the general gist for this is try to

00:32:52.919 use common versions of things if you can because you can make your containers boot up a lot faster on average

00:33:03.840 Okay so we've covered how to speed up bundler git checkout and container speed up but now that I've talked about docker

00:33:10.799 I mean you gotta have that for if you're gonna talk about okay now that we've talked about Docker I'd like to take a second look at these same three topics

00:33:17.159 but for a more advanced scenario if you have the ability to build your own containers and push them to a registry

00:33:23.460 and if you're a CI platform lets you run your own containers there's more we can do to gain even more speed

00:33:29.460 now this takes some more work than the previous tips so maybe it only starts being economical when you start having larger Dev teams which means you're not

00:33:35.880 wasting a ton more human time waiting on CI and you also probably have developed some tooling at that point that's going

00:33:41.220 to let you do this more easily so it takes a little bit more work but for us at least it was definitely worth it

00:33:47.039 now if you can't do this or you don't want to have to deal with making your own containers feel free to skip ahead

00:33:52.380 about eight minutes I'll go back to non-doctor topics then so skip ahead until you no longer see the little whale

00:33:57.899 in the corner but if you can build your images easily one thing has worked really well for us

00:34:03.360 is having a custom Docker image for your CR environment now this will not be the same Docker image that you run in

00:34:09.119 production it's going to be very very different so you're going to need two Docker files in your repo one for production one for CI and what you do is

00:34:16.020 you start from a Docker image that your CI provider offers so it's going to be heavily cached and you also try to start

00:34:21.480 from one that already has most of the stuff in it so Ruby Chrome whatever you need and then you add the stuff that you

00:34:27.119 need to run your tests which is basically the same setup you normally do in CI most of it you do it in the docker

00:34:33.060 file instead and the way you will work with this is you will automatically build this image every time you merge to your main branch

00:34:39.540 and you will always tag it with CI and in your CI configuration you point to your registry and to that CI tag so

00:34:46.980 that you're always getting every single now importantly this is not always going to be a reflection of what's in the

00:34:53.220 branch that you're testing right now these images may take a while to build and you don't want to wait for them

00:34:58.560 that's key if you have to wait for them to build before you can run your tests you're actually causing more harm than

00:35:03.599 good but since you're always using the same tag if the last merge domain hasn't finished building that's fine you'll be

00:35:09.180 using the one from the previous merge which is good enough and also most of your tests are going to run in other branches but you're using

00:35:15.119 the docker image from your main branch so you're also going to have to apply all the changes from your branch onto

00:35:20.820 onto that one so it's important to keep this in mind because there's a couple of things that you will still have to do in

00:35:25.980 CI again after the image starts so with that in mind I want to talk about how to do this same three things

00:35:32.339 again but with a Docker twist first bundler as we discussed earlier

00:35:37.680 your CI provider gives you a cache that you can use to save and restore your bundle install and this cache is really

00:35:43.500 useful but sometimes it can actually be quite slow to restore one thing that can

00:35:48.660 sometimes help is running bundle install on your Docker build and this means your gems will already be there when the

00:35:55.140 container boots up and then you don't need the cache anymore and you don't need to do that in CI because you

00:36:00.839 already have the gems in your image now there's two things to keep in mind first this image was built against your

00:36:06.180 main branch if you're testing in a branch where you've updated the gem file you won't have those latest gems so you

00:36:11.220 still need to run bundle in CI the bundle though most of the time is going to finish instantly and very

00:36:17.940 rarely it's just going to install one Jam or two so it's going to be very quick now the other thing to note is that

00:36:23.280 you're trying to save time by not needing to do the cash restore but the local layer where you installed

00:36:28.740 off your gems still needs to be downloaded so you still care about how much space this gems take on disk or the

00:36:34.740 layer is going to be huge and it's going to take forever to download now earlier when we were using the CI cache the main Improvement for this was

00:36:41.160 deleting all gems now that doesn't apply here because you never have old gems if your gem file

00:36:46.980 changed your container build starts that layer from scratch so you don't have any of the old baggage however there is

00:36:54.300 still a lot of bloat when you're using bundle install because bundle keeps a bunch of caches that you don't need and it can be quite big so you want to get

00:37:00.359 rid of them now there's a couple of parameters that bundle gives you to prevent those caches but those have changed over time and I

00:37:06.780 may be doing something wrong here but it still seems to keep those caches around even if you put in all those parameters

00:37:12.720 however a thing you can do is you can explicitly delete those caches after you install your gems and that will end up

00:37:18.839 saving you the download time your bundle install command is going to end up looking a bit like this and

00:37:24.720 you're going to have to tweak those two paths because they will change with your particular system

00:37:29.820 now I'm going to talk a bit more in a minute about how to find those paths and how to figure out how big your layers are and how to make them smaller but as

00:37:36.780 an example in our particular Docker build deleting this files takes this layer from 500 megabytes to 400. we save

00:37:42.599 almost 100 megabytes which otherwise we will be downloading over and over into those containers so it's quite a big

00:37:48.540 reduction it really pays off to do this and by the way this is something you can also do if you're not making your own

00:37:55.079 images if you're doing a normal bundle installing your CI you can still delete these directories after they install them before you save the cache to make

00:38:01.380 that cache smaller the only issue is it's not going to be as obvious to find what those paths should be but if you

00:38:06.420 can find them that will also help so that's how we make bundle faster the same idea applies to git checkout uh

00:38:14.280 or git clone if you remember the problem cloning the whole repo may take a long time because all the history is coming

00:38:19.619 to it however when you're doing a Docker bill you don't care how long it takes so you can just do a full git clone and

00:38:25.920 then when the container runs in CI most of your repo is already there all you need to do is get fetch for the branch

00:38:31.380 that you're testing and that's only going to pick up the few commits that are not already in the image and that's going to be way faster than even doing a

00:38:38.280 shallow Club so you might want to try that I finally I want to talk about

00:38:44.220 optimizing your layers as I was mentioning earlier your container will be composed of many layers one for each

00:38:50.099 command that you run in your Docker file and you want to maximize the likelihood that as many as positive those layers

00:38:55.380 are already cached and that's why we start from a base image that your CI platform gives you but there are layers that will almost

00:39:01.680 never be cached and these are the ones that you added to the base image and here there's a balancing act that you

00:39:07.140 want to play because Docker is going to download a bunch of those in parallel now how many depends on your CI platform's

00:39:12.900 configuration and you want to try to optimize for a number what you're trying to do here is balance the bandwidth that

00:39:19.560 you get for each parallel download stream and the latency of round trips let's say you have a gigabyte of layers

00:39:24.599 to download if you only have one layer that's one gigabyte you won't be parallelizing on non-download at all all of the megabytes that need to come they

00:39:31.079 need to be downloaded all serially basically and you won't be using as much

00:39:36.359 bandwidth as you could if you have more layers that download will happen more in parallel and it would download faster overall however if you have too many

00:39:44.520 layers A lot of them are going to be tiny and you're going to be doing lots of round trips between your CI machine and your local registry and you're going

00:39:51.060 to waste a lot of time there so you don't want to have too many layers and this is why it's very common

00:39:56.940 to see the double Ampersand in Docker files if each of these little instructions

00:40:02.400 here was a separate run command each of them would end up as an individual layer the double Ampersand means you go you're

00:40:09.359 going to run all of the commands and you're going to get a single layer with what's left after doing all of those things

00:40:15.660 and so that's how you optimize for how many layers you have the other thing you're optimizing for is the size of those

00:40:21.420 layers in the first place you don't want it to be huge or they will be slow to download now you want to look at the size of your

00:40:26.820 layers and if they're big see if there is stuff that you can delete some examples of these are temp files

00:40:32.700 Left Behind from build processes or those bundle caches or tar.gc files that you may have downloaded and extracted

00:40:38.820 you want to remember to delete the chart file if it was big now it's very important that you delete

00:40:44.099 those files in the same run command where you create the stuff pay attention to those ampersands because if those

00:40:50.520 deletions were in a separate run command then you have one layer where you add all the user scrap that you don't want

00:40:55.619 and that's going to have to be downloaded later and then you have a second layer that deletes it that completely defeats the purpose

00:41:02.280 okay so to do this you need to see what layers you have in your image and you need to see how big they are and the

00:41:07.380 easiest way to inspect this is by using the docker historic command that'll show you what are all the latest in image how

00:41:12.839 much space I take and if you want to look inside those layers there's this great tool called Dive that's going to show you the file

00:41:18.839 system of that image and what each layer has in there and this is really useful to find stuff that was left behind by

00:41:24.420 build processes uh you can delete like those thumbnail caches that I mentioned earlier and this is also how I found those paths

00:41:31.020 to delete here's what that looks like this is this is the dive tool up here is showing us all of the layers in the

00:41:36.480 image and we can pick which one to look at and over there is showing us the file system and in color the stuff that's new

00:41:41.820 on this layer and you can ask it to only see new stuff and you can start collapsing directories and you can

00:41:47.520 clearly see that 19 megawatt cache directory there and in a minute we are going to see

00:41:55.920 there's the other one you have another 77x so those are the two parts that you saw me delete in the previous command

00:42:01.500 that's how you find them and you delete those two we save 100x so if your

00:42:06.720 container is taking a while to download it may be worth inspecting the layers and seeing if there's any low hanging

00:42:12.119 fruit there that may help all right we're done now with tricks to

00:42:17.700 optimize your startup times and helping a test start faster is one of the most important things you can do

00:42:23.400 that's going to let you paralyze more aggressively and having faster overall times but I want to talk about a few more

00:42:29.520 things that are important to keep in mind first of all observability as I mentioned the number of times now

00:42:36.000 CA times are quite variable and that's because there's a lot of factors involved for starters how long the

00:42:41.400 container takes to start depends on whether the particular machine you're running on has the layers cached but also sometimes the network is running a

00:42:47.339 bit slow and things take longer other times the machine you're running on is just having a sad day

00:42:53.040 and because of this it's really hard sometimes to know whether you're actually making improvements you may do a little experiment in your CI config

00:42:59.880 you put in a branch and it runs super fast and maybe that's because your experiment was successful

00:43:05.099 or maybe you just get lucky it's hard to know because it varies so much and it's also hard to stay on top of your setup

00:43:10.920 which over time is going to Trend towards taking longer and longer and you're probably not going to notice that drift unless you're watching it like a

00:43:17.700 hawk which you aren't so to combat this the very least you can do is sample when you're experimenting

00:43:25.079 before you try some change you're going to want to see how long it's taking now you want to get a Baseline and to do this don't look at just the last build

00:43:31.920 in main look at the last five ten builds over the last couple of days and notice not just the average time the different

00:43:38.460 steps take but also the variance get familiar with how things normally perform what steps are consistent and

00:43:44.460 which are all over the place because those are going to be the ones that lead to Southern builds and the ones that you

00:43:49.560 want to focus on if you can and when you're running an experiment re-push to your branch three five times

00:43:55.380 so you can have more samples and that's going to give you an idea of whether you're actually changing things or just getting lucky

00:44:01.140 but ideally you can build some observability over this it's going to be hard to be specific on what's the best

00:44:06.599 way to do this because it'll depend a lot on your specific obserability stack but as a general pointer GitHub will

00:44:12.900 send you web hooks when CI steps complete and you can use those web hooks to build information about timings and push those to your observability layer

00:44:19.500 of choice and then you to make yourself a beautiful dashboard that's going to let you see with more Precision how your

00:44:25.440 changes affects CI rankings and how they evolve over time now I admit this is quite a bit of work

00:44:31.020 but we've had great results with this because it's an early warning system that thinks again is lower and it also

00:44:36.839 gives us much more confidence on the changes that we make

00:44:45.300 flaky tests are the bane of our existence the working scenario which a lot of are

00:44:51.000 used to is when you tests take forever to run and then a flaky test fails so you need to rerun everything again and

00:44:57.180 wait for it again it just adds insult to injury and no matter how fast you make your CI Suite run if you need to run it

00:45:03.780 again often you're going to have a sad time and for this I have two suggestions first our spec has a feature where you

00:45:11.099 will store in a file the test that failed and then you can execute it again and run only the failed test

00:45:17.460 and the way this works is our spec is going to store the test failures in a file that looks a bit like this

00:45:23.220 and it's going to use that to know what to run next time so you can do this in

00:45:28.500 CI you basically run your test and then re-run only failures in case there were some flakies and this doesn't fix the

00:45:34.800 problem but it makes it less likely that it will make your CI red now the other thing you want to do is

00:45:40.619 fix your flakies and for this it's important to think about motivations when you get a flaky you obviously would

00:45:47.520 like to fix it like the good developer that you are but you will actually try to get something done and you need to ship that thing and the flaky is getting

00:45:53.819 your way and the thing is testing is actually owned by another team anyway so you don't really know what to do with it so it you just retry the branch and move

00:45:59.819 on with the actual thing you are trying to achieve like we all do it we don't like it but life right

00:46:06.780 and then every now and then we do a buck bash or a hackathon or whatever and we're gonna fix this but nobody's

00:46:12.060 actually keeping track of what tests were flaky so you can't fix them either like it's hard to actually get around to

00:46:17.520 fixing flaky tests but we can do better by having robots that help us keep track

00:46:23.220 of them and that gives us the right alignment what you want to do is automatically detect these flakies and create a ticket

00:46:29.280 in your backtrack now the basic idea of how you do this is once you've run and had failures you

00:46:34.800 make a copy of the failure file and run again with only failures and now you have two failure files one for each run

00:46:40.859 and they should be identical and if they aren't then you have a flaky test and with a bit of bash hackery you can

00:46:48.240 find tests that have failed in the first run but succeeded in the second one now don't look at that go too hard you

00:46:53.640 can find the complete thing explained in the supplementary repo but you can find those flakies and pass their paths to a

00:47:00.839 utility that will create a ticket in jira or whatever backtracker they they use in order for somebody to fix that

00:47:07.859 test now jira has a CLI that will do this for you which you can install in your CI Docker file other bug trackers

00:47:13.319 also have clis and worst case scenario you can just curl into that API and now there's an actual ticket opened

00:47:20.700 by a robot so it's not even you being annoying or anything there's a ticket which somebody needs to triage and assign to the right team and I mean you

00:47:27.300 get prior tests for way later sure but now it's visible and it's trackable and trackable means that it's fixable

00:47:34.859 we have done this and it has massively helped us reduce the number of flagies that we have because now it's a ticket

00:47:40.020 that is someone's problem and even if you batch them and solve them a lot later it is way more actionable than

00:47:46.260 having than having something that gets in your way at the worst possible time that you're just gonna retry and get on

00:47:51.720 with it and move on with your life

00:47:58.079 okay let's talk about uneven distributions I take the low I took the law earlier about how certain startup

00:48:04.140 times are the main barrier to lots of parallelism because if your tests take a long time to even start pretty quickly

00:48:09.839 you get to a point where adding more machines isn't really very helpful there's also another barrier you can hit

00:48:15.180 which is test distributed and evenly between machines now they are the idea of paralyzing if you're going to go from

00:48:21.420 this to something kind of like that now we got 10 machines so after setup costs our tests should take a tenth of the

00:48:28.500 original time but this is not exactly what you get what you get is a bit more like this because some testifiers take longer than

00:48:34.980 others so not all of the machines finish like finish at the same time and now your CI time is as long as the longest

00:48:40.740 running machine now what you see here is actually pretty good if you're getting a distribution like this you got pretty lucky that

00:48:47.160 little red line down there that's how much extra time you got over the ideal scenario so that's no bad however

00:48:53.819 sometimes you get into a pathological situation where this happens now this is kind of exaggerated for effect I mean

00:49:00.060 you would have to get really unlucky to get this but you kind of see the idea here if you have this distribution

00:49:05.280 adding more machines doesn't really help you that much unless you get lucky and that rejiggles the files in a more

00:49:11.339 favorable way but hoping for that kind of rejiggle is not a good situation to be in

00:49:16.740 and the way to work around this obviously is to distribute your files with your machine so that the time is end up more even

00:49:22.380 but unfortunately this is hard this is a very annoying problem I know of only two

00:49:27.780 solutions and much to my dismay they both involve mentioning commercial vendors

00:49:33.599 one of them I mentioned in passing earlier Circle CI has the CLI tool that helps to split the test between machines now at least this little tool can split

00:49:40.920 things in very in many different ways and one of them is it involves storing in a file how long

00:49:47.880 each test took with a file that is kind of similar to the aspect one that I show you for the flakies and it stores that

00:49:53.040 file centrally so it will persist it between builds and then it uses those timings to try and split things more fairly and it

00:50:00.000 works pretty well to be honest so if you're in circle already you can do that the other solution I know of is knapsack

00:50:06.000 knapsack is a commercial solution that acts as an external cue and each of your machine is going to talk to a queue and

00:50:12.060 pull tests from it the way it works is they run a server that knows all of the files I have to run and each machine is

00:50:17.700 repeatedly asking the server for more files to run so as a machine turns through files faster it will get more

00:50:24.060 files and if you have a machine that's running longer files it's going to end up getting fewer files and that evens

00:50:29.400 things up now supposedly they also store past test timings and they do a bunch of fancy

00:50:34.980 magic to distribute things better you know hence the name get it I'm not sure

00:50:40.319 how much that helps in my uninformed opinion just having the central queue and the gradual pool is doing most of

00:50:46.020 the heavy lifting level now a side advantage of knapsack is also that for those CI providers that don't

00:50:51.900 help you do parallelism you no longer have to be keep track of which machine is which and split the files manually

00:50:57.420 you basically start as many machines as you want they all pull from the queue and that makes your life easier if your

00:51:02.940 CI provider isn't it's not a super cheap solution but it may be worth trying them and seeing if it helps with your times

00:51:09.119 and if it's worth it now this is something I've thought about

00:51:15.180 but I've never actually tried it so take it with a pinch of salt but that are spec failures file stores

00:51:21.300 how long each test took after running your tests you could do some pre-processing on it and make a little

00:51:26.880 file of your own of how long each file takes and you could persist that file between builds using your Ci's caching

00:51:33.960 mechanism that you would normally use for gems and you could have a little Ruby script in the middle of that

00:51:40.200 command that calls Oak that first sorts all of the files by time descending

00:51:46.140 and if you do that I think you've made yourself your own Poor Man's knapsack

00:51:53.400 now again I haven't tried it so I may be missing something and from the perspective of throwing money at the

00:51:58.440 problem to save the robber's time that's probably a bad idea but if an even tests are killing you and

00:52:04.500 for some reason you can't use something like knapsack it may be worth trying again it's a wild thoughts but under the

00:52:11.400 rhyme circumstances maybe that helps okay so now we've turned our containers

00:52:16.980 into lean mean testing machines they start up super fast we're running them dozens at a time they finish evenly our

00:52:22.619 CI times are amazing which we can see because you have beautiful dashboards and all of our flakies are gone and

00:52:27.900 that's awesome right we win but here's the bad news

00:52:33.359 as I mentioned earlier execution time can vary a lot depending on how sad the machine that you're running in is and

00:52:40.200 that will always affect your CI times a little bit but sometimes you're going to get a

00:52:45.720 machine that is really really sad and it'll just take forever to run your tests or the network is going to fail

00:52:51.960 you a little bit and instead that should be instant is going to take 10 minutes or literally forever you may get stuck

00:52:57.900 and he never actually finishes you need to actually go and cancel it manually now this doesn't happen often but when

00:53:03.660 it happens is really sad and it kind of negates a lot of the improvements that

00:53:08.819 we've made and so here's the the bad news and the sad part

00:53:14.579 the more machines you run in parallel the faster your test will finish but also the more likely it is that one

00:53:21.180 of them is a sad one and that one's gonna take a really long time now again this happens very infrequently

00:53:28.380 but if you're running I don't know 64 machines it'll happen on many more pushes that if you're running in four so

00:53:35.160 you need to keep an eye on that uh because the more machines you add the faster your things will run until you

00:53:40.920 hit a point where things that you actually start taking longer on average because of this product and sadly there isn't a silver bullet

00:53:49.079 for this this is a classic trade-off the best we can do is to have good observability which is going to let us

00:53:55.020 figure out what the sweet spot is that is going to give us the lower CI times on average

00:54:01.980 and finally I want to make a quick note about critical paths depending on the complexity of your project and how much

00:54:07.920 tooling and automation you've developed you may have a CI workflow that is quite complicated this is ours

00:54:14.700 and one thing one thing that is kind of obvious if you think about it but it's very easy to lose sight of is that the

00:54:22.559 only thing that you care about is how long it takes to get your branch to Green you don't care how long it takes

00:54:27.660 each individual step to be green you only care about how long it takes for the last one to be green

00:54:34.020 so you should focus all of your efforts on the steps that are in the critical path for that

00:54:39.180 now in this particular workflow these three are the only steps that matter optimizing anything else is pretty much

00:54:45.960 a waste of effort now for example it's possible that we could make these secrets that go faster I mean it's taking quite a while but that's not

00:54:52.800 going to make the overall workflow faster so we shouldn't bother and this is especially true if you're

00:54:59.579 making steps Faster by adding parallelism because now that makes it more likely that one of them will hit a

00:55:04.619 slow machine as I just mentioned and now you've shot yourself in the foot because for the steps in the critical path

00:55:10.680 there's a sweet spot of parallelism where you balance the time that you gain with the risk of a cell machine

00:55:15.900 but for the steps that are not in the critical path if they are using any more machines that they absolutely need to

00:55:21.300 you are getting all of the extra risk with no benefit to offset it because you're not going to get to Green faster

00:55:27.059 and you're also spending more money on those machines which again you don't get any actual benefit for that so focus on

00:55:33.720 the critical path ignore everything else now of course as you make a step faster you may remove it from the critical path

00:55:39.960 and that's amazing make it no faster than that focus on the new critical ones

00:55:45.599 and importantly this means not only making steps faster but also considering the dependencies between them there may

00:55:52.079 be ways of restructing your tasks so that one of your steps no longer depends on another and you can gain a huge

00:55:57.720 amount of time for this and this is very typical if you have for example a setup bundle and yarn job and then you have a

00:56:04.319 bunch of other steps that depend on it if your unit test step is the slowest one and it is depending on setup bundle

00:56:10.859 and yarn that may be a bad idea because it probably doesn't need yarn and so you can make only the steps that

00:56:17.520 do need it depend on it and save yourself some serious time on the critical plastic which is the final time

00:56:23.099 to green and the same goes for jobs that use a lot of different containers for the dependencies it's it's common for

00:56:28.559 only a few tests to need all of those dependencies you can separate those test out into their own job and you only add

00:56:34.559 those extra containers on that job like we did with search at the beginning of the talk so the rest of the test don't

00:56:39.900 really need to wait in this case for elasticsearch to boot up so that's pretty much I wanted to cover

00:56:48.059 to to do a quick recap you want to parallelize a lot but focus on optimizing your startup times and you

00:56:54.359 can save time on bundle install git checkout and your container spinner type build your own images so you can have

00:57:01.200 absolute control on how all those things work keep those layers tight optimize your dependencies improving

00:57:07.680 your critical path keep an eye on your runtimes over time and get rid of flakies by putting them

00:57:13.920 in your backlog and that's it for me I've covered a lot of different techniques just now some of

00:57:19.500 those will hopefully help again your mileage will vary some of these will help you in particular

00:57:24.900 scenario some won't but hopefully this will give you some ideas on how to approach the problem and some

00:57:30.599 combinations of these techniques are going to allow you to reach CI Bliss

00:57:36.000 thank you