Fixing Flaky Tests Like a Detective

by Sonja Peterson

In the talk "Fixing Flaky Tests Like a Detective," presented by Sonja Peterson at RailsConf 2019, the speaker delves into the pervasive issue of flaky tests in software development. Flaky tests are those that pass sometimes and fail at other times without any changes to the underlying code. Sonja emphasizes the importance of not only fixing these by identifying their root causes but also preventing them from being introduced in the first place. She shares a structured approach that parallels detective work to diagnose and resolve flaky tests efficiently.

Key Points Discussed:

Challenges of Flaky Tests: They can significantly impede the development process, leading to wasted time and lost trust in automated tests.
Categories of Flaky Tests: Sonja identifies typical culprits behind flaky tests:
- Async Code: Tests influenced by the order of asynchronous events, particularly in feature tests using Capybara.
- Order Dependency: This arises when tests' outcomes change based on the state influenced by previous tests, emphasizing the need to isolate test states.
- Time Issues: Tests failing due to date and time calculations, potentially fixed by using libraries like Timecop.
- Unordered Collections: Ensure that database queries have predictable outcomes; ordering results can avoid flaky failures.
- Randomness: Reducing reliance on randomness in tests increases reliability.
Information Gathering: Before attempting fixes, developers should gather data on flaky tests such as error messages, timing of failures, and the running order of tests. This can be managed using a bug tracking system.
The Detective Method: Adopt a systematic method similar to investigating a crime: gather evidence, identify suspects, form hypotheses, and test fixes rather than relying on trial and error.
Team Collaboration: The collective responsibility is emphasized to handle flaky tests. Designated individuals should monitor and fix these tests while ensuring continuous communication within the team.

Conclusions and Takeaways:
- Flaky tests are an inherent part of a developer's journey, but they can also be opportunities for growth and learning. Fixing them leads to improved understanding of both code and testing frameworks.
- It is vital to maintain a healthy test suite with a focus on high reliability and effectiveness, which ultimately leads to better software quality and development speed. The overarching message is that addressing flaky tests is a valuable investment in the stability of software development processes.

RailsConf 2019 - Fixing Flaky Tests Like a Detective by Sonja Peterson
_______________________________________________________________________________________________
Cloud 66 - Pain Free Rails Deployments
Cloud 66 for Rails acts like your in-house DevOps team to build, deploy and maintain your Rails applications on any cloud or server.

Get $100 Cloud 66 Free Credits with the code: RailsConf-19
($100 Cloud 66 Free Credits, for the new user only, valid till 31st December 2019)

Link to the website: https://cloud66.com/rails?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
Link to sign up: https://app.cloud66.com/users/sign_in?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
_______________________________________________________________________________________________
Every test suite has them: a few tests that usually pass but sometimes mysteriously fail when run on the same code. Since they can’t be reliably replicated, they can be tough to fix. The good news is there’s a set of usual suspects that cause them: test order, async code, time, sorting and randomness. While walking through examples of each type, I’ll show you methods for identifying a culprit that range from capturing screenshots to traveling through time. You’ll leave with the skills to fix any flaky test fast, and with strategies for monitoring and improving your test suite's reliability overall.

RailsConf 2019

00:00:20.689 all right so just to introduce myself I'm Sonia and I really appreciate y'all

00:00:28.140 coming to my talk and rails comp for having me and today I'm gonna be talking about fixing flaky tests and also about

00:00:35.339 how reading a lot of mystery novels helped me learn how to do that better so

00:00:41.220 I want to start out by telling you a story and it's about the first flaky test that I ever had to deal with it was

00:00:47.400 back in my first year as a software engineer and I'd worked really hard building out this very complicated form

00:00:52.980 it was my first big front-end feature and so I wrote a lot of unit and feature tests to make sure that I didn't miss

00:00:58.829 any edge cases everything was working pretty well and we shipped it but then a

00:01:04.199 few days later we started to have an issue a test failed unexpectedly on our

00:01:09.330 master branch the failing test is one of the feature tests for my form but nothing related to the form had changed

00:01:15.270 and it went back to passing in the next build the first time it came up we all kind of ignored it test fail randomly

00:01:23.009 once in a while and that's okay right yeah then it happened again and again

00:01:31.289 and so I said fine okay no problem I will spend an afternoon digging into it and I'll fix it and we'll all move on

00:01:37.340 the only problem was I had never fixed a flaky test before and I had no idea why a test would pass or fail on different

00:01:44.160 runs so I did what I often did when trying to debug problems that I didn't

00:01:50.640 really understand I started out by trying to use trial and error so I made a random change and then I ran the test

00:01:56.819 over and over again to see if it would it would still fail occasionally and that kind of trial and error approach

00:02:02.940 can work sometimes with normal bugs sometimes you even start using trial in there and that leads you to a solution that helps you better understand the

00:02:09.119 actual problem but that didn't work at all with this flaky test trying to random fix running at 50 times it didn't

00:02:15.930 actually prove to me that I had fixed it and then a few days later even with that fix it still failed again

00:02:21.150 so I needed another approach and that's exactly what makes fixing flaky tests so

00:02:26.640 challenging you really can't just try random fix pick test them by running the test over and

00:02:31.989 over again it's a very slow feedback loop we eventually figured out a fix for that flaky test but not until several

00:02:38.590 different people had tried random fixes that failed and it sucked up entire days of work and the other thing I learned

00:02:45.970 from this was that even just a few flaky tests can really slow down your team when a test fails without actually

00:02:52.390 signalling something wrong with the test suite you not only have to rerun all of your tests before you're ready to deploy

00:02:57.489 your code which slows down the whole development process you also lose a little bit of trust in your test suite

00:03:03.819 and eventually you might even start ignoring real failures because you assume they're just flaky tests so it's

00:03:10.450 super important to learn how to fix flaky tests efficiently and better yet avoid writing them in the first place

00:03:17.040 for me the real breakthrough in figuring out how to fix flaky tests was when I came up with a method instead of trying

00:03:25.420 things randomly I started by gathering all the information I could about the flaky tests and the times that had

00:03:31.060 failed then I used that information to try to fit it into one of the five main categories of flaky tests we'll talk

00:03:37.660 about what those are in a minute and then based on that I came up with a theory of what might be happening then

00:03:43.709 based on that theory I would implement I fix at the same time that I was figuring

00:03:52.090 this out I was on kind of a mystery novel binge and it struck me that every time I was working on fixing a flaky

00:03:57.310 test I felt kind of like a detective solving a mystery after all the steps to do that at least in the novels I read

00:04:03.160 which are probably very different from real life are basically starting with

00:04:08.680 gathering evidence then you identify suspects you come up with a theory of means and motive and then you can solve

00:04:15.669 it and so thinking about fixing flaky tests that way made it much more

00:04:21.070 enjoyable and actually became kind of a fun challenge for me instead of just a frustrating and tedious problem that I

00:04:26.440 had to deal with so that's the framework I'm going to use in this talk for explaining how to fix flaky tests let's

00:04:33.430 start with step 1 gathering evidence there are lots of pieces of information

00:04:39.130 that can be helpful to have when you're trying to diagnose and fix likeé tests some of those include error

00:04:45.659 message is an output for every time that you've seen it fail time the time of day

00:04:50.819 those failures occurred how often the test failing is it failing every other time or just once in a blue moon and

00:04:57.469 which tests were run before the test when it failed and in what order so how

00:05:03.599 can you efficiently get all of this information a method that I've used in the past and that has worked well is to

00:05:09.599 have anytime a test fails on your master branch or whatever branch you would not expect to see failures on because tests

00:05:15.509 had to pass before merging into it have any failures on that branch automatically sent to a bug tracker with

00:05:21.300 all the metadata you'd need such as a link to the CI build where they failed I've had success doing this with rollbar

00:05:27.180 in the past but I'm sure other bug trackers would work for this as well and when doing that it's important to make

00:05:33.389 sure that the failures for the same test can generally be grouped together in the bug tracker it might take a little bit

00:05:38.400 of configuration or finessing to get this to work but it's really helpful because then you're able to cross-reference between different

00:05:43.830 occurrences of the same failure and figure out what's - what they have in common which can help you understand why

00:05:50.669 they're happening all right so now that we have our evidence we can start

00:05:55.770 looking for suspects and with flaky tests the nice thing is that there is basically always the same set of usual

00:06:01.889 suspects to start with and then you can narrow down from there those suspects are async code order dependency time

00:06:11.150 unordered collections and randomness so I'm gonna go through each of these one by one I'm gonna talk through an example

00:06:17.610 or two how you might identify that a test fits into one of these particular categories and then how you would go

00:06:23.610 about fixing it based on that so let's start with async code which in my

00:06:29.610 experience is often one of the biggest categories of flaky tests when testing rails apps when I say async code I'm

00:06:36.990 talking about tests in which some code runs asynchronously which means that the events in the test can happen in more

00:06:42.990 than one order the most common way this comes up when you're testing rails apps

00:06:48.629 is in your system or feature tests so most rails apps use capybara either through rails built in system tests or

00:06:54.990 are feature tests to write end-to-end tests for the application that spin up a rail server in a browser and then the test

00:07:01.310 interacts with the app similar to the way an actual user would and the reason

00:07:07.100 you're necessarily dealing with async code and concurrency when you write capybara tests is that there are at

00:07:12.229 least three different threads involved there's the main thread executing your test code there's another thread that

00:07:18.110 capybara spins off to run your rails server and then there's a separate process that's running the browser which

00:07:24.139 capybara controls via driver so to make this a little more concrete let's talk about a simple example imagine you have

00:07:31.280 a capybara test that clicks on a submit post button in a blog post form and then it checks that that post is created in

00:07:36.979 the database here's what the happy path for this test looks like in terms of the

00:07:42.260 order of the events that occur within it first in your test code we tell capybara we want to click on that button so in

00:07:49.010 the browser that triggers a click which sends off an ajax request to the rails

00:07:54.139 server which creates a blog post in the database when that request returns it updates the UI and then your test code

00:08:01.250 checks the database and sees that the post is there everything works great so the order of events in the browser and

00:08:07.310 server timeline here is pretty predictable provided you're not optimistically updating the UI before the requests that created the blog post

00:08:14.330 returns and that's one reason why you could avoid optimistic updates if you

00:08:19.910 can because they think creative both a flaky test and if like a user experience but the events in the test code timeline

00:08:26.630 on the top here are less predictable in terms of where they happen in relation to the other ones so one problematic

00:08:33.890 ordering would be if right after we click on submit posts the test code it can move right along to check the

00:08:39.950 database and it happens to get to the database before the browser and the test rails server have finished going through

00:08:45.440 the process that creates that blog post so then we'll check the database we won't see anything there and the test

00:08:50.450 will fail the fix here is relatively simple we just need to make sure that we

00:08:56.540 wait until the request has finished before we try to check for anything in the database and we can do this by

00:09:01.640 adding one of capybaras waiting finders like have content which will look for something on the page and then retry

00:09:07.610 until it shows up to a certain time out so basically it'll check the page to see if post graded it's on it if it's not there it'll wait

00:09:14.720 for a second and then check again until it sees it there and only then will it be able to move on to the next line of

00:09:20.029 code where we check for the post in the database so with that code implemented

00:09:25.850 this is what the time line looks like have content will block us from moving forward until the rest of the process

00:09:32.420 has finished so that's a relatively simple async Flake and probably something that you've dealt with if

00:09:37.550 you've written some capybara tests but they can get a lot more complicated and sneaky so let's look at another example

00:09:44.740 here we have a test which goes to a page with a list of books clicks on a sort

00:09:49.819 button waits for the books to show up in that sorted order using one of capybaras waiting finders then clicks again to

00:09:57.800 reverse that order and waits for the order to show up again so provided

00:10:03.199 expect alphabetical order and expect reverse alphabetical order are both using those same waiting finders I was

00:10:08.389 talking about that will retry until things show up all right place it seems like this should work well we're waiting in

00:10:13.730 between each thing that we do but it is possible for this to be flaky the way

00:10:20.329 that that could happen is if when we visit the books path the books happen to already be sorted so then when we click

00:10:27.709 on sort and expect the alphabetical order that expect alphabetical order line is no longer actually waiting or

00:10:33.050 blocking anything for us we can it passes immediately when we move on to the next click so both of those clicks

00:10:39.050 can actually happen before we reloaded the page the first time with the books in alphabetical order it just kind of

00:10:45.139 acts like a double click and as a result we can end up with the test never

00:10:51.860 getting to the state of being in a reverse alphabetical order the fix here is actually fairly similar to the last

00:10:58.069 one we just need to add some more specific waiting finders to make sure that we don't move on through our test

00:11:04.189 code too quickly so in this case we might look for something on the page that indicates the request has actually finished beyond the fact that the books

00:11:10.189 are in order then we can safely move on to the next step

00:11:17.490 so if you're looking at a given flaky test and you're trying to figure out whether it might belong to this async code category the first question I

00:11:24.569 usually look at is is it a system or feature test something that uses capybara or some other way of interacting with the browser since

00:11:29.850 that's the number one place for these show up it is possible that you have other areas in your codebase where

00:11:35.129 you're dealing with async code but this is generally the biggest one and then within that does it trigger any events

00:11:41.699 without explicitly rating for the results even in a place where it looks relatively innocent it's always a good

00:11:47.730 idea to make sure that you're behaving like a real user would and waiting in between each thing you do to see the result when you're trying to identify

00:11:56.550 whether the flake is due to some async code it can also be helpful to use capybaras ability to save screenshots

00:12:02.509 which you can use by just calling save screenshot directly provided you're using one of the drivers that supports

00:12:07.889 that or the capybara screenshot gem which helps you wrap your test you know

00:12:12.929 so that every time they fail you'll capture a screenshot of the end state of the test when you're looking to prevent

00:12:22.499 async flakes there's a few things to keep in mind first as I mentioned make sure your test is waiting for each

00:12:28.079 action within it to finish and when you're doing this make sure you're not using sleep or waiting for some

00:12:33.660 arbitrary amount of time it's important to wait for something specific and that's because if you wait for an

00:12:38.670 arbitrary amount of time at some point your code will just happen to be running slowly enough that that arbitrary amount

00:12:43.769 of time isn't long enough and it will flake again it also means that you might be waiting longer than you need to in a

00:12:49.619 lot of other cases because the process happened faster and so by waiting for something specific you can avoid both of

00:12:56.009 those pitfalls it's also important to understand capybaras api which methods

00:13:01.949 wait and which don't so everything based on fine will generally wait but there are a few certain things like all that don't wait

00:13:08.699 in the same way and so it's just very important to be familiar with all of capybaras Docs and how to use its tools

00:13:17.040 correctly finally it's important to check that each assertion you're making in the test is working as you expect it

00:13:23.399 to it's very easy to write assertions that look like they're doing the correct

00:13:28.559 waiting behavior but actually don't as we saw in that double-click example sometimes content is already on the page in a

00:13:34.490 different place and it allows kind of accidental success all right so let's

00:13:41.569 move on to our next suspect order dependency if I'm I define this category

00:13:47.449 of tests as any that can pass or fail based on which tests ran before them

00:13:52.509 usually this is caused by some sort of state leaking between tests so in the

00:13:57.800 state and other tests when when the state another test creates as present or not present it can cause the flaky test

00:14:03.410 to fail and there are a few potential areas where a shared state can happen in

00:14:10.220 your tests one is the database another is global or class variables if those

00:14:16.970 are modified within your tests and then there's also the browser typically one

00:14:22.459 of the biggest issues with rails apps is database state so let's talk about that a little more in depth when you're

00:14:29.779 writing tests each test should start with a clean database that might not mean a fully empty database but any if

00:14:36.620 anything is created updated or deleted in the database during a single test it should be put back the way it was at the

00:14:42.949 beginning I kind of think of it like Leave No Trace when you're camping so this is important because otherwise

00:14:48.290 those changes to the database could have unexpected impacts and later tests or create dependencies between tests so

00:14:53.929 that you can't remove a reorder test without risking cascading failures there

00:14:59.269 are several different ways to handle clearing our database State wrapping your tests in a transaction and rolling

00:15:05.269 it back after the test is generally the fastest way to clear your database and it's the default for tests and rails but

00:15:12.170 in the past you couldn't use transactions with capybara because the test code and the test server didn't

00:15:17.540 share a database connection so they were running in separate transactions and couldn't see the data in each other's transactions rails 5 system tests

00:15:26.480 actually addressed this by allowing shared access to database connections and tests they could look at data within

00:15:32.029 the same transaction however running and transactions can still have some subtle differences from

00:15:37.670 normal behavior of your app and so there may be reasons why you still don't want to use them as your clean up

00:15:43.250 for example if you have any after commit hooks set up on your models that only run when a transaction commits those

00:15:49.070 probably won't run if you're using transactional cleanup so if you're not using transactional cleanup another

00:15:55.100 option is the database cleaner gem which can clean with either truncating tables or using a delete from statement on them

00:16:02.530 and this is generally slower than transactional but it is a little bit more realistic in terms of your not

00:16:08.870 having an additional transaction wrapped around everything that's happening in your tests and the important thing to

00:16:14.660 make sure if you're using this method is that this database cleanup is running after capybaras cleanup so capybara does

00:16:21.110 some work to make sure that the browser state is cleared and settled between each tests including wait waiting for

00:16:26.540 any Ajax requests or resolve and if you clean your database before that clean up

00:16:32.060 and waiting happens those a dress requests could create some data that doesn't get cleaned up so there's a bit

00:16:37.400 of an ordering issue here and you can avoid it if you're using our specs by putting your database cleaner call in an

00:16:43.250 append after block so why do I tell you all of this

00:16:49.010 the thing about database cleaning is it should just work and it often does especially if you're just using rails

00:16:55.220 basic built-in transactional cleaning but there are a lot of different ways that you could have your rails app and

00:17:00.920 test suite configured and it is possible to do it in such a way that certain gotchas are introduced so it's important

00:17:06.949 to know how your database cleaner works when it runs and if there's anything that's leaving behind especially if

00:17:12.770 you're starting to deal with flaky tests that seem to be order dependent let's look at an example of this let's say

00:17:19.370 we're using database cleaner with the truncation strategy maybe we started doing that back before rails 5 let us

00:17:25.610 share a database connection and it's stuck maybe we don't want any weirdness around transactions one of those reasons

00:17:31.580 but we notice this is slow so somebody comes in to optimize the test suite a little bit and they notice that we're

00:17:37.640 creating book genre x' in almost all of the tests they decide to create those

00:17:43.130 genres before the entire test suite runs and then exclude them from the database cleaner so this will speed up our tests

00:17:49.760 a bit but it does introduce a gap in our cleaning if we make any kind of

00:17:55.820 modification to book genre since we're using truncation to clean the database instead of transactions that update won't be undone

00:18:02.870 between tests and this could potentially affect later tests and show up as an order dependent Flake to be clear I'm

00:18:09.500 not picking on database cleaner here I just want to give an example of how a minor configuration change could allow

00:18:15.050 you to create more flakes and why it's important to have a good understanding of how cleaning is actually working in your test suite and the trade-offs you

00:18:21.470 might introduce depending on how you do it as I mentioned at the beginning there

00:18:27.380 are some other possible sources of order dependency via shared state one is the browser since tests run within the same

00:18:33.140 browser that can contain specific state depending on which test just ran capybara works pretty hard to clean all

00:18:39.170 of this up before it moves on to the next test so this should usually be taken care of for you but it is possible again depending on your configuration

00:18:45.590 how you have everything set up that maybe there's something that sneaks through and so it's good to be aware of that as a possible place where shared

00:18:51.320 state it could be another is global and class variables as I mentioned if you

00:18:56.540 modify those they could persist from one test to the next normally Ruby will yell at you if you reassign a global variable

00:19:02.330 but one area where these can kind of sneak in is if you have a hash assigned to a global variable and you just change

00:19:08.000 one of the values within it since that isn't reassigning the entire variable it won't come up as a warming warning all

00:19:17.120 right so if you're looking at a particular test and you're trying to figure out why what whether it's being caused by order dependency there's a

00:19:23.660 couple different strategies you can use one is just to start out by trying to replicate the failure with the same set

00:19:30.050 of tests in the same order so if you can take a look at how it ran in your CI or wherever you saw it fail and run the

00:19:35.540 exact same set of tests together with the same seed value to put them in the same order and it fails every time you

00:19:41.660 do that then you have a sense that this is probably an order dependent test but at that point you still don't know which

00:19:46.940 tests are affecting each other so to figure that out you're probably going to want to cross-reference each time you've

00:19:52.250 seen it failed and see if the same tests we're running before that failure r-spec

00:19:57.380 has a built-in bisect tool that you can also use to help narrow down the set of tests to the one that produced the

00:20:03.200 dependency however you may find that it can run a bit slowly depending on how fast your test suite runs so sometimes

00:20:09.200 it's easier to just look at things manually in order to prevent order

00:20:15.740 dependency you should make sure that you've configured your test suite to run in random order this might seem kind of

00:20:21.440 counterintuitive but the goal is that to surface order dependent tests quickly not just when you add or remove or move

00:20:28.940 around a certain test running in random order is the default in mini tests and is configurable in our spec also make

00:20:36.800 sure you spend some time understanding your entire test setup and teardown process and work to close any gaps where

00:20:41.990 shared state might be leaking through from one test to another all right moving on to our next suspect time this

00:20:49.550 is probably the one that gives me the most headaches this category includes any tests that can pass or fail

00:20:55.280 depending on the time of day that it is run so let's start with an example here imagine we have this code that runs in a

00:21:02.420 before save hook on our task model it's that's an automatic due date the next day at the end of the day if a due date

00:21:08.210 isn't already specified then we write

00:21:13.309 this test we create a task with no due date specified and we check that it's when we expect it to be the current date

00:21:20.179 plus one at the end of the day seems like it should be fine but this test

00:21:25.910 actually starts failing after 7 p.m. every night very strangely and how could

00:21:32.120 that possibly be happening the trouble is we're using two slightly different

00:21:37.130 ways of calculating tomorrow here date tomorrow uses the time based on the time

00:21:42.170 zone we set for our rails app well date dot today plus one will be based on the system time so the system time is in UTC

00:21:48.590 and our rails apps time zone is est there'll be 5 hours apart and after 7:00 p.m.

00:21:53.600 there'll be different days which results in this failure so how can we avoid this

00:21:59.450 one easy fix would be just to use date current which respects timezone instead of date today another option would be to

00:22:08.150 use the time crop gem which basically allows you to freeze time by mocking out what Ruby's sense of time is and so with

00:22:15.050 time crop we can freeze time here it would be January 1st at 10:00 a.m. and then our expected due date can just be a

00:22:22.370 static value January second at 11:59 p.m. and we can check

00:22:27.740 that the due date is that exact value this can be kind of helpful for making your test a little bit more explicit and

00:22:33.110 hat and simpler so that they don't contain complicated logic that it itself needs to be tested when you're trying to

00:22:41.299 determine whether a given flaky test is time based the first obvious thing to do is to look for any references to date or

00:22:46.850 time in the code under test if you have a record of past failures you can also

00:22:51.860 check whether they've all happened around the same time of day and finally if you suspect it's time based you can

00:22:58.700 add Timecop to that spec just temporarily to set it to the time of day where you've seen it fail before and see

00:23:05.000 if it fails every time when you do that as we saw in our example using Timecop

00:23:12.080 to freeze time can make it easier to write reliable tests that deal with time and also easier to understand exactly

00:23:17.899 what you're testing another strategy that you can use to surface time based

00:23:23.390 flakes is to set up your test suite so that it wraps every test in time got Timecop travel mocking the time to a

00:23:29.809 different random time of day on each run of the suite that's printed out before the test runs so this might seem a

00:23:35.870 little crazy but it's actually very helpful for surfacing tests that would normally only fail after business hours

00:23:41.480 when nobody happens to be running the test suite so that you see them during the normal business day instead of at

00:23:47.350 midnight when you just got woken up on call and you're trying to desperately ship a deploy and the test suite keeps

00:23:54.049 failing on expectedly it's just important to make sure that you're printing out the time of day that each

00:24:00.679 test is running at and that you're able to then rerun the test with that same time of day so that later if you're debugging a failure you can easily

00:24:07.279 replicate it all right our next suspect is unordered collections this is a

00:24:14.120 relatively simpler one this is just any test that can pass or fail depending on the order of a set of items that's

00:24:20.029 within it it doesn't have a pre-specified order so let's look at an

00:24:25.340 example here we have a test where we're looking at a set of active posts and we

00:24:30.710 expect them to equal some specific posts that perhaps we've created earlier in the test the issue with this test

00:24:37.570 is that the database query in the first line doesn't have a specific order so even though things will often be

00:24:43.090 returned from the database in the same order just by chance there's no guarantee that this will actually always happen and when it doesn't this test

00:24:50.350 will fail so the fix is just to make sure that we're specifying an order on

00:24:56.799 the items returned by the database and that also our expected posts are in that exact same order when trying to identify

00:25:05.919 whether a flaky test is being caused by on our collections look for any assertions about the order of an array

00:25:12.369 the contents of array or the first or last item in one if you're using r-spec

00:25:19.029 you can use the match array expectation which allows you to basically just assert things about what's in an array

00:25:25.779 without caring about the order or you can just add an explicit sort to both the expectations and what you're looking

00:25:32.109 at all right so we've gotten to our last

00:25:37.749 possible suspect which is randomness and you might think that all of these different categories of flaky tests have

00:25:43.720 something to do with randomness since they're randomly failing but in this case I'm talking about tests that

00:25:48.879 actually explicitly invoked randomness via a random number generator so here's

00:25:55.389 an example of a test data factory that uses factory about to create an event if we have a validation that enforces start

00:26:02.109 date sorry and suppose we might start out with just having start date and then

00:26:08.109 adding end date after that at some point and we decide okay start date will be some time five days from now end date

00:26:14.049 will be sometime ten days from now we could run into an issue where end date

00:26:19.359 actually ends up being lower than start date since they're both random values so if we add a validation to events that

00:26:26.710 enforces that at some percentage of time our tests that deal with events will

00:26:32.559 fail because they'll have invalid data so in this case you're just better off

00:26:37.989 being explicit and creating the same dates every time and this might feel a little counterintuitive because

00:26:42.999 randomness can seem useful as a tool for testing a large spectrum of different

00:26:48.700 types of data and so on but the is a big downside and not being able to know what your tests are actually testing and having them be flaky and so

00:26:57.110 a better strategy is to is to actually just write tests for each of those

00:27:03.110 specific cases that you would like to test so if you're trying to identify

00:27:09.080 whether randomness is causing your flake the first obvious thing to do obviously is to look for a random number generator

00:27:15.110 and often us will come up in your factories or fixtures but another thing

00:27:20.270 you can try is using the - seed option in either mini test or r-spec and that

00:27:27.409 will allow you to run the tests with the same seed value format for randomness and generally the same random values

00:27:32.600 produced with r-spec you just want to make sure that you actually have kernel SRAM set to our specs config seed so

00:27:40.130 that those so that passing the seed option will actually control the randomness seed to prevent randomness

00:27:49.340 based flakes as I mentioned the general strategy is to remove randomness from your tests and it's instead explicitly

00:27:56.390 test the boundaries and edge cases that you're interested in it's also generally

00:28:01.730 a good idea to avoid gems like fakir to generate data for tests they're very useful for generating realistic say

00:28:07.580 seeming data and your dev environment but in your tests at least from my perspective it's more important to have

00:28:13.010 reliable behavior than random and realistic data all right so now we've

00:28:20.720 looked at all of the usual suspects so we can move on to formula theory and actually solving a flaky test mystery my

00:28:29.090 first strategy tip when trying to find a fix to a flaky test and there isn't an obvious one popping out for you is just

00:28:35.480 to run through each of those categories that I've described and look for any connection or identifying signs that

00:28:40.610 could link this test to one of those so even if it looks perfectly fine but it is dealing with a date maybe digging

00:28:46.460 down that particular path and again just resist the urge to use trial and error

00:28:52.880 to test fixes it's more important to form a strong theory about how this might be happening first even if you're

00:28:58.100 not 100% sure it's going to work a lot better than using trial and error

00:29:03.160 what you can do and what might involve a little bit of a different kind of trial and error is trying to find a way to

00:29:09.350 reliably replicate failures to prove your theory so this came up a little bit with when I was talking about randomness

00:29:15.830 dates and order dependency because for those you have more control over the factors that might be producing the

00:29:22.250 flake you can freeze time you can run the tests in the same order you can use

00:29:27.710 the same random seed and then potentially be able to replicate the failure and since most flaky tests

00:29:34.010 typically are flaking very infrequently and passing most of the time if you're able to get them to fail two or three

00:29:40.880 times in a row you can be pretty confident confident that you've replicated it versus the other direction when you're using trial in order to test

00:29:46.940 a fix and you're seeing it pass it takes a lot of runs to be confident that that's actually what you're seeing so

00:29:55.400 you might try all those methods and still be stuck flaky tests are hard one strategy you can try if you get to that

00:30:01.550 situation is adding some code that will give you more information the next time it fails so if you've got like a hunch

00:30:07.070 that something's off like perhaps with what's in the database or you're curious about what the value value of a certain

00:30:12.380 variable is add that to something that will be logged out in the test and then the next time that it fails in CI you

00:30:18.740 can take a look at that and factor that into your process of fixing it another

00:30:23.960 strategy that I really like is pairing with another developer since fixing flaky tests is so much about your having

00:30:30.710 a deep understanding of your testing tools your framework and your own code everybody is going to have some gaps but

00:30:36.920 when you have two people working together you can fill each other's gaps in a little bit and you can also help keep each other from going down rabbit

00:30:43.220 holes or getting too frustrated chasing down the same wrong theory another

00:30:50.750 question I see coming up a lot at this point is can I just delete it I can't fix it it keeps failing is it even worth

00:30:57.530 it anymore why did I become a developer that kind of thing and my first response

00:31:03.890 to this is that you have to accept that if you're writing tests at some point inevitably you are going to have to deal

00:31:10.310 with Saiki ones you can't just delete any test that starts to be flaky because you'll end up making

00:31:15.650 significant compromises in the coverage that you have for your app and also learning to fix and avoid flaky tests is

00:31:21.890 a skill that you can develop over time and it's one that's really worth investing in even if that meant means

00:31:27.740 have spent spending two days fixing one instead of just deleting it that being

00:31:33.710 said when I'm dealing with laggy tests I do like to take a step back and think about the test coverage I have for a

00:31:39.320 feature holistically what situations do I have coverage for which ones am i

00:31:44.420 maybe neglecting or ignoring and what are the stakes of having the kind of bug that might slip through the cracks in my

00:31:49.880 coverage if the flaky test I'm looking at is for a very small edge case with low stakes or it's something that's

00:31:55.670 actually well covered by other tests or could be covered by a different type of test maybe it does make sense to delete

00:32:01.220 it or replace it and this ties into a bigger picture idea which is that when

00:32:06.770 we're writing tests we're always making trade-offs between realism and maintainability using automated tests

00:32:13.520 instead of manual QA is itself a trade-off in terms of substituting in a machine to do the testing for us which

00:32:20.570 is going to behave differently than an actual user would but it's worth it in a lot of situations because we can get

00:32:26.660 results faster and consistently and we can add tests as we code so different

00:32:31.670 types of tests will go to different lengths mimic real-life and generally the most realistic ones are the ones

00:32:37.490 that are hardest to maintain and keep from getting flaky there's an idea of

00:32:43.010 the test pyramid which I think was first came up with by Mike Kondo I think there's been many other spins on it

00:32:48.500 since and this is my particular spin you should have a strong foundation of lots of unit tests on the bottom they're

00:32:54.980 simpler they're faster and they're less likely to be flaky and then as you go from less realistic tests to more

00:33:00.650 realistic tests you should have fewer of those types of tests because they are going to take more effort to maintain

00:33:05.720 and the tests themselves are coarser grained so they're testing a lot more covering a lot more situations and the

00:33:12.830 these more realistic tests are just in general more likely to become flaky because there's so many more moving parts involved so it's wise to keep the

00:33:20.390 number of them in your test suite in balanced tests the major happy pass the major problems but leave certain edge

00:33:28.040 cases and other types of testing for more specific and

00:33:33.320 isolated tests the last thing I want to

00:33:38.690 talk about is how to work with the rest of your team to fix flaky tests it shouldn't be just a solo effort since

00:33:45.470 flaky tests can slow everyone down and I wrote everyone's trust in your test suite they should be a really high priority to fix if you can manage it

00:33:52.460 they should potentially even be the next highest priority under production fires this needs to be something that you talk

00:33:58.070 about as a team that you communicate to your new hires and that you all agree it's worth investing time in to keep

00:34:03.710 each other moving quickly and trusting your test suite the next thing I

00:34:08.990 recommend is that making sure you have a specific person assigned to each active flake that person is in charge of

00:34:14.810 looking for a fix deciding whether maybe you need to temporarily disable the tests well they well it's being worked on if it's frequently flaking that

00:34:22.460 person should reach out to others for help if they're stuck and so on and it's important to make sure that responsibility is spread out among your

00:34:29.179 entire team don't just let one person end up being the flake master and everybody else ignores them if you're

00:34:35.990 already sending flakes to a bug tracker as I suggested in the gathering evidence section you can use that as a place to

00:34:41.750 assign them to different people the next thing I recommend is setting a target

00:34:47.810 for your master branch pass rate and tracking at week over week so for example you could say that you want to

00:34:53.120 have builds on your master branch pass 90% of the time and then by tracking this that helps you keep an eye on

00:34:59.330 whether you're progressing towards that goal and course-correct if your efforts aren't working and you need to invest more in it or if you have kind of wider

00:35:05.630 issues with your test Suites reliability to wrap this all up if you remember just

00:35:13.340 one thing from my talk I hope it's that flaky tests don't have to just be an annoying and frustrating problem or

00:35:18.710 something you try to ignore as much as you can fixing them can actually be an opportunity to gain a deeper

00:35:23.750 understanding of your tools and your code and also to pretend you're a detective for a little while so hopefully this talk has made it

00:35:30.440 easier for you to do that thank you all for coming if you have any questions feel free to I'll be up here and you can