List

Fixing Flaky Tests Like a Detective

Fixing Flaky Tests Like a Detective

by Sonja Peterson

In the talk "Fixing Flaky Tests Like a Detective," presented by Sonja Peterson at RailsConf 2019, the speaker delves into the pervasive issue of flaky tests in software development. Flaky tests are those that pass sometimes and fail at other times without any changes to the underlying code. Sonja emphasizes the importance of not only fixing these by identifying their root causes but also preventing them from being introduced in the first place. She shares a structured approach that parallels detective work to diagnose and resolve flaky tests efficiently.

Key Points Discussed:

  • Challenges of Flaky Tests: They can significantly impede the development process, leading to wasted time and lost trust in automated tests.
  • Categories of Flaky Tests: Sonja identifies typical culprits behind flaky tests:
    • Async Code: Tests influenced by the order of asynchronous events, particularly in feature tests using Capybara.
    • Order Dependency: This arises when tests' outcomes change based on the state influenced by previous tests, emphasizing the need to isolate test states.
    • Time Issues: Tests failing due to date and time calculations, potentially fixed by using libraries like Timecop.
    • Unordered Collections: Ensure that database queries have predictable outcomes; ordering results can avoid flaky failures.
    • Randomness: Reducing reliance on randomness in tests increases reliability.
  • Information Gathering: Before attempting fixes, developers should gather data on flaky tests such as error messages, timing of failures, and the running order of tests. This can be managed using a bug tracking system.
  • The Detective Method: Adopt a systematic method similar to investigating a crime: gather evidence, identify suspects, form hypotheses, and test fixes rather than relying on trial and error.
  • Team Collaboration: The collective responsibility is emphasized to handle flaky tests. Designated individuals should monitor and fix these tests while ensuring continuous communication within the team.

Conclusions and Takeaways:
- Flaky tests are an inherent part of a developer's journey, but they can also be opportunities for growth and learning. Fixing them leads to improved understanding of both code and testing frameworks.
- It is vital to maintain a healthy test suite with a focus on high reliability and effectiveness, which ultimately leads to better software quality and development speed. The overarching message is that addressing flaky tests is a valuable investment in the stability of software development processes.

RailsConf 2019 - Fixing Flaky Tests Like a Detective by Sonja Peterson
_______________________________________________________________________________________________
Cloud 66 - Pain Free Rails Deployments
Cloud 66 for Rails acts like your in-house DevOps team to build, deploy and maintain your Rails applications on any cloud or server.

Get $100 Cloud 66 Free Credits with the code: RailsConf-19
($100 Cloud 66 Free Credits, for the new user only, valid till 31st December 2019)

Link to the website: https://cloud66.com/rails?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
Link to sign up: https://app.cloud66.com/users/sign_in?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
_______________________________________________________________________________________________
Every test suite has them: a few tests that usually pass but sometimes mysteriously fail when run on the same code. Since they can’t be reliably replicated, they can be tough to fix. The good news is there’s a set of usual suspects that cause them: test order, async code, time, sorting and randomness. While walking through examples of each type, I’ll show you methods for identifying a culprit that range from capturing screenshots to traveling through time. You’ll leave with the skills to fix any flaky test fast, and with strategies for monitoring and improving your test suite's reliability overall.

RailsConf 2019

00:00:20.689 all right so just to introduce myself I'm Sonia and I really appreciate y'all
00:00:28.140 coming to my talk and rails comp for having me and today I'm gonna be talking about fixing flaky tests and also about
00:00:35.339 how reading a lot of mystery novels helped me learn how to do that better so
00:00:41.220 I want to start out by telling you a story and it's about the first flaky test that I ever had to deal with it was
00:00:47.400 back in my first year as a software engineer and I'd worked really hard building out this very complicated form
00:00:52.980 it was my first big front-end feature and so I wrote a lot of unit and feature tests to make sure that I didn't miss
00:00:58.829 any edge cases everything was working pretty well and we shipped it but then a
00:01:04.199 few days later we started to have an issue a test failed unexpectedly on our
00:01:09.330 master branch the failing test is one of the feature tests for my form but nothing related to the form had changed
00:01:15.270 and it went back to passing in the next build the first time it came up we all kind of ignored it test fail randomly
00:01:23.009 once in a while and that's okay right yeah then it happened again and again
00:01:31.289 and so I said fine okay no problem I will spend an afternoon digging into it and I'll fix it and we'll all move on
00:01:37.340 the only problem was I had never fixed a flaky test before and I had no idea why a test would pass or fail on different
00:01:44.160 runs so I did what I often did when trying to debug problems that I didn't
00:01:50.640 really understand I started out by trying to use trial and error so I made a random change and then I ran the test
00:01:56.819 over and over again to see if it would it would still fail occasionally and that kind of trial and error approach
00:02:02.940 can work sometimes with normal bugs sometimes you even start using trial in there and that leads you to a solution that helps you better understand the
00:02:09.119 actual problem but that didn't work at all with this flaky test trying to random fix running at 50 times it didn't
00:02:15.930 actually prove to me that I had fixed it and then a few days later even with that fix it still failed again
00:02:21.150 so I needed another approach and that's exactly what makes fixing flaky tests so
00:02:26.640 challenging you really can't just try random fix pick test them by running the test over and
00:02:31.989 over again it's a very slow feedback loop we eventually figured out a fix for that flaky test but not until several
00:02:38.590 different people had tried random fixes that failed and it sucked up entire days of work and the other thing I learned
00:02:45.970 from this was that even just a few flaky tests can really slow down your team when a test fails without actually
00:02:52.390 signalling something wrong with the test suite you not only have to rerun all of your tests before you're ready to deploy
00:02:57.489 your code which slows down the whole development process you also lose a little bit of trust in your test suite
00:03:03.819 and eventually you might even start ignoring real failures because you assume they're just flaky tests so it's
00:03:10.450 super important to learn how to fix flaky tests efficiently and better yet avoid writing them in the first place
00:03:17.040 for me the real breakthrough in figuring out how to fix flaky tests was when I came up with a method instead of trying
00:03:25.420 things randomly I started by gathering all the information I could about the flaky tests and the times that had
00:03:31.060 failed then I used that information to try to fit it into one of the five main categories of flaky tests we'll talk
00:03:37.660 about what those are in a minute and then based on that I came up with a theory of what might be happening then
00:03:43.709 based on that theory I would implement I fix at the same time that I was figuring
00:03:52.090 this out I was on kind of a mystery novel binge and it struck me that every time I was working on fixing a flaky
00:03:57.310 test I felt kind of like a detective solving a mystery after all the steps to do that at least in the novels I read
00:04:03.160 which are probably very different from real life are basically starting with
00:04:08.680 gathering evidence then you identify suspects you come up with a theory of means and motive and then you can solve
00:04:15.669 it and so thinking about fixing flaky tests that way made it much more
00:04:21.070 enjoyable and actually became kind of a fun challenge for me instead of just a frustrating and tedious problem that I
00:04:26.440 had to deal with so that's the framework I'm going to use in this talk for explaining how to fix flaky tests let's
00:04:33.430 start with step 1 gathering evidence there are lots of pieces of information
00:04:39.130 that can be helpful to have when you're trying to diagnose and fix likeé tests some of those include error
00:04:45.659 message is an output for every time that you've seen it fail time the time of day
00:04:50.819 those failures occurred how often the test failing is it failing every other time or just once in a blue moon and
00:04:57.469 which tests were run before the test when it failed and in what order so how
00:05:03.599 can you efficiently get all of this information a method that I've used in the past and that has worked well is to
00:05:09.599 have anytime a test fails on your master branch or whatever branch you would not expect to see failures on because tests
00:05:15.509 had to pass before merging into it have any failures on that branch automatically sent to a bug tracker with
00:05:21.300 all the metadata you'd need such as a link to the CI build where they failed I've had success doing this with rollbar
00:05:27.180 in the past but I'm sure other bug trackers would work for this as well and when doing that it's important to make
00:05:33.389 sure that the failures for the same test can generally be grouped together in the bug tracker it might take a little bit
00:05:38.400 of configuration or finessing to get this to work but it's really helpful because then you're able to cross-reference between different
00:05:43.830 occurrences of the same failure and figure out what's - what they have in common which can help you understand why
00:05:50.669 they're happening all right so now that we have our evidence we can start
00:05:55.770 looking for suspects and with flaky tests the nice thing is that there is basically always the same set of usual
00:06:01.889 suspects to start with and then you can narrow down from there those suspects are async code order dependency time
00:06:11.150 unordered collections and randomness so I'm gonna go through each of these one by one I'm gonna talk through an example
00:06:17.610 or two how you might identify that a test fits into one of these particular categories and then how you would go
00:06:23.610 about fixing it based on that so let's start with async code which in my
00:06:29.610 experience is often one of the biggest categories of flaky tests when testing rails apps when I say async code I'm
00:06:36.990 talking about tests in which some code runs asynchronously which means that the events in the test can happen in more
00:06:42.990 than one order the most common way this comes up when you're testing rails apps
00:06:48.629 is in your system or feature tests so most rails apps use capybara either through rails built in system tests or
00:06:54.990 are feature tests to write end-to-end tests for the application that spin up a rail server in a browser and then the test
00:07:01.310 interacts with the app similar to the way an actual user would and the reason
00:07:07.100 you're necessarily dealing with async code and concurrency when you write capybara tests is that there are at
00:07:12.229 least three different threads involved there's the main thread executing your test code there's another thread that
00:07:18.110 capybara spins off to run your rails server and then there's a separate process that's running the browser which
00:07:24.139 capybara controls via driver so to make this a little more concrete let's talk about a simple example imagine you have
00:07:31.280 a capybara test that clicks on a submit post button in a blog post form and then it checks that that post is created in
00:07:36.979 the database here's what the happy path for this test looks like in terms of the
00:07:42.260 order of the events that occur within it first in your test code we tell capybara we want to click on that button so in
00:07:49.010 the browser that triggers a click which sends off an ajax request to the rails
00:07:54.139 server which creates a blog post in the database when that request returns it updates the UI and then your test code
00:08:01.250 checks the database and sees that the post is there everything works great so the order of events in the browser and
00:08:07.310 server timeline here is pretty predictable provided you're not optimistically updating the UI before the requests that created the blog post
00:08:14.330 returns and that's one reason why you could avoid optimistic updates if you
00:08:19.910 can because they think creative both a flaky test and if like a user experience but the events in the test code timeline
00:08:26.630 on the top here are less predictable in terms of where they happen in relation to the other ones so one problematic
00:08:33.890 ordering would be if right after we click on submit posts the test code it can move right along to check the
00:08:39.950 database and it happens to get to the database before the browser and the test rails server have finished going through
00:08:45.440 the process that creates that blog post so then we'll check the database we won't see anything there and the test
00:08:50.450 will fail the fix here is relatively simple we just need to make sure that we
00:08:56.540 wait until the request has finished before we try to check for anything in the database and we can do this by
00:09:01.640 adding one of capybaras waiting finders like have content which will look for something on the page and then retry
00:09:07.610 until it shows up to a certain time out so basically it'll check the page to see if post graded it's on it if it's not there it'll wait
00:09:14.720 for a second and then check again until it sees it there and only then will it be able to move on to the next line of
00:09:20.029 code where we check for the post in the database so with that code implemented
00:09:25.850 this is what the time line looks like have content will block us from moving forward until the rest of the process
00:09:32.420 has finished so that's a relatively simple async Flake and probably something that you've dealt with if
00:09:37.550 you've written some capybara tests but they can get a lot more complicated and sneaky so let's look at another example
00:09:44.740 here we have a test which goes to a page with a list of books clicks on a sort
00:09:49.819 button waits for the books to show up in that sorted order using one of capybaras waiting finders then clicks again to
00:09:57.800 reverse that order and waits for the order to show up again so provided
00:10:03.199 expect alphabetical order and expect reverse alphabetical order are both using those same waiting finders I was
00:10:08.389 talking about that will retry until things show up all right place it seems like this should work well we're waiting in
00:10:13.730 between each thing that we do but it is possible for this to be flaky the way
00:10:20.329 that that could happen is if when we visit the books path the books happen to already be sorted so then when we click
00:10:27.709 on sort and expect the alphabetical order that expect alphabetical order line is no longer actually waiting or
00:10:33.050 blocking anything for us we can it passes immediately when we move on to the next click so both of those clicks
00:10:39.050 can actually happen before we reloaded the page the first time with the books in alphabetical order it just kind of
00:10:45.139 acts like a double click and as a result we can end up with the test never
00:10:51.860 getting to the state of being in a reverse alphabetical order the fix here is actually fairly similar to the last
00:10:58.069 one we just need to add some more specific waiting finders to make sure that we don't move on through our test
00:11:04.189 code too quickly so in this case we might look for something on the page that indicates the request has actually finished beyond the fact that the books
00:11:10.189 are in order then we can safely move on to the next step
00:11:17.490 so if you're looking at a given flaky test and you're trying to figure out whether it might belong to this async code category the first question I
00:11:24.569 usually look at is is it a system or feature test something that uses capybara or some other way of interacting with the browser since
00:11:29.850 that's the number one place for these show up it is possible that you have other areas in your codebase where
00:11:35.129 you're dealing with async code but this is generally the biggest one and then within that does it trigger any events
00:11:41.699 without explicitly rating for the results even in a place where it looks relatively innocent it's always a good
00:11:47.730 idea to make sure that you're behaving like a real user would and waiting in between each thing you do to see the result when you're trying to identify
00:11:56.550 whether the flake is due to some async code it can also be helpful to use capybaras ability to save screenshots
00:12:02.509 which you can use by just calling save screenshot directly provided you're using one of the drivers that supports
00:12:07.889 that or the capybara screenshot gem which helps you wrap your test you know
00:12:12.929 so that every time they fail you'll capture a screenshot of the end state of the test when you're looking to prevent
00:12:22.499 async flakes there's a few things to keep in mind first as I mentioned make sure your test is waiting for each
00:12:28.079 action within it to finish and when you're doing this make sure you're not using sleep or waiting for some
00:12:33.660 arbitrary amount of time it's important to wait for something specific and that's because if you wait for an
00:12:38.670 arbitrary amount of time at some point your code will just happen to be running slowly enough that that arbitrary amount
00:12:43.769 of time isn't long enough and it will flake again it also means that you might be waiting longer than you need to in a
00:12:49.619 lot of other cases because the process happened faster and so by waiting for something specific you can avoid both of
00:12:56.009 those pitfalls it's also important to understand capybaras api which methods
00:13:01.949 wait and which don't so everything based on fine will generally wait but there are a few certain things like all that don't wait
00:13:08.699 in the same way and so it's just very important to be familiar with all of capybaras Docs and how to use its tools
00:13:17.040 correctly finally it's important to check that each assertion you're making in the test is working as you expect it
00:13:23.399 to it's very easy to write assertions that look like they're doing the correct
00:13:28.559 waiting behavior but actually don't as we saw in that double-click example sometimes content is already on the page in a
00:13:34.490 different place and it allows kind of accidental success all right so let's
00:13:41.569 move on to our next suspect order dependency if I'm I define this category
00:13:47.449 of tests as any that can pass or fail based on which tests ran before them
00:13:52.509 usually this is caused by some sort of state leaking between tests so in the
00:13:57.800 state and other tests when when the state another test creates as present or not present it can cause the flaky test
00:14:03.410 to fail and there are a few potential areas where a shared state can happen in
00:14:10.220 your tests one is the database another is global or class variables if those
00:14:16.970 are modified within your tests and then there's also the browser typically one
00:14:22.459 of the biggest issues with rails apps is database state so let's talk about that a little more in depth when you're
00:14:29.779 writing tests each test should start with a clean database that might not mean a fully empty database but any if
00:14:36.620 anything is created updated or deleted in the database during a single test it should be put back the way it was at the
00:14:42.949 beginning I kind of think of it like Leave No Trace when you're camping so this is important because otherwise
00:14:48.290 those changes to the database could have unexpected impacts and later tests or create dependencies between tests so
00:14:53.929 that you can't remove a reorder test without risking cascading failures there
00:14:59.269 are several different ways to handle clearing our database State wrapping your tests in a transaction and rolling
00:15:05.269 it back after the test is generally the fastest way to clear your database and it's the default for tests and rails but
00:15:12.170 in the past you couldn't use transactions with capybara because the test code and the test server didn't
00:15:17.540 share a database connection so they were running in separate transactions and couldn't see the data in each other's transactions rails 5 system tests
00:15:26.480 actually addressed this by allowing shared access to database connections and tests they could look at data within
00:15:32.029 the same transaction however running and transactions can still have some subtle differences from
00:15:37.670 normal behavior of your app and so there may be reasons why you still don't want to use them as your clean up
00:15:43.250 for example if you have any after commit hooks set up on your models that only run when a transaction commits those
00:15:49.070 probably won't run if you're using transactional cleanup so if you're not using transactional cleanup another
00:15:55.100 option is the database cleaner gem which can clean with either truncating tables or using a delete from statement on them
00:16:02.530 and this is generally slower than transactional but it is a little bit more realistic in terms of your not
00:16:08.870 having an additional transaction wrapped around everything that's happening in your tests and the important thing to
00:16:14.660 make sure if you're using this method is that this database cleanup is running after capybaras cleanup so capybara does
00:16:21.110 some work to make sure that the browser state is cleared and settled between each tests including wait waiting for
00:16:26.540 any Ajax requests or resolve and if you clean your database before that clean up
00:16:32.060 and waiting happens those a dress requests could create some data that doesn't get cleaned up so there's a bit
00:16:37.400 of an ordering issue here and you can avoid it if you're using our specs by putting your database cleaner call in an
00:16:43.250 append after block so why do I tell you all of this
00:16:49.010 the thing about database cleaning is it should just work and it often does especially if you're just using rails
00:16:55.220 basic built-in transactional cleaning but there are a lot of different ways that you could have your rails app and
00:17:00.920 test suite configured and it is possible to do it in such a way that certain gotchas are introduced so it's important
00:17:06.949 to know how your database cleaner works when it runs and if there's anything that's leaving behind especially if
00:17:12.770 you're starting to deal with flaky tests that seem to be order dependent let's look at an example of this let's say
00:17:19.370 we're using database cleaner with the truncation strategy maybe we started doing that back before rails 5 let us
00:17:25.610 share a database connection and it's stuck maybe we don't want any weirdness around transactions one of those reasons
00:17:31.580 but we notice this is slow so somebody comes in to optimize the test suite a little bit and they notice that we're
00:17:37.640 creating book genre x' in almost all of the tests they decide to create those
00:17:43.130 genres before the entire test suite runs and then exclude them from the database cleaner so this will speed up our tests
00:17:49.760 a bit but it does introduce a gap in our cleaning if we make any kind of
00:17:55.820 modification to book genre since we're using truncation to clean the database instead of transactions that update won't be undone
00:18:02.870 between tests and this could potentially affect later tests and show up as an order dependent Flake to be clear I'm
00:18:09.500 not picking on database cleaner here I just want to give an example of how a minor configuration change could allow
00:18:15.050 you to create more flakes and why it's important to have a good understanding of how cleaning is actually working in your test suite and the trade-offs you
00:18:21.470 might introduce depending on how you do it as I mentioned at the beginning there
00:18:27.380 are some other possible sources of order dependency via shared state one is the browser since tests run within the same
00:18:33.140 browser that can contain specific state depending on which test just ran capybara works pretty hard to clean all
00:18:39.170 of this up before it moves on to the next test so this should usually be taken care of for you but it is possible again depending on your configuration
00:18:45.590 how you have everything set up that maybe there's something that sneaks through and so it's good to be aware of that as a possible place where shared
00:18:51.320 state it could be another is global and class variables as I mentioned if you
00:18:56.540 modify those they could persist from one test to the next normally Ruby will yell at you if you reassign a global variable
00:19:02.330 but one area where these can kind of sneak in is if you have a hash assigned to a global variable and you just change
00:19:08.000 one of the values within it since that isn't reassigning the entire variable it won't come up as a warming warning all
00:19:17.120 right so if you're looking at a particular test and you're trying to figure out why what whether it's being caused by order dependency there's a
00:19:23.660 couple different strategies you can use one is just to start out by trying to replicate the failure with the same set
00:19:30.050 of tests in the same order so if you can take a look at how it ran in your CI or wherever you saw it fail and run the
00:19:35.540 exact same set of tests together with the same seed value to put them in the same order and it fails every time you
00:19:41.660 do that then you have a sense that this is probably an order dependent test but at that point you still don't know which
00:19:46.940 tests are affecting each other so to figure that out you're probably going to want to cross-reference each time you've
00:19:52.250 seen it failed and see if the same tests we're running before that failure r-spec
00:19:57.380 has a built-in bisect tool that you can also use to help narrow down the set of tests to the one that produced the
00:20:03.200 dependency however you may find that it can run a bit slowly depending on how fast your test suite runs so sometimes
00:20:09.200 it's easier to just look at things manually in order to prevent order
00:20:15.740 dependency you should make sure that you've configured your test suite to run in random order this might seem kind of
00:20:21.440 counterintuitive but the goal is that to surface order dependent tests quickly not just when you add or remove or move
00:20:28.940 around a certain test running in random order is the default in mini tests and is configurable in our spec also make
00:20:36.800 sure you spend some time understanding your entire test setup and teardown process and work to close any gaps where
00:20:41.990 shared state might be leaking through from one test to another all right moving on to our next suspect time this
00:20:49.550 is probably the one that gives me the most headaches this category includes any tests that can pass or fail
00:20:55.280 depending on the time of day that it is run so let's start with an example here imagine we have this code that runs in a
00:21:02.420 before save hook on our task model it's that's an automatic due date the next day at the end of the day if a due date
00:21:08.210 isn't already specified then we write
00:21:13.309 this test we create a task with no due date specified and we check that it's when we expect it to be the current date
00:21:20.179 plus one at the end of the day seems like it should be fine but this test
00:21:25.910 actually starts failing after 7 p.m. every night very strangely and how could
00:21:32.120 that possibly be happening the trouble is we're using two slightly different
00:21:37.130 ways of calculating tomorrow here date tomorrow uses the time based on the time
00:21:42.170 zone we set for our rails app well date dot today plus one will be based on the system time so the system time is in UTC
00:21:48.590 and our rails apps time zone is est there'll be 5 hours apart and after 7:00 p.m.
00:21:53.600 there'll be different days which results in this failure so how can we avoid this
00:21:59.450 one easy fix would be just to use date current which respects timezone instead of date today another option would be to
00:22:08.150 use the time crop gem which basically allows you to freeze time by mocking out what Ruby's sense of time is and so with
00:22:15.050 time crop we can freeze time here it would be January 1st at 10:00 a.m. and then our expected due date can just be a
00:22:22.370 static value January second at 11:59 p.m. and we can check
00:22:27.740 that the due date is that exact value this can be kind of helpful for making your test a little bit more explicit and
00:22:33.110 hat and simpler so that they don't contain complicated logic that it itself needs to be tested when you're trying to
00:22:41.299 determine whether a given flaky test is time based the first obvious thing to do is to look for any references to date or
00:22:46.850 time in the code under test if you have a record of past failures you can also
00:22:51.860 check whether they've all happened around the same time of day and finally if you suspect it's time based you can
00:22:58.700 add Timecop to that spec just temporarily to set it to the time of day where you've seen it fail before and see
00:23:05.000 if it fails every time when you do that as we saw in our example using Timecop
00:23:12.080 to freeze time can make it easier to write reliable tests that deal with time and also easier to understand exactly
00:23:17.899 what you're testing another strategy that you can use to surface time based
00:23:23.390 flakes is to set up your test suite so that it wraps every test in time got Timecop travel mocking the time to a
00:23:29.809 different random time of day on each run of the suite that's printed out before the test runs so this might seem a
00:23:35.870 little crazy but it's actually very helpful for surfacing tests that would normally only fail after business hours
00:23:41.480 when nobody happens to be running the test suite so that you see them during the normal business day instead of at
00:23:47.350 midnight when you just got woken up on call and you're trying to desperately ship a deploy and the test suite keeps
00:23:54.049 failing on expectedly it's just important to make sure that you're printing out the time of day that each
00:24:00.679 test is running at and that you're able to then rerun the test with that same time of day so that later if you're debugging a failure you can easily
00:24:07.279 replicate it all right our next suspect is unordered collections this is a
00:24:14.120 relatively simpler one this is just any test that can pass or fail depending on the order of a set of items that's
00:24:20.029 within it it doesn't have a pre-specified order so let's look at an
00:24:25.340 example here we have a test where we're looking at a set of active posts and we
00:24:30.710 expect them to equal some specific posts that perhaps we've created earlier in the test the issue with this test
00:24:37.570 is that the database query in the first line doesn't have a specific order so even though things will often be
00:24:43.090 returned from the database in the same order just by chance there's no guarantee that this will actually always happen and when it doesn't this test
00:24:50.350 will fail so the fix is just to make sure that we're specifying an order on
00:24:56.799 the items returned by the database and that also our expected posts are in that exact same order when trying to identify
00:25:05.919 whether a flaky test is being caused by on our collections look for any assertions about the order of an array
00:25:12.369 the contents of array or the first or last item in one if you're using r-spec
00:25:19.029 you can use the match array expectation which allows you to basically just assert things about what's in an array
00:25:25.779 without caring about the order or you can just add an explicit sort to both the expectations and what you're looking
00:25:32.109 at all right so we've gotten to our last
00:25:37.749 possible suspect which is randomness and you might think that all of these different categories of flaky tests have
00:25:43.720 something to do with randomness since they're randomly failing but in this case I'm talking about tests that
00:25:48.879 actually explicitly invoked randomness via a random number generator so here's
00:25:55.389 an example of a test data factory that uses factory about to create an event if we have a validation that enforces start
00:26:02.109 date sorry and suppose we might start out with just having start date and then
00:26:08.109 adding end date after that at some point and we decide okay start date will be some time five days from now end date
00:26:14.049 will be sometime ten days from now we could run into an issue where end date
00:26:19.359 actually ends up being lower than start date since they're both random values so if we add a validation to events that
00:26:26.710 enforces that at some percentage of time our tests that deal with events will
00:26:32.559 fail because they'll have invalid data so in this case you're just better off
00:26:37.989 being explicit and creating the same dates every time and this might feel a little counterintuitive because
00:26:42.999 randomness can seem useful as a tool for testing a large spectrum of different
00:26:48.700 types of data and so on but the is a big downside and not being able to know what your tests are actually testing and having them be flaky and so
00:26:57.110 a better strategy is to is to actually just write tests for each of those
00:27:03.110 specific cases that you would like to test so if you're trying to identify
00:27:09.080 whether randomness is causing your flake the first obvious thing to do obviously is to look for a random number generator
00:27:15.110 and often us will come up in your factories or fixtures but another thing
00:27:20.270 you can try is using the - seed option in either mini test or r-spec and that
00:27:27.409 will allow you to run the tests with the same seed value format for randomness and generally the same random values
00:27:32.600 produced with r-spec you just want to make sure that you actually have kernel SRAM set to our specs config seed so
00:27:40.130 that those so that passing the seed option will actually control the randomness seed to prevent randomness
00:27:49.340 based flakes as I mentioned the general strategy is to remove randomness from your tests and it's instead explicitly
00:27:56.390 test the boundaries and edge cases that you're interested in it's also generally
00:28:01.730 a good idea to avoid gems like fakir to generate data for tests they're very useful for generating realistic say
00:28:07.580 seeming data and your dev environment but in your tests at least from my perspective it's more important to have
00:28:13.010 reliable behavior than random and realistic data all right so now we've
00:28:20.720 looked at all of the usual suspects so we can move on to formula theory and actually solving a flaky test mystery my
00:28:29.090 first strategy tip when trying to find a fix to a flaky test and there isn't an obvious one popping out for you is just
00:28:35.480 to run through each of those categories that I've described and look for any connection or identifying signs that
00:28:40.610 could link this test to one of those so even if it looks perfectly fine but it is dealing with a date maybe digging
00:28:46.460 down that particular path and again just resist the urge to use trial and error
00:28:52.880 to test fixes it's more important to form a strong theory about how this might be happening first even if you're
00:28:58.100 not 100% sure it's going to work a lot better than using trial and error
00:29:03.160 what you can do and what might involve a little bit of a different kind of trial and error is trying to find a way to
00:29:09.350 reliably replicate failures to prove your theory so this came up a little bit with when I was talking about randomness
00:29:15.830 dates and order dependency because for those you have more control over the factors that might be producing the
00:29:22.250 flake you can freeze time you can run the tests in the same order you can use
00:29:27.710 the same random seed and then potentially be able to replicate the failure and since most flaky tests
00:29:34.010 typically are flaking very infrequently and passing most of the time if you're able to get them to fail two or three
00:29:40.880 times in a row you can be pretty confident confident that you've replicated it versus the other direction when you're using trial in order to test
00:29:46.940 a fix and you're seeing it pass it takes a lot of runs to be confident that that's actually what you're seeing so
00:29:55.400 you might try all those methods and still be stuck flaky tests are hard one strategy you can try if you get to that
00:30:01.550 situation is adding some code that will give you more information the next time it fails so if you've got like a hunch
00:30:07.070 that something's off like perhaps with what's in the database or you're curious about what the value value of a certain
00:30:12.380 variable is add that to something that will be logged out in the test and then the next time that it fails in CI you
00:30:18.740 can take a look at that and factor that into your process of fixing it another
00:30:23.960 strategy that I really like is pairing with another developer since fixing flaky tests is so much about your having
00:30:30.710 a deep understanding of your testing tools your framework and your own code everybody is going to have some gaps but
00:30:36.920 when you have two people working together you can fill each other's gaps in a little bit and you can also help keep each other from going down rabbit
00:30:43.220 holes or getting too frustrated chasing down the same wrong theory another
00:30:50.750 question I see coming up a lot at this point is can I just delete it I can't fix it it keeps failing is it even worth
00:30:57.530 it anymore why did I become a developer that kind of thing and my first response
00:31:03.890 to this is that you have to accept that if you're writing tests at some point inevitably you are going to have to deal
00:31:10.310 with Saiki ones you can't just delete any test that starts to be flaky because you'll end up making
00:31:15.650 significant compromises in the coverage that you have for your app and also learning to fix and avoid flaky tests is
00:31:21.890 a skill that you can develop over time and it's one that's really worth investing in even if that meant means
00:31:27.740 have spent spending two days fixing one instead of just deleting it that being
00:31:33.710 said when I'm dealing with laggy tests I do like to take a step back and think about the test coverage I have for a
00:31:39.320 feature holistically what situations do I have coverage for which ones am i
00:31:44.420 maybe neglecting or ignoring and what are the stakes of having the kind of bug that might slip through the cracks in my
00:31:49.880 coverage if the flaky test I'm looking at is for a very small edge case with low stakes or it's something that's
00:31:55.670 actually well covered by other tests or could be covered by a different type of test maybe it does make sense to delete
00:32:01.220 it or replace it and this ties into a bigger picture idea which is that when
00:32:06.770 we're writing tests we're always making trade-offs between realism and maintainability using automated tests
00:32:13.520 instead of manual QA is itself a trade-off in terms of substituting in a machine to do the testing for us which
00:32:20.570 is going to behave differently than an actual user would but it's worth it in a lot of situations because we can get
00:32:26.660 results faster and consistently and we can add tests as we code so different
00:32:31.670 types of tests will go to different lengths mimic real-life and generally the most realistic ones are the ones
00:32:37.490 that are hardest to maintain and keep from getting flaky there's an idea of
00:32:43.010 the test pyramid which I think was first came up with by Mike Kondo I think there's been many other spins on it
00:32:48.500 since and this is my particular spin you should have a strong foundation of lots of unit tests on the bottom they're
00:32:54.980 simpler they're faster and they're less likely to be flaky and then as you go from less realistic tests to more
00:33:00.650 realistic tests you should have fewer of those types of tests because they are going to take more effort to maintain
00:33:05.720 and the tests themselves are coarser grained so they're testing a lot more covering a lot more situations and the
00:33:12.830 these more realistic tests are just in general more likely to become flaky because there's so many more moving parts involved so it's wise to keep the
00:33:20.390 number of them in your test suite in balanced tests the major happy pass the major problems but leave certain edge
00:33:28.040 cases and other types of testing for more specific and
00:33:33.320 isolated tests the last thing I want to
00:33:38.690 talk about is how to work with the rest of your team to fix flaky tests it shouldn't be just a solo effort since
00:33:45.470 flaky tests can slow everyone down and I wrote everyone's trust in your test suite they should be a really high priority to fix if you can manage it
00:33:52.460 they should potentially even be the next highest priority under production fires this needs to be something that you talk
00:33:58.070 about as a team that you communicate to your new hires and that you all agree it's worth investing time in to keep
00:34:03.710 each other moving quickly and trusting your test suite the next thing I
00:34:08.990 recommend is that making sure you have a specific person assigned to each active flake that person is in charge of
00:34:14.810 looking for a fix deciding whether maybe you need to temporarily disable the tests well they well it's being worked on if it's frequently flaking that
00:34:22.460 person should reach out to others for help if they're stuck and so on and it's important to make sure that responsibility is spread out among your
00:34:29.179 entire team don't just let one person end up being the flake master and everybody else ignores them if you're
00:34:35.990 already sending flakes to a bug tracker as I suggested in the gathering evidence section you can use that as a place to
00:34:41.750 assign them to different people the next thing I recommend is setting a target
00:34:47.810 for your master branch pass rate and tracking at week over week so for example you could say that you want to
00:34:53.120 have builds on your master branch pass 90% of the time and then by tracking this that helps you keep an eye on
00:34:59.330 whether you're progressing towards that goal and course-correct if your efforts aren't working and you need to invest more in it or if you have kind of wider
00:35:05.630 issues with your test Suites reliability to wrap this all up if you remember just
00:35:13.340 one thing from my talk I hope it's that flaky tests don't have to just be an annoying and frustrating problem or
00:35:18.710 something you try to ignore as much as you can fixing them can actually be an opportunity to gain a deeper
00:35:23.750 understanding of your tools and your code and also to pretend you're a detective for a little while so hopefully this talk has made it
00:35:30.440 easier for you to do that thank you all for coming if you have any questions feel free to I'll be up here and you can