Schrödinger's Error: Living In the grey area of Exceptions

by Sweta Sanghavi

In the talk "Schrödinger's Error: Living In the Gray Area of Exceptions" delivered by Sweta Sanghavi at RubyConf 2021, the focus is on the challenges developers face when managing exceptions in complex systems. The session acknowledges that while encountering exceptions is inevitable, the key lies in how effectively developers respond to them.

Key Points Discussed:

- Understanding Exceptions: Developers operate in systems where understanding every potential failure point is impractical. Exceptions serve as feedback mechanisms to reassess code assumptions.

- Initial Challenges: At BackerKit, the team initially faced disorganization in addressing exceptions, relying on a Slack integration that resulted in missed notifications due to lack of defined ownership and unclear priority among exceptions.

- Process Experiments: The team experimented with a structured approach to manage exceptions, initiating a weekly "Badger Duty" where team members took ownership of triaging exceptions, leading to better exposure and alignment on priorities.

- Goals for Improvement: Key goals identified by the team included filtering meaningful signals from the noise of exceptions, fostering collective ownership, and proactively addressing user-impacting exceptions.

- Daily Triage Duty: This included setting a clear rotation for exception triaging, defining tasks to categorize bugs, and promptly determining the urgency of issues, thus creating a continuous feedback loop.

- Learnings and Iterations: The team learned that continually iterating on the process helped maintain clarity and focus on actionable items within the backlog. For example, using dashboards for visualizing errors simplified the triaging process while providing data for future actions.

- Case Studies: The presentation included specific examples such as handling a "Faraday timeout error" and a "missing correct access error"—emphasizing the importance of understanding error contexts, prioritizing issues based on frequency and user impact, and determining effective responses.

- Concluding Thoughts: Sweta highlighted that creating a systematic approach to exception management not only clarified responsibilities but also encouraged team collaboration and knowledge sharing, ultimately driving the team towards a more disciplined, resilient, and responsive approach.

The talk concluded with an invitation for attendees to reach out for further discussion on exception management strategies and to connect on shared experiences, emphasizing the importance of community collaboration in improving processes.

ArgumentErrors, TimeOuts, TypeErrors… even scanning a monitoring dashboard can be overwhelming. Any complex system is likely swimming in exceptions. Some are high value signals. Some are red herrings. Resilient applications that live in the entropy of the web require developers to be experts at responding to exceptions. But which ones and how?

In this talk, we’ll discuss what makes exception management difficult, tools to triage and respond to exceptions, and processes for more collective and effective exception management. We'll also explore some related opinions from you, my dear colleagues.

RubyConf 2021

00:00:10.719 all right

00:00:12.000 hello everyone

00:00:15.280 our applications live in complex systems

00:00:17.920 with points of failure that span space

00:00:20.960 and time

00:00:22.480 as developers we write code that's

00:00:24.560 executed in systems that we rarely know

00:00:26.720 all the ins and outs of and it would be

00:00:29.039 a poor use of our time to try to

00:00:30.800 anticipate all the ways in which our

00:00:33.280 code can fail

00:00:34.960 and so the errors that our code raises

00:00:37.520 gives us feedback when code pads code

00:00:40.399 paths that have some assumptions baked

00:00:42.719 into them that require a little bit of

00:00:44.960 re-examination

00:00:47.200 it's our super fat power to make

00:00:49.360 reasonable assumptions let our tests

00:00:51.680 guide us but also ship fast and allow

00:00:54.399 our systems to tell us the ways in which

00:00:57.280 our system is falling down instead of

00:00:59.359 trying to predict those failures

00:01:03.039 this also means

00:01:05.519 that sometimes we must put on our gear

00:01:08.159 step into our spaceships and identify

00:01:10.799 which invaders are going to cause us

00:01:12.560 harm and which ones are okay to allow to

00:01:15.119 stick around

00:01:17.759 we may all share this understanding of

00:01:20.799 the importance of reacting to uncaught

00:01:22.799 exceptions

00:01:24.560 and yet some of our processes or lack

00:01:27.439 thereof can make this notification

00:01:30.320 still elicit some

00:01:32.159 dread

00:01:35.040 welcome to schrodinger's error living in

00:01:37.280 the gray area of exceptions i'm sweta

00:01:40.720 sangvi i'm a developer at backer kit and

00:01:43.600 i've become really interested in why

00:01:45.680 exceptions are so difficult to manage

00:01:48.000 and what we can do about it

00:01:50.079 today i'm going to walk through some

00:01:51.280 process experiments we've tried on my

00:01:53.280 team at backer kit

00:01:55.040 and learnings we've surfaced and then

00:01:57.680 we're all going to go on to triage duty

00:01:59.680 together and see some tools we can use

00:02:01.920 to help us when we're managing

00:02:03.680 exceptions

00:02:05.439 all right put on your helmets and your

00:02:06.960 space boots

00:02:08.399 keep your arms inside the vehicle at all

00:02:11.120 times

00:02:12.879 so i'm going to lay some groundwork of

00:02:14.640 where we started at backer kit we

00:02:16.800 support creators in fulfilling their

00:02:18.720 crowdfunding projects

00:02:20.800 if you've backed a crowdfunding campaign

00:02:22.480 you may have gotten your survey through

00:02:23.840 backer kit take a look at that url

00:02:26.560 we're a pretty small lean team

00:02:28.800 our daily flow is to pair up in the

00:02:30.800 morning and work down a backlog or a

00:02:34.080 queue of features

00:02:35.840 chores and bugs

00:02:38.160 we surfaced a problem that though we had

00:02:40.480 a general idea there's utility to

00:02:42.959 reacting to impactful exceptions we

00:02:45.519 weren't super effective at doing so

00:02:49.599 when we started thinking more critically

00:02:51.920 about this we had a pretty light process

00:02:53.760 for addressing exceptions we had a slack

00:02:56.400 integration that surfaced any unresolved

00:02:58.640 or unignored errors that went to a

00:03:01.120 channel

00:03:02.080 and since we were usually pairing

00:03:03.519 someone who was either soloing or had a

00:03:05.200 spare moment may look at the channel

00:03:07.120 unnoticed when there was something worth

00:03:09.120 addressing

00:03:10.720 but one it required context switching

00:03:12.879 from our prioritized q work and there

00:03:15.440 was no owner or designated dev and so it

00:03:18.239 ended up being a few folks who were in

00:03:20.000 there regularly and a lot of folks who

00:03:22.000 weren't really looking in that channel

00:03:24.879 we also had a pretty low alignment of

00:03:26.879 what process what what process we wanted

00:03:29.120 to use for managing exceptions and not a

00:03:31.840 shared goal of what what effective

00:03:33.840 management was and which exceptions were

00:03:36.000 high priority in addressing or what

00:03:38.239 actions to take in response

00:03:40.560 there was a pretty large backlog of

00:03:42.640 errors and so it was hard to scan it

00:03:45.200 quickly or really surfaced when there

00:03:47.040 was something that should be addressed

00:03:48.879 coming up

00:03:50.720 and so we started to experiment with

00:03:52.799 some process changes to get better at

00:03:55.280 exception management

00:03:57.439 our first experiment was badger judy

00:04:00.640 affectionately called we use honey

00:04:02.239 badger as our observability tool and so

00:04:05.360 look badger patrol badger judy was the

00:04:08.159 experiment where each of us owned

00:04:10.400 managing exceptions for the week and

00:04:12.720 tracking and you could either solo a

00:04:14.159 repair on it and tracking how long it

00:04:16.000 took to triage and address any important

00:04:18.639 exceptions

00:04:20.000 this was our first stab at starting to

00:04:21.759 chip away at the errors and also to

00:04:23.600 de-silo exception management

00:04:27.280 as we were all becoming more exposed to

00:04:29.440 exception management we started to

00:04:31.199 surface what our pain points were and

00:04:33.040 why this was difficult for our team

00:04:37.520 one was the priority was unclear

00:04:40.080 it was unclear when um something was

00:04:42.800 urgent when we were just scanning

00:04:44.240 exceptions

00:04:45.520 and when compared to our high value

00:04:47.440 feature work it could feel like low

00:04:49.600 value to be managing or triaging

00:04:52.080 exceptions

00:04:53.280 internally we weren't super aligned on

00:04:55.520 what those high priority exceptions were

00:04:59.440 there were also a lot of exceptions that

00:05:01.440 had low action ability

00:05:03.360 or nor did we have a well-aligned path

00:05:05.840 to what we should be doing about it and

00:05:08.240 so it could lead to feeling a little

00:05:10.479 stuck when we came across an exception

00:05:14.240 we also started surfacing that we

00:05:16.080 weren't aligned on what the goal was

00:05:18.560 and so

00:05:19.520 kind of our takeaways at this first

00:05:21.680 first

00:05:22.560 experiment was that having this process

00:05:25.360 helped us start to have a place to

00:05:27.120 iterate from

00:05:28.400 actually just having something that we

00:05:30.080 know we were doing did ease some of the

00:05:32.240 anxiety of not knowing who was tackling

00:05:34.720 this

00:05:36.479 and it gave us a process to start

00:05:38.880 unearthing where we weren't aligned

00:05:42.160 we started to realize that getting

00:05:44.080 aligned on our goals would be a useful

00:05:46.000 exercise as we kept iterating on this

00:05:49.120 um also badger duty kind of gave us a

00:05:51.759 place like when we were pairing or

00:05:53.199 working on an exception to start digging

00:05:55.360 for finding where we were disagreeing

00:05:57.280 which is actually the first piece to

00:05:59.199 starting to align

00:06:00.960 so i'm going to go through some of the

00:06:02.240 goals we landed on as a team

00:06:04.800 our first goal was that we should be

00:06:06.080 able to see signal through noise when we

00:06:08.400 are managing exceptions

00:06:10.160 it should be clear when there's

00:06:11.440 something that should be addressed and

00:06:13.520 it should surface through or surface

00:06:15.840 above the rest of maybe the more noisy

00:06:18.319 exceptions

00:06:19.520 a piece of this means tackling some of

00:06:21.600 that noise so that the signals can kind

00:06:24.479 of come through

00:06:26.080 our goal is not to get zero exceptions

00:06:28.080 or never raise exceptions or inbox zero

00:06:31.840 and it's also not to like fight find

00:06:34.319 fires

00:06:35.600 monitoring is where we expect those

00:06:37.759 kinds of things to come through

00:06:40.639 we also had the goal of having a

00:06:43.759 collective ownership of addressing

00:06:45.759 exceptions

00:06:46.880 we saw a lot of utility for all of us to

00:06:49.520 be exposed to what the exceptions that

00:06:51.360 were happening in the app it allowed for

00:06:53.759 all of us to be involved in making the

00:06:56.160 solutions and um benefiting from other

00:06:58.960 people's perspectives or what they found

00:07:01.039 or perhaps if they were closer to that

00:07:02.800 piece of the code base

00:07:05.039 it was an opportunity for knowledge

00:07:06.400 share

00:07:08.960 and last our goal was to proactively

00:07:11.199 address exceptions this may be a shared

00:07:12.960 goal of

00:07:14.160 we should see when there's some

00:07:15.759 unfavorable user impact or if a user has

00:07:18.479 gotten themselves into a

00:07:20.720 a state that they can't get themselves

00:07:22.479 out of before that we get those user

00:07:24.800 reports that's like the utility of our

00:07:26.720 exception management

00:07:28.560 and so we proposed another process

00:07:30.720 change

00:07:31.759 the next thing we tried is a rotating

00:07:33.919 pair triaging exceptions daily

00:07:37.199 this had the benefit again to having a

00:07:38.960 clear owner so it wasn't like all the

00:07:41.039 team was expected to look at the slack

00:07:43.120 or check in and what's going on

00:07:45.599 it also allowed us because this was a

00:07:47.440 rotating designation it allowed us to

00:07:50.080 make resource decisions at the top of

00:07:51.680 the day

00:07:53.440 looking at schedules or looking at high

00:07:55.039 priority work who is going to work on

00:07:56.879 bugs and also do we have the bandwidth

00:07:58.720 to do so

00:08:00.560 again everyone was exposed to nuances of

00:08:03.360 exception management and the exceptions

00:08:05.039 that were coming through our app which

00:08:06.960 allowed us to drive towards alignment

00:08:12.160 as different people came across the same

00:08:14.000 exceptions and why they were painful we

00:08:16.479 could discuss them and come up with

00:08:17.840 solutions to incorporate the the team's

00:08:20.560 ideas

00:08:22.080 i'm going to go through what it looks

00:08:23.680 like to be on triage duty at backer get

00:08:26.960 first at the top you come to

00:08:29.199 you start at the top of the backlog of

00:08:31.280 the unaddressed

00:08:32.959 um errors

00:08:35.039 at the top of the backlog you get a 15

00:08:37.279 minute time box

00:08:38.560 your goal in that 15 minutes is to

00:08:40.320 determine the priority of that bug

00:08:43.279 it basically boils down to three options

00:08:45.760 this is something that should be fixed

00:08:47.920 and but i'm not going to do it right now

00:08:49.839 because i have 15 minutes so i'm going

00:08:50.959 to write a bug ticket and put it in the

00:08:52.560 bug section of our sprint

00:08:55.440 of our backlog

00:08:57.200 i'm going to acknowledge it so we sort

00:08:58.800 of came up with the idea of

00:09:00.000 acknowledgement but it's pretty similar

00:09:01.279 to resolve the idea being i've looked at

00:09:03.839 it this is not something we need to

00:09:05.200 really do anything about acknowledge and

00:09:07.279 move on i do want to know the next time

00:09:09.279 it happens

00:09:10.399 or

00:09:11.360 this is something that's worth fixing

00:09:12.880 and it's worth fixing now i'm going to

00:09:14.480 make a bug ticket and i'm going to work

00:09:16.000 on it now

00:09:17.680 for a really high priority exceptions

00:09:20.959 a tool that we use to support um

00:09:24.000 bug duty is a dashboard

00:09:26.640 so what this helped us do is to get a

00:09:29.120 quick idea of what exceptions had not

00:09:31.440 been addressed

00:09:32.880 this is pretty much pulling those same

00:09:34.880 errors that we saw in our slack

00:09:36.080 integration into a more user-friendly ui

00:09:40.720 it also really supported this like going

00:09:42.800 down the line of triaging starting at

00:09:44.959 the top and going down and you get some

00:09:46.880 useful information just

00:09:49.120 from at a glance

00:09:50.800 this also supported some of those

00:09:52.160 resource decisions of like how deep is

00:09:54.080 our backlog how important is it to make

00:09:56.640 sure someone's looking at

00:09:58.080 exceptions today

00:10:00.399 i also want to talk about the idea of

00:10:02.480 acknowledgement that you can see in

00:10:05.200 badger bot

00:10:06.560 um

00:10:07.680 so the purpose of acknowledge as i spoke

00:10:09.680 about before is to kind of say we're not

00:10:12.720 going to do something about this

00:10:14.399 as part of our process we also started

00:10:17.440 auto resolving weekly so basically an

00:10:20.480 acknowledgement would would expire in a

00:10:23.040 week and so you would see that error

00:10:24.720 again

00:10:26.640 you can see the drop down that comes

00:10:28.079 down when you hit acknowledge and you

00:10:30.000 can see what this kind of represents is

00:10:32.640 us starting to bucket

00:10:34.240 why we might be acknowledging exceptions

00:10:36.399 which actually ends up being really

00:10:37.760 useful as you're having conversations

00:10:39.680 about the priority of different

00:10:40.959 exceptions

00:10:42.079 you can see like yeah there's this get

00:10:44.000 timeout but it's not high priority or

00:10:46.000 it's not in high frequency i'm gonna

00:10:48.320 move on

00:10:49.440 um also some things about like i already

00:10:51.519 see a story about this i'm not gonna

00:10:52.959 make a new one

00:10:54.560 and also you could keep adding new

00:10:57.120 um

00:10:58.079 choices that you wish were here that are

00:10:59.760 not

00:11:01.440 so it kind of allows us to have a way

00:11:03.440 forward for what i think is starting to

00:11:06.079 talk about the gray area of like yes i

00:11:08.959 want to see this exception no i don't

00:11:11.120 want to do anything about it and like

00:11:13.279 being comfortable with that designation

00:11:16.000 and knowing that there are many

00:11:17.440 exceptions that are going to live in

00:11:18.640 that bucket

00:11:23.360 next i want to start talking about some

00:11:25.040 learnings from this next process

00:11:27.600 change that we made

00:11:30.000 one big one

00:11:31.360 especially for me was that exceptions

00:11:32.959 can be solved collectively over time

00:11:35.440 when you have a 15-minute time box

00:11:37.040 you're likely hitting it frequently

00:11:39.680 which means that you have to kind of

00:11:41.360 think about what can i do in 15 minutes

00:11:43.519 that's going to move debugging this

00:11:45.440 exception forward

00:11:47.360 it made it really reframed how i thought

00:11:49.360 about what it means to fix something

00:11:51.279 because sometimes it can mean that i

00:11:53.519 don't think this is high enough value

00:11:54.880 for me to spend time really digging into

00:11:56.720 this but maybe next time i wish that i

00:11:59.519 had this piece of information i'm going

00:12:01.200 to add it to the breadcrumbs or the

00:12:03.040 params

00:12:04.880 other kinds of

00:12:06.320 small things you can do like refactoring

00:12:08.560 out a love demeter violation so that

00:12:10.399 next time the line from which the error

00:12:12.560 was thrown is more useful

00:12:14.720 um or just like starting a bug ticket

00:12:17.680 and having a place to log what happened

00:12:20.399 what's gonna happen next time and being

00:12:22.399 okay with allowing it to happen again

00:12:24.880 and letting all of that be more data

00:12:26.720 points to support a broader solution

00:12:29.680 it also allows you to try something

00:12:31.680 right now and leave push forward the

00:12:34.399 exception for the next person to take

00:12:36.079 over next time and start putting

00:12:38.320 patterns into our app for dealing with

00:12:40.560 exceptions that we were seeing a lot of

00:12:45.040 another learning is about prioritizing

00:12:47.040 and writing bug stories

00:12:49.440 writing bug stories that i myself would

00:12:51.279 actually want to pick up is quite the

00:12:53.200 art

00:12:54.399 providing enough context for the next

00:12:56.240 person about what you've discovered

00:12:59.200 but not allowing like any half-baked

00:13:01.440 theories to be in the ticket is

00:13:04.000 really useful and can be difficult

00:13:06.240 things that were useful for us is like

00:13:07.680 links to faults how to reproduce that

00:13:10.000 same exception or other information you

00:13:12.480 uncovered

00:13:13.920 one way to really prove to yourself that

00:13:15.680 you understand how

00:13:17.680 why an exception is being raised is to

00:13:19.920 write a test that shows the same

00:13:22.720 system that you think is causing this

00:13:24.800 exception and seeing that failing test

00:13:26.720 which is a really great thing to link to

00:13:28.160 a bug ticket

00:13:30.160 one other learning from our triage duty

00:13:33.200 kind of process is that we started

00:13:35.839 making a lot of bug tickets but weren't

00:13:38.000 always picking them up and so our bug

00:13:39.920 backlog also grew

00:13:41.680 and so another addition to this process

00:13:43.360 was pulling a set number of bugs over

00:13:46.320 each sprint this kind of alludes to like

00:13:49.199 when you want to tackle your backlog you

00:13:51.920 have to have that investment and you do

00:13:53.600 have to prioritize that bug those bugs

00:13:56.079 and make room in your sprint for them

00:13:59.760 it also serves as a purpose of um

00:14:02.720 if no one is advocating for certain bugs

00:14:04.880 to get into the sprint maybe they

00:14:06.480 weren't that important and letting them

00:14:08.000 fall off can also be a really useful

00:14:09.600 tool

00:14:12.000 another learning is if you look at your

00:14:14.160 top occurring errors you might find that

00:14:16.639 the top it usually follows a power lot

00:14:18.720 and the top several are probably going

00:14:20.639 to take up a chunk of the exceptions and

00:14:23.120 so

00:14:24.160 making a um a deliberate effort to start

00:14:26.560 chipping away that away at those can be

00:14:28.880 really useful

00:14:30.560 and last learning is identifying the

00:14:32.880 levers for efficiency we kind of had

00:14:35.199 this general goal of triaging quicker

00:14:38.000 and moving faster and this process

00:14:40.480 started to show us what was taking a

00:14:42.720 long time for um exceptions require you

00:14:45.680 to

00:14:46.480 load up a lot of context really quickly

00:14:49.120 um

00:14:50.639 and

00:14:51.600 i think

00:14:53.440 exposing us more to triage duty

00:14:55.600 regularly and building that muscle was

00:14:57.680 really useful on us moving faster

00:15:00.959 also having more pads to action ability

00:15:03.040 which ended up being one of the reasons

00:15:04.560 why i was going so slow

00:15:07.519 we are going to move on to triage duty

00:15:10.079 so i've pulled two errors out of our app

00:15:12.959 and i wanted to actually go through the

00:15:14.639 process of triaging them together

00:15:17.120 so the first is

00:15:19.600 this faraday timeout error that we see

00:15:22.000 is being thrown from the project lead

00:15:24.720 fetcher first step understanding the

00:15:27.519 error

00:15:28.880 we see here it's a net read timeout

00:15:32.079 if the error that you're seeing is not

00:15:34.000 something you're familiar with really

00:15:35.440 familiar familiarize yourself with the

00:15:37.920 error code or if the http status is new

00:15:40.240 to you what it really means and if it's

00:15:42.560 a custom error looking into the api and

00:15:44.800 understanding when that error is

00:15:47.920 thrown

00:15:50.160 in our case we see that we have a net

00:15:52.240 read timeout a subclass of the timeout

00:15:54.560 error which is raised when a chunk of

00:15:56.800 the response can't be read

00:15:58.880 um within the the three time app that

00:16:01.519 you've set the default of which being 60

00:16:03.680 seconds

00:16:04.959 just for a little refresher we know when

00:16:06.480 we're making an http request we open up

00:16:08.959 our tcp connection send a request over

00:16:11.360 the wire and then read the response back

00:16:14.320 what our error is telling us is that

00:16:16.000 that last step is not happening i bring

00:16:18.480 that up because you might view it

00:16:20.160 differently if it's a get or a post

00:16:22.959 in our post request we really want to

00:16:24.480 know where we did we send something over

00:16:26.399 the wire or not and in this case we do

00:16:28.639 have to check versus if you were

00:16:30.720 unopened unable to open the connection

00:16:32.880 at all

00:16:34.959 next step let's understand the call site

00:16:37.279 and what this piece of code was meant to

00:16:39.199 do that it was unsuccessful at doing

00:16:41.680 we can see that this error is being

00:16:43.199 thrown from a get method in our client

00:16:45.199 with our indiegogo integration if we

00:16:47.680 look at our back trace we can look down

00:16:51.199 and see okay there's a get campaign

00:16:52.880 method that was expecting to be able to

00:16:55.519 get some information

00:16:57.120 it's being thrown from our project lead

00:16:59.199 fetcher worker all right so we're

00:17:01.360 starting to get some information

00:17:04.000 whatever project lead fetcher needed

00:17:05.839 campaign information for is unsuccessful

00:17:08.079 at doing that

00:17:09.760 we might then look at where this worker

00:17:11.919 is being called from to understand more

00:17:13.760 about the purpose of this code

00:17:15.600 i've done some digging for us and we can

00:17:17.280 see that it's coming from a staff

00:17:18.640 controller action called sync from a

00:17:20.640 method sync and update

00:17:22.640 so again we're starting to get some more

00:17:24.000 information like oh this is a staff

00:17:26.160 action that is exposed this is a get

00:17:28.640 request so that some information that we

00:17:30.799 expect to have we did not and we also

00:17:34.160 know it's coming from something that is

00:17:36.080 exposed to a user so theoretically they

00:17:38.559 could kick the worker off again

00:17:40.559 the next question is how often is this

00:17:42.240 happening so we see that it's 67

00:17:45.280 occurrences in the last four months this

00:17:47.520 is not new it's not super frequent so

00:17:50.000 it's not super noisy it can be useful

00:17:52.240 just to ask yourself is this something

00:17:53.840 that's gone

00:17:54.960 gone out in a recent deploy and maybe

00:17:56.640 that will change how likely you are to

00:17:58.799 um fix it etc and also just gives us an

00:18:01.919 idea of how noisy it is

00:18:04.640 some other tools for debugging that you

00:18:06.320 might use is to just kick off the

00:18:07.919 process again in our case i kicked this

00:18:10.720 off

00:18:11.600 when i was debugging and hey the next

00:18:13.440 time we did it we were successful

00:18:15.919 try to reproduce the error and also like

00:18:18.080 finding patterns and past occurrences of

00:18:20.320 the error can be really useful

00:18:24.160 so now let's think about priority

00:18:26.640 and this again there is no equation i

00:18:28.799 wish i could give one to you but it's an

00:18:30.720 art two things that you might consider

00:18:32.799 is the frequency

00:18:34.559 as we talked about we want to reduce the

00:18:36.559 noise and so things that have high

00:18:38.559 frequency that even if the user impact

00:18:40.960 is low it might change our priority of

00:18:43.440 of addressing that exception or not

00:18:45.039 seeing it again

00:18:46.880 user impact is the other big one we want

00:18:48.880 our users to have really great

00:18:50.160 experiences and so considering what was

00:18:52.799 the impact for the downstream downstream

00:18:55.039 user you might have a higher priority

00:18:57.600 for a page timing out and a user not

00:18:59.280 being able to do something if it's

00:19:01.760 affecting even more users that also

00:19:03.360 might change

00:19:04.880 your

00:19:05.919 appetite for making a change

00:19:08.799 so in our example we see pretty low

00:19:12.000 frequency

00:19:13.200 and we also know that the user has a way

00:19:15.919 to get themselves out of this

00:19:17.200 information and they're missing some and

00:19:19.440 for

00:19:20.240 some data they expect to have

00:19:22.720 all right so let's think about what we

00:19:24.080 could do in this case

00:19:25.840 we see that as we talked about the

00:19:27.840 re-timeout is something we've configured

00:19:30.960 there's there is some you could change

00:19:32.880 this this would be a pretty global

00:19:34.960 change

00:19:35.919 that's an option though you can use a

00:19:38.080 retry you can rescue this error and

00:19:40.080 retry and be like i don't want to see

00:19:41.679 this try once more before you

00:19:44.000 throw an exception we can acknowledge it

00:19:46.320 we can say hey this is something that

00:19:47.600 happens in our app we are

00:19:49.600 working with web requests they are going

00:19:51.200 to be flaky

00:19:52.960 we could snooze we could say not only do

00:19:54.720 i not care about this time i don't want

00:19:56.480 to see it another for another 100 times

00:19:59.200 or we could handle this case if it's

00:20:00.960 representing some specific user flow

00:20:03.840 in this case we've kind of determined

00:20:05.919 low user impact low frequency seems like

00:20:08.720 a really good candidate to dismiss and

00:20:10.480 move on

00:20:11.919 good job y'all we've solved our first

00:20:14.240 exception of triage duty let's move on

00:20:16.880 to the second one

00:20:19.440 all right next in our queue is the

00:20:21.520 missing correct access error from the

00:20:23.200 project vector worker we know our first

00:20:25.440 step is understanding the error so

00:20:27.200 missing correct access that's not an

00:20:29.360 error i really know off the bat

00:20:31.360 so i'm going to look at the call site

00:20:32.880 and be like what can you tell me more

00:20:34.720 about this error and i see this is

00:20:36.720 actually a custom error that we've

00:20:38.240 thrown

00:20:39.120 um

00:20:40.320 someone at one point was like i don't

00:20:42.000 want to see 403s i want to raise this

00:20:44.320 error that tells you more about what's

00:20:46.320 going on that's missing correct access

00:20:49.280 another

00:20:50.960 utility of having a customer is that you

00:20:52.880 can kind of throw it from different

00:20:54.080 places in the app so we can build an

00:20:55.760 understanding when it's the same root

00:20:57.440 cause

00:20:59.600 let's move on to what the purpose of

00:21:00.960 this code is we okay we have this id we

00:21:02.559 have this custom error missing correct

00:21:04.080 access maybe we are starting to

00:21:05.200 understand what that means

00:21:07.200 we can look at the stack trace and see

00:21:09.120 again project or again there's a worker

00:21:11.440 project fetcher worker it is trying to

00:21:14.320 get project information and it's we're

00:21:16.799 seeing this error that we're raising

00:21:19.440 we might need some more information

00:21:21.120 about what the purpose of this worker is

00:21:22.880 like where was it called

00:21:24.559 and our root call site here is just

00:21:27.360 sidekick

00:21:28.640 not super useful

00:21:30.960 in this in the case for workers because

00:21:32.960 it is um

00:21:34.559 being thrown from sidekick

00:21:36.880 it can be difficult to know what the um

00:21:39.280 where this is called so one tool to help

00:21:41.600 you do that is just putting the

00:21:43.679 what is calling the worker in params

00:21:47.120 so breadcrumbs and params are a really

00:21:48.640 good tool another thing that could be

00:21:50.720 useful is to save queries for post

00:21:52.480 requests to help you debug

00:21:55.120 or any other information that you wanted

00:21:57.120 when you were debugging

00:21:58.400 throwing it in the breadcrumbs or params

00:22:00.880 could be really useful

00:22:02.720 so someone already has done us our favor

00:22:04.559 we can look at params and we can see

00:22:06.640 okay there's a project update task that

00:22:09.120 is calling this worker that is causing

00:22:11.039 this exception

00:22:12.559 so i look in a heroku scheduler and i

00:22:14.559 see oh cool this update task runs every

00:22:17.280 day so

00:22:19.200 it'll just fix itself tomorrow right

00:22:22.080 the thing is if it's gonna raise again

00:22:24.400 it's also going to throw that exception

00:22:26.000 so let's try kicking it off again i see

00:22:28.159 that it's item potent i try it again and

00:22:31.280 okay this error is still happening i

00:22:32.640 still get a missing correct access

00:22:34.799 and what that means for our frequency is

00:22:36.799 it's a pretty noisy error we see that

00:22:39.280 it's being raised most days and if we

00:22:41.440 think about triage judy that means every

00:22:43.600 day someone's going to be triaging this

00:22:45.760 exception

00:22:46.960 which is painful

00:22:49.120 and so when we're thinking about this

00:22:51.039 error another part that i'm going to

00:22:53.440 tell you is that this is raised when we

00:22:55.919 integrate with kickstarter and when our

00:22:57.919 users are incorrectly off into this

00:22:59.679 external service this is where we get

00:23:01.360 this error

00:23:03.120 which also means there's not much as a

00:23:04.799 dev i can do i can't really offend for

00:23:07.039 them i don't have their credentials

00:23:09.200 um

00:23:10.640 but it's really noisy which is maybe the

00:23:12.799 worst of both worlds this is an

00:23:14.400 exception that i'm just gonna sit there

00:23:15.919 and be like cool here it is again

00:23:18.400 and we integrate with kickstarter a lot

00:23:20.559 so you're going to see this error and

00:23:22.480 any other action this user is trying to

00:23:24.240 do on our app

00:23:26.640 so what can we do

00:23:28.799 how do we reduce the noise of low value

00:23:30.960 exceptions to support our goal of only

00:23:33.600 high valuable actionable errors in our

00:23:36.000 in our backlog

00:23:38.240 one tool we can do is just error

00:23:39.679 grouping

00:23:41.120 um honey badger d by default groups

00:23:44.080 exceptions by the type of error and

00:23:45.679 where it's thrown from but you can

00:23:47.360 redefine a fingerprint to throw the same

00:23:50.880 error from two different places when you

00:23:52.400 know that's the same root cause there's

00:23:54.480 a gotcha there where it can it can

00:23:55.919 affect some statistics or comments but

00:23:57.840 it can be a useful tool

00:24:00.000 other things we can do is the rescue

00:24:02.080 retry pattern we know in this case it's

00:24:04.240 not really going to help us because we

00:24:05.679 know we're going to keep seeing it

00:24:07.679 if you if it's low action if it has low

00:24:10.000 action ability and you kind of question

00:24:11.679 is this even an exception we could see

00:24:13.760 silencing it can be a really useful tool

00:24:15.679 actually we don't want things in our

00:24:18.240 observability tool that we can't do much

00:24:20.240 about or it doesn't help to know that

00:24:21.679 it's happening

00:24:22.880 we can serve as errors users and allow

00:24:24.640 them to be self-serviceable train

00:24:26.960 trained teams and processes with a

00:24:28.559 manual workaround remember our our staff

00:24:31.440 our

00:24:32.400 people who work with us are part of our

00:24:34.159 system

00:24:35.200 or if it's widespread and affecting a

00:24:37.520 specific sliver maybe an overlooked

00:24:40.400 data consideration you might actually

00:24:42.000 fix that data

00:24:44.159 i want to show one thing that we did um

00:24:46.320 for this case as i said

00:24:48.640 you can't do a lot of things in our app

00:24:50.320 if you're not off incorrectly and so we

00:24:52.799 introduced this idea of a valid platform

00:24:55.279 credential which allows you to have an

00:24:57.600 object to look to check on to check if

00:25:00.400 something's valid and it allows you to

00:25:02.559 have a check before you even run a

00:25:04.400 process that you know is just going to

00:25:06.080 throw an exception down the road

00:25:09.200 we also complemented this with a worker

00:25:11.200 that would update valid platform

00:25:13.440 credential

00:25:16.559 i was so interested in exceptions how

00:25:18.720 other people handled them that i sent

00:25:20.880 out a survey before my talk to

00:25:22.880 understand some other pain points et

00:25:24.400 cetera and i was going to share them

00:25:26.480 with you here but as you can tell we're

00:25:28.480 pretty close to time and i'm kind of

00:25:30.320 right out of time what i'm going to do

00:25:32.400 is have a exception tip or uh of the

00:25:35.200 week on my twitter my handle is down

00:25:37.760 there look out for that if you are going

00:25:40.159 to

00:25:40.880 or if you're interested i actually also

00:25:42.880 learned a bunch of things just from

00:25:44.480 folks at rubyconf so i'm super excited

00:25:47.039 to

00:25:47.760 dive deeper into here

00:25:50.559 thank you all for joining me on this

00:25:52.640 voyage please collect all your

00:25:54.240 belongings on your way out

00:25:56.559 if you have any questions please come

00:25:57.919 find me outside or ask me on discord and

00:26:00.640 if you try anything please at me on

00:26:02.320 twitter i'd love to know how it went for

00:26:03.840 you

00:26:04.799 i want to thank my coworkers who

00:26:06.720 continuously inspire me to keep

00:26:09.039 experimenting

00:26:10.559 ian and lindsey are here right now ian

00:26:12.880 was the originator badger badgerbot so

00:26:16.240 if you want to talk to him you should

00:26:18.320 i also want to give a shout out to the

00:26:19.919 wmbcfp working group who gave me a ton

00:26:22.400 of feedback and helped me on this whole

00:26:23.840 process would not be here without you

00:26:26.320 and my parents who flew here my dad

00:26:28.880 who's in the crowd right now would also

00:26:31.440 not be there without us thanks everyone

00:26:35.679 last slide i had to shout out backer kit

00:26:37.840 if you're interested in tdd

00:26:39.240 experimentation crowdfunding come talk

00:26:41.760 to us we're hiring

00:26:43.440 thanks everyone