List

Schrödinger's Error: Living In the grey area of Exceptions

Schrödinger's Error: Living In the grey area of Exceptions

by Sweta Sanghavi

In the talk "Schrödinger's Error: Living In the Gray Area of Exceptions" delivered by Sweta Sanghavi at RubyConf 2021, the focus is on the challenges developers face when managing exceptions in complex systems. The session acknowledges that while encountering exceptions is inevitable, the key lies in how effectively developers respond to them.

Key Points Discussed:

- Understanding Exceptions: Developers operate in systems where understanding every potential failure point is impractical. Exceptions serve as feedback mechanisms to reassess code assumptions.

- Initial Challenges: At BackerKit, the team initially faced disorganization in addressing exceptions, relying on a Slack integration that resulted in missed notifications due to lack of defined ownership and unclear priority among exceptions.

- Process Experiments: The team experimented with a structured approach to manage exceptions, initiating a weekly "Badger Duty" where team members took ownership of triaging exceptions, leading to better exposure and alignment on priorities.

- Goals for Improvement: Key goals identified by the team included filtering meaningful signals from the noise of exceptions, fostering collective ownership, and proactively addressing user-impacting exceptions.

- Daily Triage Duty: This included setting a clear rotation for exception triaging, defining tasks to categorize bugs, and promptly determining the urgency of issues, thus creating a continuous feedback loop.

- Learnings and Iterations: The team learned that continually iterating on the process helped maintain clarity and focus on actionable items within the backlog. For example, using dashboards for visualizing errors simplified the triaging process while providing data for future actions.

- Case Studies: The presentation included specific examples such as handling a "Faraday timeout error" and a "missing correct access error"—emphasizing the importance of understanding error contexts, prioritizing issues based on frequency and user impact, and determining effective responses.

- Concluding Thoughts: Sweta highlighted that creating a systematic approach to exception management not only clarified responsibilities but also encouraged team collaboration and knowledge sharing, ultimately driving the team towards a more disciplined, resilient, and responsive approach.

The talk concluded with an invitation for attendees to reach out for further discussion on exception management strategies and to connect on shared experiences, emphasizing the importance of community collaboration in improving processes.

ArgumentErrors, TimeOuts, TypeErrors… even scanning a monitoring dashboard can be overwhelming. Any complex system is likely swimming in exceptions. Some are high value signals. Some are red herrings. Resilient applications that live in the entropy of the web require developers to be experts at responding to exceptions. But which ones and how?

In this talk, we’ll discuss what makes exception management difficult, tools to triage and respond to exceptions, and processes for more collective and effective exception management. We'll also explore some related opinions from you, my dear colleagues.

RubyConf 2021

00:00:10.719 all right
00:00:12.000 hello everyone
00:00:15.280 our applications live in complex systems
00:00:17.920 with points of failure that span space
00:00:20.960 and time
00:00:22.480 as developers we write code that's
00:00:24.560 executed in systems that we rarely know
00:00:26.720 all the ins and outs of and it would be
00:00:29.039 a poor use of our time to try to
00:00:30.800 anticipate all the ways in which our
00:00:33.280 code can fail
00:00:34.960 and so the errors that our code raises
00:00:37.520 gives us feedback when code pads code
00:00:40.399 paths that have some assumptions baked
00:00:42.719 into them that require a little bit of
00:00:44.960 re-examination
00:00:47.200 it's our super fat power to make
00:00:49.360 reasonable assumptions let our tests
00:00:51.680 guide us but also ship fast and allow
00:00:54.399 our systems to tell us the ways in which
00:00:57.280 our system is falling down instead of
00:00:59.359 trying to predict those failures
00:01:03.039 this also means
00:01:05.519 that sometimes we must put on our gear
00:01:08.159 step into our spaceships and identify
00:01:10.799 which invaders are going to cause us
00:01:12.560 harm and which ones are okay to allow to
00:01:15.119 stick around
00:01:17.759 we may all share this understanding of
00:01:20.799 the importance of reacting to uncaught
00:01:22.799 exceptions
00:01:24.560 and yet some of our processes or lack
00:01:27.439 thereof can make this notification
00:01:30.320 still elicit some
00:01:32.159 dread
00:01:35.040 welcome to schrodinger's error living in
00:01:37.280 the gray area of exceptions i'm sweta
00:01:40.720 sangvi i'm a developer at backer kit and
00:01:43.600 i've become really interested in why
00:01:45.680 exceptions are so difficult to manage
00:01:48.000 and what we can do about it
00:01:50.079 today i'm going to walk through some
00:01:51.280 process experiments we've tried on my
00:01:53.280 team at backer kit
00:01:55.040 and learnings we've surfaced and then
00:01:57.680 we're all going to go on to triage duty
00:01:59.680 together and see some tools we can use
00:02:01.920 to help us when we're managing
00:02:03.680 exceptions
00:02:05.439 all right put on your helmets and your
00:02:06.960 space boots
00:02:08.399 keep your arms inside the vehicle at all
00:02:11.120 times
00:02:12.879 so i'm going to lay some groundwork of
00:02:14.640 where we started at backer kit we
00:02:16.800 support creators in fulfilling their
00:02:18.720 crowdfunding projects
00:02:20.800 if you've backed a crowdfunding campaign
00:02:22.480 you may have gotten your survey through
00:02:23.840 backer kit take a look at that url
00:02:26.560 we're a pretty small lean team
00:02:28.800 our daily flow is to pair up in the
00:02:30.800 morning and work down a backlog or a
00:02:34.080 queue of features
00:02:35.840 chores and bugs
00:02:38.160 we surfaced a problem that though we had
00:02:40.480 a general idea there's utility to
00:02:42.959 reacting to impactful exceptions we
00:02:45.519 weren't super effective at doing so
00:02:49.599 when we started thinking more critically
00:02:51.920 about this we had a pretty light process
00:02:53.760 for addressing exceptions we had a slack
00:02:56.400 integration that surfaced any unresolved
00:02:58.640 or unignored errors that went to a
00:03:01.120 channel
00:03:02.080 and since we were usually pairing
00:03:03.519 someone who was either soloing or had a
00:03:05.200 spare moment may look at the channel
00:03:07.120 unnoticed when there was something worth
00:03:09.120 addressing
00:03:10.720 but one it required context switching
00:03:12.879 from our prioritized q work and there
00:03:15.440 was no owner or designated dev and so it
00:03:18.239 ended up being a few folks who were in
00:03:20.000 there regularly and a lot of folks who
00:03:22.000 weren't really looking in that channel
00:03:24.879 we also had a pretty low alignment of
00:03:26.879 what process what what process we wanted
00:03:29.120 to use for managing exceptions and not a
00:03:31.840 shared goal of what what effective
00:03:33.840 management was and which exceptions were
00:03:36.000 high priority in addressing or what
00:03:38.239 actions to take in response
00:03:40.560 there was a pretty large backlog of
00:03:42.640 errors and so it was hard to scan it
00:03:45.200 quickly or really surfaced when there
00:03:47.040 was something that should be addressed
00:03:48.879 coming up
00:03:50.720 and so we started to experiment with
00:03:52.799 some process changes to get better at
00:03:55.280 exception management
00:03:57.439 our first experiment was badger judy
00:04:00.640 affectionately called we use honey
00:04:02.239 badger as our observability tool and so
00:04:05.360 look badger patrol badger judy was the
00:04:08.159 experiment where each of us owned
00:04:10.400 managing exceptions for the week and
00:04:12.720 tracking and you could either solo a
00:04:14.159 repair on it and tracking how long it
00:04:16.000 took to triage and address any important
00:04:18.639 exceptions
00:04:20.000 this was our first stab at starting to
00:04:21.759 chip away at the errors and also to
00:04:23.600 de-silo exception management
00:04:27.280 as we were all becoming more exposed to
00:04:29.440 exception management we started to
00:04:31.199 surface what our pain points were and
00:04:33.040 why this was difficult for our team
00:04:37.520 one was the priority was unclear
00:04:40.080 it was unclear when um something was
00:04:42.800 urgent when we were just scanning
00:04:44.240 exceptions
00:04:45.520 and when compared to our high value
00:04:47.440 feature work it could feel like low
00:04:49.600 value to be managing or triaging
00:04:52.080 exceptions
00:04:53.280 internally we weren't super aligned on
00:04:55.520 what those high priority exceptions were
00:04:59.440 there were also a lot of exceptions that
00:05:01.440 had low action ability
00:05:03.360 or nor did we have a well-aligned path
00:05:05.840 to what we should be doing about it and
00:05:08.240 so it could lead to feeling a little
00:05:10.479 stuck when we came across an exception
00:05:14.240 we also started surfacing that we
00:05:16.080 weren't aligned on what the goal was
00:05:18.560 and so
00:05:19.520 kind of our takeaways at this first
00:05:21.680 first
00:05:22.560 experiment was that having this process
00:05:25.360 helped us start to have a place to
00:05:27.120 iterate from
00:05:28.400 actually just having something that we
00:05:30.080 know we were doing did ease some of the
00:05:32.240 anxiety of not knowing who was tackling
00:05:34.720 this
00:05:36.479 and it gave us a process to start
00:05:38.880 unearthing where we weren't aligned
00:05:42.160 we started to realize that getting
00:05:44.080 aligned on our goals would be a useful
00:05:46.000 exercise as we kept iterating on this
00:05:49.120 um also badger duty kind of gave us a
00:05:51.759 place like when we were pairing or
00:05:53.199 working on an exception to start digging
00:05:55.360 for finding where we were disagreeing
00:05:57.280 which is actually the first piece to
00:05:59.199 starting to align
00:06:00.960 so i'm going to go through some of the
00:06:02.240 goals we landed on as a team
00:06:04.800 our first goal was that we should be
00:06:06.080 able to see signal through noise when we
00:06:08.400 are managing exceptions
00:06:10.160 it should be clear when there's
00:06:11.440 something that should be addressed and
00:06:13.520 it should surface through or surface
00:06:15.840 above the rest of maybe the more noisy
00:06:18.319 exceptions
00:06:19.520 a piece of this means tackling some of
00:06:21.600 that noise so that the signals can kind
00:06:24.479 of come through
00:06:26.080 our goal is not to get zero exceptions
00:06:28.080 or never raise exceptions or inbox zero
00:06:31.840 and it's also not to like fight find
00:06:34.319 fires
00:06:35.600 monitoring is where we expect those
00:06:37.759 kinds of things to come through
00:06:40.639 we also had the goal of having a
00:06:43.759 collective ownership of addressing
00:06:45.759 exceptions
00:06:46.880 we saw a lot of utility for all of us to
00:06:49.520 be exposed to what the exceptions that
00:06:51.360 were happening in the app it allowed for
00:06:53.759 all of us to be involved in making the
00:06:56.160 solutions and um benefiting from other
00:06:58.960 people's perspectives or what they found
00:07:01.039 or perhaps if they were closer to that
00:07:02.800 piece of the code base
00:07:05.039 it was an opportunity for knowledge
00:07:06.400 share
00:07:08.960 and last our goal was to proactively
00:07:11.199 address exceptions this may be a shared
00:07:12.960 goal of
00:07:14.160 we should see when there's some
00:07:15.759 unfavorable user impact or if a user has
00:07:18.479 gotten themselves into a
00:07:20.720 a state that they can't get themselves
00:07:22.479 out of before that we get those user
00:07:24.800 reports that's like the utility of our
00:07:26.720 exception management
00:07:28.560 and so we proposed another process
00:07:30.720 change
00:07:31.759 the next thing we tried is a rotating
00:07:33.919 pair triaging exceptions daily
00:07:37.199 this had the benefit again to having a
00:07:38.960 clear owner so it wasn't like all the
00:07:41.039 team was expected to look at the slack
00:07:43.120 or check in and what's going on
00:07:45.599 it also allowed us because this was a
00:07:47.440 rotating designation it allowed us to
00:07:50.080 make resource decisions at the top of
00:07:51.680 the day
00:07:53.440 looking at schedules or looking at high
00:07:55.039 priority work who is going to work on
00:07:56.879 bugs and also do we have the bandwidth
00:07:58.720 to do so
00:08:00.560 again everyone was exposed to nuances of
00:08:03.360 exception management and the exceptions
00:08:05.039 that were coming through our app which
00:08:06.960 allowed us to drive towards alignment
00:08:12.160 as different people came across the same
00:08:14.000 exceptions and why they were painful we
00:08:16.479 could discuss them and come up with
00:08:17.840 solutions to incorporate the the team's
00:08:20.560 ideas
00:08:22.080 i'm going to go through what it looks
00:08:23.680 like to be on triage duty at backer get
00:08:26.960 first at the top you come to
00:08:29.199 you start at the top of the backlog of
00:08:31.280 the unaddressed
00:08:32.959 um errors
00:08:35.039 at the top of the backlog you get a 15
00:08:37.279 minute time box
00:08:38.560 your goal in that 15 minutes is to
00:08:40.320 determine the priority of that bug
00:08:43.279 it basically boils down to three options
00:08:45.760 this is something that should be fixed
00:08:47.920 and but i'm not going to do it right now
00:08:49.839 because i have 15 minutes so i'm going
00:08:50.959 to write a bug ticket and put it in the
00:08:52.560 bug section of our sprint
00:08:55.440 of our backlog
00:08:57.200 i'm going to acknowledge it so we sort
00:08:58.800 of came up with the idea of
00:09:00.000 acknowledgement but it's pretty similar
00:09:01.279 to resolve the idea being i've looked at
00:09:03.839 it this is not something we need to
00:09:05.200 really do anything about acknowledge and
00:09:07.279 move on i do want to know the next time
00:09:09.279 it happens
00:09:10.399 or
00:09:11.360 this is something that's worth fixing
00:09:12.880 and it's worth fixing now i'm going to
00:09:14.480 make a bug ticket and i'm going to work
00:09:16.000 on it now
00:09:17.680 for a really high priority exceptions
00:09:20.959 a tool that we use to support um
00:09:24.000 bug duty is a dashboard
00:09:26.640 so what this helped us do is to get a
00:09:29.120 quick idea of what exceptions had not
00:09:31.440 been addressed
00:09:32.880 this is pretty much pulling those same
00:09:34.880 errors that we saw in our slack
00:09:36.080 integration into a more user-friendly ui
00:09:40.720 it also really supported this like going
00:09:42.800 down the line of triaging starting at
00:09:44.959 the top and going down and you get some
00:09:46.880 useful information just
00:09:49.120 from at a glance
00:09:50.800 this also supported some of those
00:09:52.160 resource decisions of like how deep is
00:09:54.080 our backlog how important is it to make
00:09:56.640 sure someone's looking at
00:09:58.080 exceptions today
00:10:00.399 i also want to talk about the idea of
00:10:02.480 acknowledgement that you can see in
00:10:05.200 badger bot
00:10:06.560 um
00:10:07.680 so the purpose of acknowledge as i spoke
00:10:09.680 about before is to kind of say we're not
00:10:12.720 going to do something about this
00:10:14.399 as part of our process we also started
00:10:17.440 auto resolving weekly so basically an
00:10:20.480 acknowledgement would would expire in a
00:10:23.040 week and so you would see that error
00:10:24.720 again
00:10:26.640 you can see the drop down that comes
00:10:28.079 down when you hit acknowledge and you
00:10:30.000 can see what this kind of represents is
00:10:32.640 us starting to bucket
00:10:34.240 why we might be acknowledging exceptions
00:10:36.399 which actually ends up being really
00:10:37.760 useful as you're having conversations
00:10:39.680 about the priority of different
00:10:40.959 exceptions
00:10:42.079 you can see like yeah there's this get
00:10:44.000 timeout but it's not high priority or
00:10:46.000 it's not in high frequency i'm gonna
00:10:48.320 move on
00:10:49.440 um also some things about like i already
00:10:51.519 see a story about this i'm not gonna
00:10:52.959 make a new one
00:10:54.560 and also you could keep adding new
00:10:57.120 um
00:10:58.079 choices that you wish were here that are
00:10:59.760 not
00:11:01.440 so it kind of allows us to have a way
00:11:03.440 forward for what i think is starting to
00:11:06.079 talk about the gray area of like yes i
00:11:08.959 want to see this exception no i don't
00:11:11.120 want to do anything about it and like
00:11:13.279 being comfortable with that designation
00:11:16.000 and knowing that there are many
00:11:17.440 exceptions that are going to live in
00:11:18.640 that bucket
00:11:23.360 next i want to start talking about some
00:11:25.040 learnings from this next process
00:11:27.600 change that we made
00:11:30.000 one big one
00:11:31.360 especially for me was that exceptions
00:11:32.959 can be solved collectively over time
00:11:35.440 when you have a 15-minute time box
00:11:37.040 you're likely hitting it frequently
00:11:39.680 which means that you have to kind of
00:11:41.360 think about what can i do in 15 minutes
00:11:43.519 that's going to move debugging this
00:11:45.440 exception forward
00:11:47.360 it made it really reframed how i thought
00:11:49.360 about what it means to fix something
00:11:51.279 because sometimes it can mean that i
00:11:53.519 don't think this is high enough value
00:11:54.880 for me to spend time really digging into
00:11:56.720 this but maybe next time i wish that i
00:11:59.519 had this piece of information i'm going
00:12:01.200 to add it to the breadcrumbs or the
00:12:03.040 params
00:12:04.880 other kinds of
00:12:06.320 small things you can do like refactoring
00:12:08.560 out a love demeter violation so that
00:12:10.399 next time the line from which the error
00:12:12.560 was thrown is more useful
00:12:14.720 um or just like starting a bug ticket
00:12:17.680 and having a place to log what happened
00:12:20.399 what's gonna happen next time and being
00:12:22.399 okay with allowing it to happen again
00:12:24.880 and letting all of that be more data
00:12:26.720 points to support a broader solution
00:12:29.680 it also allows you to try something
00:12:31.680 right now and leave push forward the
00:12:34.399 exception for the next person to take
00:12:36.079 over next time and start putting
00:12:38.320 patterns into our app for dealing with
00:12:40.560 exceptions that we were seeing a lot of
00:12:45.040 another learning is about prioritizing
00:12:47.040 and writing bug stories
00:12:49.440 writing bug stories that i myself would
00:12:51.279 actually want to pick up is quite the
00:12:53.200 art
00:12:54.399 providing enough context for the next
00:12:56.240 person about what you've discovered
00:12:59.200 but not allowing like any half-baked
00:13:01.440 theories to be in the ticket is
00:13:04.000 really useful and can be difficult
00:13:06.240 things that were useful for us is like
00:13:07.680 links to faults how to reproduce that
00:13:10.000 same exception or other information you
00:13:12.480 uncovered
00:13:13.920 one way to really prove to yourself that
00:13:15.680 you understand how
00:13:17.680 why an exception is being raised is to
00:13:19.920 write a test that shows the same
00:13:22.720 system that you think is causing this
00:13:24.800 exception and seeing that failing test
00:13:26.720 which is a really great thing to link to
00:13:28.160 a bug ticket
00:13:30.160 one other learning from our triage duty
00:13:33.200 kind of process is that we started
00:13:35.839 making a lot of bug tickets but weren't
00:13:38.000 always picking them up and so our bug
00:13:39.920 backlog also grew
00:13:41.680 and so another addition to this process
00:13:43.360 was pulling a set number of bugs over
00:13:46.320 each sprint this kind of alludes to like
00:13:49.199 when you want to tackle your backlog you
00:13:51.920 have to have that investment and you do
00:13:53.600 have to prioritize that bug those bugs
00:13:56.079 and make room in your sprint for them
00:13:59.760 it also serves as a purpose of um
00:14:02.720 if no one is advocating for certain bugs
00:14:04.880 to get into the sprint maybe they
00:14:06.480 weren't that important and letting them
00:14:08.000 fall off can also be a really useful
00:14:09.600 tool
00:14:12.000 another learning is if you look at your
00:14:14.160 top occurring errors you might find that
00:14:16.639 the top it usually follows a power lot
00:14:18.720 and the top several are probably going
00:14:20.639 to take up a chunk of the exceptions and
00:14:23.120 so
00:14:24.160 making a um a deliberate effort to start
00:14:26.560 chipping away that away at those can be
00:14:28.880 really useful
00:14:30.560 and last learning is identifying the
00:14:32.880 levers for efficiency we kind of had
00:14:35.199 this general goal of triaging quicker
00:14:38.000 and moving faster and this process
00:14:40.480 started to show us what was taking a
00:14:42.720 long time for um exceptions require you
00:14:45.680 to
00:14:46.480 load up a lot of context really quickly
00:14:49.120 um
00:14:50.639 and
00:14:51.600 i think
00:14:53.440 exposing us more to triage duty
00:14:55.600 regularly and building that muscle was
00:14:57.680 really useful on us moving faster
00:15:00.959 also having more pads to action ability
00:15:03.040 which ended up being one of the reasons
00:15:04.560 why i was going so slow
00:15:07.519 we are going to move on to triage duty
00:15:10.079 so i've pulled two errors out of our app
00:15:12.959 and i wanted to actually go through the
00:15:14.639 process of triaging them together
00:15:17.120 so the first is
00:15:19.600 this faraday timeout error that we see
00:15:22.000 is being thrown from the project lead
00:15:24.720 fetcher first step understanding the
00:15:27.519 error
00:15:28.880 we see here it's a net read timeout
00:15:32.079 if the error that you're seeing is not
00:15:34.000 something you're familiar with really
00:15:35.440 familiar familiarize yourself with the
00:15:37.920 error code or if the http status is new
00:15:40.240 to you what it really means and if it's
00:15:42.560 a custom error looking into the api and
00:15:44.800 understanding when that error is
00:15:47.920 thrown
00:15:50.160 in our case we see that we have a net
00:15:52.240 read timeout a subclass of the timeout
00:15:54.560 error which is raised when a chunk of
00:15:56.800 the response can't be read
00:15:58.880 um within the the three time app that
00:16:01.519 you've set the default of which being 60
00:16:03.680 seconds
00:16:04.959 just for a little refresher we know when
00:16:06.480 we're making an http request we open up
00:16:08.959 our tcp connection send a request over
00:16:11.360 the wire and then read the response back
00:16:14.320 what our error is telling us is that
00:16:16.000 that last step is not happening i bring
00:16:18.480 that up because you might view it
00:16:20.160 differently if it's a get or a post
00:16:22.959 in our post request we really want to
00:16:24.480 know where we did we send something over
00:16:26.399 the wire or not and in this case we do
00:16:28.639 have to check versus if you were
00:16:30.720 unopened unable to open the connection
00:16:32.880 at all
00:16:34.959 next step let's understand the call site
00:16:37.279 and what this piece of code was meant to
00:16:39.199 do that it was unsuccessful at doing
00:16:41.680 we can see that this error is being
00:16:43.199 thrown from a get method in our client
00:16:45.199 with our indiegogo integration if we
00:16:47.680 look at our back trace we can look down
00:16:51.199 and see okay there's a get campaign
00:16:52.880 method that was expecting to be able to
00:16:55.519 get some information
00:16:57.120 it's being thrown from our project lead
00:16:59.199 fetcher worker all right so we're
00:17:01.360 starting to get some information
00:17:04.000 whatever project lead fetcher needed
00:17:05.839 campaign information for is unsuccessful
00:17:08.079 at doing that
00:17:09.760 we might then look at where this worker
00:17:11.919 is being called from to understand more
00:17:13.760 about the purpose of this code
00:17:15.600 i've done some digging for us and we can
00:17:17.280 see that it's coming from a staff
00:17:18.640 controller action called sync from a
00:17:20.640 method sync and update
00:17:22.640 so again we're starting to get some more
00:17:24.000 information like oh this is a staff
00:17:26.160 action that is exposed this is a get
00:17:28.640 request so that some information that we
00:17:30.799 expect to have we did not and we also
00:17:34.160 know it's coming from something that is
00:17:36.080 exposed to a user so theoretically they
00:17:38.559 could kick the worker off again
00:17:40.559 the next question is how often is this
00:17:42.240 happening so we see that it's 67
00:17:45.280 occurrences in the last four months this
00:17:47.520 is not new it's not super frequent so
00:17:50.000 it's not super noisy it can be useful
00:17:52.240 just to ask yourself is this something
00:17:53.840 that's gone
00:17:54.960 gone out in a recent deploy and maybe
00:17:56.640 that will change how likely you are to
00:17:58.799 um fix it etc and also just gives us an
00:18:01.919 idea of how noisy it is
00:18:04.640 some other tools for debugging that you
00:18:06.320 might use is to just kick off the
00:18:07.919 process again in our case i kicked this
00:18:10.720 off
00:18:11.600 when i was debugging and hey the next
00:18:13.440 time we did it we were successful
00:18:15.919 try to reproduce the error and also like
00:18:18.080 finding patterns and past occurrences of
00:18:20.320 the error can be really useful
00:18:24.160 so now let's think about priority
00:18:26.640 and this again there is no equation i
00:18:28.799 wish i could give one to you but it's an
00:18:30.720 art two things that you might consider
00:18:32.799 is the frequency
00:18:34.559 as we talked about we want to reduce the
00:18:36.559 noise and so things that have high
00:18:38.559 frequency that even if the user impact
00:18:40.960 is low it might change our priority of
00:18:43.440 of addressing that exception or not
00:18:45.039 seeing it again
00:18:46.880 user impact is the other big one we want
00:18:48.880 our users to have really great
00:18:50.160 experiences and so considering what was
00:18:52.799 the impact for the downstream downstream
00:18:55.039 user you might have a higher priority
00:18:57.600 for a page timing out and a user not
00:18:59.280 being able to do something if it's
00:19:01.760 affecting even more users that also
00:19:03.360 might change
00:19:04.880 your
00:19:05.919 appetite for making a change
00:19:08.799 so in our example we see pretty low
00:19:12.000 frequency
00:19:13.200 and we also know that the user has a way
00:19:15.919 to get themselves out of this
00:19:17.200 information and they're missing some and
00:19:19.440 for
00:19:20.240 some data they expect to have
00:19:22.720 all right so let's think about what we
00:19:24.080 could do in this case
00:19:25.840 we see that as we talked about the
00:19:27.840 re-timeout is something we've configured
00:19:30.960 there's there is some you could change
00:19:32.880 this this would be a pretty global
00:19:34.960 change
00:19:35.919 that's an option though you can use a
00:19:38.080 retry you can rescue this error and
00:19:40.080 retry and be like i don't want to see
00:19:41.679 this try once more before you
00:19:44.000 throw an exception we can acknowledge it
00:19:46.320 we can say hey this is something that
00:19:47.600 happens in our app we are
00:19:49.600 working with web requests they are going
00:19:51.200 to be flaky
00:19:52.960 we could snooze we could say not only do
00:19:54.720 i not care about this time i don't want
00:19:56.480 to see it another for another 100 times
00:19:59.200 or we could handle this case if it's
00:20:00.960 representing some specific user flow
00:20:03.840 in this case we've kind of determined
00:20:05.919 low user impact low frequency seems like
00:20:08.720 a really good candidate to dismiss and
00:20:10.480 move on
00:20:11.919 good job y'all we've solved our first
00:20:14.240 exception of triage duty let's move on
00:20:16.880 to the second one
00:20:19.440 all right next in our queue is the
00:20:21.520 missing correct access error from the
00:20:23.200 project vector worker we know our first
00:20:25.440 step is understanding the error so
00:20:27.200 missing correct access that's not an
00:20:29.360 error i really know off the bat
00:20:31.360 so i'm going to look at the call site
00:20:32.880 and be like what can you tell me more
00:20:34.720 about this error and i see this is
00:20:36.720 actually a custom error that we've
00:20:38.240 thrown
00:20:39.120 um
00:20:40.320 someone at one point was like i don't
00:20:42.000 want to see 403s i want to raise this
00:20:44.320 error that tells you more about what's
00:20:46.320 going on that's missing correct access
00:20:49.280 another
00:20:50.960 utility of having a customer is that you
00:20:52.880 can kind of throw it from different
00:20:54.080 places in the app so we can build an
00:20:55.760 understanding when it's the same root
00:20:57.440 cause
00:20:59.600 let's move on to what the purpose of
00:21:00.960 this code is we okay we have this id we
00:21:02.559 have this custom error missing correct
00:21:04.080 access maybe we are starting to
00:21:05.200 understand what that means
00:21:07.200 we can look at the stack trace and see
00:21:09.120 again project or again there's a worker
00:21:11.440 project fetcher worker it is trying to
00:21:14.320 get project information and it's we're
00:21:16.799 seeing this error that we're raising
00:21:19.440 we might need some more information
00:21:21.120 about what the purpose of this worker is
00:21:22.880 like where was it called
00:21:24.559 and our root call site here is just
00:21:27.360 sidekick
00:21:28.640 not super useful
00:21:30.960 in this in the case for workers because
00:21:32.960 it is um
00:21:34.559 being thrown from sidekick
00:21:36.880 it can be difficult to know what the um
00:21:39.280 where this is called so one tool to help
00:21:41.600 you do that is just putting the
00:21:43.679 what is calling the worker in params
00:21:47.120 so breadcrumbs and params are a really
00:21:48.640 good tool another thing that could be
00:21:50.720 useful is to save queries for post
00:21:52.480 requests to help you debug
00:21:55.120 or any other information that you wanted
00:21:57.120 when you were debugging
00:21:58.400 throwing it in the breadcrumbs or params
00:22:00.880 could be really useful
00:22:02.720 so someone already has done us our favor
00:22:04.559 we can look at params and we can see
00:22:06.640 okay there's a project update task that
00:22:09.120 is calling this worker that is causing
00:22:11.039 this exception
00:22:12.559 so i look in a heroku scheduler and i
00:22:14.559 see oh cool this update task runs every
00:22:17.280 day so
00:22:19.200 it'll just fix itself tomorrow right
00:22:22.080 the thing is if it's gonna raise again
00:22:24.400 it's also going to throw that exception
00:22:26.000 so let's try kicking it off again i see
00:22:28.159 that it's item potent i try it again and
00:22:31.280 okay this error is still happening i
00:22:32.640 still get a missing correct access
00:22:34.799 and what that means for our frequency is
00:22:36.799 it's a pretty noisy error we see that
00:22:39.280 it's being raised most days and if we
00:22:41.440 think about triage judy that means every
00:22:43.600 day someone's going to be triaging this
00:22:45.760 exception
00:22:46.960 which is painful
00:22:49.120 and so when we're thinking about this
00:22:51.039 error another part that i'm going to
00:22:53.440 tell you is that this is raised when we
00:22:55.919 integrate with kickstarter and when our
00:22:57.919 users are incorrectly off into this
00:22:59.679 external service this is where we get
00:23:01.360 this error
00:23:03.120 which also means there's not much as a
00:23:04.799 dev i can do i can't really offend for
00:23:07.039 them i don't have their credentials
00:23:09.200 um
00:23:10.640 but it's really noisy which is maybe the
00:23:12.799 worst of both worlds this is an
00:23:14.400 exception that i'm just gonna sit there
00:23:15.919 and be like cool here it is again
00:23:18.400 and we integrate with kickstarter a lot
00:23:20.559 so you're going to see this error and
00:23:22.480 any other action this user is trying to
00:23:24.240 do on our app
00:23:26.640 so what can we do
00:23:28.799 how do we reduce the noise of low value
00:23:30.960 exceptions to support our goal of only
00:23:33.600 high valuable actionable errors in our
00:23:36.000 in our backlog
00:23:38.240 one tool we can do is just error
00:23:39.679 grouping
00:23:41.120 um honey badger d by default groups
00:23:44.080 exceptions by the type of error and
00:23:45.679 where it's thrown from but you can
00:23:47.360 redefine a fingerprint to throw the same
00:23:50.880 error from two different places when you
00:23:52.400 know that's the same root cause there's
00:23:54.480 a gotcha there where it can it can
00:23:55.919 affect some statistics or comments but
00:23:57.840 it can be a useful tool
00:24:00.000 other things we can do is the rescue
00:24:02.080 retry pattern we know in this case it's
00:24:04.240 not really going to help us because we
00:24:05.679 know we're going to keep seeing it
00:24:07.679 if you if it's low action if it has low
00:24:10.000 action ability and you kind of question
00:24:11.679 is this even an exception we could see
00:24:13.760 silencing it can be a really useful tool
00:24:15.679 actually we don't want things in our
00:24:18.240 observability tool that we can't do much
00:24:20.240 about or it doesn't help to know that
00:24:21.679 it's happening
00:24:22.880 we can serve as errors users and allow
00:24:24.640 them to be self-serviceable train
00:24:26.960 trained teams and processes with a
00:24:28.559 manual workaround remember our our staff
00:24:31.440 our
00:24:32.400 people who work with us are part of our
00:24:34.159 system
00:24:35.200 or if it's widespread and affecting a
00:24:37.520 specific sliver maybe an overlooked
00:24:40.400 data consideration you might actually
00:24:42.000 fix that data
00:24:44.159 i want to show one thing that we did um
00:24:46.320 for this case as i said
00:24:48.640 you can't do a lot of things in our app
00:24:50.320 if you're not off incorrectly and so we
00:24:52.799 introduced this idea of a valid platform
00:24:55.279 credential which allows you to have an
00:24:57.600 object to look to check on to check if
00:25:00.400 something's valid and it allows you to
00:25:02.559 have a check before you even run a
00:25:04.400 process that you know is just going to
00:25:06.080 throw an exception down the road
00:25:09.200 we also complemented this with a worker
00:25:11.200 that would update valid platform
00:25:13.440 credential
00:25:16.559 i was so interested in exceptions how
00:25:18.720 other people handled them that i sent
00:25:20.880 out a survey before my talk to
00:25:22.880 understand some other pain points et
00:25:24.400 cetera and i was going to share them
00:25:26.480 with you here but as you can tell we're
00:25:28.480 pretty close to time and i'm kind of
00:25:30.320 right out of time what i'm going to do
00:25:32.400 is have a exception tip or uh of the
00:25:35.200 week on my twitter my handle is down
00:25:37.760 there look out for that if you are going
00:25:40.159 to
00:25:40.880 or if you're interested i actually also
00:25:42.880 learned a bunch of things just from
00:25:44.480 folks at rubyconf so i'm super excited
00:25:47.039 to
00:25:47.760 dive deeper into here
00:25:50.559 thank you all for joining me on this
00:25:52.640 voyage please collect all your
00:25:54.240 belongings on your way out
00:25:56.559 if you have any questions please come
00:25:57.919 find me outside or ask me on discord and
00:26:00.640 if you try anything please at me on
00:26:02.320 twitter i'd love to know how it went for
00:26:03.840 you
00:26:04.799 i want to thank my coworkers who
00:26:06.720 continuously inspire me to keep
00:26:09.039 experimenting
00:26:10.559 ian and lindsey are here right now ian
00:26:12.880 was the originator badger badgerbot so
00:26:16.240 if you want to talk to him you should
00:26:18.320 i also want to give a shout out to the
00:26:19.919 wmbcfp working group who gave me a ton
00:26:22.400 of feedback and helped me on this whole
00:26:23.840 process would not be here without you
00:26:26.320 and my parents who flew here my dad
00:26:28.880 who's in the crowd right now would also
00:26:31.440 not be there without us thanks everyone
00:26:35.679 last slide i had to shout out backer kit
00:26:37.840 if you're interested in tdd
00:26:39.240 experimentation crowdfunding come talk
00:26:41.760 to us we're hiring
00:26:43.440 thanks everyone