00:00:10.719
all right
00:00:12.000
hello everyone
00:00:15.280
our applications live in complex systems
00:00:17.920
with points of failure that span space
00:00:20.960
and time
00:00:22.480
as developers we write code that's
00:00:24.560
executed in systems that we rarely know
00:00:26.720
all the ins and outs of and it would be
00:00:29.039
a poor use of our time to try to
00:00:30.800
anticipate all the ways in which our
00:00:33.280
code can fail
00:00:34.960
and so the errors that our code raises
00:00:37.520
gives us feedback when code pads code
00:00:40.399
paths that have some assumptions baked
00:00:42.719
into them that require a little bit of
00:00:44.960
re-examination
00:00:47.200
it's our super fat power to make
00:00:49.360
reasonable assumptions let our tests
00:00:51.680
guide us but also ship fast and allow
00:00:54.399
our systems to tell us the ways in which
00:00:57.280
our system is falling down instead of
00:00:59.359
trying to predict those failures
00:01:03.039
this also means
00:01:05.519
that sometimes we must put on our gear
00:01:08.159
step into our spaceships and identify
00:01:10.799
which invaders are going to cause us
00:01:12.560
harm and which ones are okay to allow to
00:01:15.119
stick around
00:01:17.759
we may all share this understanding of
00:01:20.799
the importance of reacting to uncaught
00:01:22.799
exceptions
00:01:24.560
and yet some of our processes or lack
00:01:27.439
thereof can make this notification
00:01:30.320
still elicit some
00:01:32.159
dread
00:01:35.040
welcome to schrodinger's error living in
00:01:37.280
the gray area of exceptions i'm sweta
00:01:40.720
sangvi i'm a developer at backer kit and
00:01:43.600
i've become really interested in why
00:01:45.680
exceptions are so difficult to manage
00:01:48.000
and what we can do about it
00:01:50.079
today i'm going to walk through some
00:01:51.280
process experiments we've tried on my
00:01:53.280
team at backer kit
00:01:55.040
and learnings we've surfaced and then
00:01:57.680
we're all going to go on to triage duty
00:01:59.680
together and see some tools we can use
00:02:01.920
to help us when we're managing
00:02:03.680
exceptions
00:02:05.439
all right put on your helmets and your
00:02:06.960
space boots
00:02:08.399
keep your arms inside the vehicle at all
00:02:11.120
times
00:02:12.879
so i'm going to lay some groundwork of
00:02:14.640
where we started at backer kit we
00:02:16.800
support creators in fulfilling their
00:02:18.720
crowdfunding projects
00:02:20.800
if you've backed a crowdfunding campaign
00:02:22.480
you may have gotten your survey through
00:02:23.840
backer kit take a look at that url
00:02:26.560
we're a pretty small lean team
00:02:28.800
our daily flow is to pair up in the
00:02:30.800
morning and work down a backlog or a
00:02:34.080
queue of features
00:02:35.840
chores and bugs
00:02:38.160
we surfaced a problem that though we had
00:02:40.480
a general idea there's utility to
00:02:42.959
reacting to impactful exceptions we
00:02:45.519
weren't super effective at doing so
00:02:49.599
when we started thinking more critically
00:02:51.920
about this we had a pretty light process
00:02:53.760
for addressing exceptions we had a slack
00:02:56.400
integration that surfaced any unresolved
00:02:58.640
or unignored errors that went to a
00:03:01.120
channel
00:03:02.080
and since we were usually pairing
00:03:03.519
someone who was either soloing or had a
00:03:05.200
spare moment may look at the channel
00:03:07.120
unnoticed when there was something worth
00:03:09.120
addressing
00:03:10.720
but one it required context switching
00:03:12.879
from our prioritized q work and there
00:03:15.440
was no owner or designated dev and so it
00:03:18.239
ended up being a few folks who were in
00:03:20.000
there regularly and a lot of folks who
00:03:22.000
weren't really looking in that channel
00:03:24.879
we also had a pretty low alignment of
00:03:26.879
what process what what process we wanted
00:03:29.120
to use for managing exceptions and not a
00:03:31.840
shared goal of what what effective
00:03:33.840
management was and which exceptions were
00:03:36.000
high priority in addressing or what
00:03:38.239
actions to take in response
00:03:40.560
there was a pretty large backlog of
00:03:42.640
errors and so it was hard to scan it
00:03:45.200
quickly or really surfaced when there
00:03:47.040
was something that should be addressed
00:03:48.879
coming up
00:03:50.720
and so we started to experiment with
00:03:52.799
some process changes to get better at
00:03:55.280
exception management
00:03:57.439
our first experiment was badger judy
00:04:00.640
affectionately called we use honey
00:04:02.239
badger as our observability tool and so
00:04:05.360
look badger patrol badger judy was the
00:04:08.159
experiment where each of us owned
00:04:10.400
managing exceptions for the week and
00:04:12.720
tracking and you could either solo a
00:04:14.159
repair on it and tracking how long it
00:04:16.000
took to triage and address any important
00:04:18.639
exceptions
00:04:20.000
this was our first stab at starting to
00:04:21.759
chip away at the errors and also to
00:04:23.600
de-silo exception management
00:04:27.280
as we were all becoming more exposed to
00:04:29.440
exception management we started to
00:04:31.199
surface what our pain points were and
00:04:33.040
why this was difficult for our team
00:04:37.520
one was the priority was unclear
00:04:40.080
it was unclear when um something was
00:04:42.800
urgent when we were just scanning
00:04:44.240
exceptions
00:04:45.520
and when compared to our high value
00:04:47.440
feature work it could feel like low
00:04:49.600
value to be managing or triaging
00:04:52.080
exceptions
00:04:53.280
internally we weren't super aligned on
00:04:55.520
what those high priority exceptions were
00:04:59.440
there were also a lot of exceptions that
00:05:01.440
had low action ability
00:05:03.360
or nor did we have a well-aligned path
00:05:05.840
to what we should be doing about it and
00:05:08.240
so it could lead to feeling a little
00:05:10.479
stuck when we came across an exception
00:05:14.240
we also started surfacing that we
00:05:16.080
weren't aligned on what the goal was
00:05:18.560
and so
00:05:19.520
kind of our takeaways at this first
00:05:21.680
first
00:05:22.560
experiment was that having this process
00:05:25.360
helped us start to have a place to
00:05:27.120
iterate from
00:05:28.400
actually just having something that we
00:05:30.080
know we were doing did ease some of the
00:05:32.240
anxiety of not knowing who was tackling
00:05:34.720
this
00:05:36.479
and it gave us a process to start
00:05:38.880
unearthing where we weren't aligned
00:05:42.160
we started to realize that getting
00:05:44.080
aligned on our goals would be a useful
00:05:46.000
exercise as we kept iterating on this
00:05:49.120
um also badger duty kind of gave us a
00:05:51.759
place like when we were pairing or
00:05:53.199
working on an exception to start digging
00:05:55.360
for finding where we were disagreeing
00:05:57.280
which is actually the first piece to
00:05:59.199
starting to align
00:06:00.960
so i'm going to go through some of the
00:06:02.240
goals we landed on as a team
00:06:04.800
our first goal was that we should be
00:06:06.080
able to see signal through noise when we
00:06:08.400
are managing exceptions
00:06:10.160
it should be clear when there's
00:06:11.440
something that should be addressed and
00:06:13.520
it should surface through or surface
00:06:15.840
above the rest of maybe the more noisy
00:06:18.319
exceptions
00:06:19.520
a piece of this means tackling some of
00:06:21.600
that noise so that the signals can kind
00:06:24.479
of come through
00:06:26.080
our goal is not to get zero exceptions
00:06:28.080
or never raise exceptions or inbox zero
00:06:31.840
and it's also not to like fight find
00:06:34.319
fires
00:06:35.600
monitoring is where we expect those
00:06:37.759
kinds of things to come through
00:06:40.639
we also had the goal of having a
00:06:43.759
collective ownership of addressing
00:06:45.759
exceptions
00:06:46.880
we saw a lot of utility for all of us to
00:06:49.520
be exposed to what the exceptions that
00:06:51.360
were happening in the app it allowed for
00:06:53.759
all of us to be involved in making the
00:06:56.160
solutions and um benefiting from other
00:06:58.960
people's perspectives or what they found
00:07:01.039
or perhaps if they were closer to that
00:07:02.800
piece of the code base
00:07:05.039
it was an opportunity for knowledge
00:07:06.400
share
00:07:08.960
and last our goal was to proactively
00:07:11.199
address exceptions this may be a shared
00:07:12.960
goal of
00:07:14.160
we should see when there's some
00:07:15.759
unfavorable user impact or if a user has
00:07:18.479
gotten themselves into a
00:07:20.720
a state that they can't get themselves
00:07:22.479
out of before that we get those user
00:07:24.800
reports that's like the utility of our
00:07:26.720
exception management
00:07:28.560
and so we proposed another process
00:07:30.720
change
00:07:31.759
the next thing we tried is a rotating
00:07:33.919
pair triaging exceptions daily
00:07:37.199
this had the benefit again to having a
00:07:38.960
clear owner so it wasn't like all the
00:07:41.039
team was expected to look at the slack
00:07:43.120
or check in and what's going on
00:07:45.599
it also allowed us because this was a
00:07:47.440
rotating designation it allowed us to
00:07:50.080
make resource decisions at the top of
00:07:51.680
the day
00:07:53.440
looking at schedules or looking at high
00:07:55.039
priority work who is going to work on
00:07:56.879
bugs and also do we have the bandwidth
00:07:58.720
to do so
00:08:00.560
again everyone was exposed to nuances of
00:08:03.360
exception management and the exceptions
00:08:05.039
that were coming through our app which
00:08:06.960
allowed us to drive towards alignment
00:08:12.160
as different people came across the same
00:08:14.000
exceptions and why they were painful we
00:08:16.479
could discuss them and come up with
00:08:17.840
solutions to incorporate the the team's
00:08:20.560
ideas
00:08:22.080
i'm going to go through what it looks
00:08:23.680
like to be on triage duty at backer get
00:08:26.960
first at the top you come to
00:08:29.199
you start at the top of the backlog of
00:08:31.280
the unaddressed
00:08:32.959
um errors
00:08:35.039
at the top of the backlog you get a 15
00:08:37.279
minute time box
00:08:38.560
your goal in that 15 minutes is to
00:08:40.320
determine the priority of that bug
00:08:43.279
it basically boils down to three options
00:08:45.760
this is something that should be fixed
00:08:47.920
and but i'm not going to do it right now
00:08:49.839
because i have 15 minutes so i'm going
00:08:50.959
to write a bug ticket and put it in the
00:08:52.560
bug section of our sprint
00:08:55.440
of our backlog
00:08:57.200
i'm going to acknowledge it so we sort
00:08:58.800
of came up with the idea of
00:09:00.000
acknowledgement but it's pretty similar
00:09:01.279
to resolve the idea being i've looked at
00:09:03.839
it this is not something we need to
00:09:05.200
really do anything about acknowledge and
00:09:07.279
move on i do want to know the next time
00:09:09.279
it happens
00:09:10.399
or
00:09:11.360
this is something that's worth fixing
00:09:12.880
and it's worth fixing now i'm going to
00:09:14.480
make a bug ticket and i'm going to work
00:09:16.000
on it now
00:09:17.680
for a really high priority exceptions
00:09:20.959
a tool that we use to support um
00:09:24.000
bug duty is a dashboard
00:09:26.640
so what this helped us do is to get a
00:09:29.120
quick idea of what exceptions had not
00:09:31.440
been addressed
00:09:32.880
this is pretty much pulling those same
00:09:34.880
errors that we saw in our slack
00:09:36.080
integration into a more user-friendly ui
00:09:40.720
it also really supported this like going
00:09:42.800
down the line of triaging starting at
00:09:44.959
the top and going down and you get some
00:09:46.880
useful information just
00:09:49.120
from at a glance
00:09:50.800
this also supported some of those
00:09:52.160
resource decisions of like how deep is
00:09:54.080
our backlog how important is it to make
00:09:56.640
sure someone's looking at
00:09:58.080
exceptions today
00:10:00.399
i also want to talk about the idea of
00:10:02.480
acknowledgement that you can see in
00:10:05.200
badger bot
00:10:06.560
um
00:10:07.680
so the purpose of acknowledge as i spoke
00:10:09.680
about before is to kind of say we're not
00:10:12.720
going to do something about this
00:10:14.399
as part of our process we also started
00:10:17.440
auto resolving weekly so basically an
00:10:20.480
acknowledgement would would expire in a
00:10:23.040
week and so you would see that error
00:10:24.720
again
00:10:26.640
you can see the drop down that comes
00:10:28.079
down when you hit acknowledge and you
00:10:30.000
can see what this kind of represents is
00:10:32.640
us starting to bucket
00:10:34.240
why we might be acknowledging exceptions
00:10:36.399
which actually ends up being really
00:10:37.760
useful as you're having conversations
00:10:39.680
about the priority of different
00:10:40.959
exceptions
00:10:42.079
you can see like yeah there's this get
00:10:44.000
timeout but it's not high priority or
00:10:46.000
it's not in high frequency i'm gonna
00:10:48.320
move on
00:10:49.440
um also some things about like i already
00:10:51.519
see a story about this i'm not gonna
00:10:52.959
make a new one
00:10:54.560
and also you could keep adding new
00:10:57.120
um
00:10:58.079
choices that you wish were here that are
00:10:59.760
not
00:11:01.440
so it kind of allows us to have a way
00:11:03.440
forward for what i think is starting to
00:11:06.079
talk about the gray area of like yes i
00:11:08.959
want to see this exception no i don't
00:11:11.120
want to do anything about it and like
00:11:13.279
being comfortable with that designation
00:11:16.000
and knowing that there are many
00:11:17.440
exceptions that are going to live in
00:11:18.640
that bucket
00:11:23.360
next i want to start talking about some
00:11:25.040
learnings from this next process
00:11:27.600
change that we made
00:11:30.000
one big one
00:11:31.360
especially for me was that exceptions
00:11:32.959
can be solved collectively over time
00:11:35.440
when you have a 15-minute time box
00:11:37.040
you're likely hitting it frequently
00:11:39.680
which means that you have to kind of
00:11:41.360
think about what can i do in 15 minutes
00:11:43.519
that's going to move debugging this
00:11:45.440
exception forward
00:11:47.360
it made it really reframed how i thought
00:11:49.360
about what it means to fix something
00:11:51.279
because sometimes it can mean that i
00:11:53.519
don't think this is high enough value
00:11:54.880
for me to spend time really digging into
00:11:56.720
this but maybe next time i wish that i
00:11:59.519
had this piece of information i'm going
00:12:01.200
to add it to the breadcrumbs or the
00:12:03.040
params
00:12:04.880
other kinds of
00:12:06.320
small things you can do like refactoring
00:12:08.560
out a love demeter violation so that
00:12:10.399
next time the line from which the error
00:12:12.560
was thrown is more useful
00:12:14.720
um or just like starting a bug ticket
00:12:17.680
and having a place to log what happened
00:12:20.399
what's gonna happen next time and being
00:12:22.399
okay with allowing it to happen again
00:12:24.880
and letting all of that be more data
00:12:26.720
points to support a broader solution
00:12:29.680
it also allows you to try something
00:12:31.680
right now and leave push forward the
00:12:34.399
exception for the next person to take
00:12:36.079
over next time and start putting
00:12:38.320
patterns into our app for dealing with
00:12:40.560
exceptions that we were seeing a lot of
00:12:45.040
another learning is about prioritizing
00:12:47.040
and writing bug stories
00:12:49.440
writing bug stories that i myself would
00:12:51.279
actually want to pick up is quite the
00:12:53.200
art
00:12:54.399
providing enough context for the next
00:12:56.240
person about what you've discovered
00:12:59.200
but not allowing like any half-baked
00:13:01.440
theories to be in the ticket is
00:13:04.000
really useful and can be difficult
00:13:06.240
things that were useful for us is like
00:13:07.680
links to faults how to reproduce that
00:13:10.000
same exception or other information you
00:13:12.480
uncovered
00:13:13.920
one way to really prove to yourself that
00:13:15.680
you understand how
00:13:17.680
why an exception is being raised is to
00:13:19.920
write a test that shows the same
00:13:22.720
system that you think is causing this
00:13:24.800
exception and seeing that failing test
00:13:26.720
which is a really great thing to link to
00:13:28.160
a bug ticket
00:13:30.160
one other learning from our triage duty
00:13:33.200
kind of process is that we started
00:13:35.839
making a lot of bug tickets but weren't
00:13:38.000
always picking them up and so our bug
00:13:39.920
backlog also grew
00:13:41.680
and so another addition to this process
00:13:43.360
was pulling a set number of bugs over
00:13:46.320
each sprint this kind of alludes to like
00:13:49.199
when you want to tackle your backlog you
00:13:51.920
have to have that investment and you do
00:13:53.600
have to prioritize that bug those bugs
00:13:56.079
and make room in your sprint for them
00:13:59.760
it also serves as a purpose of um
00:14:02.720
if no one is advocating for certain bugs
00:14:04.880
to get into the sprint maybe they
00:14:06.480
weren't that important and letting them
00:14:08.000
fall off can also be a really useful
00:14:09.600
tool
00:14:12.000
another learning is if you look at your
00:14:14.160
top occurring errors you might find that
00:14:16.639
the top it usually follows a power lot
00:14:18.720
and the top several are probably going
00:14:20.639
to take up a chunk of the exceptions and
00:14:23.120
so
00:14:24.160
making a um a deliberate effort to start
00:14:26.560
chipping away that away at those can be
00:14:28.880
really useful
00:14:30.560
and last learning is identifying the
00:14:32.880
levers for efficiency we kind of had
00:14:35.199
this general goal of triaging quicker
00:14:38.000
and moving faster and this process
00:14:40.480
started to show us what was taking a
00:14:42.720
long time for um exceptions require you
00:14:45.680
to
00:14:46.480
load up a lot of context really quickly
00:14:49.120
um
00:14:50.639
and
00:14:51.600
i think
00:14:53.440
exposing us more to triage duty
00:14:55.600
regularly and building that muscle was
00:14:57.680
really useful on us moving faster
00:15:00.959
also having more pads to action ability
00:15:03.040
which ended up being one of the reasons
00:15:04.560
why i was going so slow
00:15:07.519
we are going to move on to triage duty
00:15:10.079
so i've pulled two errors out of our app
00:15:12.959
and i wanted to actually go through the
00:15:14.639
process of triaging them together
00:15:17.120
so the first is
00:15:19.600
this faraday timeout error that we see
00:15:22.000
is being thrown from the project lead
00:15:24.720
fetcher first step understanding the
00:15:27.519
error
00:15:28.880
we see here it's a net read timeout
00:15:32.079
if the error that you're seeing is not
00:15:34.000
something you're familiar with really
00:15:35.440
familiar familiarize yourself with the
00:15:37.920
error code or if the http status is new
00:15:40.240
to you what it really means and if it's
00:15:42.560
a custom error looking into the api and
00:15:44.800
understanding when that error is
00:15:47.920
thrown
00:15:50.160
in our case we see that we have a net
00:15:52.240
read timeout a subclass of the timeout
00:15:54.560
error which is raised when a chunk of
00:15:56.800
the response can't be read
00:15:58.880
um within the the three time app that
00:16:01.519
you've set the default of which being 60
00:16:03.680
seconds
00:16:04.959
just for a little refresher we know when
00:16:06.480
we're making an http request we open up
00:16:08.959
our tcp connection send a request over
00:16:11.360
the wire and then read the response back
00:16:14.320
what our error is telling us is that
00:16:16.000
that last step is not happening i bring
00:16:18.480
that up because you might view it
00:16:20.160
differently if it's a get or a post
00:16:22.959
in our post request we really want to
00:16:24.480
know where we did we send something over
00:16:26.399
the wire or not and in this case we do
00:16:28.639
have to check versus if you were
00:16:30.720
unopened unable to open the connection
00:16:32.880
at all
00:16:34.959
next step let's understand the call site
00:16:37.279
and what this piece of code was meant to
00:16:39.199
do that it was unsuccessful at doing
00:16:41.680
we can see that this error is being
00:16:43.199
thrown from a get method in our client
00:16:45.199
with our indiegogo integration if we
00:16:47.680
look at our back trace we can look down
00:16:51.199
and see okay there's a get campaign
00:16:52.880
method that was expecting to be able to
00:16:55.519
get some information
00:16:57.120
it's being thrown from our project lead
00:16:59.199
fetcher worker all right so we're
00:17:01.360
starting to get some information
00:17:04.000
whatever project lead fetcher needed
00:17:05.839
campaign information for is unsuccessful
00:17:08.079
at doing that
00:17:09.760
we might then look at where this worker
00:17:11.919
is being called from to understand more
00:17:13.760
about the purpose of this code
00:17:15.600
i've done some digging for us and we can
00:17:17.280
see that it's coming from a staff
00:17:18.640
controller action called sync from a
00:17:20.640
method sync and update
00:17:22.640
so again we're starting to get some more
00:17:24.000
information like oh this is a staff
00:17:26.160
action that is exposed this is a get
00:17:28.640
request so that some information that we
00:17:30.799
expect to have we did not and we also
00:17:34.160
know it's coming from something that is
00:17:36.080
exposed to a user so theoretically they
00:17:38.559
could kick the worker off again
00:17:40.559
the next question is how often is this
00:17:42.240
happening so we see that it's 67
00:17:45.280
occurrences in the last four months this
00:17:47.520
is not new it's not super frequent so
00:17:50.000
it's not super noisy it can be useful
00:17:52.240
just to ask yourself is this something
00:17:53.840
that's gone
00:17:54.960
gone out in a recent deploy and maybe
00:17:56.640
that will change how likely you are to
00:17:58.799
um fix it etc and also just gives us an
00:18:01.919
idea of how noisy it is
00:18:04.640
some other tools for debugging that you
00:18:06.320
might use is to just kick off the
00:18:07.919
process again in our case i kicked this
00:18:10.720
off
00:18:11.600
when i was debugging and hey the next
00:18:13.440
time we did it we were successful
00:18:15.919
try to reproduce the error and also like
00:18:18.080
finding patterns and past occurrences of
00:18:20.320
the error can be really useful
00:18:24.160
so now let's think about priority
00:18:26.640
and this again there is no equation i
00:18:28.799
wish i could give one to you but it's an
00:18:30.720
art two things that you might consider
00:18:32.799
is the frequency
00:18:34.559
as we talked about we want to reduce the
00:18:36.559
noise and so things that have high
00:18:38.559
frequency that even if the user impact
00:18:40.960
is low it might change our priority of
00:18:43.440
of addressing that exception or not
00:18:45.039
seeing it again
00:18:46.880
user impact is the other big one we want
00:18:48.880
our users to have really great
00:18:50.160
experiences and so considering what was
00:18:52.799
the impact for the downstream downstream
00:18:55.039
user you might have a higher priority
00:18:57.600
for a page timing out and a user not
00:18:59.280
being able to do something if it's
00:19:01.760
affecting even more users that also
00:19:03.360
might change
00:19:04.880
your
00:19:05.919
appetite for making a change
00:19:08.799
so in our example we see pretty low
00:19:12.000
frequency
00:19:13.200
and we also know that the user has a way
00:19:15.919
to get themselves out of this
00:19:17.200
information and they're missing some and
00:19:19.440
for
00:19:20.240
some data they expect to have
00:19:22.720
all right so let's think about what we
00:19:24.080
could do in this case
00:19:25.840
we see that as we talked about the
00:19:27.840
re-timeout is something we've configured
00:19:30.960
there's there is some you could change
00:19:32.880
this this would be a pretty global
00:19:34.960
change
00:19:35.919
that's an option though you can use a
00:19:38.080
retry you can rescue this error and
00:19:40.080
retry and be like i don't want to see
00:19:41.679
this try once more before you
00:19:44.000
throw an exception we can acknowledge it
00:19:46.320
we can say hey this is something that
00:19:47.600
happens in our app we are
00:19:49.600
working with web requests they are going
00:19:51.200
to be flaky
00:19:52.960
we could snooze we could say not only do
00:19:54.720
i not care about this time i don't want
00:19:56.480
to see it another for another 100 times
00:19:59.200
or we could handle this case if it's
00:20:00.960
representing some specific user flow
00:20:03.840
in this case we've kind of determined
00:20:05.919
low user impact low frequency seems like
00:20:08.720
a really good candidate to dismiss and
00:20:10.480
move on
00:20:11.919
good job y'all we've solved our first
00:20:14.240
exception of triage duty let's move on
00:20:16.880
to the second one
00:20:19.440
all right next in our queue is the
00:20:21.520
missing correct access error from the
00:20:23.200
project vector worker we know our first
00:20:25.440
step is understanding the error so
00:20:27.200
missing correct access that's not an
00:20:29.360
error i really know off the bat
00:20:31.360
so i'm going to look at the call site
00:20:32.880
and be like what can you tell me more
00:20:34.720
about this error and i see this is
00:20:36.720
actually a custom error that we've
00:20:38.240
thrown
00:20:39.120
um
00:20:40.320
someone at one point was like i don't
00:20:42.000
want to see 403s i want to raise this
00:20:44.320
error that tells you more about what's
00:20:46.320
going on that's missing correct access
00:20:49.280
another
00:20:50.960
utility of having a customer is that you
00:20:52.880
can kind of throw it from different
00:20:54.080
places in the app so we can build an
00:20:55.760
understanding when it's the same root
00:20:57.440
cause
00:20:59.600
let's move on to what the purpose of
00:21:00.960
this code is we okay we have this id we
00:21:02.559
have this custom error missing correct
00:21:04.080
access maybe we are starting to
00:21:05.200
understand what that means
00:21:07.200
we can look at the stack trace and see
00:21:09.120
again project or again there's a worker
00:21:11.440
project fetcher worker it is trying to
00:21:14.320
get project information and it's we're
00:21:16.799
seeing this error that we're raising
00:21:19.440
we might need some more information
00:21:21.120
about what the purpose of this worker is
00:21:22.880
like where was it called
00:21:24.559
and our root call site here is just
00:21:27.360
sidekick
00:21:28.640
not super useful
00:21:30.960
in this in the case for workers because
00:21:32.960
it is um
00:21:34.559
being thrown from sidekick
00:21:36.880
it can be difficult to know what the um
00:21:39.280
where this is called so one tool to help
00:21:41.600
you do that is just putting the
00:21:43.679
what is calling the worker in params
00:21:47.120
so breadcrumbs and params are a really
00:21:48.640
good tool another thing that could be
00:21:50.720
useful is to save queries for post
00:21:52.480
requests to help you debug
00:21:55.120
or any other information that you wanted
00:21:57.120
when you were debugging
00:21:58.400
throwing it in the breadcrumbs or params
00:22:00.880
could be really useful
00:22:02.720
so someone already has done us our favor
00:22:04.559
we can look at params and we can see
00:22:06.640
okay there's a project update task that
00:22:09.120
is calling this worker that is causing
00:22:11.039
this exception
00:22:12.559
so i look in a heroku scheduler and i
00:22:14.559
see oh cool this update task runs every
00:22:17.280
day so
00:22:19.200
it'll just fix itself tomorrow right
00:22:22.080
the thing is if it's gonna raise again
00:22:24.400
it's also going to throw that exception
00:22:26.000
so let's try kicking it off again i see
00:22:28.159
that it's item potent i try it again and
00:22:31.280
okay this error is still happening i
00:22:32.640
still get a missing correct access
00:22:34.799
and what that means for our frequency is
00:22:36.799
it's a pretty noisy error we see that
00:22:39.280
it's being raised most days and if we
00:22:41.440
think about triage judy that means every
00:22:43.600
day someone's going to be triaging this
00:22:45.760
exception
00:22:46.960
which is painful
00:22:49.120
and so when we're thinking about this
00:22:51.039
error another part that i'm going to
00:22:53.440
tell you is that this is raised when we
00:22:55.919
integrate with kickstarter and when our
00:22:57.919
users are incorrectly off into this
00:22:59.679
external service this is where we get
00:23:01.360
this error
00:23:03.120
which also means there's not much as a
00:23:04.799
dev i can do i can't really offend for
00:23:07.039
them i don't have their credentials
00:23:09.200
um
00:23:10.640
but it's really noisy which is maybe the
00:23:12.799
worst of both worlds this is an
00:23:14.400
exception that i'm just gonna sit there
00:23:15.919
and be like cool here it is again
00:23:18.400
and we integrate with kickstarter a lot
00:23:20.559
so you're going to see this error and
00:23:22.480
any other action this user is trying to
00:23:24.240
do on our app
00:23:26.640
so what can we do
00:23:28.799
how do we reduce the noise of low value
00:23:30.960
exceptions to support our goal of only
00:23:33.600
high valuable actionable errors in our
00:23:36.000
in our backlog
00:23:38.240
one tool we can do is just error
00:23:39.679
grouping
00:23:41.120
um honey badger d by default groups
00:23:44.080
exceptions by the type of error and
00:23:45.679
where it's thrown from but you can
00:23:47.360
redefine a fingerprint to throw the same
00:23:50.880
error from two different places when you
00:23:52.400
know that's the same root cause there's
00:23:54.480
a gotcha there where it can it can
00:23:55.919
affect some statistics or comments but
00:23:57.840
it can be a useful tool
00:24:00.000
other things we can do is the rescue
00:24:02.080
retry pattern we know in this case it's
00:24:04.240
not really going to help us because we
00:24:05.679
know we're going to keep seeing it
00:24:07.679
if you if it's low action if it has low
00:24:10.000
action ability and you kind of question
00:24:11.679
is this even an exception we could see
00:24:13.760
silencing it can be a really useful tool
00:24:15.679
actually we don't want things in our
00:24:18.240
observability tool that we can't do much
00:24:20.240
about or it doesn't help to know that
00:24:21.679
it's happening
00:24:22.880
we can serve as errors users and allow
00:24:24.640
them to be self-serviceable train
00:24:26.960
trained teams and processes with a
00:24:28.559
manual workaround remember our our staff
00:24:31.440
our
00:24:32.400
people who work with us are part of our
00:24:34.159
system
00:24:35.200
or if it's widespread and affecting a
00:24:37.520
specific sliver maybe an overlooked
00:24:40.400
data consideration you might actually
00:24:42.000
fix that data
00:24:44.159
i want to show one thing that we did um
00:24:46.320
for this case as i said
00:24:48.640
you can't do a lot of things in our app
00:24:50.320
if you're not off incorrectly and so we
00:24:52.799
introduced this idea of a valid platform
00:24:55.279
credential which allows you to have an
00:24:57.600
object to look to check on to check if
00:25:00.400
something's valid and it allows you to
00:25:02.559
have a check before you even run a
00:25:04.400
process that you know is just going to
00:25:06.080
throw an exception down the road
00:25:09.200
we also complemented this with a worker
00:25:11.200
that would update valid platform
00:25:13.440
credential
00:25:16.559
i was so interested in exceptions how
00:25:18.720
other people handled them that i sent
00:25:20.880
out a survey before my talk to
00:25:22.880
understand some other pain points et
00:25:24.400
cetera and i was going to share them
00:25:26.480
with you here but as you can tell we're
00:25:28.480
pretty close to time and i'm kind of
00:25:30.320
right out of time what i'm going to do
00:25:32.400
is have a exception tip or uh of the
00:25:35.200
week on my twitter my handle is down
00:25:37.760
there look out for that if you are going
00:25:40.159
to
00:25:40.880
or if you're interested i actually also
00:25:42.880
learned a bunch of things just from
00:25:44.480
folks at rubyconf so i'm super excited
00:25:47.039
to
00:25:47.760
dive deeper into here
00:25:50.559
thank you all for joining me on this
00:25:52.640
voyage please collect all your
00:25:54.240
belongings on your way out
00:25:56.559
if you have any questions please come
00:25:57.919
find me outside or ask me on discord and
00:26:00.640
if you try anything please at me on
00:26:02.320
twitter i'd love to know how it went for
00:26:03.840
you
00:26:04.799
i want to thank my coworkers who
00:26:06.720
continuously inspire me to keep
00:26:09.039
experimenting
00:26:10.559
ian and lindsey are here right now ian
00:26:12.880
was the originator badger badgerbot so
00:26:16.240
if you want to talk to him you should
00:26:18.320
i also want to give a shout out to the
00:26:19.919
wmbcfp working group who gave me a ton
00:26:22.400
of feedback and helped me on this whole
00:26:23.840
process would not be here without you
00:26:26.320
and my parents who flew here my dad
00:26:28.880
who's in the crowd right now would also
00:26:31.440
not be there without us thanks everyone
00:26:35.679
last slide i had to shout out backer kit
00:26:37.840
if you're interested in tdd
00:26:39.240
experimentation crowdfunding come talk
00:26:41.760
to us we're hiring
00:26:43.440
thanks everyone