00:00:05.440
do
00:00:25.519
um but i'm joseph rucio i'm co-founder cto of a company called librato
00:00:31.439
we do monitoring and i personally love graphs and the title of the talk
00:00:37.760
it's not in production unless it's monitored is is one of my favorite quotes and so i
00:00:42.879
thought i would actually dig and and try and find out where it came from and as best i can tell
00:00:49.360
um uh greg i'm not even going to pretend to mangle his last name
00:00:55.360
i'm not quite sure but he's a devops engineer an infrastructure engineer if you'll have it at uh evite
00:01:02.879
uh and it's kind of interesting i think that he said this because evite is one of the old web 1-0 properties
00:01:09.280
right so it was launched in 1998 they've sent something to the tune of uh over a billion
00:01:15.600
uh invitations and about the time he put this up on twitter like a year and a half ago
00:01:22.560
they just finished completely overhauling their system so moving from java and oracle rack over to things like
00:01:28.960
python google app engine various polyglot nosql solutions
00:01:34.320
and so it kind of made me start thinking about so what what about the context of that switch as they were preparing for the
00:01:40.640
next decade made him say this and so if you if you think about kind of how
00:01:47.360
sas was developed like you know 10 years ago 14 years ago you know you'd start
00:01:52.960
you get funded your quote unquote seed round you know it's the the tune of millions of dollars
00:01:58.479
and that was because to just even get going right you had a big upfront capital expense buy a lot of servers
00:02:03.600
physical rack and stack you have a dedicated ops team right who put those all together and finally
00:02:10.879
write you your own custom software stack right you know google the yahoos where
00:02:16.080
everything's in-house and everything runs on their own custom hardware
00:02:21.200
so now in in uh 2012 right we have
00:02:27.920
your seed round if you're not even if you're doing a seed round now bootstrap you know to tune twenty thousand dollars your infrastructure is a monthly expense
00:02:34.720
just like your cable bill right you're using amazon or rackspace or whatever
00:02:40.879
if you have an ops person if you're lucky it's one might even just be you
00:02:46.160
and finally you're using open source software and external services to build your whole stack
00:02:55.120
so what this means is that our infrastructure now what's interesting if you look at the two of those our infrastructure now is what i
00:03:01.599
like to think of as agile and i definitely in the sense of agile that moving quickly adapting to change it's
00:03:07.760
ephemeral servers instances come and go you actually uh you know when you talk with amazon
00:03:14.159
they'll tell you literally you have to use multiple multiple availability zones because we reserve the right to take your servers away at
00:03:20.080
any time so we're now in an environment where we have more change
00:03:25.680
but we actually have worse tools right so google has amazing tools for
00:03:31.440
monitoring and understanding what's happening inside of google but that doesn't do you any good
00:03:38.640
and if you look so we're now because of this we're kind of seeing a renaissance and monitoring i
00:03:45.040
like to think of and if you look at companies uh you think of who are leading this like
00:03:50.879
the etsy's uh flickr or even github and i was trying to find a common thread of you
00:03:56.879
know what what were driving these guys to be so heavy into monitoring other people at the other end of the scale
00:04:03.760
and the one kind of common thread i came up with was continuous deployment right just
00:04:10.080
how many people here do continuous deployment okay so that's that's about half that's
00:04:15.439
good uh i mean a quick digression on that one one of the fascinating things i found
00:04:20.479
about continuous deployment is it's easy to see the case where you say oh yeah we we ship all the time
00:04:25.759
so we don't have this oh this three month the huge release where there's all these features we ended up building that we didn't have to build and we wasted
00:04:32.000
all this time but a lot of times you'll get these uh you know someone come in and say oh okay
00:04:38.000
we'll be fine we'll just schedule a release every week right which is sounds great at first but it is
00:04:44.320
kind of a false economy because if you ship once a week or once every two weeks
00:04:50.160
that means every week or every two weeks you have a day where everyone's scrambling to do
00:04:55.919
the big ship right they're trying to get code in in time to make the ship or they're trying to figure out why
00:05:01.759
something's not working and the ship is being delayed and so anything other than continuous deployment is
00:05:08.000
really a trade-off between a scheduled waste of time versus wasting time on
00:05:13.120
features uh that you may not not needed so we we do continuous deployment and
00:05:20.240
there's kind of i like to think of five steps to that the first is continuous integration right so as
00:05:26.880
developers you run tests all the time so you have the confidence hey i'm pushing new code
00:05:32.080
out but i know it didn't break anything there are no regressions
00:05:37.120
i'm going to make deploying as cheap as possible so one click whether it's a campfire bot or a single button
00:05:44.320
you know make make deploys as costless as possible
00:05:49.360
once everything is out you've deployed use feature flagging as an additional
00:05:54.479
installation where you can kind of bleed users on your own users personally
00:05:59.520
and then a select percentage of users until everyone gets it
00:06:06.319
so this right here is a great setup already but the problem is even this you know this is not enough bugs still make
00:06:11.840
it through and that's where monitoring comes in and i think it's instructive
00:06:18.479
so continuous integration came out of agile test driven development right we have these tests we're going to run them all the time to reap the benefit
00:06:25.360
and if there's one takeaway you should really start to think about monitoring and instrumentation is to operations
00:06:33.600
as unit tests is to development right if we have good monitoring in place then our
00:06:39.360
ops people will sleep better because they know i'll be able to tell instantaneously
00:06:44.400
once this hits production that i don't have regressions i can look at my dashboards and i can see there were no
00:06:50.319
regressions and the final component
00:06:55.759
so that's active monitoring if you're actually visualizing you're checking right after the deploy but there are latent bugs these have good alerting and
00:07:02.880
alerting is definitely a component of monitoring so if something does happen six or seven hours after deploy
00:07:08.800
you find out as an example here's a um
00:07:14.240
a slide yeah it's kind of dark but travis ci which is a great continuous
00:07:20.160
integration project did a post recently on monitoring and
00:07:25.280
this kind of illustrates the cycle that you go through where you start and you see a deploy and
00:07:31.120
there's a they're tracking the number of error responses they get so there's an immediate spike after the deploy at which point i imagine they were
00:07:37.120
scurrying around their keyboards trying to figure out what's going on and they deployed a fix and you can see that come back down
00:07:42.960
but there's still some noise uh you know there's a green line noise so they kept digging in and you can see where they
00:07:49.520
fixed another one and at every stage they had this immediate feedback on their progress in
00:07:54.879
production so that's that's a good illustration i
00:08:01.280
think that's a good driver if that's not enough i'm not sure how many hardcore rubios there are in here
00:08:06.720
but you can also find the chunky bacon using monitoring
00:08:13.199
this is actually a graph from the wild too i didn't i'd like to say i came up with some algorithm to draw that but
00:08:19.440
that actually happened so now you know monitoring leads to bacon so that's our first if you didn't
00:08:24.960
get anything else just think of it that way monterey means bacon you go to google and you say hey
00:08:31.120
i want to do monitoring let me let me go monitoring tools and you get hit with
00:08:36.560
this just an explosion right of all these different services tools you can use
00:08:44.880
um some of these essentially i put a mix up here some of these i personally think are really really good
00:08:50.959
some of these i think are not so really good but my hope is at the end of the talk i'll leave as an exercise to the
00:08:56.560
audience to be able to discern that for yourself now if you do pick some of the not so
00:09:02.640
really good ones you're going to end up with something that looks like this you'll come in and you say okay well i
00:09:08.320
need to monitor this and i did pick those names that's not neces that's a hint maybe uh you'll
00:09:15.120
end up with a system where you say okay i want to monitor this one thing or load on this one thing i'll
00:09:20.160
pull this off the shelf it's got some agent it's got a storage it's got a ui that you have to configure
00:09:25.920
and you're all set until you find something it doesn't do so then you go google and you find another one that does that you pull that off the shelf
00:09:32.560
and pretty soon you've got multiple vertically integrated silos you can't correlate across them and you have to
00:09:38.000
learn how to use each of them at this point you reach this stage
00:09:43.680
where you say you know what monitoring sucks this is really hard there's these huge tools with verbose configuration
00:09:50.560
configurations code they're designed for extremely long-lived host physical hosts
00:09:57.279
they're try to be a jack-of-all-trades they don't do anything well and you invariably need more than one of
00:10:02.560
them and and this is something else if you're interested in this stuff monitoring sucks this is a twitter hashtag uh
00:10:08.480
there's an irc room there's even a github repository and it's a whole movement of devops systems trying to
00:10:15.120
make monitoring better so we need a better model and and that's
00:10:21.040
really kind of gets the core of the talk because i want to kind of build a model that you should think about when you're evaluating monitoring tools or building
00:10:27.519
your monitoring solution and things you should strive for so first thing consider our different
00:10:33.440
metrics types we have so you're going to be tracking your business drivers this is probably a small number of things but
00:10:39.040
it's very important it's the things that make you money it's the things that keep you employed so what are the numbers that if they go
00:10:44.800
up or down depending uh improve your business
00:10:49.839
you're going to have your application performance so generally speaking uh you know this is how well your requests how how well
00:10:56.240
does your app feel your customers that's going to be tied there's going to be system resources that are used by the
00:11:01.680
application are there memory leaks is that's what's causing our application to slow down is a disk full
00:11:07.760
and there's network generally how i like think how many connection attempts are we receiving what's the load balancer doing
00:11:15.200
and what's interesting is you often you're going to want the ability to cut across all levels of the stack
00:11:22.160
right if the our business for example is driven by you can think of it as the number of api
00:11:28.240
calls we can uh we can handle you know we're volume based business so we track
00:11:33.279
very closely the number of api calls we're doing every second and that has to do with
00:11:38.480
is impacted by how many our application can handle what type of system resources and and
00:11:43.680
the network so if we're going to use
00:11:49.120
one of those monolithic things so look inside what is one of those monolithic things actually look like
00:11:55.040
inside generally speaking there's a collection stage right so this this happens every request in your app
00:12:00.959
you have something happens you measure it now you have a measurement and this happens you know order 50 100
00:12:07.120
300 milliseconds or if you have one of elia's web pages every 15 seconds
00:12:14.959
you get a lot of these so for trending this information is way too dense both both for just visualizing
00:12:20.480
as well as storing so there's an aggregation phase where we roll up all these sub second measurements into maybe
00:12:28.000
10 second interval or 15 second interval then we have to get this to disk somehow
00:12:33.120
we have to store this somewhere and then we have different types of analyses whether it's just plain visualization
00:12:38.480
or learning or even some type of algorithmic mining
00:12:44.240
so as people who maintain and operate software platforms this kind of diagram should start
00:12:50.399
screaming out at us if we're thinking about monolithic solutions right and this is a a prime example of of
00:12:57.680
where we can use separations of concerns right we can split these behaviors out and we can
00:13:03.360
use well-defined interfaces between them so we can mix and match what we want to do
00:13:09.200
so digging into some of those the most important one
00:13:14.480
or this is the most important place to focus on probably is collection because if we're going to use monitoring for our
00:13:20.480
operations like we use tests for our development then the cost of collection has to be as close to zero as possible
00:13:26.560
when monitoring sucks nobody does it because it sucks and there's a lot of friction so we need to make it super
00:13:32.959
easy you know new code goes out it should have instrumentation with it when it goes out just like it has tests and
00:13:39.360
we have to make that cheap the cheapest way to monitor something
00:13:46.000
you already have is logging right uh and there are some cool projects you
00:13:51.040
can use uh etsy logster is one of them if you think about logs as streams right i mean so log files are
00:13:57.040
interesting semi-structured text you've got a big log file and you can throw all kinds of weird queries at it which is is
00:14:02.720
nice but as it streams to that file it's it's good to think of there are things in there i can count the number of requests
00:14:09.279
i had in one minute of that stream or i can count the number of 200s the number of 500s so there are several projects
00:14:15.120
that will actually as your logs are being consumed uh count things in them and generate statistics off them which
00:14:21.519
you can then graph alert do all kinds of interesting things etsy lobsters one
00:14:26.560
log stash these are all github anywhere i have something something's a github url
00:14:32.880
logs so etsy just looks parses your existing log files logstash actually
00:14:38.079
mimics uh syslogd and has storage engine with it and then there's finally uh there are a
00:14:43.839
lot of good services too we love paper trail uh that also looks like syslog d um and has integration with uh you know
00:14:50.959
third-party services like ours for for graphing in rail specifically uh active support
00:14:57.600
notifications one of my colleagues gave a talk yesterday on this you should definitely check the slides out
00:15:04.320
but basically it's a pub sub instrumentation for rails 3 which makes it really cheap first of all
00:15:09.600
there's a ton of out of the box instrumentation and then it's really cheap to add new instrumentation your rails apps and you can up you can pipe
00:15:17.519
that stuff publish it to multiple place a couple cool projects matthias myers
00:15:22.639
log rage uses this to trim your rails logs which will go well with anything in the previous slide
00:15:28.720
and then harness is a neat thing to hook in and then actually instead of just going to
00:15:34.000
logs publish these to any third-party service another interesting collector for any
00:15:40.240
ruby project although you can pull access support out too but um is
00:15:45.600
metrics and this is a project that just gives you simple primitives like counters meters timers
00:15:51.600
and makes it really easy to plug it into multiple reporting back ends i think he has graphite he has librato support he
00:15:57.120
has riemann support
00:16:02.160
so now that you're collecting we made it real cheap to collect some of those collectors the metrics
00:16:08.240
actually the ones just on the last slide actually does aggregation so there are some collectors but if you're writing
00:16:13.279
something custom um or if you're using one that doesn't you don't want to worry about that and so there are tools you can use the
00:16:20.399
most well-known one is statsd that comes out etsy as well they've done
00:16:25.759
a lot of good stuff with monitoring statsd is really interesting it's about 319 lines of node.js
00:16:32.880
just a little demon um it supports uh several different types
00:16:38.480
counters timers gaugers what's neat is it just sits on a port and listens for udp packets
00:16:44.240
which means anything you're instrumenting you have a zero cost almost zero cost to
00:16:50.160
in the middle of your request response cycle dump a udp package just the memory copy to a kernel buffer right
00:16:56.560
um and the way we have this set up the way i like to think about you can
00:17:02.000
set one of these stats d daemons up somewhere in your network and have point all your servers at it uh but given that
00:17:08.240
it's like so small and so lightweight we prefer to think of it as just something almost like a syslog d right where
00:17:15.280
i just install it runs on every one of this is an example one of our front end interfaces and it just sits there so if i bring any
00:17:21.439
new service on this box i know i have statsd locally i don't have a single point of failure for aggregation in my
00:17:26.720
network and because the udp is going over the loopback i mean trending data if you
00:17:33.280
lose some measurements is not a big deal i'm kind of pedantic so i like knowing that it goes over the loopback and i
00:17:38.480
have almost zero loss as long as the boxes is healthy
00:17:44.160
and besides the udp being zero cost what's really neat about statsd is it's actually just defined a udp wire
00:17:51.120
protocol that's the most important thing it did so there are a ton of clients for that
00:17:56.320
nginx module we use github put out a rack statsd so you can get all types of rack level statistics
00:18:02.559
into it shopify has this statsd instrument which lets you use in your application level and and basically wrap code blocks and
00:18:09.679
stuff and get things directly into statsd and what's nice about doing it multiple levels the same kind of stats
00:18:16.240
is then you're trying to debug a network issue and find out where in your stack it's coming from you can compare these
00:18:21.280
different ones and because
00:18:26.559
like i said it's just a udp wire protocol if you don't want to run node.js that's fine it's only 319 lines this guy it's
00:18:33.559
joemiller.me uh on his blog he maintains a comprehensive list
00:18:38.720
at least as of recently was updated of all these different stats statsy server implementations in perl ruby i think
00:18:43.840
there's one in go so whatever language you're comfortable doing your production managing and production you can probably
00:18:50.240
find an implementation that so now that we've we've got collection
00:18:57.360
and aggregation is very cheap low friction uh the most important thing is how do you collect all this data in a central
00:19:04.400
location where you can access it and do any you know your arbitrary correlations
00:19:09.520
and this is where i i do think you want to think carefully here because you know collection and aggregation is is
00:19:15.840
streaming so it's very easy to swap things in and out with open interfaces and having storage
00:19:22.000
being its own component that isn't as hard to swap in and out but there is some persistence there
00:19:28.880
so the first rod tool is the round robin database tool and this is
00:19:35.280
the default i think it's like 10 or 12 years old
00:19:40.559
but most of these monolithic solutions actually use this internally so cacti immune in
00:19:47.200
and what's nice about this if you're if you're not aware of it it uses a circular buffer file to give you a
00:19:52.480
constant guarantee on storage so it it writes new values into the buffer and if you if
00:19:58.000
you've buffered enough space for 100 measurements then the 101 measurement will overwrite the first one
00:20:05.039
and it's designed for rollups so you actually get multiple circular buffers uh per metric so you can have resolution
00:20:11.360
you can configure this um but you can do resolution of you know raw data
00:20:16.799
10 second and one minute roll-ups 15-minute roll-ups
00:20:22.480
so that was one of the first ones i think probably what's more interesting is these next couple solutions so graphite
00:20:29.679
is also based around whisper rd but it bundles a visualization component with it it's a separate visualization
00:20:35.679
component but it makes a lot of sense to bundle those uh next to your storage right because
00:20:41.600
the visualization of big pieces are just pulling the data how do you retrieving that as storage and and graphing
00:20:46.840
it so graphite uses whisper this came out of orbitz in 2008 orbitz.com
00:20:54.640
one couple things to know it's got a flat hierarchical namespace which means it stores just key values
00:21:00.400
but you use you can use your keys with dotted decimal notation to to add dimensions in the name as long as it's a
00:21:06.480
proper hierarchy and it supports to pull graphs out uh
00:21:13.039
pngs http queries a couple things to consider here this does seem to be kind
00:21:18.559
of the intro tool a lot of people use and it's it's nice for that but you do want to be careful to plan
00:21:24.240
your capacity this is generally used as a scale-up solution it pre-allocates files for your metrics
00:21:31.520
so i usually end up seeing based on retention settings people using about 3.2 megabytes per file
00:21:39.679
and if you end up with a lot of metrics i have seen in production people running graphite on a 64 gigabyte ram fs
00:21:47.039
i've seen people with 10 ssd drives together in a raid zero big hardware solutions to scale up to that
00:21:55.679
so if you think you're going to have that much data one solution that's pretty neat came out of stumbleupon is uh the open time series
00:22:02.640
database and that's based off of hbase hadoop and so what's neat about that
00:22:08.400
horizontally scalable hadoop lots of storage uh it supports because of that multiple
00:22:14.000
dimensions so it uses denormalization if you want to tag a particular measurement with multiple i want to look this up on this
00:22:20.960
dimension in this dimension so i want to look this up on the host name as well as the time zone
00:22:27.280
by writing the measurement multiple times and http queries
00:22:32.640
and the only downside here is you have to run a hadoop query
00:22:37.919
your last uh you know option is to use a service we provide one
00:22:43.520
there are several others you can look if you google around typically i think the best way to do this what we
00:22:49.600
do is is json over http you just push to an api measurements there are additionally typically agents
00:22:56.720
and language bindings to make that easy they generally have rollups and
00:23:03.600
we have interactive front ends typically services like ours visualization
00:23:09.200
like there's a couple things think about visualization most one of the most important is
00:23:14.720
correlation so you want the ability and part of driving towards this common infrastructure if
00:23:20.840
you if collection's really cheap and all these different collectors you can come up with new collectors and get
00:23:26.880
things in a central repository a graphite open time series database at this point whenever you're trying to diagnose
00:23:32.799
something you have all the data in one place you can build any graph you want right
00:23:38.960
and in addition to putting metrics on that graph ideally you're going to have a solution uh
00:23:44.080
you know one of the things we're working on and these existing solutions have is annotations so an important thing to
00:23:49.600
also think about events like deploys as something you should be monitoring right you should be having you should be pushing a stream uh
00:23:56.799
ideally with context so that's the shot of the deploy so when you're doing a correlation you can overlay your deploys
00:24:03.440
your overlay network events or whatever other events you're tracking with there
00:24:09.919
and you want these arbitrary combinations a couple examples this is a
00:24:15.919
open time series database correlation it's uh it's hard to read the legend this came right off their front page
00:24:21.440
though it's i think it's my sequel delete queries with um 99th percentile performance
00:24:30.159
this one this is a correlation i took out of our system for our storage ring
00:24:35.200
and so it's correlating uh read requests into the ring that's the orange really periodic so we have a periodic read
00:24:41.360
activity with disk ops and uh well iops and then disk
00:24:47.520
bytes right out of disk so you can see those you know always correlate with the read requests but then at the right end
00:24:52.720
of the graph there's something much bigger that happened there so it's probably a good idea that we have some other driver traffic there we want to
00:24:58.799
look into those are kind of things that's very very simple to pull up with correlations
00:25:07.200
dashboards so this is a bigger part of visualization a lot of tools have correlations but one
00:25:12.640
of the things i think a lot fall down on that's that's hyper important are dashboards right and the reason for that
00:25:19.360
if you have dashboards these are actually way much like you can think of as a wiki to have a shared understanding
00:25:24.880
across all levels of your team right there's a lot of times where someone asks you to go do something go
00:25:30.240
do this you know the ceo said we need to do this well why do we need to do that you don't know you go do it
00:25:36.559
if you have your business drivers or your app performance or whatever up on a wall on a plasma where everybody
00:25:42.799
can see when they're walking by to get their their coffee then everybody's on the same page and not only that
00:25:47.919
everybody's on the same page up to the minute
00:25:53.039
and as they're doing that also there's no more sophisticated aberration detection
00:25:58.159
than your marketing guy who knows nothing about your technology looking at a dashboard right and one of the most
00:26:04.240
interesting things is that that's literally happened to me where i'll be hacking away and and our marketing guy
00:26:09.279
will go joe why is that what is that and i'll look up and oh crap you know we
00:26:14.799
have to go do something and so it's really cool about having these all you know if you have these up is is people will see things
00:26:22.320
and finally if if your marketing guy isn't up at you know 1am looking at a dashboard when
00:26:28.000
something goes wrong and your alert catches it dashboards if you have a dashboard for
00:26:33.120
everything this actually provides a firefighting manual right if your expert says okay i'm going to build
00:26:38.799
a dashboard for this thing these are the most important things for this when something's wrong you go look here first
00:26:44.400
you're going to open that up when something's going wrong oh open the dashboard for that and you'll see okay this one this one
00:26:50.159
looks really funny so i'm not a total expert here but this is where i'm going to go first
00:26:56.320
so when you're looking at systems i mean we focused a lot on this but i think you want something that's going to make it very easy to template dashboards and
00:27:02.799
make a lot of them and make them easy to pull up this example like one of ours this is
00:27:08.559
for our api so we track this one the ones we have up uh you know our average response time the response times for
00:27:15.120
every post operation response times for get operations the number of errors and success codes
00:27:20.960
were returning per broken out by host in the stack graph below um and so when any of these shifts you
00:27:27.039
know we're dealing with you know thousands upon thousands of requests per second things can get messy in a
00:27:32.840
hurry uh dashboards could be a whole nother talk so this is a great book
00:27:38.640
it talks a lot about visualization techniques in general very easy read but also how to design dashboards and and what to look for
00:27:48.399
so the last bit i'm going to talk about uh is alerting oh sorry you had a question
00:28:05.440
right
00:28:23.760
right
00:28:34.960
yeah so that i think the question just in summary is if if you're using something like a platform as a service where
00:28:40.720
you don't have as much transparency into things like if there's a load balancer issue on their part how do you build a common dashboard that says oh it's it's
00:28:47.360
not my service it's theirs so that's that's actually one of the hard platformers of service problems
00:28:54.240
and i think you know that space is is maturing still i think that's one of the things they're
00:28:59.440
going to get to but you know they actually need to give you access right now um
00:29:04.559
using heroku as an example right they do provide heroku logs so one of the thing
00:29:09.919
that won't help you in the case that their load balancer is dropping packets um
00:29:15.039
so using an external service like a you know a pingdom or a new relic that's going to give you insight
00:29:20.320
um so usually that's that's one good thing though uh using specialized services is not necessarily bad i mean
00:29:26.320
everyone should obviously still use google analytics you know if you have a lot of new relic expertise and you're using that you
00:29:32.000
should keep using that i would look for services that give you the ability to pull out the cream of the
00:29:37.360
crop of what they're they're generating for you and get that into a common place you can correlate against other things
00:29:44.000
so one thing i would look for is it with your pingdom perhaps is an api that you can pull
00:29:49.200
stats out and get that into a common storage thing so every minute you query pingdom get the latest thing and put
00:29:55.360
that into the common canonical repository uh that's going to give graph your dashboard
00:30:02.320
uh we talk more after if you want to so alerts the the big thing with alerts
00:30:09.279
is that you have to think of them as being alive right so if you set an alert up something like
00:30:15.440
disk capacity is pretty easy to set a threshold and say okay when this thing gets 90 full and it's not magically
00:30:20.720
going to get bigger let me know walk away and never touch it again but alerts for a lot of other things as your
00:30:26.320
service evolve your thresholds are going to change so you want to be thinking about ways that you can tune these right because
00:30:32.880
there's nothing worse than a noisy alert because first it annoys me and second i don't care about it you can if my inbox
00:30:39.279
is piling up with stuff i just uh that thing's always going and something could actually be on fire
00:30:45.520
so with alerts uh there's different nomenclature for this the way at least i i think about it so you want clear your
00:30:51.840
threshold this is what actually triggers it that's simple enough but look for things like having a
00:30:57.039
a cancel threshold and uh part of this right now this is monkey c monkey do i'm still building this in my what i'm
00:31:03.519
building but you're gonna want something like a cancel threshold where you say okay when it when it goes over this throw me an
00:31:10.559
alert but do not come to me again unless it goes down below this lower mount and then comes back up
00:31:16.960
you probably want a rearm window too that says okay we're gonna have a cancel threshold but if it does that within five minutes don't bother me again
00:31:24.399
you'll probably want some ability to throw a function on that say i'm really looking for the exponential weighted moving average or the min or the max
00:31:32.640
and you probably want to do that over several samples so look for windowing capabilities too when you're alerting
00:31:38.080
and another thing for alerting is i think really look for integrations especially in in you know now is with
00:31:44.799
third-party services right so i mean we do alerts we just work on your stream but then i think you want
00:31:51.440
something that's going to say hey you can hook this in with your email you can hook this to with your campfire you can hook this in with your pagerduty and not
00:31:56.720
say oh you you better like our escalation strategy you know like myself is monitoring i don't i'm we're not i
00:32:03.120
don't know the first thing about escalation so if you want escalation go use pagerduty so whatever tools you have
00:32:08.799
look for integrations and the last cool thing if i could leave
00:32:14.159
this next thing could be the subject of a whole talk but this is still even with living alerts and paying attention
00:32:20.960
uh there are some very hard scenarios to deal with right i mean aberrant behavior
00:32:26.240
um so if your load is evolving over over time right and you have maybe some seasonal shifts like saturday nights my
00:32:33.600
load is way less than monday morning how do you deal with that uh and there's some interesting work being i guess done or has been done i
00:32:41.360
know we're looking into probably other people how do you automatically detect that
00:32:47.200
and so one thing you look for is a holt winters which is a 50-year-old algorithm from the field of
00:32:54.080
time series modeling but the idea is that because this is all time series data
00:32:59.360
once i've once i've got past the collection point this is all generic time series data right there are very advanced mathematical models i mean guys
00:33:06.159
on wall street make money based off time series data so there is very advanced mathematical models for prediction so if
00:33:11.679
i apply one of these models to say my cp cpu load average coming in i can predict
00:33:17.760
you know you could predict what the next value should be if the next value shows up and it's some number of deviations
00:33:23.039
away from that it's something you should start looking at right or maybe three in a row
00:33:28.640
this particular one is what's called triple exponential smoothing and what's interesting about it is it
00:33:33.840
takes three things into account it takes a stationary trend into account it takes a linear trend so if traffic's growing from eight
00:33:40.720
in the morning to noon and it takes a seasonal effect into account which is you know like an e-commerce site
00:33:46.399
december is way crazier than june probably
00:33:52.399
all right so in conclusion uh the couple takeaways when you're looking
00:33:58.480
at tools or building tools or putting things together try to achieve those separations of concern
00:34:05.120
monitoring equals tests for your ops guys monitoring is your unit test and if your devs you probably wouldn't
00:34:10.800
want to live without those so make it easy for your opposite for yourself you want the ability to have a
00:34:17.200
as much as you can a single repository that you use is a canonical store of truth you can build arbitrary
00:34:22.960
correlations from dashboard all the things plasmas are cheap they're like a couple
00:34:29.359
hundred bucks now so if you're bootstrapped you're very early stage i get that but anything past that the
00:34:34.399
money you'll save just by having a couple plasmas up and a couple mac minis well worth it
00:34:40.240
and finally care and feed for your alerts have alerts but constantly tune and update them
00:34:47.599
uh and that's that's that's it so
00:35:48.079
you