List

It's Not in Production Unless it's Monitored

It's Not in Production Unless it's Monitored

by Joseph Ruscio

In this talk titled "It's Not in Production Unless It's Monitored" presented by Joseph Ruscio at Rails Conf 2012, the focus is on the importance of monitoring in modern software development, particularly for teams adopting agile and continuous deployment practices. Ruscio emphasizes that in today's tech landscape, even small startups operate on a much leaner infrastructure model compared to the earlier days of software development. He covers various aspects of monitoring, including:

  • Transition to Modern Infrastructure: The evolution from requiring large upfront investments in dedicated hardware and custom software to utilizing cloud services like AWS and open source tools.

  • The Importance of Monitoring: Monitoring is akin to running unit tests for operations; it not only helps detect changes and bugs post-deployment but also provides crucial insights into business and technical performance.

  • Continuous Deployment: Advocates for continuous deployment, highlighting its benefits in minimizing release cycles and avoiding large feature dumps.

  • Collecting Metrics: Discusses the types of metrics to monitor, including business drivers, application performance, system resources, and network activity. He introduces tools such as StatsD for easy metric collection and aggregation.

  • Visualization and Dashboards: Urges the need for effective visualization of metrics through dashboards that provide situational awareness to the entire team, enabling quick reactions to issues.

  • Alerts and Active Monitoring: Stresses the importance of setting up sound alerting systems that adapt to changes in the environment, steering clear of alert fatigue by configuring sensible thresholds and cancel thresholds.

Ruscio illustrates his points with examples from companies like Etsy and Github, which are known for their robust monitoring practices. Furthermore, he suggests a variety of tools and approaches to building an effective monitoring system that offers businesses the ability to correlate different data points and insights effectively. In conclusion, he asserts that monitoring should be integrated as seamlessly as testing within the development process to ensure smooth operations and robust software performance.

In the 21st century successful teams are data-driven. We'll present a complete introduction to everything you need to start monitoring your service at every level from business drivers to per-request metrics in Rails/Rack, down to server memory/cpu. Provides a high-level overview of the fundamental components that comprise a holistic monitoring system and then drills into real-world examples with tools like ActiveSupport::Notifications, statsd/rack-statsd, and CollectD. Also covers best practices for active alerting on custom monitoring data.

Help us caption & translate this video!

http://amara.org/v/FGic/

Rails Conf 2012

00:00:05.440 do
00:00:25.519 um but i'm joseph rucio i'm co-founder cto of a company called librato
00:00:31.439 we do monitoring and i personally love graphs and the title of the talk
00:00:37.760 it's not in production unless it's monitored is is one of my favorite quotes and so i
00:00:42.879 thought i would actually dig and and try and find out where it came from and as best i can tell
00:00:49.360 um uh greg i'm not even going to pretend to mangle his last name
00:00:55.360 i'm not quite sure but he's a devops engineer an infrastructure engineer if you'll have it at uh evite
00:01:02.879 uh and it's kind of interesting i think that he said this because evite is one of the old web 1-0 properties
00:01:09.280 right so it was launched in 1998 they've sent something to the tune of uh over a billion
00:01:15.600 uh invitations and about the time he put this up on twitter like a year and a half ago
00:01:22.560 they just finished completely overhauling their system so moving from java and oracle rack over to things like
00:01:28.960 python google app engine various polyglot nosql solutions
00:01:34.320 and so it kind of made me start thinking about so what what about the context of that switch as they were preparing for the
00:01:40.640 next decade made him say this and so if you if you think about kind of how
00:01:47.360 sas was developed like you know 10 years ago 14 years ago you know you'd start
00:01:52.960 you get funded your quote unquote seed round you know it's the the tune of millions of dollars
00:01:58.479 and that was because to just even get going right you had a big upfront capital expense buy a lot of servers
00:02:03.600 physical rack and stack you have a dedicated ops team right who put those all together and finally
00:02:10.879 write you your own custom software stack right you know google the yahoos where
00:02:16.080 everything's in-house and everything runs on their own custom hardware
00:02:21.200 so now in in uh 2012 right we have
00:02:27.920 your seed round if you're not even if you're doing a seed round now bootstrap you know to tune twenty thousand dollars your infrastructure is a monthly expense
00:02:34.720 just like your cable bill right you're using amazon or rackspace or whatever
00:02:40.879 if you have an ops person if you're lucky it's one might even just be you
00:02:46.160 and finally you're using open source software and external services to build your whole stack
00:02:55.120 so what this means is that our infrastructure now what's interesting if you look at the two of those our infrastructure now is what i
00:03:01.599 like to think of as agile and i definitely in the sense of agile that moving quickly adapting to change it's
00:03:07.760 ephemeral servers instances come and go you actually uh you know when you talk with amazon
00:03:14.159 they'll tell you literally you have to use multiple multiple availability zones because we reserve the right to take your servers away at
00:03:20.080 any time so we're now in an environment where we have more change
00:03:25.680 but we actually have worse tools right so google has amazing tools for
00:03:31.440 monitoring and understanding what's happening inside of google but that doesn't do you any good
00:03:38.640 and if you look so we're now because of this we're kind of seeing a renaissance and monitoring i
00:03:45.040 like to think of and if you look at companies uh you think of who are leading this like
00:03:50.879 the etsy's uh flickr or even github and i was trying to find a common thread of you
00:03:56.879 know what what were driving these guys to be so heavy into monitoring other people at the other end of the scale
00:04:03.760 and the one kind of common thread i came up with was continuous deployment right just
00:04:10.080 how many people here do continuous deployment okay so that's that's about half that's
00:04:15.439 good uh i mean a quick digression on that one one of the fascinating things i found
00:04:20.479 about continuous deployment is it's easy to see the case where you say oh yeah we we ship all the time
00:04:25.759 so we don't have this oh this three month the huge release where there's all these features we ended up building that we didn't have to build and we wasted
00:04:32.000 all this time but a lot of times you'll get these uh you know someone come in and say oh okay
00:04:38.000 we'll be fine we'll just schedule a release every week right which is sounds great at first but it is
00:04:44.320 kind of a false economy because if you ship once a week or once every two weeks
00:04:50.160 that means every week or every two weeks you have a day where everyone's scrambling to do
00:04:55.919 the big ship right they're trying to get code in in time to make the ship or they're trying to figure out why
00:05:01.759 something's not working and the ship is being delayed and so anything other than continuous deployment is
00:05:08.000 really a trade-off between a scheduled waste of time versus wasting time on
00:05:13.120 features uh that you may not not needed so we we do continuous deployment and
00:05:20.240 there's kind of i like to think of five steps to that the first is continuous integration right so as
00:05:26.880 developers you run tests all the time so you have the confidence hey i'm pushing new code
00:05:32.080 out but i know it didn't break anything there are no regressions
00:05:37.120 i'm going to make deploying as cheap as possible so one click whether it's a campfire bot or a single button
00:05:44.320 you know make make deploys as costless as possible
00:05:49.360 once everything is out you've deployed use feature flagging as an additional
00:05:54.479 installation where you can kind of bleed users on your own users personally
00:05:59.520 and then a select percentage of users until everyone gets it
00:06:06.319 so this right here is a great setup already but the problem is even this you know this is not enough bugs still make
00:06:11.840 it through and that's where monitoring comes in and i think it's instructive
00:06:18.479 so continuous integration came out of agile test driven development right we have these tests we're going to run them all the time to reap the benefit
00:06:25.360 and if there's one takeaway you should really start to think about monitoring and instrumentation is to operations
00:06:33.600 as unit tests is to development right if we have good monitoring in place then our
00:06:39.360 ops people will sleep better because they know i'll be able to tell instantaneously
00:06:44.400 once this hits production that i don't have regressions i can look at my dashboards and i can see there were no
00:06:50.319 regressions and the final component
00:06:55.759 so that's active monitoring if you're actually visualizing you're checking right after the deploy but there are latent bugs these have good alerting and
00:07:02.880 alerting is definitely a component of monitoring so if something does happen six or seven hours after deploy
00:07:08.800 you find out as an example here's a um
00:07:14.240 a slide yeah it's kind of dark but travis ci which is a great continuous
00:07:20.160 integration project did a post recently on monitoring and
00:07:25.280 this kind of illustrates the cycle that you go through where you start and you see a deploy and
00:07:31.120 there's a they're tracking the number of error responses they get so there's an immediate spike after the deploy at which point i imagine they were
00:07:37.120 scurrying around their keyboards trying to figure out what's going on and they deployed a fix and you can see that come back down
00:07:42.960 but there's still some noise uh you know there's a green line noise so they kept digging in and you can see where they
00:07:49.520 fixed another one and at every stage they had this immediate feedback on their progress in
00:07:54.879 production so that's that's a good illustration i
00:08:01.280 think that's a good driver if that's not enough i'm not sure how many hardcore rubios there are in here
00:08:06.720 but you can also find the chunky bacon using monitoring
00:08:13.199 this is actually a graph from the wild too i didn't i'd like to say i came up with some algorithm to draw that but
00:08:19.440 that actually happened so now you know monitoring leads to bacon so that's our first if you didn't
00:08:24.960 get anything else just think of it that way monterey means bacon you go to google and you say hey
00:08:31.120 i want to do monitoring let me let me go monitoring tools and you get hit with
00:08:36.560 this just an explosion right of all these different services tools you can use
00:08:44.880 um some of these essentially i put a mix up here some of these i personally think are really really good
00:08:50.959 some of these i think are not so really good but my hope is at the end of the talk i'll leave as an exercise to the
00:08:56.560 audience to be able to discern that for yourself now if you do pick some of the not so
00:09:02.640 really good ones you're going to end up with something that looks like this you'll come in and you say okay well i
00:09:08.320 need to monitor this and i did pick those names that's not neces that's a hint maybe uh you'll
00:09:15.120 end up with a system where you say okay i want to monitor this one thing or load on this one thing i'll
00:09:20.160 pull this off the shelf it's got some agent it's got a storage it's got a ui that you have to configure
00:09:25.920 and you're all set until you find something it doesn't do so then you go google and you find another one that does that you pull that off the shelf
00:09:32.560 and pretty soon you've got multiple vertically integrated silos you can't correlate across them and you have to
00:09:38.000 learn how to use each of them at this point you reach this stage
00:09:43.680 where you say you know what monitoring sucks this is really hard there's these huge tools with verbose configuration
00:09:50.560 configurations code they're designed for extremely long-lived host physical hosts
00:09:57.279 they're try to be a jack-of-all-trades they don't do anything well and you invariably need more than one of
00:10:02.560 them and and this is something else if you're interested in this stuff monitoring sucks this is a twitter hashtag uh
00:10:08.480 there's an irc room there's even a github repository and it's a whole movement of devops systems trying to
00:10:15.120 make monitoring better so we need a better model and and that's
00:10:21.040 really kind of gets the core of the talk because i want to kind of build a model that you should think about when you're evaluating monitoring tools or building
00:10:27.519 your monitoring solution and things you should strive for so first thing consider our different
00:10:33.440 metrics types we have so you're going to be tracking your business drivers this is probably a small number of things but
00:10:39.040 it's very important it's the things that make you money it's the things that keep you employed so what are the numbers that if they go
00:10:44.800 up or down depending uh improve your business
00:10:49.839 you're going to have your application performance so generally speaking uh you know this is how well your requests how how well
00:10:56.240 does your app feel your customers that's going to be tied there's going to be system resources that are used by the
00:11:01.680 application are there memory leaks is that's what's causing our application to slow down is a disk full
00:11:07.760 and there's network generally how i like think how many connection attempts are we receiving what's the load balancer doing
00:11:15.200 and what's interesting is you often you're going to want the ability to cut across all levels of the stack
00:11:22.160 right if the our business for example is driven by you can think of it as the number of api
00:11:28.240 calls we can uh we can handle you know we're volume based business so we track
00:11:33.279 very closely the number of api calls we're doing every second and that has to do with
00:11:38.480 is impacted by how many our application can handle what type of system resources and and
00:11:43.680 the network so if we're going to use
00:11:49.120 one of those monolithic things so look inside what is one of those monolithic things actually look like
00:11:55.040 inside generally speaking there's a collection stage right so this this happens every request in your app
00:12:00.959 you have something happens you measure it now you have a measurement and this happens you know order 50 100
00:12:07.120 300 milliseconds or if you have one of elia's web pages every 15 seconds
00:12:14.959 you get a lot of these so for trending this information is way too dense both both for just visualizing
00:12:20.480 as well as storing so there's an aggregation phase where we roll up all these sub second measurements into maybe
00:12:28.000 10 second interval or 15 second interval then we have to get this to disk somehow
00:12:33.120 we have to store this somewhere and then we have different types of analyses whether it's just plain visualization
00:12:38.480 or learning or even some type of algorithmic mining
00:12:44.240 so as people who maintain and operate software platforms this kind of diagram should start
00:12:50.399 screaming out at us if we're thinking about monolithic solutions right and this is a a prime example of of
00:12:57.680 where we can use separations of concerns right we can split these behaviors out and we can
00:13:03.360 use well-defined interfaces between them so we can mix and match what we want to do
00:13:09.200 so digging into some of those the most important one
00:13:14.480 or this is the most important place to focus on probably is collection because if we're going to use monitoring for our
00:13:20.480 operations like we use tests for our development then the cost of collection has to be as close to zero as possible
00:13:26.560 when monitoring sucks nobody does it because it sucks and there's a lot of friction so we need to make it super
00:13:32.959 easy you know new code goes out it should have instrumentation with it when it goes out just like it has tests and
00:13:39.360 we have to make that cheap the cheapest way to monitor something
00:13:46.000 you already have is logging right uh and there are some cool projects you
00:13:51.040 can use uh etsy logster is one of them if you think about logs as streams right i mean so log files are
00:13:57.040 interesting semi-structured text you've got a big log file and you can throw all kinds of weird queries at it which is is
00:14:02.720 nice but as it streams to that file it's it's good to think of there are things in there i can count the number of requests
00:14:09.279 i had in one minute of that stream or i can count the number of 200s the number of 500s so there are several projects
00:14:15.120 that will actually as your logs are being consumed uh count things in them and generate statistics off them which
00:14:21.519 you can then graph alert do all kinds of interesting things etsy lobsters one
00:14:26.560 log stash these are all github anywhere i have something something's a github url
00:14:32.880 logs so etsy just looks parses your existing log files logstash actually
00:14:38.079 mimics uh syslogd and has storage engine with it and then there's finally uh there are a
00:14:43.839 lot of good services too we love paper trail uh that also looks like syslog d um and has integration with uh you know
00:14:50.959 third-party services like ours for for graphing in rail specifically uh active support
00:14:57.600 notifications one of my colleagues gave a talk yesterday on this you should definitely check the slides out
00:15:04.320 but basically it's a pub sub instrumentation for rails 3 which makes it really cheap first of all
00:15:09.600 there's a ton of out of the box instrumentation and then it's really cheap to add new instrumentation your rails apps and you can up you can pipe
00:15:17.519 that stuff publish it to multiple place a couple cool projects matthias myers
00:15:22.639 log rage uses this to trim your rails logs which will go well with anything in the previous slide
00:15:28.720 and then harness is a neat thing to hook in and then actually instead of just going to
00:15:34.000 logs publish these to any third-party service another interesting collector for any
00:15:40.240 ruby project although you can pull access support out too but um is
00:15:45.600 metrics and this is a project that just gives you simple primitives like counters meters timers
00:15:51.600 and makes it really easy to plug it into multiple reporting back ends i think he has graphite he has librato support he
00:15:57.120 has riemann support
00:16:02.160 so now that you're collecting we made it real cheap to collect some of those collectors the metrics
00:16:08.240 actually the ones just on the last slide actually does aggregation so there are some collectors but if you're writing
00:16:13.279 something custom um or if you're using one that doesn't you don't want to worry about that and so there are tools you can use the
00:16:20.399 most well-known one is statsd that comes out etsy as well they've done
00:16:25.759 a lot of good stuff with monitoring statsd is really interesting it's about 319 lines of node.js
00:16:32.880 just a little demon um it supports uh several different types
00:16:38.480 counters timers gaugers what's neat is it just sits on a port and listens for udp packets
00:16:44.240 which means anything you're instrumenting you have a zero cost almost zero cost to
00:16:50.160 in the middle of your request response cycle dump a udp package just the memory copy to a kernel buffer right
00:16:56.560 um and the way we have this set up the way i like to think about you can
00:17:02.000 set one of these stats d daemons up somewhere in your network and have point all your servers at it uh but given that
00:17:08.240 it's like so small and so lightweight we prefer to think of it as just something almost like a syslog d right where
00:17:15.280 i just install it runs on every one of this is an example one of our front end interfaces and it just sits there so if i bring any
00:17:21.439 new service on this box i know i have statsd locally i don't have a single point of failure for aggregation in my
00:17:26.720 network and because the udp is going over the loopback i mean trending data if you
00:17:33.280 lose some measurements is not a big deal i'm kind of pedantic so i like knowing that it goes over the loopback and i
00:17:38.480 have almost zero loss as long as the boxes is healthy
00:17:44.160 and besides the udp being zero cost what's really neat about statsd is it's actually just defined a udp wire
00:17:51.120 protocol that's the most important thing it did so there are a ton of clients for that
00:17:56.320 nginx module we use github put out a rack statsd so you can get all types of rack level statistics
00:18:02.559 into it shopify has this statsd instrument which lets you use in your application level and and basically wrap code blocks and
00:18:09.679 stuff and get things directly into statsd and what's nice about doing it multiple levels the same kind of stats
00:18:16.240 is then you're trying to debug a network issue and find out where in your stack it's coming from you can compare these
00:18:21.280 different ones and because
00:18:26.559 like i said it's just a udp wire protocol if you don't want to run node.js that's fine it's only 319 lines this guy it's
00:18:33.559 joemiller.me uh on his blog he maintains a comprehensive list
00:18:38.720 at least as of recently was updated of all these different stats statsy server implementations in perl ruby i think
00:18:43.840 there's one in go so whatever language you're comfortable doing your production managing and production you can probably
00:18:50.240 find an implementation that so now that we've we've got collection
00:18:57.360 and aggregation is very cheap low friction uh the most important thing is how do you collect all this data in a central
00:19:04.400 location where you can access it and do any you know your arbitrary correlations
00:19:09.520 and this is where i i do think you want to think carefully here because you know collection and aggregation is is
00:19:15.840 streaming so it's very easy to swap things in and out with open interfaces and having storage
00:19:22.000 being its own component that isn't as hard to swap in and out but there is some persistence there
00:19:28.880 so the first rod tool is the round robin database tool and this is
00:19:35.280 the default i think it's like 10 or 12 years old
00:19:40.559 but most of these monolithic solutions actually use this internally so cacti immune in
00:19:47.200 and what's nice about this if you're if you're not aware of it it uses a circular buffer file to give you a
00:19:52.480 constant guarantee on storage so it it writes new values into the buffer and if you if
00:19:58.000 you've buffered enough space for 100 measurements then the 101 measurement will overwrite the first one
00:20:05.039 and it's designed for rollups so you actually get multiple circular buffers uh per metric so you can have resolution
00:20:11.360 you can configure this um but you can do resolution of you know raw data
00:20:16.799 10 second and one minute roll-ups 15-minute roll-ups
00:20:22.480 so that was one of the first ones i think probably what's more interesting is these next couple solutions so graphite
00:20:29.679 is also based around whisper rd but it bundles a visualization component with it it's a separate visualization
00:20:35.679 component but it makes a lot of sense to bundle those uh next to your storage right because
00:20:41.600 the visualization of big pieces are just pulling the data how do you retrieving that as storage and and graphing
00:20:46.840 it so graphite uses whisper this came out of orbitz in 2008 orbitz.com
00:20:54.640 one couple things to know it's got a flat hierarchical namespace which means it stores just key values
00:21:00.400 but you use you can use your keys with dotted decimal notation to to add dimensions in the name as long as it's a
00:21:06.480 proper hierarchy and it supports to pull graphs out uh
00:21:13.039 pngs http queries a couple things to consider here this does seem to be kind
00:21:18.559 of the intro tool a lot of people use and it's it's nice for that but you do want to be careful to plan
00:21:24.240 your capacity this is generally used as a scale-up solution it pre-allocates files for your metrics
00:21:31.520 so i usually end up seeing based on retention settings people using about 3.2 megabytes per file
00:21:39.679 and if you end up with a lot of metrics i have seen in production people running graphite on a 64 gigabyte ram fs
00:21:47.039 i've seen people with 10 ssd drives together in a raid zero big hardware solutions to scale up to that
00:21:55.679 so if you think you're going to have that much data one solution that's pretty neat came out of stumbleupon is uh the open time series
00:22:02.640 database and that's based off of hbase hadoop and so what's neat about that
00:22:08.400 horizontally scalable hadoop lots of storage uh it supports because of that multiple
00:22:14.000 dimensions so it uses denormalization if you want to tag a particular measurement with multiple i want to look this up on this
00:22:20.960 dimension in this dimension so i want to look this up on the host name as well as the time zone
00:22:27.280 by writing the measurement multiple times and http queries
00:22:32.640 and the only downside here is you have to run a hadoop query
00:22:37.919 your last uh you know option is to use a service we provide one
00:22:43.520 there are several others you can look if you google around typically i think the best way to do this what we
00:22:49.600 do is is json over http you just push to an api measurements there are additionally typically agents
00:22:56.720 and language bindings to make that easy they generally have rollups and
00:23:03.600 we have interactive front ends typically services like ours visualization
00:23:09.200 like there's a couple things think about visualization most one of the most important is
00:23:14.720 correlation so you want the ability and part of driving towards this common infrastructure if
00:23:20.840 you if collection's really cheap and all these different collectors you can come up with new collectors and get
00:23:26.880 things in a central repository a graphite open time series database at this point whenever you're trying to diagnose
00:23:32.799 something you have all the data in one place you can build any graph you want right
00:23:38.960 and in addition to putting metrics on that graph ideally you're going to have a solution uh
00:23:44.080 you know one of the things we're working on and these existing solutions have is annotations so an important thing to
00:23:49.600 also think about events like deploys as something you should be monitoring right you should be having you should be pushing a stream uh
00:23:56.799 ideally with context so that's the shot of the deploy so when you're doing a correlation you can overlay your deploys
00:24:03.440 your overlay network events or whatever other events you're tracking with there
00:24:09.919 and you want these arbitrary combinations a couple examples this is a
00:24:15.919 open time series database correlation it's uh it's hard to read the legend this came right off their front page
00:24:21.440 though it's i think it's my sequel delete queries with um 99th percentile performance
00:24:30.159 this one this is a correlation i took out of our system for our storage ring
00:24:35.200 and so it's correlating uh read requests into the ring that's the orange really periodic so we have a periodic read
00:24:41.360 activity with disk ops and uh well iops and then disk
00:24:47.520 bytes right out of disk so you can see those you know always correlate with the read requests but then at the right end
00:24:52.720 of the graph there's something much bigger that happened there so it's probably a good idea that we have some other driver traffic there we want to
00:24:58.799 look into those are kind of things that's very very simple to pull up with correlations
00:25:07.200 dashboards so this is a bigger part of visualization a lot of tools have correlations but one
00:25:12.640 of the things i think a lot fall down on that's that's hyper important are dashboards right and the reason for that
00:25:19.360 if you have dashboards these are actually way much like you can think of as a wiki to have a shared understanding
00:25:24.880 across all levels of your team right there's a lot of times where someone asks you to go do something go
00:25:30.240 do this you know the ceo said we need to do this well why do we need to do that you don't know you go do it
00:25:36.559 if you have your business drivers or your app performance or whatever up on a wall on a plasma where everybody
00:25:42.799 can see when they're walking by to get their their coffee then everybody's on the same page and not only that
00:25:47.919 everybody's on the same page up to the minute
00:25:53.039 and as they're doing that also there's no more sophisticated aberration detection
00:25:58.159 than your marketing guy who knows nothing about your technology looking at a dashboard right and one of the most
00:26:04.240 interesting things is that that's literally happened to me where i'll be hacking away and and our marketing guy
00:26:09.279 will go joe why is that what is that and i'll look up and oh crap you know we
00:26:14.799 have to go do something and so it's really cool about having these all you know if you have these up is is people will see things
00:26:22.320 and finally if if your marketing guy isn't up at you know 1am looking at a dashboard when
00:26:28.000 something goes wrong and your alert catches it dashboards if you have a dashboard for
00:26:33.120 everything this actually provides a firefighting manual right if your expert says okay i'm going to build
00:26:38.799 a dashboard for this thing these are the most important things for this when something's wrong you go look here first
00:26:44.400 you're going to open that up when something's going wrong oh open the dashboard for that and you'll see okay this one this one
00:26:50.159 looks really funny so i'm not a total expert here but this is where i'm going to go first
00:26:56.320 so when you're looking at systems i mean we focused a lot on this but i think you want something that's going to make it very easy to template dashboards and
00:27:02.799 make a lot of them and make them easy to pull up this example like one of ours this is
00:27:08.559 for our api so we track this one the ones we have up uh you know our average response time the response times for
00:27:15.120 every post operation response times for get operations the number of errors and success codes
00:27:20.960 were returning per broken out by host in the stack graph below um and so when any of these shifts you
00:27:27.039 know we're dealing with you know thousands upon thousands of requests per second things can get messy in a
00:27:32.840 hurry uh dashboards could be a whole nother talk so this is a great book
00:27:38.640 it talks a lot about visualization techniques in general very easy read but also how to design dashboards and and what to look for
00:27:48.399 so the last bit i'm going to talk about uh is alerting oh sorry you had a question
00:28:05.440 right
00:28:23.760 right
00:28:34.960 yeah so that i think the question just in summary is if if you're using something like a platform as a service where
00:28:40.720 you don't have as much transparency into things like if there's a load balancer issue on their part how do you build a common dashboard that says oh it's it's
00:28:47.360 not my service it's theirs so that's that's actually one of the hard platformers of service problems
00:28:54.240 and i think you know that space is is maturing still i think that's one of the things they're
00:28:59.440 going to get to but you know they actually need to give you access right now um
00:29:04.559 using heroku as an example right they do provide heroku logs so one of the thing
00:29:09.919 that won't help you in the case that their load balancer is dropping packets um
00:29:15.039 so using an external service like a you know a pingdom or a new relic that's going to give you insight
00:29:20.320 um so usually that's that's one good thing though uh using specialized services is not necessarily bad i mean
00:29:26.320 everyone should obviously still use google analytics you know if you have a lot of new relic expertise and you're using that you
00:29:32.000 should keep using that i would look for services that give you the ability to pull out the cream of the
00:29:37.360 crop of what they're they're generating for you and get that into a common place you can correlate against other things
00:29:44.000 so one thing i would look for is it with your pingdom perhaps is an api that you can pull
00:29:49.200 stats out and get that into a common storage thing so every minute you query pingdom get the latest thing and put
00:29:55.360 that into the common canonical repository uh that's going to give graph your dashboard
00:30:02.320 uh we talk more after if you want to so alerts the the big thing with alerts
00:30:09.279 is that you have to think of them as being alive right so if you set an alert up something like
00:30:15.440 disk capacity is pretty easy to set a threshold and say okay when this thing gets 90 full and it's not magically
00:30:20.720 going to get bigger let me know walk away and never touch it again but alerts for a lot of other things as your
00:30:26.320 service evolve your thresholds are going to change so you want to be thinking about ways that you can tune these right because
00:30:32.880 there's nothing worse than a noisy alert because first it annoys me and second i don't care about it you can if my inbox
00:30:39.279 is piling up with stuff i just uh that thing's always going and something could actually be on fire
00:30:45.520 so with alerts uh there's different nomenclature for this the way at least i i think about it so you want clear your
00:30:51.840 threshold this is what actually triggers it that's simple enough but look for things like having a
00:30:57.039 a cancel threshold and uh part of this right now this is monkey c monkey do i'm still building this in my what i'm
00:31:03.519 building but you're gonna want something like a cancel threshold where you say okay when it when it goes over this throw me an
00:31:10.559 alert but do not come to me again unless it goes down below this lower mount and then comes back up
00:31:16.960 you probably want a rearm window too that says okay we're gonna have a cancel threshold but if it does that within five minutes don't bother me again
00:31:24.399 you'll probably want some ability to throw a function on that say i'm really looking for the exponential weighted moving average or the min or the max
00:31:32.640 and you probably want to do that over several samples so look for windowing capabilities too when you're alerting
00:31:38.080 and another thing for alerting is i think really look for integrations especially in in you know now is with
00:31:44.799 third-party services right so i mean we do alerts we just work on your stream but then i think you want
00:31:51.440 something that's going to say hey you can hook this in with your email you can hook this to with your campfire you can hook this in with your pagerduty and not
00:31:56.720 say oh you you better like our escalation strategy you know like myself is monitoring i don't i'm we're not i
00:32:03.120 don't know the first thing about escalation so if you want escalation go use pagerduty so whatever tools you have
00:32:08.799 look for integrations and the last cool thing if i could leave
00:32:14.159 this next thing could be the subject of a whole talk but this is still even with living alerts and paying attention
00:32:20.960 uh there are some very hard scenarios to deal with right i mean aberrant behavior
00:32:26.240 um so if your load is evolving over over time right and you have maybe some seasonal shifts like saturday nights my
00:32:33.600 load is way less than monday morning how do you deal with that uh and there's some interesting work being i guess done or has been done i
00:32:41.360 know we're looking into probably other people how do you automatically detect that
00:32:47.200 and so one thing you look for is a holt winters which is a 50-year-old algorithm from the field of
00:32:54.080 time series modeling but the idea is that because this is all time series data
00:32:59.360 once i've once i've got past the collection point this is all generic time series data right there are very advanced mathematical models i mean guys
00:33:06.159 on wall street make money based off time series data so there is very advanced mathematical models for prediction so if
00:33:11.679 i apply one of these models to say my cp cpu load average coming in i can predict
00:33:17.760 you know you could predict what the next value should be if the next value shows up and it's some number of deviations
00:33:23.039 away from that it's something you should start looking at right or maybe three in a row
00:33:28.640 this particular one is what's called triple exponential smoothing and what's interesting about it is it
00:33:33.840 takes three things into account it takes a stationary trend into account it takes a linear trend so if traffic's growing from eight
00:33:40.720 in the morning to noon and it takes a seasonal effect into account which is you know like an e-commerce site
00:33:46.399 december is way crazier than june probably
00:33:52.399 all right so in conclusion uh the couple takeaways when you're looking
00:33:58.480 at tools or building tools or putting things together try to achieve those separations of concern
00:34:05.120 monitoring equals tests for your ops guys monitoring is your unit test and if your devs you probably wouldn't
00:34:10.800 want to live without those so make it easy for your opposite for yourself you want the ability to have a
00:34:17.200 as much as you can a single repository that you use is a canonical store of truth you can build arbitrary
00:34:22.960 correlations from dashboard all the things plasmas are cheap they're like a couple
00:34:29.359 hundred bucks now so if you're bootstrapped you're very early stage i get that but anything past that the
00:34:34.399 money you'll save just by having a couple plasmas up and a couple mac minis well worth it
00:34:40.240 and finally care and feed for your alerts have alerts but constantly tune and update them
00:34:47.599 uh and that's that's that's it so
00:35:48.079 you