Monitoring the Health of Your App

by Carl Lerche and Yehuda Katz

In this session at Rails Conf 2013, Yehuda Katz and Carl Lerche discuss the importance of proactive application monitoring to maintain app health. They emphasize that most traditional tools merely react to problems after they occur, like a doctor diagnosing a heart attack post-factum. Instead, developers are encouraged to gather meaningful metrics to detect issues before they impact user experience. Key points from the session include:

Collect Data: Emphasizing the need to measure anything that provides value, enabling quick validation of hypotheses about app performance.
Traffic Monitoring: Discussing the importance of tracking key business metrics such as revenue per minute to identify anomalies immediately after deploying changes.
Response Times: An emphasis on understanding average response times versus median and standard deviation to make informed decisions on performance.
Long-Tail Distribution: Presenting research showing that web response times do not adhere to a normal distribution; instead, they exhibit long-tail behavior where outliers are more common than expected.
Real-World Implications: They shared a case study demonstrating how measuring metrics like the number of users adding items to the cart can pinpoint where issues are arising in the checkout process.
Performance Tools: Introduction of a monitoring tool, Skylight, designed to visualize app performance data using histograms to illustrate how response times are distributed across a range of interactions. This allows for deeper insights into where optimizations are needed rather than focusing solely on averages.

Ultimately, the presenters strongly conclude that measuring every aspect of application interaction is essential in fostering a healthy app environment. By adopting this proactive approach to monitoring and measuring, developers can significantly enhance user experience and improve the efficiency of debugging. Understanding and interpreting correct metrics ultimately promotes better software development practices.

Your app is your business, so keeping it healthy is important. Unfortunately, most of the tools available today are more like your doctor verifying the fact that you've had a heart attack—after it's happened.
You can do better. In this session, you'll learn how to use metrics to be more proactive about monitoring your applications health, and to suss out the subtle but important warning signs that can help you prioritize developer time and improve the developer experience. We'll talk about how to instrument your code, what to measure, how to interpret the data, as well as how you can use the data to streamline your development process.

Help us caption & translate this video!

http://amara.org/v/FGb5/

Rails Conf 2013

00:00:16.600 so well again thanks for coming um my name's Carl I work at Tilda and this is

00:00:23.080 my esteemed colleague Yehuda cats he I don't believe he requires an introduction so today um we're going to

00:00:29.800 be talking about measuring your up um but before we get started I just want to

00:00:35.200 start with a quick story this is my first job you know out of college it was

00:00:40.879 your basic e-commerce app you know they had uh you know products for sale you

00:00:46.320 added to your car you checked out put in your credit card processed in order pretty basic and the day in question I

00:00:53.239 was tasked with a pretty simple feature just I think it was just something like tweak the coupon page on the checkout

00:00:59.600 process you know pretty basic U it was easy I just did it out in a couple hours

00:01:05.280 shipped it out um QA QA accepted you know they had a couple people that when

00:01:10.439 something was ready to go out they just banged on the app made sure it worked for them very scientific uh so went out to production

00:01:17.600 I went home and everything seemed good except I think it was about 8:00 P.M my

00:01:22.799 boss calls calls me he's like hey how's it going I'm not liking the way this is starting never calls me um yeah so I was

00:01:30.640 looking at some numbers and sales seem low like okay Sal seems low seems get

00:01:36.159 well I was told you made some changes to the checkout process I I mean I I think

00:01:42.640 you might have broken something it's like all right uh my boss tells me that he's also the owner of the company it

00:01:48.479 was not too huge so you could tell he had some stress in his voice and you just know he's like every I bet every

00:01:53.759 day he's like staying at home refreshing sales making sure nobody broke anything so all right I get out my computer I'm

00:01:59.960 going to fix this step one let's see if I can reproduce it no cannot I go through the entire process no problem

00:02:07.039 it's late at night so next thing I want to do I'll just I'll just revert this I don't want to deal with it uh yeah

00:02:13.160 incidentally this was a long time ago uh we didn't use Source control either so that made reverting a bit trickier but

00:02:19.120 it was it was a simple change so I just like went by memory so I reverted that

00:02:24.480 pushed it out and sent an email to my house all right I reverted it let me know if everything's okay

00:02:30.280 30 minutes later calls me yeah sales are still down well I hope I reverted

00:02:37.000 correctly but something went wrong so next I'm just going to go like ssh in

00:02:42.440 the server this is the obvious Next Step SSH of the server and live add debug statements it this was also in this was

00:02:50.519 about the same time I was actually First Learning rails so this job was a PHP shop so adding debug statements was much

00:02:57.000 easier and just you just add it in the fly and automatically reloads deployment oh anyway so I couldn't find anything

00:03:05.480 this took me like about 5 hours of debugging everything uh it turns out

00:03:10.959 that the actual problem was somebody else that day updated the Google

00:03:16.040 analytics tracker code you know that little JavaScript snippet they updated at the same time and given the lack of

00:03:21.840 source control I didn't know didn't have an easy way of looking at that but on the checkout page the JavaScript

00:03:28.840 interacted weirdly with our javascripts on certain browsers so that certain

00:03:34.680 browser happened to be iie which I was young I I didn't think

00:03:39.720 to check iie uh but so what lessons can we learn from this really besides you

00:03:45.480 Source control but everybody here has already learned that I assume but the main lesson I want to focus on is uh my

00:03:52.959 boss had a hypothesis right he his hypothesis was where sales are down and

00:03:59.920 I nobody really had any real way to validate that hypothesis quickly you know like he looked at some sales

00:04:06.560 reports just hit refresh a whole bunch but besides that there wasn't anything um then I had the hypothesis that my

00:04:14.799 coupon code like my coupon tweak was the culprits but I had absolute no way of

00:04:21.000 validating that short of insane amounts of debugging if I had like been if I had

00:04:26.280 been collecting just a little bit of data I think it would have been much easier to debug this issue and that's

00:04:32.759 really going to be the thesis of this talk you know I'm going to try to convince you by the end that just like I

00:04:38.440 don't think I hope nobody here would not write test for their app I think I'm going to try to convince you that you

00:04:44.080 need to measure anything that matters in your app basically anything that

00:04:49.320 provides value so so that you can quickly detect things that break so

00:04:56.039 before we continue quick little game which line is slow so on the next slide I'm going to

00:05:03.639 show you a very basic rails action only two lines of code and I'm going to tell you this action responds in about one

00:05:10.039 second and it's going to be up to you to try to determine which line that

00:05:15.280 is so here are the two lines give you a

00:05:21.039 minute so uh who thinks line one oh one person let bre two what about

00:05:29.280 the second one line two oh every everybody seems like vast majority line two what what does senor cats think I

00:05:36.759 don't oh sorry oh the the just the second line of

00:05:41.919 this action so T like users equal users. whereare admin false second one's render Json users

00:05:47.800 first all right so I'm just GNA like let you know I don't know this is rails 3 app so what

00:05:55.440 actually is going to happen here in theory is that users Weare is going to create a scope and so that's going to

00:06:01.520 just create like a a lazy object and then when you call first on that lazy object it's it's going to rewrite the

00:06:07.319 sequel query to be efficient so in theory if you just look at this it should just pull one user from the

00:06:13.160 database and render it okay but let me show you a little bit more so I'm telling you now that this code is in the

00:06:20.960 model huh that should be self whereare bro oh well I didn't actually run this

00:06:26.199 but that should be self. Weare so it's overriding that uh that function so assuming this now which line

00:06:33.560 do you think is the culprit line one

00:06:40.960 anybody line two uh getting smarts so this is what

00:06:46.520 I'm actually going to tell now I'm going to give you some measurements like the first line takes 0.1% of the request and

00:06:52.000 the second line takes 04 so now we have some data we can determine that actually

00:06:57.759 it's not in here let's continue looking further okay we find a the filter with a little

00:07:06.560 bit you I'm sure many people have seen code like this before so the main point is and the main

00:07:15.199 point is like this is like kind of a tongue and cheek example but uh I'm sure nobody here really knows every line of

00:07:22.120 code that you're running your rails at like even if you're the only developer on that app you still have all the gems

00:07:27.879 probably hundreds of gems you pull in your app and you don't know every line of code so you

00:07:33.160 need a way to determine when things go wrong how you need a way to quickly determine what's happening so how do we

00:07:39.319 do that so actually I was at COD comp a few years ago and I saw a talk by this

00:07:45.680 gentleman go to hail and really resonated with me basically the way he put it

00:07:51.400 was as developers we uh aren't paid to code we're paid to provide business

00:07:58.000 value we're paid like to write code but that code is supposed to add a feature or make the

00:08:04.720 site faster or do something that ends up providing value to the business so and

00:08:11.720 that that code provides that value in production so it doesn't matter it doesn't matter at all whether or not the

00:08:18.440 like it runs faster on your Dev box it doesn't matter if CI is green what matters is what's actually happening in

00:08:24.879 production so how do we know um that this is happening basically what we need

00:08:31.759 to have is a correct mental model of how our run code runs in production and we

00:08:37.519 do that by measuring so what do we measure well the easy version is we measure business

00:08:44.480 value uh we're going to we need to measure our application responding to

00:08:50.800 users completing tasks that in the end those tasks make us money so going back

00:08:56.000 to my original story uh what's the obvious thing I could have measured right so well the obvious thing to

00:09:01.560 measure is going to be let's measure how much money we're making so this is I think probably the key metric for most

00:09:06.760 businesses if you're not a Silicon Valley VC backed one

00:09:12.120 but uh if I had something as basic as maybe a line graph with dollars made per

00:09:18.760 minute I would have been able to see the dollars dropped off like oh I shipped this and now this is going way

00:09:25.120 down um oh or Bitcoin if that's what you're

00:09:30.959 into so and we can do more than that we can take that data and we can save it

00:09:37.000 off we can like we can use it to as reference to like kind of see tell like

00:09:42.680 notice Trends we can use it to uh create a benchmark for every point in the day

00:09:48.440 how much money do we think we should be making and we can build uh tools that

00:09:53.600 use that Benchmark and take the current value of that and if it falls below a certain threshold for a certain amount

00:09:58.839 of time time let's send out an alert let's like have a tool tell us 5 minutes after we sh we deployed that something's

00:10:06.079 broken instead of our boss calling us like five hours later right that in the end if our boss doesn't call us

00:10:11.519 everybody's happier but so that would have just told me right just told me like something was

00:10:17.680 wrong we need to go beyond that like what I need to be able to create like a

00:10:24.000 hypothesis like okay now I already know something's broken we're not making money so what is broken I need to be

00:10:29.640 able to make a decision quicker about where to fix it because things are busted so what we need to do is like all

00:10:37.399 applications like have many steps in the process of making money we really should be instrumenting every single step along

00:10:44.880 the way like for example I don't know in my example like number of users that add

00:10:50.279 items to the shopping cart that's the first step let's track that how many items get added let's track that right

00:10:56.279 how many items get removed that's also important and then from that how many what's the number of users that check

00:11:01.959 out and then from that what's the number of users that like uh have successful

00:11:07.160 orders so what I'm saying also is yes you probably are using some sort of like

00:11:13.000 tools to measure this on the business side but I'm really saying on the developer side we also need something

00:11:18.519 that measure it for us because then when you ship something into production and

00:11:23.560 it breaks everything right you can then look at this data and find pinpoint like which part which part of the process is

00:11:32.399 broken so if it's important for your business measure it because if it's not measured

00:11:38.839 how do you know it's working

00:11:45.800 and performance is also business value right I'm sure you might have already heard this but just going to repeat it

00:11:51.680 like the research that go both Google and Amazon have done right like Google I

00:11:57.320 think they said like oh every time time their search result page went from4 seconds response time to 0.9

00:12:03.920 response 0.9 seconds response time they lost 20% in ad cells and Amazon every

00:12:10.399 extra 100 milliseconds of response time like caused them a loss of 1% in sales

00:12:17.560 so performance is really critical as well on the business side and well I

00:12:23.680 guess the good news we already have tools that uh measure the average response time for us but hey hold on M

00:12:30.160 should have been hey yeah the other mic should have been there seems good so uh Carl just said we have tools that measure average response time so uh

00:12:37.600 unfortunately I'm going to have to go off into academic land briefly uh and

00:12:43.240 the reason for that is that I think a lot of us do hear the term average response time and they think that what they're getting is a useful piece of

00:12:48.920 information that they can use to uh understand what their app is doing so Carl talked about having a San mental

00:12:55.079 model for what your application is doing and where a lot of people start is they start with having an average and what

00:13:01.360 I'm here to tell you is that that actually gives you a very poor misleading and broken mental model of what's going on so I want to start just

00:13:08.199 to make sure we're all on the same page I want to start and go back to statistics 101 or High School statistics

00:13:14.120 uh I'll start with I'll start with the simple so we have a big list of numbers here a small list of numbers and if we

00:13:20.000 want to find the average just want to make sure everyone knows what an average is uh if you want to get an arithmetic

00:13:25.760 average the first thing we're going to do is we're going to add up all the numbers num in the list we're going to get a sum then we're going to divide by

00:13:33.560 the count we're going to get a number out and we're going to have an average so that's uh that's step one how do you

00:13:39.680 get an average you add up the numbers this is actually relatively easy thing to collect because you can basically

00:13:45.240 just keep a running total and a running count and as you get more data points it's pretty easy to add another thing to

00:13:51.160 your total and divide by the next bigger number and you get the new average so quite easy to collect and but soon I

00:13:56.959 will show that it's not quite easy to be useful uh median is actually quite uh quite a bit more useful and basically

00:14:03.560 the idea behind median is that you take uh the numbers you line up in order you get the middle numbers uh if there's two

00:14:09.720 in the middle you average them but you get you basically get out a number in this particular example the median and mean are roughly the same so the average

00:14:16.880 that we got by doing the easy to collect information gave us something like 1735 the median which we got by doing a

00:14:23.560 more comp more complicated thing which is ordering every lining up everything in order and getting the middle gave us a similar thing

00:14:29.360 um now it's a little bit deceptive because I think when you learn mean and median in high school you have like a small set of numbers so it's it feels

00:14:35.839 like you're it's a quite similar thing unfortunately it's not very easy to keep a running total of median because you

00:14:41.839 need either all the numbers so you can keep putting them in order and getting the middle one or you need some sort of

00:14:47.880 uh algorithm or data structure that's doing it and it uh it's not super easy

00:14:53.600 um and then another thing and I don't want to go through what how you collect a standard deviation uh I'm just going

00:14:59.519 to say that the standard deviation of this list is 4.96 most people think of a standard deviation as like a variance

00:15:05.639 it's it's uh how how uh dispersed are all the numbers in this group so 4.96

00:15:12.360 basically means that as you move 4.96 and either direction you are going to get less and less likely to hit a

00:15:18.519 particular number so what the way most people think when they hear I have a standard deviation of something they

00:15:24.440 think in their head of this bell curve so again in high school you probably learned about the bell curve so your

00:15:29.920 mental model of a lot of the stuff is really I think deeply rooted if you know statistics at all in the bell curve so

00:15:36.000 let me do another exercise here let's say we have someone tells you I have gone into my application uh performance

00:15:42.920 portal and it has told me that my mean is 50 and my standard deviation is 10 uh most of these things don't give you a

00:15:48.959 standard deviation but let's just assume that somehow you have it most people's mental model is going to be okay jump

00:15:55.160 directly to that normal curve that means that about 68% % of all my requests are going to be between 20 40 and 60

00:16:02.160 milliseconds and something like 95% of all my requests are going to be between 20 and 80 milliseconds and very

00:16:08.120 importantly your mental model says something like as I keep going further and further out the likelihood of

00:16:14.399 hitting something 100 milliseconds 200 milliseconds 500 milliseconds becomes

00:16:19.560 extremely rare it should basically never happen if you your if your mental model model is like this 500 millisecond

00:16:25.480 responses should basically never ever happen because they're so far off from the center when even when you take

00:16:30.800 standard deviation into consideration unfortunately the real world does not actually look like the platonic normal

00:16:36.759 distribution what the real world looks like uh and this is also uh somewhat platonic in the sense that the real

00:16:42.440 world is even more complicated than this but the real world looks a lot like this which is that there's a bunch of stuff

00:16:49.160 that uh you start off pretty high you get like what kind of looks like a skewed uh normal distribution and then

00:16:55.880 you get a pretty long tail and I want what I want to do is to show a couple of illustrations of what these

00:17:02.000 distributions look like when we're talking about real world things that have nothing to do with

00:17:07.160 programming uh so if the average male height is 60 in and the standard deviation is 3 in this distribution

00:17:14.280 behaves like you would expect 68% of all men are between 57 and 61 etc etc and by

00:17:19.959 the time you even get to 410 to 610 you're getting to a much very tiny percentage of people that are outside of

00:17:25.319 it and therefore the correct thing for your gut to say is if you are very unlikely to ever encounter a man who is

00:17:31.760 10t tall that is basically astronomically unlikely you will will never happen it is also very unlikely to

00:17:38.320 uh encounter a man who is one foot tall uh although more likely um perhaps so uh

00:17:44.960 the point is that we have these distributions and a lot of our gut feeling a lot of what we learned in school um even in college about

00:17:50.919 statistics is based around this kind of distribution we have something we have some bunch of stuff grouped in the

00:17:56.520 middle and as we move out from The Middle we get to be much and much less likely and therefore in our brains we

00:18:02.280 automatically say we don't actually care about things that are far out we don't care about things that are hundred milliseconds

00:18:07.640 out our intuition is basically that outliers are extremely are quickly become extremely rare So when you hear

00:18:13.559 someone talk about an outlier so if I was to tell you that your mean response time is 100 milliseconds and you have a

00:18:18.760 standard deviation of 50 your gut would tell you that 500 milliseconds based on that information is definitely an outlier you should not care about it it

00:18:25.240 doesn't matter but that doesn't actually turn out to be how the distribution Works in reality again I think what this

00:18:31.880 tell what this makes you feel like is we only have to worry about values that are clustered around the middle we only and

00:18:37.919 some approximation of that is we only have to worry about values that are clustered around the mean so we rely a

00:18:43.520 lot on tools that give us mean averages and we start to think we start to feel comfortable with this gut intuition that

00:18:50.720 it's okay we only have to worry about values cluster around the middle like I said unfortunately the world largely is

00:18:56.840 not distributed normally at all so for the most part the thing that I just told you about what our intuition says is

00:19:02.880 completely wrong um so here's an example something like average salary if I tell you the this is these are roughly

00:19:08.360 correct the average salary in the United States is about 60,000 standard deviation is 30,000 and what you'll

00:19:14.280 notice is you would expect that if I got to 990,000 120,000 150,000 you would

00:19:21.200 expect to basically never in the same way that you would expect to never encounter somebody 10 ft tall you'd

00:19:26.480 expect to never ever encounter in your entire lifetime somebody making a million dollars but it doesn't work out like that what happens in practice is

00:19:33.320 that there are vast there there are distributions like the one that we see here for income distribution that really

00:19:40.280 don't look anything like what your gut what your intuition tells you so if someone tells you the average salary in the United States is 60,000 I think a

00:19:46.960 lot of people think ah that pretty much means that it's clustered around the middle but of course that's not true at all of course as you you can see that as

00:19:53.440 I'm having I'm going from 50 to 25 to 12 to 6 you can see that the amount that we're increasing by has stayed roughly

00:19:59.039 the same as we go in half and basically what that ends up meaning is that longtail distributions are not clustered

00:20:05.240 around the middle so the in the uh people making a million dollars are what

00:20:11.720 you would think of as massive outliers but they're actually not because the distribution doesn't look like what you were taught in school is the

00:20:19.440 distribution if we focus on The Middle on the mean we're actually going to lose a large chunk of reality so we're going

00:20:25.200 to feel like 500 milliseconds is extremely unlikely basically astronomically unlikely when in fact

00:20:30.640 it's just a little bit unlikely somewhat unlikely and pro tip it turns out web

00:20:36.760 responses are the same similar type of distribution as income you would think

00:20:41.880 that if you got 100 milliseconds as your average that 500 milliseconds would be extremely unrare extremely uncommon

00:20:47.720 extremely rare but in fact that does not turn out to be true in practice now uh not every single distribution of every

00:20:55.320 single response is exactly what is called a log normal distribution but

00:21:00.360 they are more log normal than they are normal and this and what that means is that they are more longtail there's more

00:21:06.159 data keeping on going out out out out out out out out then there are uh then you rapidly go down into the abyss so

00:21:15.240 let me try to uh say it another way so you see average response time is 150

00:21:21.400 milliseconds standard deviation is 50 milliseconds you think 95th

00:21:26.679 percentile is between or 95% of all my requests are between 50 and 250 milliseconds and if I was to see a 500 a

00:21:35.039 500 millisecond request it would be extremely rare I could basically ignore it it basically means something crazy

00:21:40.120 happened the reality is nothing like that at all the reality is that you have a long tail which means that in this

00:21:47.400 particular case 500 millisecond responses might be relatively common now that doesn't mean that half of all your

00:21:53.559 responses are going to be 500 milliseconds but it means that it would not be surprising at all to get to a a

00:21:58.880 uh five or 10% of all your requests are in a very what looks to be a very outlier

00:22:05.159 position so that's that's sort of the the Crux of what I'm getting at here um and what I want to say is if what you if

00:22:12.720 the if your mental model is the normal distribution things are clustered around the middle if I know my mean it's fine

00:22:18.000 I'll just drag my mean to the left and everything's great you end up with a situation where you don't have really good intuitions around consistent

00:22:23.720 performance you end up with intuitions that are making you feel like you're making things better but in fact there's

00:22:29.039 a whole bunch of requests that are hundreds and hundreds of milliseconds to the right of your mean and you're just you're not doing anything with them so I

00:22:36.720 think the way I usually have started to look at this since I've been looking uh looking at a lot of real data more is

00:22:42.279 there is basically in many many web requests there's basically three areas there's uh something like 60% of your

00:22:48.760 requests are in on the left and those requests usually are within some kind of Tolerance that you're willing to do like

00:22:54.400 150 milliseconds and you could basically uh you're happy with them you don't have to worry about optimizing those at all

00:23:00.400 uh maybe you might be interested in optimizing them for some reason but you have bigger fish to fry usually than the

00:23:05.640 stuff that's to the to the left of that and then you have like another 20% are

00:23:10.880 uh not not super great but they sort of obey your intuition about the distribution right it's like 20% and now

00:23:16.640 you're at 250 milliseconds seems fine if you you your intuition is if you move

00:23:21.720 the mean over it's going to get those but then the problem is there's like another 15% of request and again that's something like one out of every eight

00:23:28.080 requests that your that your users make and keep in mind many users make more than one request so if you're seeing one

00:23:34.080 in eight requests is extremely slow it's going to one user is going to hit those

00:23:39.360 from time to time the really slow request uh you basically are getting to you get to a point where you have um

00:23:44.720 this very unintuitive uh group of requests that are hundreds and hundreds of milliseconds to the right of your

00:23:50.279 mean and that make up something like approaching to like 95 96 97 99% of all

00:23:56.159 your requests so so basically the way I imagine that you should deal with this is you should pick a tolerance like I am

00:24:02.480 happy with 150 milliseconds and then you should optimize to the

00:24:08.039 right so you should basically go and say Let Me Go deal with the requests that are to the right of my tolerance now of

00:24:14.240 course in order to do that you're going to need the ability to know what exactly is going on not to the left of the of

00:24:20.520 the 150 milliseconds but to the right of the 150 milliseconds and probably you're going to want to differentiate between

00:24:25.840 the ones that are somewhat intuitive and ones that are basically longtail crazy I don't really know what's going on here

00:24:31.679 um now a lot of people because their gut intuition is very normal curvy when you start talking about things this way they

00:24:38.360 really want to think that 95th percentile is some crazy outlier is number because of the way that you've been taught to think about distributions

00:24:45.000 95th percentile really feels like if I don't have to worry about that anything outside of 95th percentile is like super

00:24:51.039 rare but actually what that means in the real world is that 5% of all your

00:24:56.880 requests one at one one of every 20 requests is slower than this number so in this case let's say we say the 95th percentile is 500 milliseconds that's

00:25:04.360 much worse than the mean the mean here would tell you like 150 or something but um it's very far it's very far to the

00:25:11.360 right but it still means that one out of every 20 times your user makes a request they're going to get something that's slower than 500 milliseconds and of

00:25:18.080 course one out every 10 users is going to get something slower than like 400 right so you really need to be thinking about it in terms of what is actually

00:25:24.320 going on in these particular regions and what that is in our case is a product

00:25:32.200 that we've been working on called Skylight um Skylight is basically a uh

00:25:37.600 right now it's very early we I would say it's MVP uh quality um but what we have

00:25:43.039 done is basically build an entire performance monitoring solution around the idea of histograms around the idea

00:25:48.919 of actually get looking being able to look at the distribution that is real and basically giving you the insight

00:25:54.880 into what's going on in these particular locations so let me actually switch off and show you that so this is why this is

00:26:00.600 a product and services track uh so how does one do

00:26:07.600 this this is yeah awkward sorry guys okay I'm going to mirror

00:26:15.440 displays come on all right there we go okay so the first thing that I'm

00:26:22.640 going to show you here is here's an example from a real application uh our first customer actually

00:26:28.200 uh they're using Skylight you can see that there is a bunch of requests 279 and the first thing that I would like to

00:26:33.640 call out is that the distribution in fact looks quite like the log normal distribution the weird skewed

00:26:38.919 distribution that I showed before it actually looks nothing like the normal distribution and in fact if you start using uh a tool like this that lets you

00:26:45.360 see the real distribution you you know that's a very early thing you notice is like the normal distribution is never

00:26:50.679 happens basically for for reasons that are not fully understood you never hit that scenario now like I said before you

00:26:57.279 actually want to differentiate between the faster requests um and the slower requests so you can see here that when I click

00:27:03.720 faster it's selecting like the top third uh the fastest third requests when I click

00:27:09.080 slower it's selecting the bottom third and down here we're basically using um

00:27:14.159 active support notifications and a few other things to collect information about what's going on in your request uh

00:27:19.880 it's important to note here um so if you're familiar with like the Chrome debugging tools it's quite similar to

00:27:25.159 those except that this is showing the information about a whole bunch of requests mixed together not one

00:27:31.320 particular request so you can see for example that uh the faster requests have very little time spent in GC uh while

00:27:37.399 the slower requests have a lot of time spent in GC um and in our particular case what we've done is uh GC is

00:27:43.559 something that's can get very noisy so if you're looking at how much time is spent in for example a SQL query uh if

00:27:49.159 you just ignore GC entirely often the uh Ruby will start running GC in the middle

00:27:54.320 of a SQL query and you'll start you'll see like oh my God my SQL quer is taking like 500 milliseconds what's I mean

00:27:59.840 Ruby's GC is not that slow but my SQL query is taking like 100 milliseconds in that makes no sense at all so one of the

00:28:06.080 first things that we did was basically subtract out the um GC time from all these little circles so that you can um

00:28:12.320 get a real sense of it and not get to all this noise um around what's going on so um now of course like I said before

00:28:20.880 you don't necessarily only want to look at faster or slower you also want the ability to look at particular distributions so um you can select any

00:28:28.559 distribution that you want and you can see that the bottom area is live updating which is pretty awesome um and

00:28:35.679 one thing that we have not done yet but which we are planning to do in the near future uh is basically for every for any

00:28:42.200 area that you happen to have selected so you select this area over here right now you can see basically what's going on uh

00:28:47.399 we plan to have key requests down here that are basically some particular requests that represent that area that you selected that will give you much

00:28:53.960 more information like uh you know what exact SQL queries are being called so um I think it's not one other option if you

00:29:01.399 want to get detailed information is just to get the very slowest request but obviously the very slowest request really is an outlier um but here you can

00:29:08.279 basically say I want a request from this range or I want a request from this range um you get the full details so

00:29:13.320 that's great um and then I just want to show another example so this is uh this is a different distribution and what you

00:29:20.399 see here is um after this morning's keynote I'm sure everyone's going to run home and cash cash cash uh all the

00:29:26.880 nesting and and what you can see here is that um often if you're cashing a lot obviously

00:29:33.559 the cash hits are going to be uh are going to be represented by information over here and then you have all the way

00:29:38.799 down here the cash misses right that are much much slower and of course average is completely high this type of thing

00:29:44.080 right the average is somewhere in the middle but that's there's no real request in this case even in in what the average would tell you so you would have

00:29:50.840 completely lost it where if you go just click on uh click on this example here you can see right away that there's two

00:29:56.760 essentially two two different requests going on here there's the cash request and the the non-cash requests um the

00:30:02.000 last thing that I want to say uh here in the demo is actually I'll show the real

00:30:07.360 one be better um is so right now uh we're

00:30:13.320 planning on on adding uh things like notifications um to the right side here but for right now what we have basically

00:30:19.159 here is a list of endpoints and there's a couple things that I would I would want to call out here the first one is that the order is basically based on

00:30:25.399 importance and that's a combination of how many requests per minute you're getting and how slow it is so if um your

00:30:32.080 response times are very slow but basically nobody hits it then maybe it doesn't really matter unless it's really

00:30:37.279 really slow maybe then it does matter um and the response time here is actually not the average response time it's the

00:30:43.519 95th percentile uh and the 95th percentile again you might think is this like crazy super outlier is thing except

00:30:49.679 remember that web responses are not clustered around uh around the middle so the 95th percentile in this case uh so

00:30:56.480 here we have 279 so like half uh 5% is

00:31:01.919 uh like 12 or 13 responses so we can see that the 95th percentile goes out um

00:31:09.000 95th percentile is everything yeah it's it's a lot of stuff but um there's still like 20 or 30 requests actually that's

00:31:15.799 it's more like here right there's still like 20 or 30 requests that are outside the range I'm trying to get I'm trying

00:31:21.120 to get it for reals uh so let's say something like this right there's still a bunch of requests that are outside which you

00:31:27.320 would not expect if your mental model was the simple mean and then when you get out of there it's like crazy um so

00:31:32.880 still one out of every 20 requests here is going to be uh more worse than 600 or

00:31:38.240 700 milliseconds whatever we had there and I think one thing that's important to note is I think one of the first things that happens we have a few people

00:31:44.679 using uh using Skylight now one of the first things that happens is that that number is really a shocker right so

00:31:50.039 they're used to seeing like 1550 over there or 100 because that's what other uh tools give you when they talk about

00:31:55.200 averages and we're giving you a number that looks much worse um but what we're giving you is a number that is much more

00:32:01.200 honest and uh admittedly harder to collect but um much more honest so

00:32:06.720 that's the deal uh I don't know this is on is that on all right so that uh uh

00:32:12.279 thank you I I don't think we're quite done I've shown you the demo I've talked about statistics and I think me and Carl are going to close off so let me just um

00:32:19.600 let me switch back to non-mirror and I think yeah for us the main thing is that we want to get this out as soon as

00:32:25.760 possible because we want people on it so that we can Sol be solving real world problems uh just having a few people on

00:32:31.919 it helped us tremendously so and I yeah what like you mentioned I think the most

00:32:37.240 interesting thing we found was first responses not sticker shock but I had no idea what was going on because they had

00:32:43.639 what is considered a pretty performant rails app but they discovered many things yeah and I think one thing I found that was interesting when I was uh

00:32:50.080 starting to look into this is um in 2009 dhh wrote a blog post called the problem with averages and essentially in this

00:32:56.679 blog post he wrote um at the end he said what we need are histograms I want basically you need to actually

00:33:01.799 physically look at it because honestly your intuition is just wrong right you look at a percentile even if you look at standard deviation your intuition about

00:33:08.480 what's going on is just wrong you need to see what's really happening the real distribution um I also uh in in the

00:33:15.960 comment in response to dh's post uh the principal architect from New Relic said

00:33:21.039 I also agree that a histogram is the ultimate answer to understanding the user experience and I think basically

00:33:26.519 our position is averages are uh I think I should be saying averages are great

00:33:32.679 averages are not great averages are basically information that are telling you something that is inaccurate that's

00:33:38.039 misleading that plays to the wrong part of your intuition the wrong part of your schooling and you basically are going to

00:33:43.799 get you're going to be making assumptions about like what Carl said before about measurement around what you know that are just completely incorrect

00:33:50.240 that don't reflect how web responses actually work in the real world um so

00:33:55.960 you can sign up yeah yeah so it's still private right now but we'll be at the we'll be at the booth adding anybody who wants access

00:34:03.200 get Fe like try it out give us feedback because we want to build this tool to solve like real world problems your

00:34:08.560 problems y so just a couple more slides so um there's a lot so obviously this

00:34:14.480 was like what we were able to finish in time for this talk so um what you've seen is super early um there's a lot

00:34:21.119 coming soon there's like uh one thing I want to say is like getting real data really changed the way I thought about

00:34:26.440 the app like my initial gut feeling about what we would actually build and what would be important turned out to be

00:34:31.480 a little bit different like once we put real users on the questions that they were answering we're asking were quite different than the questions that I

00:34:37.280 expected them to to ask and that's because once you can get a sense of like okay like 30% of my requests are in this

00:34:42.560 range and I can see that you know a certain amount of time is being spent in SQL or a certain amount of time is being spent in Ruby you start you start really

00:34:49.560 asking questions that are hard to ask in the first place when you when you're looking at averages so you you get average of 150 milliseconds there's

00:34:55.000 really not much to do but if you see 30 % of my requests are in a Range where the Ruby is super slow you start asking

00:35:00.560 like why is the Ruby super slow is it like related to how many how big of a payload I'm processing here or maybe

00:35:06.079 it's like my server is occasionally slow and like not giving me enough CPU right like ask actually being able to ask

00:35:12.640 these questions is really powerful and I think what we're probably going to be working on next is just making it easier

00:35:17.760 to ask these questions now or to answer these questions now that it's much easier to ask them in the first place um

00:35:23.440 and I think in general like this type of structure of of operating around history Instagrams seeing aggregate distributions really gives you a better

00:35:30.440 feel for what you don't know right so I think when you see a mean there's so basically the amount of things that you don't know is so vast that you don't

00:35:37.200 even really have a good way to think about like what you might want to know but when you when you open the door to

00:35:43.320 seeing basically what's going on for real you start having a good question about what's up um I want to close with

00:35:48.680 just uh us I'm really excited that uh it's now like two a year and a half or

00:35:54.560 something since we started Tilda uh we've been doing a lot of Consulting we've been doing bootstrapping um I

00:36:00.319 pretty much worked at VC funded companies a lot before Tilda and I've uh definitely big believer in bootstrapping

00:36:06.560 now big believer in healthy sustainable businesses that are um people working with other community members that they

00:36:13.240 have a kinship to uh not necessarily you know big sales run businesses I've been there before uh and I'm pretty much over

00:36:19.960 that so we definitely want to we we're basically developers we have you know we do Consulting and we build real products

00:36:26.960 that show us basically what is wrong and I think we want to help ourselves and you get to a better place understanding

00:36:33.319 what's happening in your applications performance Etc so thank you very much and definitely come to our booth

00:36:39.280 tomorrow and get signed up thanks ah seems