How to Build a Smart Profiler for Rails

00:00:15.869 hey you guys ready thank you guys so much for coming this is awesome I was

00:00:21.430 really I when they were putting together the schedule I said make sure that you put us down in the caves of Moria so

00:00:26.890 thank you guys for coming down and making it I'm tom this is yehuda when

00:00:33.850 people told me I was signed up to do a back-to-back talk I don't know what I was thinking yeah so we want talk to you

00:00:40.809 today about about skylight so just a little bit before we talk about that I want talk about us a little bit so in

00:00:47.739 2011 we started a company called tilde a shirt i lay i made me self-conscious because this is actually the first

00:00:53.860 edition and it's printed off center well either i'm off center or the shirts off center one of the two so we started

00:01:00.460 tilde in 2011 and we had all just left a venture-backed company and that was a

00:01:06.100 pretty traumatic experience for us because we spent a lot of time building the company and then we ran out of money and it sold the facebook and we really

00:01:11.740 didn't to repeat that experience so we decided to start tilde and when we did it we decided to be dhh and the other

00:01:20.170 people at base camp were talking about you know being bootstrapped and proud and that was a message that really resonated with us and so we wanted to

00:01:26.320 capture the same thing there's only one problem with being boost trapped and proud and that is in order to be both of

00:01:31.479 those things you actually money it turns out it's not like you just say it in a blog post and then all of a sudden you are in business so we had to think a lot

00:01:39.640 about okay well how do we make money how do we make money how do we make a profitable and most importantly

00:01:45.130 sustainable business because we didn't want to just flip the facebook in a couple years so looking around I think

00:01:51.759 the most obvious thing that people suggested to us is well why don't you guys just become ember tank-raised few

00:01:58.090 million dollars you know build what's your business model mostly prayer but

00:02:04.899 that's not really how we want to think about building open source communities we don't really think that that

00:02:10.119 necessarily leads to the best open source communities and if you are interested more in that i recommend Leia

00:02:15.430 silber who is one of our co-founders she's giving a talk this afternoon sorry Friday afternoon about

00:02:21.380 how to build a company that is centered on open source so if you want to learn more about how we've done that I would

00:02:27.050 really suggest you go check out her talk so no so no amber game not a lot so we

00:02:34.280 really want to build something that leveraged the strength that we thought that we had the one I think most

00:02:41.210 importantly a really deep knowledge of open source and a deep knowledge of the rails stack and also Carl's turns out is

00:02:46.700 really really good at building highly scalable big data sista systems lots of

00:02:52.340 Hadoop in there so last year at railsconf we announced the private beta

00:02:58.460 of skylight how many of you have used Skyland can you raise your hand if you have used it okay many of you awesome so

00:03:04.690 so skylight is a tool for profiling and measuring the performance of your rails applications in production and as a

00:03:12.350 product skylight I think was built on three really three key breakthroughs

00:03:19.340 there were key three key breakthroughs we didn't want to ship a product that was incrementally better than

00:03:24.590 competition we want to ship a product that was dramatically better quantum leap order of magnitude better and in

00:03:30.950 order to do that we spent a lot of time thinking about it about how we could solve most of the problems that we saw in the existing landscape and so those

00:03:37.940 those breakthroughs are predicated sorry those delivering a product that does that is predicated on these three

00:03:44.000 breakthroughs so the first one I want to talk about is honest response times

00:03:49.300 honest response times so dhh wrote a blog post on what was then the 37signals

00:03:56.360 blog now the base camp blog called the problem with averages how many of you have read this awesome for those of you

00:04:02.930 that have not how many of you hate raising your hands at presentations so

00:04:09.500 for those of you that's put a button in every seat press the right is valid nivea so if you read this blog post the

00:04:19.730 way it opens is our average response time for base camp right now is 87 milliseconds that sounds fantastic and

00:04:26.000 it easily leads you to believe that all is well and that we wouldn't need to

00:04:31.160 and any more time optimizing performance but that's actually wrong the average

00:04:37.670 number is completely skewed by tons of fast responses to feed requests and other cash replies if you have 1,000

00:04:45.590 requests that return in five milliseconds and then you can have 200 requests taking two thousand

00:04:51.500 milliseconds for two seconds you can still report and a ver a respectable 170

00:04:56.570 milliseconds of average that's useless so what does dhh say that we need dhh

00:05:02.690 says the solution is histograms so for those of you like me who are sleeping

00:05:07.700 through your statistics class in high school in college a brief primer on histograms so histogram is very simple

00:05:14.780 basically you have it you have a series of numbers along some axis and every

00:05:20.120 time you you're in that number you're in that bucket you basically increment that bar by one so this is an example of a

00:05:26.360 histogram of response times in a rails application so you can see that there's a big cluster in the middle around four

00:05:32.780 hundred eighty eight milliseconds 500 milliseconds isn't a super-speedy app but it's not the worst thing in the world and they're all clustered and then

00:05:38.870 as you kind of move to the right you can see that the response times get longer and longer and longer and if you move to the left response time to get shorter

00:05:44.390 and shorter and shorter so why do you want a histogram what's the what's the most important thing about a histogram well I think it's because most requests

00:05:52.430 don't actually look like this most points don't actually look like right if you think about what your rails app is doing it's a complicated Beast right it

00:05:58.430 turns out Ruby terrain complete you can you can do branching logic you can do a lot of things and so what that means is

00:06:05.510 that one endpoint if you represent that with a single number you are losing a lot of fidelity to the point where it

00:06:12.169 becomes as dhh said useless so for example in a histogram you can easily see oh here's a group of requests in

00:06:19.220 response times where I'm hitting the cash and here's another group where I'm missing it and you can see that that cluster is significantly slower than the

00:06:25.880 faster cash shooting cluster and the other thing that you get when you have a

00:06:31.070 distribution we keep the whole distribution in the histogram is you can look at this number at the 95th

00:06:37.280 percentile right so the right the way to think about the performance of your web

00:06:42.289 application is not the average because the average doesn't really tell you anything you want to

00:06:49.650 think about the 95th percentile because that's not the average response time that's the average worst response time

00:06:55.380 that a user is likely to hit and thing to keep in mind is that it's not as though customer comes to your site they

00:07:02.490 issue one request and then they're done right as someone is using your website they're going to be generating a lot of

00:07:09.150 requests and you need to look at the 95th percentile because otherwise every

00:07:15.060 request is basically you rolling the dice that they're not going to hit one of those two second three second four second responses close the tab and go to

00:07:21.570 your competitor so we look to this here's a crazy thing here's what I think is crazy that blog post the DHH wrote is

00:07:30.960 from 2009 spend five years and there's still no tool that does what the attached was asking for so we frankly we

00:07:38.130 smelled money we were like oh I just slide that green I should be green and dollars i think you know has the dollar

00:07:44.370 so the make it rain effect i should have used so we smelled blood in the water

00:07:50.639 we're like this is awesome there's only one problem that we discovered and that

00:07:56.340 is it turns out that billing this thing is actually really really freaking hard really really hard so we announced the

00:08:02.639 private beta at railsconf last year before doing that we'd spent a year of

00:08:08.699 research spiking out prototypes building prototypes building out the beta we launched at railsconf and we realized we

00:08:16.530 made a lot of problems we made a lot of errors when we were building the system so then after railsconf last year we

00:08:24.150 basically took six months to completely rewrite the backend from the ground up

00:08:30.090 and I think tying into your keynote yehuda we we were like oh we clearly

00:08:35.459 have a bespoke problem no one else is doing this so we rewrote our own custom

00:08:40.740 backends and then we had all these problems and we realized that they had actually already all been solved by the open source community and so we

00:08:46.740 benefited tremendously by having a shared solution so our first release of this was really very bespoke and the

00:08:53.190 current release had uses tremendous amount of very off-the-shelf open source projects that just solve the

00:09:00.330 particular problem very effectively very well none of which are as easy to use as

00:09:05.640 rails but all of which solve really thrown any problems very effectively so so let's just talk just for your own

00:09:12.960 understanding let's talk about how most performance monitoring tools work so the way that most of these work is that you

00:09:18.570 run your rails app and running inside of your rails app is some gem some agent that you install and every time the

00:09:25.050 rails app handles a request it generates events and those events which include information about performance data those

00:09:32.460 events are passed into the agent and then the agent sends that data to some

00:09:37.680 kind of centralized server now it turns out that doing a running average is actually really simple which is why

00:09:44.490 everyone does it that you basically you can do it in a single single cream right all you do is you have three columns and

00:09:49.650 database the endpoint the running average and the number of requests and then so you you can those are the two

00:09:55.380 things that you need to keep a running average right so keeping a running average is actually really simple from a technical point of view I don't think you could do in javascript due to the

00:10:01.530 lack of integers yes you probably would want to do any math in JavaScript it turns out so so we took a little bit

00:10:08.400 different approach who did you want to go over there sure sure sure so um when we first started right at the beginning

00:10:15.180 we basically did a similar thing where we had a bunch of your app creates events most of those start off as being

00:10:21.360 active support notifications although it turns out that there's very limited use of active sport notifications so we had

00:10:27.390 to do some normalization work to get them saying which we're going to be up streaming back into into rails but one

00:10:33.150 thing that's kind of unfortunate about having every single rails app have an agent is that you end up having to do a lot of the same kind of work over and

00:10:39.840 over again and use up a lot of memories so for example every one of these things is making HTTP requests so now you have

00:10:45.120 a queue of things that you're sending over HTTP in every single one of your rails processes and of course you

00:10:50.520 probably don't notice as people are used to rails taking up hundreds and hundreds of megabytes so you probably don't notice if you install some agent and it

00:10:55.830 suddenly starts taking 20 30 40 50 more megabytes but we really wanted to keep the actual memory / process down to a

00:11:02.850 small amount so one of the very first things that we did we even did it before last year that we pulled out all that shared logic

00:11:09.240 into a separate process called the coordinator and the agent is basically responsible simply for collecting the

00:11:15.810 the trace it's not responsible for actually talking to our server at all and that means that the coordinator only has to do this Q this keeping a stat a

00:11:22.890 bunch of stuff of work in one place and doesn't end up using up as much memory

00:11:27.950 and I think this ended up being very effective for us I think that low overhead also allows us to just collect

00:11:33.780 more information in general yep um now after our first attempt we started

00:11:39.630 getting a bunch of customers that were telling us that even the separate so the separate coordinator sort of is a good thing and a bad thing on the one hand

00:11:45.480 there's only one of them so it uses up only one set of memory on the other hand it's really easy for someone to go in

00:11:51.180 and PS that process and see how many megabytes of memory it's using so we got a lot of initial complaints that said oh

00:11:56.490 your process is using a lot of memory and I spent a few weeks I I know Ruby pretty well I spent a couple of weeks I

00:12:03.510 actually wrote a gem called allocation counter that basically went in and tried to pinpoint exactly where the allocations were have coming from but it

00:12:09.480 turns out that it's actually really really hard to track down exactly where allocations are coming from the Ruby because something as simple as using a

00:12:15.240 regular expression in Ruby can allocate match objects they can put back on the stack and so I was able to pair this

00:12:21.180 down to some degree but I really discovered quickly that trying to keep a lid on the memory allocation by doing

00:12:27.330 all the stuff in Ruby um is mostly fine but for our specific use case where we really want it we want to be telling you

00:12:32.820 you can run the agent on your process on your box and it's talking use a lot of memory we really needed something more

00:12:38.490 efficient and our first thought was will use C++ or C no problem c is native it's

00:12:43.710 great and Carl did the work Carl is very smart and then he said you who da it is

00:12:48.750 now your turn you need to start maintaining this and I said I don't trust myself to write C++ code that's running and all viewer guys's boxes and

00:12:54.990 not seg fault so I don't think that doesn't work for me and so I i noticed

00:13:00.840 that rust was coming along and what r us really gives you is it gives you the ability to write low-level code a lossy

00:13:06.390 or C++ with manual memory management that keeps your memory allocation low and keeps things speedy research low

00:13:11.820 resource utilization while also giving you compile time guarantees about not seg faulting so again if your process is

00:13:17.940 randomly started sec faulting because you installed the agent i think you would being our customer very quickly so having want pretty much one hundred

00:13:23.340 percent guarantees about that was very important to us and so that's why we decided to use rust just keep going keep

00:13:30.570 going um so we have this coordinator object and basically the coordinator object is receiving events so the events

00:13:36.180 basically end up being these traces that describe what's happening your application and the next thing I think

00:13:42.210 our initial work on this we use JSON just to send the payload to the server but we noticed that a lot of people have

00:13:47.580 really big request so you may have a big request of the big sequel query in it or a lot of big sequel queries in it some

00:13:53.370 people have traces that are hundreds and hundreds of nodes long and so we really wanted to figure out how to shrink down the payload size to something that we

00:14:01.080 could be you know pumping out of your box on a regular basis without tearing up your bandwidth costs so one of the

00:14:07.950 first things that we did early on was we switch to using protobuf says the transport mechanism that really shrunk shrunk down the payloads a lot our

00:14:15.300 earlier prototypes for actually collecting the data were written in Ruby but I think Carl did like a weekend hack

00:14:20.640 to just pour it over to Java and got like 200 x performance and you don't always get 200 x performance if mostly

00:14:26.490 what you're doing is database queries you're not going to get a huge performance win but mostly what we're doing is math and an algorithms and data

00:14:32.070 structures and for that Ruby is just it could in theory one day have a good jet or something but today writing that code

00:14:38.220 in Java didn't end up being significantly more code because it's just you know algorithms and data structures and all just know something

00:14:43.560 about standardizing on proto buffs in our in our stack is actually a huge win because we realized hey browsers it

00:14:49.590 turns out are pretty powerful these days they've got you know they can allocate memory they can do all type of computation so and protobuf libraries

00:14:56.550 exist everywhere so we save ourselves a lot of computation in a lot of time by just treating protobuf as the canonical

00:15:03.120 serialization form and then you can move palos around the entire stack and everything speaks the same language so you saved the serialization

00:15:08.370 deserialization and javascript is actually surprisingly effective at taking proto boss and converting them to

00:15:13.470 the format that we need efficiently so so we basically take this data the Java

00:15:19.140 collector is basically collecting all these proto buffs and pretty much it just turns around and this is sort of

00:15:24.420 where we got into bespoke territory before we started rolling our own but we realized that when you write a big

00:15:29.490 distributed fault tolerant system there's a lot of problems that you really just someone else to have thought about so

00:15:35.130 what we do is we basically take these take these payloads that are coming in we convert them into batches and we send

00:15:41.009 the batches down into the kafka queue and the next thing that happens so the kafka kafka is basically just a cue that

00:15:48.209 allows you to throw things into I guess it might be considered similar to like something like amqp it has some nice

00:15:54.120 fault-tolerance properties and integrates well with storm what most importantly it's just super super high throughput so we basically don't want to

00:16:00.630 put any barrier between you giving us the data and us getting it to discuss terms which will I think talk about in a

00:16:05.639 bit so we so basically Kafka takes the data and start sending it into storm and

00:16:10.800 if you think about what has to happen in order to get some requests so you have these requests there's you know maybe

00:16:16.579 traces that have a bunch of sequel queries and our job is basically to take all those sequel queries and say okay I can see that in all of your requests you

00:16:22.620 have the sequel query and it took around this amount of time and it happened as a child in this other node and the way to think about that is basically just a

00:16:28.440 processing pipeline right so you have these traces that come in one side you start passing them through a bunch of processing steps and then you end up on

00:16:34.740 the other side with the data and storm is actually a way of describing that processing pipeline in sort of

00:16:40.350 functional style and then you tell it okay here's how many servers I need here's how you know here's how I'm going

00:16:45.389 to handle failures and it basically deals with distribution and scaling and all that stuff for you and part of that

00:16:52.110 is because you wrote everything using functional style and so what happens is kafka sends the data into the entry

00:16:57.930 spout which is sort of terminology in terminology and storm for the these

00:17:03.120 streams that get created and they basically go into these processing things which very cleverly cutely are

00:17:08.520 called bolts this is definitely not the naming I would have used but they're

00:17:13.530 called bolts and the idea is that basically every request may have several things so for example we now

00:17:18.600 automatically detect n plus 1 queries and that's sort of a different kind of processing from just make a picture of

00:17:24.929 the entire request or what is the 95th percentile across your entire app right these are all different kinds of processing so we take the data and we

00:17:30.929 send them into a bunch of bolts and the cool thing about bolts is that again because they're just functional chaining

00:17:36.900 you can take the output from one bolt and feed it into another bolt and that works that works pretty well and and you

00:17:43.500 don't have to worry about I mean you have to worry about things like fault tolerance failure idempotent but you worry about

00:17:49.830 them at the abstraction level and then the operation all part is handled for you it's like a very declarative way of

00:17:55.799 describing how this computation works and in a way that's easy to scale and Karl actually talked about this at very

00:18:01.860 high speed yesterday and some of you may have been there I would recommend watching the video when it comes out if

00:18:07.170 you want to learn more about how to make use of this stuff in your own applications and then when you're finally done with all the processing you

00:18:13.530 need to actually do something with it you need to put it somewhere so that the web app can get access to it and that is basically we use Cassandra for this and

00:18:20.669 Cassandra again is mostly it's a dumb database but it has its has high

00:18:26.010 capacity it has some of the fault tolerance properties that we're just very very heavy right like we tend to be

00:18:31.470 writing more than whatever reading yep and then when we're done when we're done with a particular batch Cassandra

00:18:37.919 basically kicks off the process over again so we're basically doing these things as batches so these are these are roll-ups is what's happening here so

00:18:43.650 basically every minute every 10 minutes and then every hour we reprocessing leary aggregate so that when you query

00:18:49.380 us we know exactly with you yep so we sort of have the cycle where we start off obviously in the first five set in

00:18:55.230 the first minute you really want high granularity you want to see what's happening right now but if you want to go back and look at that it from three

00:19:00.480 months ago you probably care about it like the day granularity or maybe the hour granularity so we basically do

00:19:06.179 these roll-ups and we cycle through the process so this it turns out building

00:19:11.250 the system required an intense amount of work Karl spent probably six months reading PhD thesis ease to find pcs to

00:19:21.090 find to find data structures and algorithms that we could use because this is a huge amount of data like I

00:19:26.940 think even a few months after we were in private that of private beta we were already handling over a billion requests

00:19:34.140 per month and obviously there's no way basically the number of requests that we handle is the sum of all the requests

00:19:39.600 that you handle and all of our customers animal right right so that's a lot of requests so obviously we can't provide a

00:19:44.730 service at least one that's not can provide an affordable service an accessible service if we have to store

00:19:51.150 terabytes or exabytes of data just tell you how your app is run and I think also a problem it's problematic if you

00:19:56.880 store all the data in a database and then every single time someone wants to learn something about that you have to do a query those queries can take a very

00:20:02.400 long time they can take minutes and I think we really wanted to have something that would be very that where the

00:20:07.560 feedback loop would be fast so we wanted to find algorithms that let us handle the data at real time and then provide

00:20:13.830 it to you at real time instead of these like dump the data somewhere and then do these complicated queries so so this

00:20:21.480 slide was not supposed to be here supposed to be a real slice um so well I went too far okay we'll watch that again

00:20:28.950 that's pretty cool so the last thing I want to say is perhaps your takeaway from looking at this architecture

00:20:34.320 diagram is oh my gosh these rails guys having completely dark they jumped the shark they ditched real I saw like three

00:20:41.310 tweets yesterday I wasn't here I was in Portland yesterday but I saw like three tweets that are like I'm at railsconf in I haven't seen the single talk about

00:20:46.530 rails so that's true here too but I want

00:20:51.900 to assure you that we are only using this stack for the heavy computation we started in rails we started with like a

00:20:59.880 what do we need Apple people if I need to authenticate and log in and we probably you to do billing and those are

00:21:05.010 all things that rails is really really good at so we started with rails as basically the starting point and then we

00:21:10.140 realized oh my gosh computation is really slow there's no way we're going to be able to offer the service okay now

00:21:15.300 let's think about how we can I think notably a lot of people who look at rails are like there's a lot of companies that have built big stuff on

00:21:21.330 Rails and their attitude is like oh this legacy terrible rails app I really we should get rid of it we could just write everything in Scala or closure or go

00:21:28.410 everything would be amazing that is definitely not our attitude our attitude is that rails is really amazing at particular at the kinds of things that

00:21:34.950 are really common across everyone's web applications authentication billing etc and we really want to be using rails for

00:21:41.160 the parts where even things like error tracking we do through the rails app we want to be using rails because it's very productive at doing those things it

00:21:47.070 happens to be very slow at doing data crunching so we're going to use a different tool for that but I don't think you'll ever see me getting up and

00:21:52.230 saying I really wish we just started writing you know the rails app and rust

00:21:57.920 so that's number one is is honest response times which which it turns out

00:22:03.900 seems like should be easy requires storing insane amount of data so the

00:22:09.270 second thing that we realized we were looking at all of these tools is that most of them focus on data they focus on

00:22:14.370 giving you the raw data but I'm not a machine I'm not a computer I don't enjoy sifting through data that's what

00:22:20.460 computers are good for I would rather be drinking a beer it's really nice important this time of year so we want to think about if you are trying to

00:22:27.570 solve the performance problems in your application what are the things that you would suss out with the existing tools

00:22:33.480 after spending like four hours depleting your ego to get there and I think part of this is just people are actually very

00:22:40.580 people like to think that they're going to use these tools but when the tools require you to dig through a lot of data people just don't use them very much so

00:22:46.680 the goal here was to build a tool that people actually use and actually like using and not to build a tool that happens to provide a lot of data you can

00:22:52.710 sift through yeah so probably the one of the first things that we realized is that we don't want to provide this is a

00:22:58.710 trace of a request you've probably seen similar you eyes using other tools using for example the inspector in like Chrome

00:23:05.310 or Safari and this is just showing basically it's basically a visual stack trace of where your application is

00:23:11.040 spending its time but I think what was important for us is showing not just a single request because your app handles

00:23:18.540 you know hundreds of thousands of requests or millions of requests so looking at a single request statistically as complete is just noise

00:23:24.540 and it's especially bad if it's the worst request because the worst request is is really noise it's like hey gallant

00:23:30.120 network right yeah it's literally the outlier it's literally the outliner yep so what we present in skylight is

00:23:37.440 something a little bit different that's something that we call the aggregate trace so the aggregate trace is

00:23:43.760 basically us taking all of your requests averaging them out where each of these

00:23:49.710 things spends their time and then showing you that so this is basically like this is like this is like the

00:23:56.760 Statue of David it is the idealized form of the stack trace of how your applications behaving but of course you

00:24:03.630 have the same problems before which is if this was all that we were showing you it would be obscuring

00:24:10.600 of information you want to actually be able to tell the difference between okay what's my stack trace look like for fast requests and how does that differ from

00:24:16.860 requests that are slower so what we've got a I've got a little video here you can see that when I move this slider

00:24:22.539 that this trace blow it is actually updating in real time as I move the

00:24:28.419 slider around you can see that the aggregate race actually updates with it and that's because we're collecting all

00:24:33.970 have some information we're collecting like I said a lot of data we can recompute this aggregate trace on the fly basically for each bucket we're

00:24:40.299 storing a different trace and then on the client we're reassembling that will go into that little bit and I think it's really important that you be able to do

00:24:46.450 these experiments quickly if every time you think oh I wonder what happens if i add another histogram bucket if it

00:24:51.850 requires a whole full page refresh then that would basically make people not want to use the tool not able to use the tool so actually building something

00:24:58.330 which is real time and fast and gets the data as it comes it was very important to us so that's number one and second

00:25:04.299 thing so we built that I'm really okay well what's next and i think the the big problem with this is that you need to

00:25:09.580 know that there is a problem before you go look at it right so we've been working for the past few months and the

00:25:15.760 storm infrastructure that we've built makes it pretty straight forward to start building more abstractions on top of the data that we've already collected

00:25:20.980 it's a very declarative system so we've been working on a feature called inspections and what's cool about

00:25:26.559 inspections is that we can look at this tremendous volume of data that we've collected from your app and we can

00:25:31.570 automatically tease out what the problems are so the first one that we've shipped this is in beta right now it's not it's not out and enabled by default

00:25:38.289 but there is behind a feature flag that we've had some users turning on and in trying out and so what we can do in this

00:25:44.860 case is because we have information about all the database queries in your app we can look and see if you have n

00:25:50.980 plus 1 queries maybe explain what n plus 1 clear yes so I'm people know hopefully what n plus 1 queries but the it's the

00:25:56.169 idea that you by accident for some reason instead of making one query you ask for like all the posts and then you

00:26:02.320 iterate it through all them and got all the comments so now you instead of having one query you have one query per

00:26:08.020 post right and you what if what I'd like to do is do eager loading where you say include comments right but you have to

00:26:13.870 know that you have to do that so there are some tools that will run in development mode if you happen to catch it like bullet this is basically a tool

00:26:19.929 that's looking at every single one of your requests and it has some holds that once we see that a bunch of your requests have the same exact query

00:26:26.470 so we do some work to pull out binds so if it's like where something equals one we will automatically pull out the one

00:26:32.530 and replace it with a question mark and then we basically take all those our queries and if they're the exact same query repeated multiple times subject to

00:26:39.430 some thresholds will start showing you hey there's an N plus one query and you can imagine the same sort of thing being done for things like are you missing an

00:26:45.940 index right or are you using the ruby version of day someone you should be using the native version of JSON these

00:26:51.640 are all things that we can start detecting just because we're consuming an enormous amount of information and we can start writing some in heuristics for

00:26:58.000 bubbling it up so third and final breakthrough we realize that we really really needed a lightning fast you I

00:27:04.180 something really responsive so in particular the feedback loop is critical right you can imagine if the way that

00:27:09.880 you dug into data was you click and you wait an hour and then you get your results no one would do it no one would

00:27:14.920 ever do it and the existing tools are ok but you click any weight you look at it

00:27:20.320 and you're like oh I want a different view so then you go edit your query and then you click any weight and it's just not a pleasant experience so so we use

00:27:28.330 ember the the UI that you're using when you log into Scylla even though it feels just like a regular website doesn't feel like a native app is powered all of the

00:27:36.040 routing all the rendering all the decision making is happening in as an ember j/s app and we pair that with d3

00:27:42.310 so all of the chart the chart that you saw there in the aggregate race that is all amber components powered by d3 so

00:27:49.510 this is actually significantly cleaned up our client side code it makes reusability really really awesome so

00:27:55.540 I'll give you an example this is from our billing page the designer came and they had a component that was like the date component and it seemed really

00:28:03.250 boring at first seemed really boring but this is the implementation right so you could copy and paste this code over and

00:28:09.550 over again everywhere you go just remember to format it correctly if you forget to format it it's not going to look the same everywhere but I was like

00:28:14.890 hey we're using this all over the place why don't we bundle this up into a component and so with ember it was super

00:28:20.230 easy we basically just said okay here's a new calendar date component it has a property on it called date just set that

00:28:25.270 to any JavaScript date objects just that you have to remember about converting it or formatting it here's the component set the date and it will render the

00:28:31.420 correct thing automatically and so the architecture of the M Braff looks a little bit something

00:28:36.580 like this where you have many many different components most of them just driven by d3 and then they're plugged into the model of the controller and the

00:28:43.060 Ember app will go fetch those models from the cloud and the cloud from the Java app which just queries Cassandra

00:28:48.850 and render them and what's neat about this model is turning on web sockets it's super easy right because all of

00:28:56.050 these components are bound to a single place so when the WebSockets is hey we have update information for you to show

00:29:01.120 it just pushes it onto the model or onto the controller and all the whole UI updates automatically it's like magic

00:29:06.760 and it's like magic and and when debugging this is especially awesome too because I don't know maybe show a demo

00:29:13.510 of the Ember inspector using this so yeah so that lightning fast you I reducing the feedback loop so that you can quickly play with your data makes it

00:29:20.200 go from a chore there's something that actually feels kind of fun so these were

00:29:25.690 the breakthroughs that we had when we were building skylight the things that made us think yes this is actually a product that we think deserves to be on

00:29:30.730 the market so one honest response times collect data that no one else can collect focus on answers instead of just

00:29:36.010 dumping data and have a lightning fast you like to do it so we'd like to think of skylight as basically a smart profiling it's a smart profiler that

00:29:42.130 runs in production it's like the profile that run on your local development machine but instead of being on your

00:29:47.470 local dev box which has nothing to do with the performance characteristics of what your users are experiencing we're actually running in production so let me

00:29:55.540 just give you guys a quick demo

00:30:01.049 so this is what the skylight this is what's got it looks like let me send this there we go so the first thing here

00:30:08.820 is we've got the app dashboard so this looks like our response about 95 percentile response length maybe you're

00:30:15.029 all hammering it right now that would be nice so this is a graph of your response time over time and then on the right

00:30:21.090 this is a this is a graph of the RPMs the requests per minute that your app is handling so this is app wide and this is

00:30:26.970 live this updates every minute then down below you have a list of the endpoints in your application so you can see

00:30:33.659 actually the top of the slowest ones for us where we have an instrumentation API and we've gone and instrumented our background workers so we can see them

00:30:39.869 here in their response time plays in so we can see that we have this reporting worker that's taking 95th percentile 13

00:30:47.009 seconds so all that time used to be inside of some requests somewhere and we discovered that there was a lot of time being spent in things that we could push

00:30:53.399 to the background we probably need to update the agony index so that it doesn't make workers very high because spending sometime your workers not that

00:30:59.549 big of a deal so so then if we dive into one of these you can see that for this request we've got the time explorer up

00:31:06.419 above and that shows a graph of response time again 95th percentile and you can if you want to go back and look at

00:31:12.450 historical data you just drag it like this and this has gotta brush so you can zoom in and out on different times and

00:31:17.489 every time you change that range you can see that it's very responsive it's never waiting for the server but it is going back and fetching more data from the

00:31:23.489 server and then when the data comes back you see the whole UI just updates and we get that for free with ember and d3 and then them down below as we discussed you

00:31:30.840 actually have a real histogram and this histogram in this case is showing so this is 457 requests and if we click and

00:31:37.529 drag we can just move this and you can see that the aggregate trace below updates in response to us dragging this

00:31:43.049 and if we want to look at the fastest quartile we just click faster and we'll choose that range on the histogram think

00:31:48.659 it's faster fastest third and then if you click on slower you can see the slow request so this makes it really easy to

00:31:53.940 compare contrast okay why are most certain requests fast or wire certain requests slow you can see the blue these

00:31:59.609 blue areas this is Ruby code so right now it's not super granular it would be nice you could actually know what's

00:32:06.029 going on here but it'll at least tell you where in your controller action this is happening and then you can actually

00:32:11.879 see which database queries are being executed and what their duration is and you can see that we actually extract the sequel

00:32:18.180 and we d normalize it so you we normalize it so you can see exactly what those requests are even if the values

00:32:23.790 are totally different between yeah so the real query courtesy of rails not yet supporting bind extraction is like where

00:32:29.550 ID equals 1 or 10 or whatever yep so that's pretty cool one other thing is

00:32:36.380 initially we actually just showed the whole trace but we discovered that obviously when you show whole traces your have information that doesn't

00:32:42.450 really matter that much so we started off by we've recently basically started to collapse things that don't matter so

00:32:48.390 much so you can basically expand or condense the trace and we wanted to make it not that you have to think about expanding or condensing individual areas

00:32:55.380 but just you see what matters the most and then you can see trivial areas yep so so that's demos skylight we'd really

00:33:02.610 like it if you checked it out there is one more thing I want to show you that it's like really freaking cool this is

00:33:07.680 coming out of Tilda labs Carl was like has been hacking he's been up until past

00:33:12.870 midnight getting almost no sleep for the past month trying to have this ready I don't know how many of you know this but

00:33:19.460 Ruby 2 point 1 has a new stack sampling

00:33:25.320 feature so you can get really granular information about how your Ruby code is performing so i want to show you i just

00:33:32.940 mentioned how it would be nice if we could get more information out of what your Ruby code is doing and now we can

00:33:38.220 do that basically every few milliseconds this code that Carl wrote is going into the to the Ruby into MRI and it's taking

00:33:45.570 a snapshot of the stack and because this is built in it's very low impact it's

00:33:50.790 not allocating any new memory it's very little performance it basically you wouldn't even notice it so every few

00:33:56.160 milliseconds it's sampling and we take that information and we send it up to our servers so it's almost like you're running Ruby profile on your local dev

00:34:02.820 box where you get extremely granular information about where your code is spending a sign in Ruby or method for

00:34:08.370 all of these things what it's happening in production so this is so this is a we

00:34:16.110 enabled it in staging you can see that we've got some rendering bugs still in beta and we haven't yet collapsed things

00:34:21.330 that are not important for this particular feature so we want we want to hide things like like framework code obviously

00:34:27.460 this gives you an incredibly incredibly granular view of what your app is doing in production and we think this is an

00:34:36.550 API that's built into into Ruby two on one because our agent is running so low

00:34:42.849 level because we wrote it in rust we have the ability to do things like this and Carl thinks that we may be able to

00:34:48.190 actually backport this two older movies too so if you're not really 21 we think that we can actually bring this but that's a TBD so I think the cool thing

00:34:55.330 about this in general is when you run of sampling profile is a sampling profiler right we don't want to be burning every

00:35:00.730 single thing that you do in your program with tracing that would be very slow so when you normally run a sampling profile

00:35:06.640 you have to basically make a loop you have to basically create loop run this code a million times and keep sampling

00:35:12.160 and eventually we'll get enough samples to get the information but it turns out that your production server is a loop your production server is serving tons

00:35:19.060 and tons of requests so by simply take you know taking a few microseconds out of every request and collecting a couple

00:35:24.849 of samples over time we can actually get this really high fidelity picture with basically no cost and that's pretty

00:35:30.220 mind-blowing and this is the kind of stuff that we can start doing by really caring about about both the user

00:35:36.580 experience and the implementation and getting really kick scary about it and I'm really like that honestly this is a

00:35:42.010 really exciting feature that really shows what we can do as we start building sounds we got that once we've got that round work so if you guys want

00:35:49.089 check it out skylight that I oh it's available today it's no longer in private beta everyone can sign up no invitation token necessary and you can

00:35:55.869 get a 30 day free trial if you haven't started one already so if you have any questions please come see us right now or we have a booth in the vendor hall

00:36:01.030 thank you guys very much

00:36:22.650 Oh