List

How to Build a Smart Profiler for Rails

How to Build a Smart Profiler for Rails

by Tom Dale and Yehuda Katz

In the RailsConf 2014 talk titled "How to Build a Smart Profiler for Rails," speakers Yehuda Katz and Tom Dale discuss the development and functionality of Skylight, a performance profiling tool for Rails applications. They explain the origins of their company, Tilde, and the desire to create a sustainable business while focusing on the open-source community. The core purpose of Skylight is to provide developers with accurate performance measurements without the skewed data that averages might present. The key points of the presentation include:

  • Honest Response Times: The importance of using histograms instead of averages to understand response times, providing comprehensive insights into application performance to avoid misleading averages.

  • Architecture: How Skylight processes data through a custom-built architecture that leverages Kafka and Storm for high throughput and reliability, allowing for sophisticated analysis without overwhelming memory usage.

  • User Interface: The implementation of a responsive UI using Ember and D3 that enables real-time data interaction, allowing developers to quickly analyze their application's performance without tedious waits for updates.

  • Detection Features: Skylight's ability to automatically identify common performance issues such as N+1 queries through extensive data collection and heuristics, simplifying the troubleshooting process for developers.

  • Robustness: The use of Rust for building the Skylight agent, ensuring low overhead and stability in production environments without adding unnecessary memory burden.

The presentation emphasizes that to effectively address performance problems, tools need to be intuitive and actionable rather than simply presenting raw data. The functionality of Skylight is touted as a significant advance in performance tooling for Rails, making it easier for developers to monitor and improve their applications actively.

In conclusion, Skylight is characterized as a 'smart profiler' that operates in production environments, providing real-time insights and actionable advice, enhancing the developer experience and improving application performance through informed decision-making.

Help us caption & translate this video!

http://amara.org/v/FGZc/

RailsConf 2014

00:00:15.869 hey you guys ready thank you guys so much for coming this is awesome I was
00:00:21.430 really I when they were putting together the schedule I said make sure that you put us down in the caves of Moria so
00:00:26.890 thank you guys for coming down and making it I'm tom this is yehuda when
00:00:33.850 people told me I was signed up to do a back-to-back talk I don't know what I was thinking yeah so we want talk to you
00:00:40.809 today about about skylight so just a little bit before we talk about that I want talk about us a little bit so in
00:00:47.739 2011 we started a company called tilde a shirt i lay i made me self-conscious because this is actually the first
00:00:53.860 edition and it's printed off center well either i'm off center or the shirts off center one of the two so we started
00:01:00.460 tilde in 2011 and we had all just left a venture-backed company and that was a
00:01:06.100 pretty traumatic experience for us because we spent a lot of time building the company and then we ran out of money and it sold the facebook and we really
00:01:11.740 didn't to repeat that experience so we decided to start tilde and when we did it we decided to be dhh and the other
00:01:20.170 people at base camp were talking about you know being bootstrapped and proud and that was a message that really resonated with us and so we wanted to
00:01:26.320 capture the same thing there's only one problem with being boost trapped and proud and that is in order to be both of
00:01:31.479 those things you actually money it turns out it's not like you just say it in a blog post and then all of a sudden you are in business so we had to think a lot
00:01:39.640 about okay well how do we make money how do we make money how do we make a profitable and most importantly
00:01:45.130 sustainable business because we didn't want to just flip the facebook in a couple years so looking around I think
00:01:51.759 the most obvious thing that people suggested to us is well why don't you guys just become ember tank-raised few
00:01:58.090 million dollars you know build what's your business model mostly prayer but
00:02:04.899 that's not really how we want to think about building open source communities we don't really think that that
00:02:10.119 necessarily leads to the best open source communities and if you are interested more in that i recommend Leia
00:02:15.430 silber who is one of our co-founders she's giving a talk this afternoon sorry Friday afternoon about
00:02:21.380 how to build a company that is centered on open source so if you want to learn more about how we've done that I would
00:02:27.050 really suggest you go check out her talk so no so no amber game not a lot so we
00:02:34.280 really want to build something that leveraged the strength that we thought that we had the one I think most
00:02:41.210 importantly a really deep knowledge of open source and a deep knowledge of the rails stack and also Carl's turns out is
00:02:46.700 really really good at building highly scalable big data sista systems lots of
00:02:52.340 Hadoop in there so last year at railsconf we announced the private beta
00:02:58.460 of skylight how many of you have used Skyland can you raise your hand if you have used it okay many of you awesome so
00:03:04.690 so skylight is a tool for profiling and measuring the performance of your rails applications in production and as a
00:03:12.350 product skylight I think was built on three really three key breakthroughs
00:03:19.340 there were key three key breakthroughs we didn't want to ship a product that was incrementally better than
00:03:24.590 competition we want to ship a product that was dramatically better quantum leap order of magnitude better and in
00:03:30.950 order to do that we spent a lot of time thinking about it about how we could solve most of the problems that we saw in the existing landscape and so those
00:03:37.940 those breakthroughs are predicated sorry those delivering a product that does that is predicated on these three
00:03:44.000 breakthroughs so the first one I want to talk about is honest response times
00:03:49.300 honest response times so dhh wrote a blog post on what was then the 37signals
00:03:56.360 blog now the base camp blog called the problem with averages how many of you have read this awesome for those of you
00:04:02.930 that have not how many of you hate raising your hands at presentations so
00:04:09.500 for those of you that's put a button in every seat press the right is valid nivea so if you read this blog post the
00:04:19.730 way it opens is our average response time for base camp right now is 87 milliseconds that sounds fantastic and
00:04:26.000 it easily leads you to believe that all is well and that we wouldn't need to
00:04:31.160 and any more time optimizing performance but that's actually wrong the average
00:04:37.670 number is completely skewed by tons of fast responses to feed requests and other cash replies if you have 1,000
00:04:45.590 requests that return in five milliseconds and then you can have 200 requests taking two thousand
00:04:51.500 milliseconds for two seconds you can still report and a ver a respectable 170
00:04:56.570 milliseconds of average that's useless so what does dhh say that we need dhh
00:05:02.690 says the solution is histograms so for those of you like me who are sleeping
00:05:07.700 through your statistics class in high school in college a brief primer on histograms so histogram is very simple
00:05:14.780 basically you have it you have a series of numbers along some axis and every
00:05:20.120 time you you're in that number you're in that bucket you basically increment that bar by one so this is an example of a
00:05:26.360 histogram of response times in a rails application so you can see that there's a big cluster in the middle around four
00:05:32.780 hundred eighty eight milliseconds 500 milliseconds isn't a super-speedy app but it's not the worst thing in the world and they're all clustered and then
00:05:38.870 as you kind of move to the right you can see that the response times get longer and longer and longer and if you move to the left response time to get shorter
00:05:44.390 and shorter and shorter so why do you want a histogram what's the what's the most important thing about a histogram well I think it's because most requests
00:05:52.430 don't actually look like this most points don't actually look like right if you think about what your rails app is doing it's a complicated Beast right it
00:05:58.430 turns out Ruby terrain complete you can you can do branching logic you can do a lot of things and so what that means is
00:06:05.510 that one endpoint if you represent that with a single number you are losing a lot of fidelity to the point where it
00:06:12.169 becomes as dhh said useless so for example in a histogram you can easily see oh here's a group of requests in
00:06:19.220 response times where I'm hitting the cash and here's another group where I'm missing it and you can see that that cluster is significantly slower than the
00:06:25.880 faster cash shooting cluster and the other thing that you get when you have a
00:06:31.070 distribution we keep the whole distribution in the histogram is you can look at this number at the 95th
00:06:37.280 percentile right so the right the way to think about the performance of your web
00:06:42.289 application is not the average because the average doesn't really tell you anything you want to
00:06:49.650 think about the 95th percentile because that's not the average response time that's the average worst response time
00:06:55.380 that a user is likely to hit and thing to keep in mind is that it's not as though customer comes to your site they
00:07:02.490 issue one request and then they're done right as someone is using your website they're going to be generating a lot of
00:07:09.150 requests and you need to look at the 95th percentile because otherwise every
00:07:15.060 request is basically you rolling the dice that they're not going to hit one of those two second three second four second responses close the tab and go to
00:07:21.570 your competitor so we look to this here's a crazy thing here's what I think is crazy that blog post the DHH wrote is
00:07:30.960 from 2009 spend five years and there's still no tool that does what the attached was asking for so we frankly we
00:07:38.130 smelled money we were like oh I just slide that green I should be green and dollars i think you know has the dollar
00:07:44.370 so the make it rain effect i should have used so we smelled blood in the water
00:07:50.639 we're like this is awesome there's only one problem that we discovered and that
00:07:56.340 is it turns out that billing this thing is actually really really freaking hard really really hard so we announced the
00:08:02.639 private beta at railsconf last year before doing that we'd spent a year of
00:08:08.699 research spiking out prototypes building prototypes building out the beta we launched at railsconf and we realized we
00:08:16.530 made a lot of problems we made a lot of errors when we were building the system so then after railsconf last year we
00:08:24.150 basically took six months to completely rewrite the backend from the ground up
00:08:30.090 and I think tying into your keynote yehuda we we were like oh we clearly
00:08:35.459 have a bespoke problem no one else is doing this so we rewrote our own custom
00:08:40.740 backends and then we had all these problems and we realized that they had actually already all been solved by the open source community and so we
00:08:46.740 benefited tremendously by having a shared solution so our first release of this was really very bespoke and the
00:08:53.190 current release had uses tremendous amount of very off-the-shelf open source projects that just solve the
00:09:00.330 particular problem very effectively very well none of which are as easy to use as
00:09:05.640 rails but all of which solve really thrown any problems very effectively so so let's just talk just for your own
00:09:12.960 understanding let's talk about how most performance monitoring tools work so the way that most of these work is that you
00:09:18.570 run your rails app and running inside of your rails app is some gem some agent that you install and every time the
00:09:25.050 rails app handles a request it generates events and those events which include information about performance data those
00:09:32.460 events are passed into the agent and then the agent sends that data to some
00:09:37.680 kind of centralized server now it turns out that doing a running average is actually really simple which is why
00:09:44.490 everyone does it that you basically you can do it in a single single cream right all you do is you have three columns and
00:09:49.650 database the endpoint the running average and the number of requests and then so you you can those are the two
00:09:55.380 things that you need to keep a running average right so keeping a running average is actually really simple from a technical point of view I don't think you could do in javascript due to the
00:10:01.530 lack of integers yes you probably would want to do any math in JavaScript it turns out so so we took a little bit
00:10:08.400 different approach who did you want to go over there sure sure sure so um when we first started right at the beginning
00:10:15.180 we basically did a similar thing where we had a bunch of your app creates events most of those start off as being
00:10:21.360 active support notifications although it turns out that there's very limited use of active sport notifications so we had
00:10:27.390 to do some normalization work to get them saying which we're going to be up streaming back into into rails but one
00:10:33.150 thing that's kind of unfortunate about having every single rails app have an agent is that you end up having to do a lot of the same kind of work over and
00:10:39.840 over again and use up a lot of memories so for example every one of these things is making HTTP requests so now you have
00:10:45.120 a queue of things that you're sending over HTTP in every single one of your rails processes and of course you
00:10:50.520 probably don't notice as people are used to rails taking up hundreds and hundreds of megabytes so you probably don't notice if you install some agent and it
00:10:55.830 suddenly starts taking 20 30 40 50 more megabytes but we really wanted to keep the actual memory / process down to a
00:11:02.850 small amount so one of the very first things that we did we even did it before last year that we pulled out all that shared logic
00:11:09.240 into a separate process called the coordinator and the agent is basically responsible simply for collecting the
00:11:15.810 the trace it's not responsible for actually talking to our server at all and that means that the coordinator only has to do this Q this keeping a stat a
00:11:22.890 bunch of stuff of work in one place and doesn't end up using up as much memory
00:11:27.950 and I think this ended up being very effective for us I think that low overhead also allows us to just collect
00:11:33.780 more information in general yep um now after our first attempt we started
00:11:39.630 getting a bunch of customers that were telling us that even the separate so the separate coordinator sort of is a good thing and a bad thing on the one hand
00:11:45.480 there's only one of them so it uses up only one set of memory on the other hand it's really easy for someone to go in
00:11:51.180 and PS that process and see how many megabytes of memory it's using so we got a lot of initial complaints that said oh
00:11:56.490 your process is using a lot of memory and I spent a few weeks I I know Ruby pretty well I spent a couple of weeks I
00:12:03.510 actually wrote a gem called allocation counter that basically went in and tried to pinpoint exactly where the allocations were have coming from but it
00:12:09.480 turns out that it's actually really really hard to track down exactly where allocations are coming from the Ruby because something as simple as using a
00:12:15.240 regular expression in Ruby can allocate match objects they can put back on the stack and so I was able to pair this
00:12:21.180 down to some degree but I really discovered quickly that trying to keep a lid on the memory allocation by doing
00:12:27.330 all the stuff in Ruby um is mostly fine but for our specific use case where we really want it we want to be telling you
00:12:32.820 you can run the agent on your process on your box and it's talking use a lot of memory we really needed something more
00:12:38.490 efficient and our first thought was will use C++ or C no problem c is native it's
00:12:43.710 great and Carl did the work Carl is very smart and then he said you who da it is
00:12:48.750 now your turn you need to start maintaining this and I said I don't trust myself to write C++ code that's running and all viewer guys's boxes and
00:12:54.990 not seg fault so I don't think that doesn't work for me and so I i noticed
00:13:00.840 that rust was coming along and what r us really gives you is it gives you the ability to write low-level code a lossy
00:13:06.390 or C++ with manual memory management that keeps your memory allocation low and keeps things speedy research low
00:13:11.820 resource utilization while also giving you compile time guarantees about not seg faulting so again if your process is
00:13:17.940 randomly started sec faulting because you installed the agent i think you would being our customer very quickly so having want pretty much one hundred
00:13:23.340 percent guarantees about that was very important to us and so that's why we decided to use rust just keep going keep
00:13:30.570 going um so we have this coordinator object and basically the coordinator object is receiving events so the events
00:13:36.180 basically end up being these traces that describe what's happening your application and the next thing I think
00:13:42.210 our initial work on this we use JSON just to send the payload to the server but we noticed that a lot of people have
00:13:47.580 really big request so you may have a big request of the big sequel query in it or a lot of big sequel queries in it some
00:13:53.370 people have traces that are hundreds and hundreds of nodes long and so we really wanted to figure out how to shrink down the payload size to something that we
00:14:01.080 could be you know pumping out of your box on a regular basis without tearing up your bandwidth costs so one of the
00:14:07.950 first things that we did early on was we switch to using protobuf says the transport mechanism that really shrunk shrunk down the payloads a lot our
00:14:15.300 earlier prototypes for actually collecting the data were written in Ruby but I think Carl did like a weekend hack
00:14:20.640 to just pour it over to Java and got like 200 x performance and you don't always get 200 x performance if mostly
00:14:26.490 what you're doing is database queries you're not going to get a huge performance win but mostly what we're doing is math and an algorithms and data
00:14:32.070 structures and for that Ruby is just it could in theory one day have a good jet or something but today writing that code
00:14:38.220 in Java didn't end up being significantly more code because it's just you know algorithms and data structures and all just know something
00:14:43.560 about standardizing on proto buffs in our in our stack is actually a huge win because we realized hey browsers it
00:14:49.590 turns out are pretty powerful these days they've got you know they can allocate memory they can do all type of computation so and protobuf libraries
00:14:56.550 exist everywhere so we save ourselves a lot of computation in a lot of time by just treating protobuf as the canonical
00:15:03.120 serialization form and then you can move palos around the entire stack and everything speaks the same language so you saved the serialization
00:15:08.370 deserialization and javascript is actually surprisingly effective at taking proto boss and converting them to
00:15:13.470 the format that we need efficiently so so we basically take this data the Java
00:15:19.140 collector is basically collecting all these proto buffs and pretty much it just turns around and this is sort of
00:15:24.420 where we got into bespoke territory before we started rolling our own but we realized that when you write a big
00:15:29.490 distributed fault tolerant system there's a lot of problems that you really just someone else to have thought about so
00:15:35.130 what we do is we basically take these take these payloads that are coming in we convert them into batches and we send
00:15:41.009 the batches down into the kafka queue and the next thing that happens so the kafka kafka is basically just a cue that
00:15:48.209 allows you to throw things into I guess it might be considered similar to like something like amqp it has some nice
00:15:54.120 fault-tolerance properties and integrates well with storm what most importantly it's just super super high throughput so we basically don't want to
00:16:00.630 put any barrier between you giving us the data and us getting it to discuss terms which will I think talk about in a
00:16:05.639 bit so we so basically Kafka takes the data and start sending it into storm and
00:16:10.800 if you think about what has to happen in order to get some requests so you have these requests there's you know maybe
00:16:16.579 traces that have a bunch of sequel queries and our job is basically to take all those sequel queries and say okay I can see that in all of your requests you
00:16:22.620 have the sequel query and it took around this amount of time and it happened as a child in this other node and the way to think about that is basically just a
00:16:28.440 processing pipeline right so you have these traces that come in one side you start passing them through a bunch of processing steps and then you end up on
00:16:34.740 the other side with the data and storm is actually a way of describing that processing pipeline in sort of
00:16:40.350 functional style and then you tell it okay here's how many servers I need here's how you know here's how I'm going
00:16:45.389 to handle failures and it basically deals with distribution and scaling and all that stuff for you and part of that
00:16:52.110 is because you wrote everything using functional style and so what happens is kafka sends the data into the entry
00:16:57.930 spout which is sort of terminology in terminology and storm for the these
00:17:03.120 streams that get created and they basically go into these processing things which very cleverly cutely are
00:17:08.520 called bolts this is definitely not the naming I would have used but they're
00:17:13.530 called bolts and the idea is that basically every request may have several things so for example we now
00:17:18.600 automatically detect n plus 1 queries and that's sort of a different kind of processing from just make a picture of
00:17:24.929 the entire request or what is the 95th percentile across your entire app right these are all different kinds of processing so we take the data and we
00:17:30.929 send them into a bunch of bolts and the cool thing about bolts is that again because they're just functional chaining
00:17:36.900 you can take the output from one bolt and feed it into another bolt and that works that works pretty well and and you
00:17:43.500 don't have to worry about I mean you have to worry about things like fault tolerance failure idempotent but you worry about
00:17:49.830 them at the abstraction level and then the operation all part is handled for you it's like a very declarative way of
00:17:55.799 describing how this computation works and in a way that's easy to scale and Karl actually talked about this at very
00:18:01.860 high speed yesterday and some of you may have been there I would recommend watching the video when it comes out if
00:18:07.170 you want to learn more about how to make use of this stuff in your own applications and then when you're finally done with all the processing you
00:18:13.530 need to actually do something with it you need to put it somewhere so that the web app can get access to it and that is basically we use Cassandra for this and
00:18:20.669 Cassandra again is mostly it's a dumb database but it has its has high
00:18:26.010 capacity it has some of the fault tolerance properties that we're just very very heavy right like we tend to be
00:18:31.470 writing more than whatever reading yep and then when we're done when we're done with a particular batch Cassandra
00:18:37.919 basically kicks off the process over again so we're basically doing these things as batches so these are these are roll-ups is what's happening here so
00:18:43.650 basically every minute every 10 minutes and then every hour we reprocessing leary aggregate so that when you query
00:18:49.380 us we know exactly with you yep so we sort of have the cycle where we start off obviously in the first five set in
00:18:55.230 the first minute you really want high granularity you want to see what's happening right now but if you want to go back and look at that it from three
00:19:00.480 months ago you probably care about it like the day granularity or maybe the hour granularity so we basically do
00:19:06.179 these roll-ups and we cycle through the process so this it turns out building
00:19:11.250 the system required an intense amount of work Karl spent probably six months reading PhD thesis ease to find pcs to
00:19:21.090 find to find data structures and algorithms that we could use because this is a huge amount of data like I
00:19:26.940 think even a few months after we were in private that of private beta we were already handling over a billion requests
00:19:34.140 per month and obviously there's no way basically the number of requests that we handle is the sum of all the requests
00:19:39.600 that you handle and all of our customers animal right right so that's a lot of requests so obviously we can't provide a
00:19:44.730 service at least one that's not can provide an affordable service an accessible service if we have to store
00:19:51.150 terabytes or exabytes of data just tell you how your app is run and I think also a problem it's problematic if you
00:19:56.880 store all the data in a database and then every single time someone wants to learn something about that you have to do a query those queries can take a very
00:20:02.400 long time they can take minutes and I think we really wanted to have something that would be very that where the
00:20:07.560 feedback loop would be fast so we wanted to find algorithms that let us handle the data at real time and then provide
00:20:13.830 it to you at real time instead of these like dump the data somewhere and then do these complicated queries so so this
00:20:21.480 slide was not supposed to be here supposed to be a real slice um so well I went too far okay we'll watch that again
00:20:28.950 that's pretty cool so the last thing I want to say is perhaps your takeaway from looking at this architecture
00:20:34.320 diagram is oh my gosh these rails guys having completely dark they jumped the shark they ditched real I saw like three
00:20:41.310 tweets yesterday I wasn't here I was in Portland yesterday but I saw like three tweets that are like I'm at railsconf in I haven't seen the single talk about
00:20:46.530 rails so that's true here too but I want
00:20:51.900 to assure you that we are only using this stack for the heavy computation we started in rails we started with like a
00:20:59.880 what do we need Apple people if I need to authenticate and log in and we probably you to do billing and those are
00:21:05.010 all things that rails is really really good at so we started with rails as basically the starting point and then we
00:21:10.140 realized oh my gosh computation is really slow there's no way we're going to be able to offer the service okay now
00:21:15.300 let's think about how we can I think notably a lot of people who look at rails are like there's a lot of companies that have built big stuff on
00:21:21.330 Rails and their attitude is like oh this legacy terrible rails app I really we should get rid of it we could just write everything in Scala or closure or go
00:21:28.410 everything would be amazing that is definitely not our attitude our attitude is that rails is really amazing at particular at the kinds of things that
00:21:34.950 are really common across everyone's web applications authentication billing etc and we really want to be using rails for
00:21:41.160 the parts where even things like error tracking we do through the rails app we want to be using rails because it's very productive at doing those things it
00:21:47.070 happens to be very slow at doing data crunching so we're going to use a different tool for that but I don't think you'll ever see me getting up and
00:21:52.230 saying I really wish we just started writing you know the rails app and rust
00:21:57.920 so that's number one is is honest response times which which it turns out
00:22:03.900 seems like should be easy requires storing insane amount of data so the
00:22:09.270 second thing that we realized we were looking at all of these tools is that most of them focus on data they focus on
00:22:14.370 giving you the raw data but I'm not a machine I'm not a computer I don't enjoy sifting through data that's what
00:22:20.460 computers are good for I would rather be drinking a beer it's really nice important this time of year so we want to think about if you are trying to
00:22:27.570 solve the performance problems in your application what are the things that you would suss out with the existing tools
00:22:33.480 after spending like four hours depleting your ego to get there and I think part of this is just people are actually very
00:22:40.580 people like to think that they're going to use these tools but when the tools require you to dig through a lot of data people just don't use them very much so
00:22:46.680 the goal here was to build a tool that people actually use and actually like using and not to build a tool that happens to provide a lot of data you can
00:22:52.710 sift through yeah so probably the one of the first things that we realized is that we don't want to provide this is a
00:22:58.710 trace of a request you've probably seen similar you eyes using other tools using for example the inspector in like Chrome
00:23:05.310 or Safari and this is just showing basically it's basically a visual stack trace of where your application is
00:23:11.040 spending its time but I think what was important for us is showing not just a single request because your app handles
00:23:18.540 you know hundreds of thousands of requests or millions of requests so looking at a single request statistically as complete is just noise
00:23:24.540 and it's especially bad if it's the worst request because the worst request is is really noise it's like hey gallant
00:23:30.120 network right yeah it's literally the outlier it's literally the outliner yep so what we present in skylight is
00:23:37.440 something a little bit different that's something that we call the aggregate trace so the aggregate trace is
00:23:43.760 basically us taking all of your requests averaging them out where each of these
00:23:49.710 things spends their time and then showing you that so this is basically like this is like this is like the
00:23:56.760 Statue of David it is the idealized form of the stack trace of how your applications behaving but of course you
00:24:03.630 have the same problems before which is if this was all that we were showing you it would be obscuring
00:24:10.600 of information you want to actually be able to tell the difference between okay what's my stack trace look like for fast requests and how does that differ from
00:24:16.860 requests that are slower so what we've got a I've got a little video here you can see that when I move this slider
00:24:22.539 that this trace blow it is actually updating in real time as I move the
00:24:28.419 slider around you can see that the aggregate race actually updates with it and that's because we're collecting all
00:24:33.970 have some information we're collecting like I said a lot of data we can recompute this aggregate trace on the fly basically for each bucket we're
00:24:40.299 storing a different trace and then on the client we're reassembling that will go into that little bit and I think it's really important that you be able to do
00:24:46.450 these experiments quickly if every time you think oh I wonder what happens if i add another histogram bucket if it
00:24:51.850 requires a whole full page refresh then that would basically make people not want to use the tool not able to use the tool so actually building something
00:24:58.330 which is real time and fast and gets the data as it comes it was very important to us so that's number one and second
00:25:04.299 thing so we built that I'm really okay well what's next and i think the the big problem with this is that you need to
00:25:09.580 know that there is a problem before you go look at it right so we've been working for the past few months and the
00:25:15.760 storm infrastructure that we've built makes it pretty straight forward to start building more abstractions on top of the data that we've already collected
00:25:20.980 it's a very declarative system so we've been working on a feature called inspections and what's cool about
00:25:26.559 inspections is that we can look at this tremendous volume of data that we've collected from your app and we can
00:25:31.570 automatically tease out what the problems are so the first one that we've shipped this is in beta right now it's not it's not out and enabled by default
00:25:38.289 but there is behind a feature flag that we've had some users turning on and in trying out and so what we can do in this
00:25:44.860 case is because we have information about all the database queries in your app we can look and see if you have n
00:25:50.980 plus 1 queries maybe explain what n plus 1 clear yes so I'm people know hopefully what n plus 1 queries but the it's the
00:25:56.169 idea that you by accident for some reason instead of making one query you ask for like all the posts and then you
00:26:02.320 iterate it through all them and got all the comments so now you instead of having one query you have one query per
00:26:08.020 post right and you what if what I'd like to do is do eager loading where you say include comments right but you have to
00:26:13.870 know that you have to do that so there are some tools that will run in development mode if you happen to catch it like bullet this is basically a tool
00:26:19.929 that's looking at every single one of your requests and it has some holds that once we see that a bunch of your requests have the same exact query
00:26:26.470 so we do some work to pull out binds so if it's like where something equals one we will automatically pull out the one
00:26:32.530 and replace it with a question mark and then we basically take all those our queries and if they're the exact same query repeated multiple times subject to
00:26:39.430 some thresholds will start showing you hey there's an N plus one query and you can imagine the same sort of thing being done for things like are you missing an
00:26:45.940 index right or are you using the ruby version of day someone you should be using the native version of JSON these
00:26:51.640 are all things that we can start detecting just because we're consuming an enormous amount of information and we can start writing some in heuristics for
00:26:58.000 bubbling it up so third and final breakthrough we realize that we really really needed a lightning fast you I
00:27:04.180 something really responsive so in particular the feedback loop is critical right you can imagine if the way that
00:27:09.880 you dug into data was you click and you wait an hour and then you get your results no one would do it no one would
00:27:14.920 ever do it and the existing tools are ok but you click any weight you look at it
00:27:20.320 and you're like oh I want a different view so then you go edit your query and then you click any weight and it's just not a pleasant experience so so we use
00:27:28.330 ember the the UI that you're using when you log into Scylla even though it feels just like a regular website doesn't feel like a native app is powered all of the
00:27:36.040 routing all the rendering all the decision making is happening in as an ember j/s app and we pair that with d3
00:27:42.310 so all of the chart the chart that you saw there in the aggregate race that is all amber components powered by d3 so
00:27:49.510 this is actually significantly cleaned up our client side code it makes reusability really really awesome so
00:27:55.540 I'll give you an example this is from our billing page the designer came and they had a component that was like the date component and it seemed really
00:28:03.250 boring at first seemed really boring but this is the implementation right so you could copy and paste this code over and
00:28:09.550 over again everywhere you go just remember to format it correctly if you forget to format it it's not going to look the same everywhere but I was like
00:28:14.890 hey we're using this all over the place why don't we bundle this up into a component and so with ember it was super
00:28:20.230 easy we basically just said okay here's a new calendar date component it has a property on it called date just set that
00:28:25.270 to any JavaScript date objects just that you have to remember about converting it or formatting it here's the component set the date and it will render the
00:28:31.420 correct thing automatically and so the architecture of the M Braff looks a little bit something
00:28:36.580 like this where you have many many different components most of them just driven by d3 and then they're plugged into the model of the controller and the
00:28:43.060 Ember app will go fetch those models from the cloud and the cloud from the Java app which just queries Cassandra
00:28:48.850 and render them and what's neat about this model is turning on web sockets it's super easy right because all of
00:28:56.050 these components are bound to a single place so when the WebSockets is hey we have update information for you to show
00:29:01.120 it just pushes it onto the model or onto the controller and all the whole UI updates automatically it's like magic
00:29:06.760 and it's like magic and and when debugging this is especially awesome too because I don't know maybe show a demo
00:29:13.510 of the Ember inspector using this so yeah so that lightning fast you I reducing the feedback loop so that you can quickly play with your data makes it
00:29:20.200 go from a chore there's something that actually feels kind of fun so these were
00:29:25.690 the breakthroughs that we had when we were building skylight the things that made us think yes this is actually a product that we think deserves to be on
00:29:30.730 the market so one honest response times collect data that no one else can collect focus on answers instead of just
00:29:36.010 dumping data and have a lightning fast you like to do it so we'd like to think of skylight as basically a smart profiling it's a smart profiler that
00:29:42.130 runs in production it's like the profile that run on your local development machine but instead of being on your
00:29:47.470 local dev box which has nothing to do with the performance characteristics of what your users are experiencing we're actually running in production so let me
00:29:55.540 just give you guys a quick demo
00:30:01.049 so this is what the skylight this is what's got it looks like let me send this there we go so the first thing here
00:30:08.820 is we've got the app dashboard so this looks like our response about 95 percentile response length maybe you're
00:30:15.029 all hammering it right now that would be nice so this is a graph of your response time over time and then on the right
00:30:21.090 this is a this is a graph of the RPMs the requests per minute that your app is handling so this is app wide and this is
00:30:26.970 live this updates every minute then down below you have a list of the endpoints in your application so you can see
00:30:33.659 actually the top of the slowest ones for us where we have an instrumentation API and we've gone and instrumented our background workers so we can see them
00:30:39.869 here in their response time plays in so we can see that we have this reporting worker that's taking 95th percentile 13
00:30:47.009 seconds so all that time used to be inside of some requests somewhere and we discovered that there was a lot of time being spent in things that we could push
00:30:53.399 to the background we probably need to update the agony index so that it doesn't make workers very high because spending sometime your workers not that
00:30:59.549 big of a deal so so then if we dive into one of these you can see that for this request we've got the time explorer up
00:31:06.419 above and that shows a graph of response time again 95th percentile and you can if you want to go back and look at
00:31:12.450 historical data you just drag it like this and this has gotta brush so you can zoom in and out on different times and
00:31:17.489 every time you change that range you can see that it's very responsive it's never waiting for the server but it is going back and fetching more data from the
00:31:23.489 server and then when the data comes back you see the whole UI just updates and we get that for free with ember and d3 and then them down below as we discussed you
00:31:30.840 actually have a real histogram and this histogram in this case is showing so this is 457 requests and if we click and
00:31:37.529 drag we can just move this and you can see that the aggregate trace below updates in response to us dragging this
00:31:43.049 and if we want to look at the fastest quartile we just click faster and we'll choose that range on the histogram think
00:31:48.659 it's faster fastest third and then if you click on slower you can see the slow request so this makes it really easy to
00:31:53.940 compare contrast okay why are most certain requests fast or wire certain requests slow you can see the blue these
00:31:59.609 blue areas this is Ruby code so right now it's not super granular it would be nice you could actually know what's
00:32:06.029 going on here but it'll at least tell you where in your controller action this is happening and then you can actually
00:32:11.879 see which database queries are being executed and what their duration is and you can see that we actually extract the sequel
00:32:18.180 and we d normalize it so you we normalize it so you can see exactly what those requests are even if the values
00:32:23.790 are totally different between yeah so the real query courtesy of rails not yet supporting bind extraction is like where
00:32:29.550 ID equals 1 or 10 or whatever yep so that's pretty cool one other thing is
00:32:36.380 initially we actually just showed the whole trace but we discovered that obviously when you show whole traces your have information that doesn't
00:32:42.450 really matter that much so we started off by we've recently basically started to collapse things that don't matter so
00:32:48.390 much so you can basically expand or condense the trace and we wanted to make it not that you have to think about expanding or condensing individual areas
00:32:55.380 but just you see what matters the most and then you can see trivial areas yep so so that's demos skylight we'd really
00:33:02.610 like it if you checked it out there is one more thing I want to show you that it's like really freaking cool this is
00:33:07.680 coming out of Tilda labs Carl was like has been hacking he's been up until past
00:33:12.870 midnight getting almost no sleep for the past month trying to have this ready I don't know how many of you know this but
00:33:19.460 Ruby 2 point 1 has a new stack sampling
00:33:25.320 feature so you can get really granular information about how your Ruby code is performing so i want to show you i just
00:33:32.940 mentioned how it would be nice if we could get more information out of what your Ruby code is doing and now we can
00:33:38.220 do that basically every few milliseconds this code that Carl wrote is going into the to the Ruby into MRI and it's taking
00:33:45.570 a snapshot of the stack and because this is built in it's very low impact it's
00:33:50.790 not allocating any new memory it's very little performance it basically you wouldn't even notice it so every few
00:33:56.160 milliseconds it's sampling and we take that information and we send it up to our servers so it's almost like you're running Ruby profile on your local dev
00:34:02.820 box where you get extremely granular information about where your code is spending a sign in Ruby or method for
00:34:08.370 all of these things what it's happening in production so this is so this is a we
00:34:16.110 enabled it in staging you can see that we've got some rendering bugs still in beta and we haven't yet collapsed things
00:34:21.330 that are not important for this particular feature so we want we want to hide things like like framework code obviously
00:34:27.460 this gives you an incredibly incredibly granular view of what your app is doing in production and we think this is an
00:34:36.550 API that's built into into Ruby two on one because our agent is running so low
00:34:42.849 level because we wrote it in rust we have the ability to do things like this and Carl thinks that we may be able to
00:34:48.190 actually backport this two older movies too so if you're not really 21 we think that we can actually bring this but that's a TBD so I think the cool thing
00:34:55.330 about this in general is when you run of sampling profile is a sampling profiler right we don't want to be burning every
00:35:00.730 single thing that you do in your program with tracing that would be very slow so when you normally run a sampling profile
00:35:06.640 you have to basically make a loop you have to basically create loop run this code a million times and keep sampling
00:35:12.160 and eventually we'll get enough samples to get the information but it turns out that your production server is a loop your production server is serving tons
00:35:19.060 and tons of requests so by simply take you know taking a few microseconds out of every request and collecting a couple
00:35:24.849 of samples over time we can actually get this really high fidelity picture with basically no cost and that's pretty
00:35:30.220 mind-blowing and this is the kind of stuff that we can start doing by really caring about about both the user
00:35:36.580 experience and the implementation and getting really kick scary about it and I'm really like that honestly this is a
00:35:42.010 really exciting feature that really shows what we can do as we start building sounds we got that once we've got that round work so if you guys want
00:35:49.089 check it out skylight that I oh it's available today it's no longer in private beta everyone can sign up no invitation token necessary and you can
00:35:55.869 get a 30 day free trial if you haven't started one already so if you have any questions please come see us right now or we have a booth in the vendor hall
00:36:01.030 thank you guys very much
00:36:22.650 Oh