00:00:15.869
hey you guys ready thank you guys so much for coming this is awesome I was
00:00:21.430
really I when they were putting together the schedule I said make sure that you put us down in the caves of Moria so
00:00:26.890
thank you guys for coming down and making it I'm tom this is yehuda when
00:00:33.850
people told me I was signed up to do a back-to-back talk I don't know what I was thinking yeah so we want talk to you
00:00:40.809
today about about skylight so just a little bit before we talk about that I want talk about us a little bit so in
00:00:47.739
2011 we started a company called tilde a shirt i lay i made me self-conscious because this is actually the first
00:00:53.860
edition and it's printed off center well either i'm off center or the shirts off center one of the two so we started
00:01:00.460
tilde in 2011 and we had all just left a venture-backed company and that was a
00:01:06.100
pretty traumatic experience for us because we spent a lot of time building the company and then we ran out of money and it sold the facebook and we really
00:01:11.740
didn't to repeat that experience so we decided to start tilde and when we did it we decided to be dhh and the other
00:01:20.170
people at base camp were talking about you know being bootstrapped and proud and that was a message that really resonated with us and so we wanted to
00:01:26.320
capture the same thing there's only one problem with being boost trapped and proud and that is in order to be both of
00:01:31.479
those things you actually money it turns out it's not like you just say it in a blog post and then all of a sudden you are in business so we had to think a lot
00:01:39.640
about okay well how do we make money how do we make money how do we make a profitable and most importantly
00:01:45.130
sustainable business because we didn't want to just flip the facebook in a couple years so looking around I think
00:01:51.759
the most obvious thing that people suggested to us is well why don't you guys just become ember tank-raised few
00:01:58.090
million dollars you know build what's your business model mostly prayer but
00:02:04.899
that's not really how we want to think about building open source communities we don't really think that that
00:02:10.119
necessarily leads to the best open source communities and if you are interested more in that i recommend Leia
00:02:15.430
silber who is one of our co-founders she's giving a talk this afternoon sorry Friday afternoon about
00:02:21.380
how to build a company that is centered on open source so if you want to learn more about how we've done that I would
00:02:27.050
really suggest you go check out her talk so no so no amber game not a lot so we
00:02:34.280
really want to build something that leveraged the strength that we thought that we had the one I think most
00:02:41.210
importantly a really deep knowledge of open source and a deep knowledge of the rails stack and also Carl's turns out is
00:02:46.700
really really good at building highly scalable big data sista systems lots of
00:02:52.340
Hadoop in there so last year at railsconf we announced the private beta
00:02:58.460
of skylight how many of you have used Skyland can you raise your hand if you have used it okay many of you awesome so
00:03:04.690
so skylight is a tool for profiling and measuring the performance of your rails applications in production and as a
00:03:12.350
product skylight I think was built on three really three key breakthroughs
00:03:19.340
there were key three key breakthroughs we didn't want to ship a product that was incrementally better than
00:03:24.590
competition we want to ship a product that was dramatically better quantum leap order of magnitude better and in
00:03:30.950
order to do that we spent a lot of time thinking about it about how we could solve most of the problems that we saw in the existing landscape and so those
00:03:37.940
those breakthroughs are predicated sorry those delivering a product that does that is predicated on these three
00:03:44.000
breakthroughs so the first one I want to talk about is honest response times
00:03:49.300
honest response times so dhh wrote a blog post on what was then the 37signals
00:03:56.360
blog now the base camp blog called the problem with averages how many of you have read this awesome for those of you
00:04:02.930
that have not how many of you hate raising your hands at presentations so
00:04:09.500
for those of you that's put a button in every seat press the right is valid nivea so if you read this blog post the
00:04:19.730
way it opens is our average response time for base camp right now is 87 milliseconds that sounds fantastic and
00:04:26.000
it easily leads you to believe that all is well and that we wouldn't need to
00:04:31.160
and any more time optimizing performance but that's actually wrong the average
00:04:37.670
number is completely skewed by tons of fast responses to feed requests and other cash replies if you have 1,000
00:04:45.590
requests that return in five milliseconds and then you can have 200 requests taking two thousand
00:04:51.500
milliseconds for two seconds you can still report and a ver a respectable 170
00:04:56.570
milliseconds of average that's useless so what does dhh say that we need dhh
00:05:02.690
says the solution is histograms so for those of you like me who are sleeping
00:05:07.700
through your statistics class in high school in college a brief primer on histograms so histogram is very simple
00:05:14.780
basically you have it you have a series of numbers along some axis and every
00:05:20.120
time you you're in that number you're in that bucket you basically increment that bar by one so this is an example of a
00:05:26.360
histogram of response times in a rails application so you can see that there's a big cluster in the middle around four
00:05:32.780
hundred eighty eight milliseconds 500 milliseconds isn't a super-speedy app but it's not the worst thing in the world and they're all clustered and then
00:05:38.870
as you kind of move to the right you can see that the response times get longer and longer and longer and if you move to the left response time to get shorter
00:05:44.390
and shorter and shorter so why do you want a histogram what's the what's the most important thing about a histogram well I think it's because most requests
00:05:52.430
don't actually look like this most points don't actually look like right if you think about what your rails app is doing it's a complicated Beast right it
00:05:58.430
turns out Ruby terrain complete you can you can do branching logic you can do a lot of things and so what that means is
00:06:05.510
that one endpoint if you represent that with a single number you are losing a lot of fidelity to the point where it
00:06:12.169
becomes as dhh said useless so for example in a histogram you can easily see oh here's a group of requests in
00:06:19.220
response times where I'm hitting the cash and here's another group where I'm missing it and you can see that that cluster is significantly slower than the
00:06:25.880
faster cash shooting cluster and the other thing that you get when you have a
00:06:31.070
distribution we keep the whole distribution in the histogram is you can look at this number at the 95th
00:06:37.280
percentile right so the right the way to think about the performance of your web
00:06:42.289
application is not the average because the average doesn't really tell you anything you want to
00:06:49.650
think about the 95th percentile because that's not the average response time that's the average worst response time
00:06:55.380
that a user is likely to hit and thing to keep in mind is that it's not as though customer comes to your site they
00:07:02.490
issue one request and then they're done right as someone is using your website they're going to be generating a lot of
00:07:09.150
requests and you need to look at the 95th percentile because otherwise every
00:07:15.060
request is basically you rolling the dice that they're not going to hit one of those two second three second four second responses close the tab and go to
00:07:21.570
your competitor so we look to this here's a crazy thing here's what I think is crazy that blog post the DHH wrote is
00:07:30.960
from 2009 spend five years and there's still no tool that does what the attached was asking for so we frankly we
00:07:38.130
smelled money we were like oh I just slide that green I should be green and dollars i think you know has the dollar
00:07:44.370
so the make it rain effect i should have used so we smelled blood in the water
00:07:50.639
we're like this is awesome there's only one problem that we discovered and that
00:07:56.340
is it turns out that billing this thing is actually really really freaking hard really really hard so we announced the
00:08:02.639
private beta at railsconf last year before doing that we'd spent a year of
00:08:08.699
research spiking out prototypes building prototypes building out the beta we launched at railsconf and we realized we
00:08:16.530
made a lot of problems we made a lot of errors when we were building the system so then after railsconf last year we
00:08:24.150
basically took six months to completely rewrite the backend from the ground up
00:08:30.090
and I think tying into your keynote yehuda we we were like oh we clearly
00:08:35.459
have a bespoke problem no one else is doing this so we rewrote our own custom
00:08:40.740
backends and then we had all these problems and we realized that they had actually already all been solved by the open source community and so we
00:08:46.740
benefited tremendously by having a shared solution so our first release of this was really very bespoke and the
00:08:53.190
current release had uses tremendous amount of very off-the-shelf open source projects that just solve the
00:09:00.330
particular problem very effectively very well none of which are as easy to use as
00:09:05.640
rails but all of which solve really thrown any problems very effectively so so let's just talk just for your own
00:09:12.960
understanding let's talk about how most performance monitoring tools work so the way that most of these work is that you
00:09:18.570
run your rails app and running inside of your rails app is some gem some agent that you install and every time the
00:09:25.050
rails app handles a request it generates events and those events which include information about performance data those
00:09:32.460
events are passed into the agent and then the agent sends that data to some
00:09:37.680
kind of centralized server now it turns out that doing a running average is actually really simple which is why
00:09:44.490
everyone does it that you basically you can do it in a single single cream right all you do is you have three columns and
00:09:49.650
database the endpoint the running average and the number of requests and then so you you can those are the two
00:09:55.380
things that you need to keep a running average right so keeping a running average is actually really simple from a technical point of view I don't think you could do in javascript due to the
00:10:01.530
lack of integers yes you probably would want to do any math in JavaScript it turns out so so we took a little bit
00:10:08.400
different approach who did you want to go over there sure sure sure so um when we first started right at the beginning
00:10:15.180
we basically did a similar thing where we had a bunch of your app creates events most of those start off as being
00:10:21.360
active support notifications although it turns out that there's very limited use of active sport notifications so we had
00:10:27.390
to do some normalization work to get them saying which we're going to be up streaming back into into rails but one
00:10:33.150
thing that's kind of unfortunate about having every single rails app have an agent is that you end up having to do a lot of the same kind of work over and
00:10:39.840
over again and use up a lot of memories so for example every one of these things is making HTTP requests so now you have
00:10:45.120
a queue of things that you're sending over HTTP in every single one of your rails processes and of course you
00:10:50.520
probably don't notice as people are used to rails taking up hundreds and hundreds of megabytes so you probably don't notice if you install some agent and it
00:10:55.830
suddenly starts taking 20 30 40 50 more megabytes but we really wanted to keep the actual memory / process down to a
00:11:02.850
small amount so one of the very first things that we did we even did it before last year that we pulled out all that shared logic
00:11:09.240
into a separate process called the coordinator and the agent is basically responsible simply for collecting the
00:11:15.810
the trace it's not responsible for actually talking to our server at all and that means that the coordinator only has to do this Q this keeping a stat a
00:11:22.890
bunch of stuff of work in one place and doesn't end up using up as much memory
00:11:27.950
and I think this ended up being very effective for us I think that low overhead also allows us to just collect
00:11:33.780
more information in general yep um now after our first attempt we started
00:11:39.630
getting a bunch of customers that were telling us that even the separate so the separate coordinator sort of is a good thing and a bad thing on the one hand
00:11:45.480
there's only one of them so it uses up only one set of memory on the other hand it's really easy for someone to go in
00:11:51.180
and PS that process and see how many megabytes of memory it's using so we got a lot of initial complaints that said oh
00:11:56.490
your process is using a lot of memory and I spent a few weeks I I know Ruby pretty well I spent a couple of weeks I
00:12:03.510
actually wrote a gem called allocation counter that basically went in and tried to pinpoint exactly where the allocations were have coming from but it
00:12:09.480
turns out that it's actually really really hard to track down exactly where allocations are coming from the Ruby because something as simple as using a
00:12:15.240
regular expression in Ruby can allocate match objects they can put back on the stack and so I was able to pair this
00:12:21.180
down to some degree but I really discovered quickly that trying to keep a lid on the memory allocation by doing
00:12:27.330
all the stuff in Ruby um is mostly fine but for our specific use case where we really want it we want to be telling you
00:12:32.820
you can run the agent on your process on your box and it's talking use a lot of memory we really needed something more
00:12:38.490
efficient and our first thought was will use C++ or C no problem c is native it's
00:12:43.710
great and Carl did the work Carl is very smart and then he said you who da it is
00:12:48.750
now your turn you need to start maintaining this and I said I don't trust myself to write C++ code that's running and all viewer guys's boxes and
00:12:54.990
not seg fault so I don't think that doesn't work for me and so I i noticed
00:13:00.840
that rust was coming along and what r us really gives you is it gives you the ability to write low-level code a lossy
00:13:06.390
or C++ with manual memory management that keeps your memory allocation low and keeps things speedy research low
00:13:11.820
resource utilization while also giving you compile time guarantees about not seg faulting so again if your process is
00:13:17.940
randomly started sec faulting because you installed the agent i think you would being our customer very quickly so having want pretty much one hundred
00:13:23.340
percent guarantees about that was very important to us and so that's why we decided to use rust just keep going keep
00:13:30.570
going um so we have this coordinator object and basically the coordinator object is receiving events so the events
00:13:36.180
basically end up being these traces that describe what's happening your application and the next thing I think
00:13:42.210
our initial work on this we use JSON just to send the payload to the server but we noticed that a lot of people have
00:13:47.580
really big request so you may have a big request of the big sequel query in it or a lot of big sequel queries in it some
00:13:53.370
people have traces that are hundreds and hundreds of nodes long and so we really wanted to figure out how to shrink down the payload size to something that we
00:14:01.080
could be you know pumping out of your box on a regular basis without tearing up your bandwidth costs so one of the
00:14:07.950
first things that we did early on was we switch to using protobuf says the transport mechanism that really shrunk shrunk down the payloads a lot our
00:14:15.300
earlier prototypes for actually collecting the data were written in Ruby but I think Carl did like a weekend hack
00:14:20.640
to just pour it over to Java and got like 200 x performance and you don't always get 200 x performance if mostly
00:14:26.490
what you're doing is database queries you're not going to get a huge performance win but mostly what we're doing is math and an algorithms and data
00:14:32.070
structures and for that Ruby is just it could in theory one day have a good jet or something but today writing that code
00:14:38.220
in Java didn't end up being significantly more code because it's just you know algorithms and data structures and all just know something
00:14:43.560
about standardizing on proto buffs in our in our stack is actually a huge win because we realized hey browsers it
00:14:49.590
turns out are pretty powerful these days they've got you know they can allocate memory they can do all type of computation so and protobuf libraries
00:14:56.550
exist everywhere so we save ourselves a lot of computation in a lot of time by just treating protobuf as the canonical
00:15:03.120
serialization form and then you can move palos around the entire stack and everything speaks the same language so you saved the serialization
00:15:08.370
deserialization and javascript is actually surprisingly effective at taking proto boss and converting them to
00:15:13.470
the format that we need efficiently so so we basically take this data the Java
00:15:19.140
collector is basically collecting all these proto buffs and pretty much it just turns around and this is sort of
00:15:24.420
where we got into bespoke territory before we started rolling our own but we realized that when you write a big
00:15:29.490
distributed fault tolerant system there's a lot of problems that you really just someone else to have thought about so
00:15:35.130
what we do is we basically take these take these payloads that are coming in we convert them into batches and we send
00:15:41.009
the batches down into the kafka queue and the next thing that happens so the kafka kafka is basically just a cue that
00:15:48.209
allows you to throw things into I guess it might be considered similar to like something like amqp it has some nice
00:15:54.120
fault-tolerance properties and integrates well with storm what most importantly it's just super super high throughput so we basically don't want to
00:16:00.630
put any barrier between you giving us the data and us getting it to discuss terms which will I think talk about in a
00:16:05.639
bit so we so basically Kafka takes the data and start sending it into storm and
00:16:10.800
if you think about what has to happen in order to get some requests so you have these requests there's you know maybe
00:16:16.579
traces that have a bunch of sequel queries and our job is basically to take all those sequel queries and say okay I can see that in all of your requests you
00:16:22.620
have the sequel query and it took around this amount of time and it happened as a child in this other node and the way to think about that is basically just a
00:16:28.440
processing pipeline right so you have these traces that come in one side you start passing them through a bunch of processing steps and then you end up on
00:16:34.740
the other side with the data and storm is actually a way of describing that processing pipeline in sort of
00:16:40.350
functional style and then you tell it okay here's how many servers I need here's how you know here's how I'm going
00:16:45.389
to handle failures and it basically deals with distribution and scaling and all that stuff for you and part of that
00:16:52.110
is because you wrote everything using functional style and so what happens is kafka sends the data into the entry
00:16:57.930
spout which is sort of terminology in terminology and storm for the these
00:17:03.120
streams that get created and they basically go into these processing things which very cleverly cutely are
00:17:08.520
called bolts this is definitely not the naming I would have used but they're
00:17:13.530
called bolts and the idea is that basically every request may have several things so for example we now
00:17:18.600
automatically detect n plus 1 queries and that's sort of a different kind of processing from just make a picture of
00:17:24.929
the entire request or what is the 95th percentile across your entire app right these are all different kinds of processing so we take the data and we
00:17:30.929
send them into a bunch of bolts and the cool thing about bolts is that again because they're just functional chaining
00:17:36.900
you can take the output from one bolt and feed it into another bolt and that works that works pretty well and and you
00:17:43.500
don't have to worry about I mean you have to worry about things like fault tolerance failure idempotent but you worry about
00:17:49.830
them at the abstraction level and then the operation all part is handled for you it's like a very declarative way of
00:17:55.799
describing how this computation works and in a way that's easy to scale and Karl actually talked about this at very
00:18:01.860
high speed yesterday and some of you may have been there I would recommend watching the video when it comes out if
00:18:07.170
you want to learn more about how to make use of this stuff in your own applications and then when you're finally done with all the processing you
00:18:13.530
need to actually do something with it you need to put it somewhere so that the web app can get access to it and that is basically we use Cassandra for this and
00:18:20.669
Cassandra again is mostly it's a dumb database but it has its has high
00:18:26.010
capacity it has some of the fault tolerance properties that we're just very very heavy right like we tend to be
00:18:31.470
writing more than whatever reading yep and then when we're done when we're done with a particular batch Cassandra
00:18:37.919
basically kicks off the process over again so we're basically doing these things as batches so these are these are roll-ups is what's happening here so
00:18:43.650
basically every minute every 10 minutes and then every hour we reprocessing leary aggregate so that when you query
00:18:49.380
us we know exactly with you yep so we sort of have the cycle where we start off obviously in the first five set in
00:18:55.230
the first minute you really want high granularity you want to see what's happening right now but if you want to go back and look at that it from three
00:19:00.480
months ago you probably care about it like the day granularity or maybe the hour granularity so we basically do
00:19:06.179
these roll-ups and we cycle through the process so this it turns out building
00:19:11.250
the system required an intense amount of work Karl spent probably six months reading PhD thesis ease to find pcs to
00:19:21.090
find to find data structures and algorithms that we could use because this is a huge amount of data like I
00:19:26.940
think even a few months after we were in private that of private beta we were already handling over a billion requests
00:19:34.140
per month and obviously there's no way basically the number of requests that we handle is the sum of all the requests
00:19:39.600
that you handle and all of our customers animal right right so that's a lot of requests so obviously we can't provide a
00:19:44.730
service at least one that's not can provide an affordable service an accessible service if we have to store
00:19:51.150
terabytes or exabytes of data just tell you how your app is run and I think also a problem it's problematic if you
00:19:56.880
store all the data in a database and then every single time someone wants to learn something about that you have to do a query those queries can take a very
00:20:02.400
long time they can take minutes and I think we really wanted to have something that would be very that where the
00:20:07.560
feedback loop would be fast so we wanted to find algorithms that let us handle the data at real time and then provide
00:20:13.830
it to you at real time instead of these like dump the data somewhere and then do these complicated queries so so this
00:20:21.480
slide was not supposed to be here supposed to be a real slice um so well I went too far okay we'll watch that again
00:20:28.950
that's pretty cool so the last thing I want to say is perhaps your takeaway from looking at this architecture
00:20:34.320
diagram is oh my gosh these rails guys having completely dark they jumped the shark they ditched real I saw like three
00:20:41.310
tweets yesterday I wasn't here I was in Portland yesterday but I saw like three tweets that are like I'm at railsconf in I haven't seen the single talk about
00:20:46.530
rails so that's true here too but I want
00:20:51.900
to assure you that we are only using this stack for the heavy computation we started in rails we started with like a
00:20:59.880
what do we need Apple people if I need to authenticate and log in and we probably you to do billing and those are
00:21:05.010
all things that rails is really really good at so we started with rails as basically the starting point and then we
00:21:10.140
realized oh my gosh computation is really slow there's no way we're going to be able to offer the service okay now
00:21:15.300
let's think about how we can I think notably a lot of people who look at rails are like there's a lot of companies that have built big stuff on
00:21:21.330
Rails and their attitude is like oh this legacy terrible rails app I really we should get rid of it we could just write everything in Scala or closure or go
00:21:28.410
everything would be amazing that is definitely not our attitude our attitude is that rails is really amazing at particular at the kinds of things that
00:21:34.950
are really common across everyone's web applications authentication billing etc and we really want to be using rails for
00:21:41.160
the parts where even things like error tracking we do through the rails app we want to be using rails because it's very productive at doing those things it
00:21:47.070
happens to be very slow at doing data crunching so we're going to use a different tool for that but I don't think you'll ever see me getting up and
00:21:52.230
saying I really wish we just started writing you know the rails app and rust
00:21:57.920
so that's number one is is honest response times which which it turns out
00:22:03.900
seems like should be easy requires storing insane amount of data so the
00:22:09.270
second thing that we realized we were looking at all of these tools is that most of them focus on data they focus on
00:22:14.370
giving you the raw data but I'm not a machine I'm not a computer I don't enjoy sifting through data that's what
00:22:20.460
computers are good for I would rather be drinking a beer it's really nice important this time of year so we want to think about if you are trying to
00:22:27.570
solve the performance problems in your application what are the things that you would suss out with the existing tools
00:22:33.480
after spending like four hours depleting your ego to get there and I think part of this is just people are actually very
00:22:40.580
people like to think that they're going to use these tools but when the tools require you to dig through a lot of data people just don't use them very much so
00:22:46.680
the goal here was to build a tool that people actually use and actually like using and not to build a tool that happens to provide a lot of data you can
00:22:52.710
sift through yeah so probably the one of the first things that we realized is that we don't want to provide this is a
00:22:58.710
trace of a request you've probably seen similar you eyes using other tools using for example the inspector in like Chrome
00:23:05.310
or Safari and this is just showing basically it's basically a visual stack trace of where your application is
00:23:11.040
spending its time but I think what was important for us is showing not just a single request because your app handles
00:23:18.540
you know hundreds of thousands of requests or millions of requests so looking at a single request statistically as complete is just noise
00:23:24.540
and it's especially bad if it's the worst request because the worst request is is really noise it's like hey gallant
00:23:30.120
network right yeah it's literally the outlier it's literally the outliner yep so what we present in skylight is
00:23:37.440
something a little bit different that's something that we call the aggregate trace so the aggregate trace is
00:23:43.760
basically us taking all of your requests averaging them out where each of these
00:23:49.710
things spends their time and then showing you that so this is basically like this is like this is like the
00:23:56.760
Statue of David it is the idealized form of the stack trace of how your applications behaving but of course you
00:24:03.630
have the same problems before which is if this was all that we were showing you it would be obscuring
00:24:10.600
of information you want to actually be able to tell the difference between okay what's my stack trace look like for fast requests and how does that differ from
00:24:16.860
requests that are slower so what we've got a I've got a little video here you can see that when I move this slider
00:24:22.539
that this trace blow it is actually updating in real time as I move the
00:24:28.419
slider around you can see that the aggregate race actually updates with it and that's because we're collecting all
00:24:33.970
have some information we're collecting like I said a lot of data we can recompute this aggregate trace on the fly basically for each bucket we're
00:24:40.299
storing a different trace and then on the client we're reassembling that will go into that little bit and I think it's really important that you be able to do
00:24:46.450
these experiments quickly if every time you think oh I wonder what happens if i add another histogram bucket if it
00:24:51.850
requires a whole full page refresh then that would basically make people not want to use the tool not able to use the tool so actually building something
00:24:58.330
which is real time and fast and gets the data as it comes it was very important to us so that's number one and second
00:25:04.299
thing so we built that I'm really okay well what's next and i think the the big problem with this is that you need to
00:25:09.580
know that there is a problem before you go look at it right so we've been working for the past few months and the
00:25:15.760
storm infrastructure that we've built makes it pretty straight forward to start building more abstractions on top of the data that we've already collected
00:25:20.980
it's a very declarative system so we've been working on a feature called inspections and what's cool about
00:25:26.559
inspections is that we can look at this tremendous volume of data that we've collected from your app and we can
00:25:31.570
automatically tease out what the problems are so the first one that we've shipped this is in beta right now it's not it's not out and enabled by default
00:25:38.289
but there is behind a feature flag that we've had some users turning on and in trying out and so what we can do in this
00:25:44.860
case is because we have information about all the database queries in your app we can look and see if you have n
00:25:50.980
plus 1 queries maybe explain what n plus 1 clear yes so I'm people know hopefully what n plus 1 queries but the it's the
00:25:56.169
idea that you by accident for some reason instead of making one query you ask for like all the posts and then you
00:26:02.320
iterate it through all them and got all the comments so now you instead of having one query you have one query per
00:26:08.020
post right and you what if what I'd like to do is do eager loading where you say include comments right but you have to
00:26:13.870
know that you have to do that so there are some tools that will run in development mode if you happen to catch it like bullet this is basically a tool
00:26:19.929
that's looking at every single one of your requests and it has some holds that once we see that a bunch of your requests have the same exact query
00:26:26.470
so we do some work to pull out binds so if it's like where something equals one we will automatically pull out the one
00:26:32.530
and replace it with a question mark and then we basically take all those our queries and if they're the exact same query repeated multiple times subject to
00:26:39.430
some thresholds will start showing you hey there's an N plus one query and you can imagine the same sort of thing being done for things like are you missing an
00:26:45.940
index right or are you using the ruby version of day someone you should be using the native version of JSON these
00:26:51.640
are all things that we can start detecting just because we're consuming an enormous amount of information and we can start writing some in heuristics for
00:26:58.000
bubbling it up so third and final breakthrough we realize that we really really needed a lightning fast you I
00:27:04.180
something really responsive so in particular the feedback loop is critical right you can imagine if the way that
00:27:09.880
you dug into data was you click and you wait an hour and then you get your results no one would do it no one would
00:27:14.920
ever do it and the existing tools are ok but you click any weight you look at it
00:27:20.320
and you're like oh I want a different view so then you go edit your query and then you click any weight and it's just not a pleasant experience so so we use
00:27:28.330
ember the the UI that you're using when you log into Scylla even though it feels just like a regular website doesn't feel like a native app is powered all of the
00:27:36.040
routing all the rendering all the decision making is happening in as an ember j/s app and we pair that with d3
00:27:42.310
so all of the chart the chart that you saw there in the aggregate race that is all amber components powered by d3 so
00:27:49.510
this is actually significantly cleaned up our client side code it makes reusability really really awesome so
00:27:55.540
I'll give you an example this is from our billing page the designer came and they had a component that was like the date component and it seemed really
00:28:03.250
boring at first seemed really boring but this is the implementation right so you could copy and paste this code over and
00:28:09.550
over again everywhere you go just remember to format it correctly if you forget to format it it's not going to look the same everywhere but I was like
00:28:14.890
hey we're using this all over the place why don't we bundle this up into a component and so with ember it was super
00:28:20.230
easy we basically just said okay here's a new calendar date component it has a property on it called date just set that
00:28:25.270
to any JavaScript date objects just that you have to remember about converting it or formatting it here's the component set the date and it will render the
00:28:31.420
correct thing automatically and so the architecture of the M Braff looks a little bit something
00:28:36.580
like this where you have many many different components most of them just driven by d3 and then they're plugged into the model of the controller and the
00:28:43.060
Ember app will go fetch those models from the cloud and the cloud from the Java app which just queries Cassandra
00:28:48.850
and render them and what's neat about this model is turning on web sockets it's super easy right because all of
00:28:56.050
these components are bound to a single place so when the WebSockets is hey we have update information for you to show
00:29:01.120
it just pushes it onto the model or onto the controller and all the whole UI updates automatically it's like magic
00:29:06.760
and it's like magic and and when debugging this is especially awesome too because I don't know maybe show a demo
00:29:13.510
of the Ember inspector using this so yeah so that lightning fast you I reducing the feedback loop so that you can quickly play with your data makes it
00:29:20.200
go from a chore there's something that actually feels kind of fun so these were
00:29:25.690
the breakthroughs that we had when we were building skylight the things that made us think yes this is actually a product that we think deserves to be on
00:29:30.730
the market so one honest response times collect data that no one else can collect focus on answers instead of just
00:29:36.010
dumping data and have a lightning fast you like to do it so we'd like to think of skylight as basically a smart profiling it's a smart profiler that
00:29:42.130
runs in production it's like the profile that run on your local development machine but instead of being on your
00:29:47.470
local dev box which has nothing to do with the performance characteristics of what your users are experiencing we're actually running in production so let me
00:29:55.540
just give you guys a quick demo
00:30:01.049
so this is what the skylight this is what's got it looks like let me send this there we go so the first thing here
00:30:08.820
is we've got the app dashboard so this looks like our response about 95 percentile response length maybe you're
00:30:15.029
all hammering it right now that would be nice so this is a graph of your response time over time and then on the right
00:30:21.090
this is a this is a graph of the RPMs the requests per minute that your app is handling so this is app wide and this is
00:30:26.970
live this updates every minute then down below you have a list of the endpoints in your application so you can see
00:30:33.659
actually the top of the slowest ones for us where we have an instrumentation API and we've gone and instrumented our background workers so we can see them
00:30:39.869
here in their response time plays in so we can see that we have this reporting worker that's taking 95th percentile 13
00:30:47.009
seconds so all that time used to be inside of some requests somewhere and we discovered that there was a lot of time being spent in things that we could push
00:30:53.399
to the background we probably need to update the agony index so that it doesn't make workers very high because spending sometime your workers not that
00:30:59.549
big of a deal so so then if we dive into one of these you can see that for this request we've got the time explorer up
00:31:06.419
above and that shows a graph of response time again 95th percentile and you can if you want to go back and look at
00:31:12.450
historical data you just drag it like this and this has gotta brush so you can zoom in and out on different times and
00:31:17.489
every time you change that range you can see that it's very responsive it's never waiting for the server but it is going back and fetching more data from the
00:31:23.489
server and then when the data comes back you see the whole UI just updates and we get that for free with ember and d3 and then them down below as we discussed you
00:31:30.840
actually have a real histogram and this histogram in this case is showing so this is 457 requests and if we click and
00:31:37.529
drag we can just move this and you can see that the aggregate trace below updates in response to us dragging this
00:31:43.049
and if we want to look at the fastest quartile we just click faster and we'll choose that range on the histogram think
00:31:48.659
it's faster fastest third and then if you click on slower you can see the slow request so this makes it really easy to
00:31:53.940
compare contrast okay why are most certain requests fast or wire certain requests slow you can see the blue these
00:31:59.609
blue areas this is Ruby code so right now it's not super granular it would be nice you could actually know what's
00:32:06.029
going on here but it'll at least tell you where in your controller action this is happening and then you can actually
00:32:11.879
see which database queries are being executed and what their duration is and you can see that we actually extract the sequel
00:32:18.180
and we d normalize it so you we normalize it so you can see exactly what those requests are even if the values
00:32:23.790
are totally different between yeah so the real query courtesy of rails not yet supporting bind extraction is like where
00:32:29.550
ID equals 1 or 10 or whatever yep so that's pretty cool one other thing is
00:32:36.380
initially we actually just showed the whole trace but we discovered that obviously when you show whole traces your have information that doesn't
00:32:42.450
really matter that much so we started off by we've recently basically started to collapse things that don't matter so
00:32:48.390
much so you can basically expand or condense the trace and we wanted to make it not that you have to think about expanding or condensing individual areas
00:32:55.380
but just you see what matters the most and then you can see trivial areas yep so so that's demos skylight we'd really
00:33:02.610
like it if you checked it out there is one more thing I want to show you that it's like really freaking cool this is
00:33:07.680
coming out of Tilda labs Carl was like has been hacking he's been up until past
00:33:12.870
midnight getting almost no sleep for the past month trying to have this ready I don't know how many of you know this but
00:33:19.460
Ruby 2 point 1 has a new stack sampling
00:33:25.320
feature so you can get really granular information about how your Ruby code is performing so i want to show you i just
00:33:32.940
mentioned how it would be nice if we could get more information out of what your Ruby code is doing and now we can
00:33:38.220
do that basically every few milliseconds this code that Carl wrote is going into the to the Ruby into MRI and it's taking
00:33:45.570
a snapshot of the stack and because this is built in it's very low impact it's
00:33:50.790
not allocating any new memory it's very little performance it basically you wouldn't even notice it so every few
00:33:56.160
milliseconds it's sampling and we take that information and we send it up to our servers so it's almost like you're running Ruby profile on your local dev
00:34:02.820
box where you get extremely granular information about where your code is spending a sign in Ruby or method for
00:34:08.370
all of these things what it's happening in production so this is so this is a we
00:34:16.110
enabled it in staging you can see that we've got some rendering bugs still in beta and we haven't yet collapsed things
00:34:21.330
that are not important for this particular feature so we want we want to hide things like like framework code obviously
00:34:27.460
this gives you an incredibly incredibly granular view of what your app is doing in production and we think this is an
00:34:36.550
API that's built into into Ruby two on one because our agent is running so low
00:34:42.849
level because we wrote it in rust we have the ability to do things like this and Carl thinks that we may be able to
00:34:48.190
actually backport this two older movies too so if you're not really 21 we think that we can actually bring this but that's a TBD so I think the cool thing
00:34:55.330
about this in general is when you run of sampling profile is a sampling profiler right we don't want to be burning every
00:35:00.730
single thing that you do in your program with tracing that would be very slow so when you normally run a sampling profile
00:35:06.640
you have to basically make a loop you have to basically create loop run this code a million times and keep sampling
00:35:12.160
and eventually we'll get enough samples to get the information but it turns out that your production server is a loop your production server is serving tons
00:35:19.060
and tons of requests so by simply take you know taking a few microseconds out of every request and collecting a couple
00:35:24.849
of samples over time we can actually get this really high fidelity picture with basically no cost and that's pretty
00:35:30.220
mind-blowing and this is the kind of stuff that we can start doing by really caring about about both the user
00:35:36.580
experience and the implementation and getting really kick scary about it and I'm really like that honestly this is a
00:35:42.010
really exciting feature that really shows what we can do as we start building sounds we got that once we've got that round work so if you guys want
00:35:49.089
check it out skylight that I oh it's available today it's no longer in private beta everyone can sign up no invitation token necessary and you can
00:35:55.869
get a 30 day free trial if you haven't started one already so if you have any questions please come see us right now or we have a booth in the vendor hall
00:36:01.030
thank you guys very much
00:36:22.650
Oh