00:00:13.200
my name is Simon I work on the infrastructure team at Shopify and today I'm going to talk about the past five
00:00:19.920
years of of scaling rails at Shopify um for I've only been around at
00:00:26.439
Shopify for four years so the first the first year was a little bit of digging but I want to talk about all of the
00:00:32.480
things that we've learned and I hope that people in this audience can maybe Place themselves on this timeline and
00:00:37.719
learn something learn from some of the lessons that we've had to learn over the past five years this talk is inspired by
00:00:44.480
a guy called Jeff Dean from from Google he's a genius um and he did this talk about how they scale Google for the
00:00:51.520
first couple of years and he showed how they they ran rid of a couple my sqls they shed they did all this stuff they
00:00:57.879
did all the the nosql Paradigm and and finally went to the new SQL Paradigm
00:01:03.000
that we're now starting to see but this was really interesting to me because you saw why they made the decisions that they made at that point in time I've
00:01:10.159
always been fascinated by what made say Facebook decide that this now was the
00:01:16.200
time to write a VM for for their for PHP to make it faster so this talk is about
00:01:21.240
that it's about an overview of the decisions we've made at shabi and less so about the very technical details of
00:01:28.400
all of them it's to give you an overview um and mental model for how we evolve our platform and there's tons of
00:01:35.960
documentation test okay there's tons of documentation out there on all of the things I'm going to talk about today
00:01:42.280
other Talks by co-workers blog posts readmes and things like that so I'm not going to get into the weeds but I'm
00:01:47.719
going to provide an overview I work at Shopify and at this point you're probably tired of hearing
00:01:54.119
about Shopify so I'm not going to talk too much about it just overall uh
00:01:59.880
Shopify is something that allows Merchants to sell people to other people and that's relevant for the rest of this
00:02:05.799
talk we have hundreds of thousands of people who depend on Shopify for their
00:02:11.560
livelihood and through this platform we run almost 100K RPS at Peak the largest
00:02:18.400
sales hit our rail servers with almost 100K requests per second our steady
00:02:23.920
state is around 20 to 40K requests per second and we run this on 10 thousands of workers across two data
00:02:31.160
centers about $30 billion have made it through this platform which means the downtime is costly to say the
00:02:38.159
least and these numbers you should keep in the back of your head as the numbers that we have used to to go to the point
00:02:45.599
that we are at today roughly these metrics double every year that's the metric we've used so if you go back five
00:02:52.000
years you just have to cut this in half five times I want to introduce a little bit
00:02:57.200
of vocabulary for Shopify because I'm going to use this Loosely in in this talk to understand how Shopify Works
00:03:03.799
Shopify is at least four sections one of those sections is the storefront this is where people are browsing their
00:03:10.239
collections browsing their products adding to the cart this is the majority of traffic somewhere between 80 to 90%
00:03:15.920
of our traffic are pre people browsing their storefronts then we have the checkout
00:03:21.239
this is where it gets a little bit more complicated we can't cash as heavily as we do on the storefront this is where we
00:03:26.599
have to do rights decrement inventory and capture payments admin is more complex you have people
00:03:33.120
who apply actions to hundreds of thousands of orders at the same time at the um concurrently you have people who
00:03:40.000
um change billing they need to be build and all these things that are much more complex than both checkout and
00:03:45.799
storefront in terms of consistency and the API allows you to change the majority of the things you can change in
00:03:51.720
in the admin the only real difference is that computers hit the API and computers can
00:03:59.040
hit the API really fast recently I saw an app for people who wanted to have an offset on orders numbers for at a
00:04:06.640
million and so this app will create a million orders and then delete all of them to get that offset so people do
00:04:12.599
crazy things with this API and that's one of our other major that is our
00:04:17.759
second largest um source of traffic um after after
00:04:23.560
storefront I want to talk a little bit about a philosophy that has shaped this platform over the past five years flash
00:04:30.160
sales is really what has built and shaped the platform that we have when Kanye wants to drop his new album on
00:04:36.600
Shopify it is the team that I am on that is
00:04:42.080
terrified we have we had a sort of fork in the road five years ago five years
00:04:47.160
ago was when we started seeing these customers who could drive more traffic for one of their sales than the entire
00:04:53.320
platform otherwise was serving of traffic they would drive in multiple so if we were serving a, requests per
00:05:00.320
second for all the stores on Shopify some of these stores could get us to 5,000 and this happens in a matter of
00:05:07.080
seconds their sale might start at 2 p.m. and that's when everyone is coming in so
00:05:12.360
there's a fork in the road do we become a company that support these sales or do
00:05:17.639
we just kick them off the platform and throttle them heavily and say this is not the platform for you that's a
00:05:23.039
reasonable path to take 99.9 something per of the stores don't have this
00:05:28.440
pattern they can't dve that much traffic but we decided to go the other
00:05:33.800
route we wanted to be a company that could support these sales and we decided to form a team that would solve this
00:05:40.280
problem of customers that could drive enormous amounts of traffic in a very short amount of
00:05:46.880
time and this is I think was a fantastic suggestion and this happened exactly five years ago which is why the time
00:05:53.400
frame of the talk is 5 years and I think it was a powerful decision because this has served as a
00:05:59.319
canary into the coal mine the flat sales that we see today
00:06:04.360
and the amount of traffic that they can drive of say 80k RPS that's what the steady state is
00:06:10.120
going to look like next year so when we prepare for these sales we know what next year is going to look like and we
00:06:15.960
know that we're going to laugh next year because we're already working on that problem so they help us stay ahead one
00:06:21.520
to two years ahead the meat of the in the meat of this talk I will walk through the past
00:06:27.440
five years of the major infrastructure projects that we've done these are not the only projects that we've done
00:06:32.720
there's been other apps and many other efforts but these are the most important to descaling of our rails
00:06:39.560
application 2012 was the year where we sat down and decided that we were going
00:06:44.840
to go the antifragile route with flash sales we were going to become the best place in the world to have flash sales
00:06:51.520
so a team was formed whose sole job was to make sure that Shopify as an application would stay up and be
00:06:57.440
responsive under these circumstances and the first thing you do when you start optimizing an application is you
00:07:03.479
try to identify the lower hanging fruit in this case the lower or in many cases
00:07:08.720
the lower hanging fruit is very application dependent the lowest hanging fruit from
00:07:13.759
an infrastructure side that's already harvested in your load balancers rails is really good at this or your operating
00:07:19.199
system they will take all of the generic optimization tuning so at some point it has that work has to be handed off to
00:07:25.280
you and you have to understand your problem Dom remain well enough that you know where to the bigest winds
00:07:30.960
are for us the first ones were things like backgrounded checkouts and this sounds crazy what do you mean they
00:07:36.400
weren't backgrounded before well the app was started in 2005 2004 and back then
00:07:41.960
backgrounding jobs in Ruby or rails was not really a common thing and we hadn't
00:07:47.080
really done it after that either because it was such a large source of technical debt so in 2012 a team sat down and
00:07:54.560
collected the massive amount of technical debt to move the background jobs into or remove the checkout process
00:08:00.319
into background jobs so the payments were captured not in a request that took a long time but in jobs asynchronously
00:08:06.560
with the rest and this of course was a massive source of speed up now you're not occupying all these workers with long
00:08:12.319
running requests another thing we did at these domain specific problems is was
00:08:18.560
inventory if you just you might think that inventory is just decrementing one number and doing that really fast if you
00:08:24.240
have thousands of people but my SQL is not good at that if you're trying to decrement the same number from thousands
00:08:29.960
of quaries at the same time you would run into lock contention so we had to solve this problem as well and these are
00:08:35.599
just two of many problems we solved in general what we did was we printed out the debug logs of every single querry on
00:08:41.519
the storefront on the checkout and all of the other hot paths and basically started checking them off I couldn't
00:08:47.160
find the original picture but I found one from a talk that someone from the company did uh three years ago where you
00:08:53.160
can see the wall here where the debug locks were taped and the team at the time would go and cross off and write
00:08:59.600
their name on the cories and figure out how to reduce this as much as possible but you need a feedback loop
00:09:04.839
for this we couldn't wait every until the next sale to see if the optimizations we've done actually made a
00:09:10.200
difference we needed a better way than just crashing at every single flash sale but having a tighter feedback loop just
00:09:16.040
like when you run the test locally you know whether it worked pretty much right away we wanted to do the same for
00:09:22.519
performance so we started we wrote a lad testing tool and what this load testing tool will do is that it will simulate a
00:09:28.519
user performing checkout it will go to the go to the storefront browse around a little bit find some products add them
00:09:34.839
to its cart and perform the full checkout it will fussy test this entire checkout procedure and then we spin up
00:09:41.480
thousands of these in parallel to test whether the performance actually made any difference this is now so deeply
00:09:46.880
webbed in our infrastructure culture that whenever someone makes a performance change people ask well how
00:09:52.040
did the low testing go this was really important for us and a Siege just hitting the storefront that
00:09:58.399
is something that just runs a bunch of the same requests is just not a realistic Benchmark of realistic
00:10:05.000
scenarios another thing we did at this time was we wrote a library called identity
00:10:10.240
cache we had a problem with we had one MySQL at this point hosting tens of thousands of stores and when you have
00:10:17.680
that you're pretty pretty protective of that single database and we were doing a lot of ques to it and especially these
00:10:23.920
sales were driving such a massive amount of traffic at once to these databases so we needed a way of of reducing the load
00:10:30.640
on the database the normal way of doing this or the the most common way of doing
00:10:35.959
this is to start sending queries to the read slaves so you have databases that feed off of the one that you write to
00:10:41.880
and you start reading from those and we try to do that at a time it's it has a
00:10:47.160
lot of nice properties to use to reach slaves over another method but when we did this back in the day there wasn't
00:10:53.440
any really good libraries in the rails world we tried to Fork some and tried to figure something out but we ran into
00:10:58.839
Data corre ruption issues we ran into just mismanagement of the Reed slaves
00:11:04.240
was really problematic at the time because we didn't have any dpas and overall mind you this is a this is a
00:11:10.079
team of rails developers who just had to turn um infrastructure developers and understand all of this
00:11:16.240
stuff and learned at the job because we were we decided to to handle flash sales
00:11:22.120
the way that we did so we just didn't know enough about MySQL and these things to to go that path so we decided to
00:11:30.200
figure out something else and deep inside of Shopify Toby had written a commit many many years ago introducing
00:11:36.920
this idea of identity cach of managing your cash out of bound in mcash idea
00:11:43.240
being that if I Cory for a product I look at mcast first and see if it's there if it's not if it's there I'll
00:11:49.480
just grab it and not even touch the database if it's not there I'll put it there so that for the next request it
00:11:55.320
will be there and every time we do a right we just expire those entries that's what we managed to do this has a
00:12:00.399
lot of drawbacks because that cache is never going to be 100% what is in the
00:12:05.440
database so when we do a read from that managed cache we never write that back to dat the database it's too dangerous
00:12:12.079
that's also why the API is opt in you have to do fetch instead of find to use
00:12:17.480
IDC because we only want to do it on these paths and it will return readon records so you cannot change them to not
00:12:23.519
corrupt your database this is the massive downside with either using read slaves or Identity cach or something
00:12:30.480
like this is that you have to deal with what are you going to do when the cash is expired or
00:12:36.600
old so this is what we decided to do at the time I don't know if this is what we
00:12:41.880
would have done today maybe we would have we've gotten much better at handling read slaves and they have a lot of other advantages such as being able
00:12:47.959
to do much more complicated queries but this is what we did at the time and if you're having severe scaling issues already identity cach is a very simple
00:12:55.199
thing to do and use so off the 2012 and what would have
00:13:02.560
been probably our worst Black Friday Cyber Monday ever because the team was working night and day to make this
00:13:07.920
happen there's this famous picture of our CTO face planted on the ground after
00:13:14.760
exhausting work of scaling Shopify at the time and someone then woke him up and told him hey dude check out us
00:13:21.760
down we were not in a good place but identity cach load testing and all this optimization it saved us and once once
00:13:30.560
the team had decompressed after this massive Sprint to survive these sales and survive Black Friday and Cyber Monday this year we decided to raise a
00:13:38.199
question of how can we never get into this situation again we'd spend a lot of time optimizing checkout and storefront
00:13:45.079
but this is not sustainable if you keep optimizing something for so long it becomes
00:13:50.480
inflexible often fast code is hard to change code if you've optimized
00:13:55.680
storefront and checkout and had a team that only knew how to do that there's going to be a developer who's going to
00:14:00.720
come in add a feature and add a query as a result and this should be okay people should be allowed to add queries without
00:14:07.600
understanding everything about the infrastructure often the more slower thing is more flexible think of a
00:14:13.880
completely normalized schema it is much easier to change and adapt upon and that is the entire point of a relational
00:14:19.720
database but once you make it fast it often is a trade-off of becoming more inflexible think of say a an algorithm
00:14:26.839
um a bubble sword n Square is how much is the complexity of the algorithm you
00:14:32.560
can make that really fast you can make that the fastest bubble sword in the world you can write a c extension in
00:14:38.360
Ruby that has inline a sample and this is the best bubble sword in the world but my terrible implementation of a
00:14:45.480
quick sword which is n login complexity is still going to be faster so at some
00:14:50.519
point you have to stop optimizing zoom out and re architect so that's what we did with
00:14:57.480
charting at some point point we we needed that flexibility back and
00:15:02.800
sharding seemed like a good way to do that we couldn't we also had the problem of fundamentally Shopify is an
00:15:08.399
application that will have a lot of Rights doing these sales there's going to be a lot of rights to the database and you can't cash rights so we have to
00:15:15.680
find a way to do that and sharting was it so basically we build this API a shop
00:15:21.399
is fundamentally isolated from other shops it should be shop a should not have to care about shop B so we did
00:15:28.639
person shop sharting where one Shop's data would all be on one Shard and another shop might be on another Shard
00:15:33.800
and a third shop might be on the together with the first one so this was the API basically this is all the
00:15:39.519
sharting API internally exposes within that block it will select a correct database where the product is for that
00:15:45.680
shop within that block you can't reach the other Shard that's illegal and in a controller this might
00:15:51.839
look something like this at this point most developers don't have to care about it it's all done by a
00:15:57.319
filter that will find a shop on another database wrap the entire request in the connection that that shop is on and any
00:16:05.920
product query will then go to the correct chart this is really simple and this means that the majority of the time
00:16:11.759
developers don't have to care about charting they don't even have to know if it's existent it just works like this
00:16:17.079
and jobs will work the same way but it has drawbacks there's tons of
00:16:22.440
things that you now can't do I talked about how optimization might you might lose flexibility with optimization but
00:16:29.480
with architecture you lose flexibility at a much grander scale fundamentally shops should be isolated from each other
00:16:36.000
but in the few cases where you want them to not be there's nothing you can do that's the drawback of architecture
00:16:43.160
and changing the architecture for example you might want to do joints across shops you might want to gather
00:16:48.880
some some data or an ad hoc query about app installation across shops and this
00:16:54.120
might not really seem like something you would need to do but the partners interface for all of our partners who build applications actually need to do
00:17:01.000
that they need to get all the shops and the installations for them so it was just written as something that did a join across all the shops and listed it
00:17:06.919
and this had to be changed and so the same thing went for our internal dashboard that would do things across
00:17:13.480
shops find all the shops with a certain app you just couldn't do that anymore so we have to find
00:17:18.799
Alternatives if you can get around it don't Shard fundamentally Shopify is an application that will have a lot of
00:17:24.720
Rights but that might not be your application it's really hard and it took us a year to do and figure
00:17:31.160
out we ended up doing an at at the application Level but there's many different parts of part or levels where
00:17:37.240
you can chart if your database is magical you don't have to do any of this
00:17:42.559
some databases are really good at handling this stuff and you can make some trade-offs at the database level so you don't have to do this at an
00:17:48.280
application Level but there are really nice things about being on a relational database transactions and and schemas
00:17:54.760
and the fact that most developers are just familiar with them are massive benefit and they're reliable they've been around
00:18:00.880
for 30 years and so they're probably going to be around for another 30 years at least we decided to do that the
00:18:07.440
application Level because we didn't have the experience to write a proxy and the databases that we looked at at the time
00:18:12.799
were just not mature enough and I actually looked at some of the databases that we were considering at the time and
00:18:18.840
most of them have gone out of business so we were lucky that we didn't buy into this proprietary technology and
00:18:25.200
solved it at the level that we felt most comfortable with at the time today we have a different team and we might have
00:18:30.440
solved this at a proxy level or somewhere else but this was the right decision at the
00:18:35.880
time in 2014 we started investing in resiliency and you might ask what is
00:18:41.720
what is resiliency doing in a talk about performance and scaling well as a function of scale
00:18:48.320
you're going to have more failures and this led us to a threshold in 2014 where
00:18:53.559
we had enough components that failures were happening quite rapidly and they had a disproportional impact on the
00:18:59.320
platform when one of our shards was experiencing a problem requests to other shards and
00:19:05.480
shops that were on other shards were either much slower or filling altogether it didn't make sense that when a single
00:19:11.440
reddis reddis server blew up all of Shopify was down this reminds me of a concept from
00:19:18.520
chemistry where it your reaction time is proportional to the amount of surface
00:19:24.520
area that you expose if you have two glasses of water and you put a a teaspoon of loose sugar in one and a
00:19:30.400
sugar cube cube in the other glass the glass with the loose sugar is going to be dissolved into water quicker because
00:19:36.520
the surface area is larger the same goes for technology when you have more more servers more components there's more
00:19:43.360
things that will react and can potentially fail and make it all fall apart this means that if you have a ton
00:19:50.039
of components and they're all tightly knitted together in a web where one of where if one of these components fail it
00:19:56.520
DRS a bunch of others with it and you have never thought about this adding a component will probably decrease your
00:20:01.880
availability and this happens exponentially as you add more components your overall availability goes down if
00:20:09.200
you have 10 components with four nines you have a a lot less downtime if
00:20:14.840
they're tightly webbed together in a way that one of them is a single point of failure and we hadn't really at this
00:20:20.000
point had the luxury of finding out what our single point of failures even were we thought it was going to be okay but I
00:20:26.280
bet you if you haven't actually verified this you will have single points of failure all over your application where
00:20:33.679
one failure will take everything down with it do you know what happens if your mcash cluster cluster go goes down we
00:20:39.799
didn't and we were quite surprised to find out this means that you're only really
00:20:45.520
as weak or as good as your weakest single point of failure and if you have multiple single points of failure
00:20:51.200
multiply the probability of all of those single points of failures together and you have the final probability of your
00:20:57.679
app being available very quickly the what looks like downtime of hours per
00:21:03.039
component will be days or even weeks of downtime globally advertised over an entire year if you're not paying
00:21:10.159
attention to this it means that adding a component will probably decrease your overall availability the outages looks something
00:21:16.840
like this your response time increases and this is this is a real graph of the incidents at the time in 2014 where
00:21:24.039
something became slow and as you can see here the timeout is probably 20 seconds
00:21:29.600
exactly so something was being really slow and hitting a time out of 20 seconds if all of the workers in your
00:21:35.640
application are spending 20 seconds waiting for something that's never going to return because it's going to time out
00:21:41.480
then there's no time to serve any requests that might actually work so if Shard one is slow requests for shard
00:21:47.960
zero are going to lag behind in a queue because these requests to to Shard one will never ever
00:21:55.400
complete the man that you have to adopt if when this starts becoming a problem for you is that single component failure
00:22:02.600
cannot compromise the availability or performance of your entire system your job is to build a reliable
00:22:10.240
system from unreliable components a really useful mental model
00:22:16.200
for thinking about this is the resiliency Matrix on the left hand side we have all
00:22:21.520
the components in our infrastructure at the top we have the sections of the infrastructure such as admin checkout
00:22:28.960
storefront the ones I showed from before every cell will tell you what happens if that
00:22:35.320
component on the left is unavailable or slow what happens to the section so if
00:22:41.880
red is goes down is storefront up is checkout up is admin up this is not what
00:22:47.400
it actually looked like in reality when we drew this out it was probably a lot worse and we were shocked to find out
00:22:53.440
how red and blue how down and degraded shabii looked when what we fought were
00:22:58.640
tangential um data stores like mcash and redis took down everything along with it
00:23:05.720
the other thing we were shocked about when we wrote this was this is really hard to figure out figuring out what all
00:23:10.760
these cells and the values of them are is really difficult how do you do that do you go into your production and just start tanging down stuff how do you know
00:23:17.440
what would you do in development so we wrote a a tool that will help you do this the tool is called
00:23:23.760
toxy proxy and what it does is that for a duration of a block it will emulate Network failures at the network level by
00:23:30.240
sitting in between you and that component on the left this means that you can write a test for every single
00:23:35.960
cell in that grid so when you flip it from being red to being green from being
00:23:41.520
bad to being good you can know that no one will ever reintroduce that failure so these these might look
00:23:48.960
something like this that when some message cue is down I get this section and I assert that the response is
00:23:55.039
Success at this point in Shopify we have very good coverage of our resiliency matrix by unit test that are all backed
00:24:01.919
by toxy proxy and this is really really simple to do another tool we wrote is called
00:24:09.039
semian it's fairly complicated exactly how how all of these components work and
00:24:14.200
how they work together in semian so I'm not going to go into it but there's a read me that goes into Vivid detail
00:24:20.000
about how semian works semian is a library that helps your application become more resilient and how it does
00:24:25.840
that I encourage you to check to read me um to find out how it works but this tool was also invaluable for us to run
00:24:33.159
to to not or or to be able to be a res more resilient
00:24:39.080
application what we the mental model we mapped out for how to work with resiliency was Data a PID where we had a
00:24:46.760
lot of resiliency debt because for 10 years we hadn't paid any attention to this the web I talked about before of
00:24:53.080
certain elements dragging down everything with it was eminent it was happening everywhere the resiliency Matrix was completely red when we
00:25:00.360
started and nowadays it's in it's in pretty good shape so we started climbing it we started figuring out writing all
00:25:06.559
these tool incorporating all these tools and then at the when we got to the very top someone asked a question what
00:25:13.159
happens if you flood the data center that's when we started working on multi-c in
00:25:19.960
2015 we needed a way such that if the data center caught fire we could fail
00:25:25.240
over to the other data center but resiliency and sharding and optimization were more important for us than going
00:25:31.720
multi DC multi DC was largely an infrastru largely an infrastructure effort of just
00:25:38.360
going from one to n this requires required a massive amount of changes in our cookbooks but finally we had
00:25:45.360
procured all the inventory and all the servers and stuff to spin up a second Data Center and at this point if you
00:25:50.559
want to fail over Shopify to another data center you just run this script and it's done all of Shopify has
00:25:57.559
moved to a different Data Center and the strategy that it uses is actually quite simple and one that most
00:26:03.720
rails apps can use pretty much as is if the traffic and things like that are set up
00:26:09.440
correctly Shopify is running in a data center right now in Virginia and one in Chicago if you go to a Shopify owned IP
00:26:17.440
you will go to the data center that is closest to you if you're in Toronto you're going to go to the data center in Chicago if you are in New Orleans you
00:26:25.080
might go to the data center in Virginia when you hit that data center the low balancers in that data center
00:26:31.120
inside of our network will know which one of the two data centers is active is it Chicago or is it asburn and it will
00:26:36.880
route all the traffic there so when we do a failover we tell the low balancers in all the data centers what is the
00:26:43.200
primary data center so if the primary data center was Chicago and we're moving it to Ashburn we tell the low balancers
00:26:48.760
in both the data centers to Route all traffic to Ashburn aspir in
00:26:54.600
Virginia when the traffic gets there and we've just moved over any write will fail the databases at
00:27:00.720
that point are in readon they are not writable in both locations at one because the risk of data corruption is too high so that means that most things
00:27:08.159
actually work if you're browsing around Shopify and Shopify Shopify storefront
00:27:13.600
looking at products which is the majority of traffic you won't see anything even if you are in admin you
00:27:19.279
might just be looking at your products and not notice this at all and while that's happening we're failing over all of the databases which means checking
00:27:25.480
that they're caught up in the new Data Center and then making them writeable so very quickly discharge recover over a a
00:27:31.960
couple of minutes um it could be anywhere from 10 to 60 seconds per database and then Shopify works again we
00:27:37.919
then move the jobs because when we move the when we move all the traffic we stop the jobs in the in the source data
00:27:43.320
center so we move all the jobs over to the new Data Center and everything just
00:27:49.600
ticks but then how do we use both of these data centers we have one data center that is essentially doing nothing
00:27:56.000
just very very expensive Hardware sitting there doing absolutely nothing how can we get to a state where we're
00:28:01.399
running traffic out of multiple data centers at the same time utilizing both of
00:28:06.480
them the architecture at first looked something like this it was shared we had shared red as instances shared mcash
00:28:13.080
between all of the shops when we say A Shard we're referring to a MySQL Shard
00:28:18.120
but we hadn't sharded reddis we hadn't sharded mcash and other things so all of this was
00:28:23.760
shared What If instead of running one big Shopify like this that we're moving around we run many small Shopify that
00:28:30.960
are independent from each other and have everything they need to run and we call this a pod so a pod will have everything
00:28:37.360
that a Shopify needs to run as the workers as the redis the mcash the myql
00:28:43.159
whatever else there might needs there needs to be for a little Shopify to run if you have these many Shopify and
00:28:50.080
they're completely independent they can be in multiple data centers at the same time you can have
00:28:55.480
some of them active in data center one and some of them active active in data center 2 pod one might be active in data
00:29:01.519
center 2 and pod two might be active in data center one so that's good but how do you get
00:29:08.200
traffic there so for Shopify every single
00:29:15.440
shop has usually a domain it might be a free domain that we provide or their own domain this when this request hits one
00:29:22.640
of the data centers the one that you're closest to Chicago or or um Virginia depending on where in the world you are
00:29:28.880
it goes to this little script that's very aply named Sorting Hat um and what Sorting Hat will do is
00:29:36.399
that it will look at the request and interpolate what shop what pod what mini
00:29:42.440
Shopify does this request belong to if that request is on a shop that is
00:29:48.559
going to pod two it will route us to Data Center one on the left but if it's
00:29:53.600
another one it will go to the right so Sorting Hat is just sitting there sorting the request and sending them to
00:29:58.720
the right data center doesn't care where you're Landing which data center you're Landing to it would just route you to the other data center if it needs
00:30:04.880
to okay so we have an idea now of what this multi-c strategy can look like but how do we know if it's safe turns out
00:30:12.080
that there's just needs to be two rules that are honored rule number one is that any
00:30:17.600
request must be annotated with the shop or the pot that it's going to all of these requests for the storefront are on
00:30:23.159
the shop domain so they're indirectly annotated with the shop they're going to through the domain with the domain we
00:30:29.640
know which pod which mini Shopify that this request is belonging to the second
00:30:34.760
rule is that any request can only Touch One pod otherwise it would have to go
00:30:40.880
across data centers and potentially this means that one request might have to reach Asia Europe maybe also North
00:30:46.600
America all in the same request and that's just not reliable again fundamentally shops and request to shops
00:30:52.799
should be independent so we should be able to honor these two rules so you might think well it sounds
00:30:59.039
reasonable like Shopify should just be an application with a bunch of control actions that just go to a shop but there
00:31:04.559
were hundreds if not a thousand requests that didn't that violated this they might look something like this they
00:31:10.960
might do something going over every Shard and Counting something or doing something like that or maybe it's
00:31:16.360
uninstalling PayPal accounts and seeing if if there are any other stores with it or something like that across multiple
00:31:22.360
stores when you have hundreds of endpoints that are violating something you're trying to do and you have a
00:31:28.639
hundred developers who are doing all kinds of other things and introducing new end points every single day that's
00:31:33.799
going to be a losing battle if you just send an email because tomorrow someone joins who's never read that email who's going to violate
00:31:40.080
this Raphael talked a little bit about this yesterday um he called it wh listing we called it shitless driven
00:31:49.039
development the idea is that your job if you want to honor rule one and two is to
00:31:54.799
build something that gives you a list a list of all the things that violate the rule if you do not obey the
00:32:01.039
list you raise an error telling people what to do instead this needs to be actionable you can't just tell tell
00:32:06.840
people not to do something unless you provide an alternative even if the alternative is that they come to you and
00:32:11.880
you help them solve the problem but this means that you stop the bleeding and you can then going forward rely on rule one
00:32:19.080
and two in this case being honored when we had this for Shopify
00:32:25.799
rule one and two honored our multi-c strategy worked and today with all of this building a top of
00:32:33.600
five years of work we're running 880,000 requests per second out of multiple data
00:32:39.440
centers and this is how we got there thank
00:32:48.200
you do you have any global data that doesn't fit into a Shard yes we have a
00:32:53.840
dreaded Master database and that database holds data that doesn't belong to a single
00:33:00.039
shop in there is for example the shop model right we need we need something that stores the shop globally because
00:33:05.720
otherwise the low balancers can't know globally where the shop is other examples are apps apps are sort of
00:33:11.399
inherently Global and then they're installed by many shops it can be billing data because it might span
00:33:17.519
multiple shops partner data there's actually a lot of this data so I didn't go into this at all but
00:33:24.159
I actually spent six months of my life solving this problem so we have a master database and
00:33:29.440
it spans multiple data multiple multiple data centers and the way that we solve this is essentially we have read slaves in every single data center that feed
00:33:36.159
off of a master database that is in one of the data data centers if you do a right you do cross DC wrs this sounds
00:33:43.880
super scary but we eliminated pretty much every path that has an high as a low from writing on this so billing has
00:33:51.559
a lower SLO in Shopify because the rights have to be cross DC but the thing is that billing and partner and the
00:33:58.240
other sections of this master database they're in different sections they're
00:34:03.880
fundamentally different applications and as we speak they're actually being extracted out of Shopify because Shopify should be a completely sharted
00:34:10.079
application and if they're extracted out of Shopify then you're also doing across DC right because you don't know where
00:34:15.399
that thing is so it's not really making the slos worse and it's okay that some
00:34:20.839
of these things have lower slos than the checkout and storefront and the admin that have the highest slos so that's how
00:34:26.919
we deal with that um we don't really deal with it how do you deal with a disproportionate amount of traffic to a
00:34:33.320
single pot or a single shop so the I showed a diagram earlier
00:34:40.599
that shows that the workers are isolated per pod this is actually a lie the
00:34:46.200
workers are shared which means that a single pod can grab up to 60 to 70% of
00:34:52.800
all of the capacity of Shopify so what's actually isolated in the pod are all the
00:34:58.240
data stores and the workers can sort of move between pods fluid like they're
00:35:04.200
fungible they will move between pods on the Fly the low balancer just sends requests to it and it will appropriately
00:35:10.119
connect to the correct pod so this means that the maximum capacity of a single
00:35:15.359
stor is somewhere between 60 and 70% of an entire Data Center and it's not 100%
00:35:20.520
because that would cause an outage because of a single store which we're not interested in but that's how we sort
00:35:25.880
of move this around does that answer how do we deal with large amounts of
00:35:31.839
data yeah like someone someone who's doing importing 100,000 customers or
00:35:37.760
100,000 orders well this is where the multi-tenancy strategy or architecture sort of shines
00:35:44.880
these databases are massive half a terabyte of memory many many tens of cores and so if one customer has tons of
00:35:52.000
orders then that just fits um and if the if the customer is so large that they
00:35:57.359
need to to be moved that's sort what this def fragmentation project is around is around moving these stores to
00:36:02.480
somewhere where there might be more space for them so basically we just deal with it by having massive massive data stores that can handle this without a
00:36:08.680
problem the import itself is just done in a job uh some of these jobs are quite slow for the big customers and there we
00:36:15.040
need to do some more parallelization work but most of the time it's not a big deal if you have millions of orders and
00:36:20.960
it takes a week to import that you have plenty of other work to do do during that time um otherwise so this is not
00:36:26.800
something that's been high high on the list uh how much time I left done okay
00:36:45.160
you