List

5 Years of Rails Scaling to 80k RPS

5 Years of Rails Scaling to 80k RPS

by Simon Eskildsen

In his 2017 RailsConf talk, Simon Eskildsen discusses the evolution of Shopify's Rails infrastructure over five years, culminating in the ability to scale the platform to handle 80,000 requests per second (RPS) during peak times. He reflects on key milestones from 2012 to 2016 that enabled this growth, focusing on strategic decisions and optimizations that shaped the architecture of Shopify.

Key Points Discussed:

  • Shopify's Scaling Journey: Initially, the team realized its infrastructure needed to support massive traffic surges, particularly during 'flash sales,' which posed challenges due to high demand.
  • Background Jobs Optimization (2012): Transitioning from synchronous to asynchronous processing for checkout processes and inventory management was pivotal, alleviating long-running requests and improving response times.
  • Load Testing Introduction: The creation of a load-testing tool allowed the team to simulate user checkout scenarios and assess performance improvements in real-time, establishing a culture of continuous performance validation.
  • Identity Cache Implementation: To reduce database load, they implemented an identity cache system, balancing data freshness and cache efficiency amidst heavy request traffic.
  • Sharding for Flexibility (2013): Sharding was introduced to isolate data for individual shops, allowing better management of read/write operations and preventing interference between different stores navigating data-intensive environments.
  • Resiliency Strategies (2014): As the infrastructure expanded, the team focused on identifying and mitigating single points of failure to ensure system reliability and reduce the cascading effects of failures across components.
  • Multi-Data Center Strategy (2015): To enhance reliability, Shopify transitioned to a multi-data center architecture for failover capability, enabling seamless traffic routing without service disruption.
  • Current Metrics (2016): Ultimately, the platform achieved the capability to handle 80,000 RPS across its multiple data centers, processing substantial sales traffic while running efficiently.

Conclusions and Takeaways:
- The evolution involved recognizing the need for not just performance but also resilience in infrastructure.
- Lessons learned reflect ongoing integrations of new technologies, adaptations to existing processes, and an emphasis on outcomes rather than just technical specifications. Shopify's commitment to scalability and technology foresight highlights its position as a robust platform for e-commerce under pressure.
- Continued collaboration and knowledge transfer among engineering teams are essential to manage and innovate the platform effectively, ensuring readiness for future demands.

RailsConf 2017: 5 Years of Rails Scaling to 80k RPS by Simon Eskildsen

Shopify has taken Rails through some of the world's largest sales: Superbowl, Celebrity Launches, and Black Friday. In this talk, we will go through the evolution of the Shopify infrastructure: from re-architecting and caching in 2012, sharding in 2013, and reducing the blast radius of every point of failure in 2014. To 2016, where we accomplished running our 325,000+ stores out of multiple datacenters. It'll be whirlwind tour of the lessons learned scaling one of the world's largest Rails deployments for half a decade.

RailsConf 2017

00:00:13.200 my name is Simon I work on the infrastructure team at Shopify and today I'm going to talk about the past five
00:00:19.920 years of of scaling rails at Shopify um for I've only been around at
00:00:26.439 Shopify for four years so the first the first year was a little bit of digging but I want to talk about all of the
00:00:32.480 things that we've learned and I hope that people in this audience can maybe Place themselves on this timeline and
00:00:37.719 learn something learn from some of the lessons that we've had to learn over the past five years this talk is inspired by
00:00:44.480 a guy called Jeff Dean from from Google he's a genius um and he did this talk about how they scale Google for the
00:00:51.520 first couple of years and he showed how they they ran rid of a couple my sqls they shed they did all this stuff they
00:00:57.879 did all the the nosql Paradigm and and finally went to the new SQL Paradigm
00:01:03.000 that we're now starting to see but this was really interesting to me because you saw why they made the decisions that they made at that point in time I've
00:01:10.159 always been fascinated by what made say Facebook decide that this now was the
00:01:16.200 time to write a VM for for their for PHP to make it faster so this talk is about
00:01:21.240 that it's about an overview of the decisions we've made at shabi and less so about the very technical details of
00:01:28.400 all of them it's to give you an overview um and mental model for how we evolve our platform and there's tons of
00:01:35.960 documentation test okay there's tons of documentation out there on all of the things I'm going to talk about today
00:01:42.280 other Talks by co-workers blog posts readmes and things like that so I'm not going to get into the weeds but I'm
00:01:47.719 going to provide an overview I work at Shopify and at this point you're probably tired of hearing
00:01:54.119 about Shopify so I'm not going to talk too much about it just overall uh
00:01:59.880 Shopify is something that allows Merchants to sell people to other people and that's relevant for the rest of this
00:02:05.799 talk we have hundreds of thousands of people who depend on Shopify for their
00:02:11.560 livelihood and through this platform we run almost 100K RPS at Peak the largest
00:02:18.400 sales hit our rail servers with almost 100K requests per second our steady
00:02:23.920 state is around 20 to 40K requests per second and we run this on 10 thousands of workers across two data
00:02:31.160 centers about $30 billion have made it through this platform which means the downtime is costly to say the
00:02:38.159 least and these numbers you should keep in the back of your head as the numbers that we have used to to go to the point
00:02:45.599 that we are at today roughly these metrics double every year that's the metric we've used so if you go back five
00:02:52.000 years you just have to cut this in half five times I want to introduce a little bit
00:02:57.200 of vocabulary for Shopify because I'm going to use this Loosely in in this talk to understand how Shopify Works
00:03:03.799 Shopify is at least four sections one of those sections is the storefront this is where people are browsing their
00:03:10.239 collections browsing their products adding to the cart this is the majority of traffic somewhere between 80 to 90%
00:03:15.920 of our traffic are pre people browsing their storefronts then we have the checkout
00:03:21.239 this is where it gets a little bit more complicated we can't cash as heavily as we do on the storefront this is where we
00:03:26.599 have to do rights decrement inventory and capture payments admin is more complex you have people
00:03:33.120 who apply actions to hundreds of thousands of orders at the same time at the um concurrently you have people who
00:03:40.000 um change billing they need to be build and all these things that are much more complex than both checkout and
00:03:45.799 storefront in terms of consistency and the API allows you to change the majority of the things you can change in
00:03:51.720 in the admin the only real difference is that computers hit the API and computers can
00:03:59.040 hit the API really fast recently I saw an app for people who wanted to have an offset on orders numbers for at a
00:04:06.640 million and so this app will create a million orders and then delete all of them to get that offset so people do
00:04:12.599 crazy things with this API and that's one of our other major that is our
00:04:17.759 second largest um source of traffic um after after
00:04:23.560 storefront I want to talk a little bit about a philosophy that has shaped this platform over the past five years flash
00:04:30.160 sales is really what has built and shaped the platform that we have when Kanye wants to drop his new album on
00:04:36.600 Shopify it is the team that I am on that is
00:04:42.080 terrified we have we had a sort of fork in the road five years ago five years
00:04:47.160 ago was when we started seeing these customers who could drive more traffic for one of their sales than the entire
00:04:53.320 platform otherwise was serving of traffic they would drive in multiple so if we were serving a, requests per
00:05:00.320 second for all the stores on Shopify some of these stores could get us to 5,000 and this happens in a matter of
00:05:07.080 seconds their sale might start at 2 p.m. and that's when everyone is coming in so
00:05:12.360 there's a fork in the road do we become a company that support these sales or do
00:05:17.639 we just kick them off the platform and throttle them heavily and say this is not the platform for you that's a
00:05:23.039 reasonable path to take 99.9 something per of the stores don't have this
00:05:28.440 pattern they can't dve that much traffic but we decided to go the other
00:05:33.800 route we wanted to be a company that could support these sales and we decided to form a team that would solve this
00:05:40.280 problem of customers that could drive enormous amounts of traffic in a very short amount of
00:05:46.880 time and this is I think was a fantastic suggestion and this happened exactly five years ago which is why the time
00:05:53.400 frame of the talk is 5 years and I think it was a powerful decision because this has served as a
00:05:59.319 canary into the coal mine the flat sales that we see today
00:06:04.360 and the amount of traffic that they can drive of say 80k RPS that's what the steady state is
00:06:10.120 going to look like next year so when we prepare for these sales we know what next year is going to look like and we
00:06:15.960 know that we're going to laugh next year because we're already working on that problem so they help us stay ahead one
00:06:21.520 to two years ahead the meat of the in the meat of this talk I will walk through the past
00:06:27.440 five years of the major infrastructure projects that we've done these are not the only projects that we've done
00:06:32.720 there's been other apps and many other efforts but these are the most important to descaling of our rails
00:06:39.560 application 2012 was the year where we sat down and decided that we were going
00:06:44.840 to go the antifragile route with flash sales we were going to become the best place in the world to have flash sales
00:06:51.520 so a team was formed whose sole job was to make sure that Shopify as an application would stay up and be
00:06:57.440 responsive under these circumstances and the first thing you do when you start optimizing an application is you
00:07:03.479 try to identify the lower hanging fruit in this case the lower or in many cases
00:07:08.720 the lower hanging fruit is very application dependent the lowest hanging fruit from
00:07:13.759 an infrastructure side that's already harvested in your load balancers rails is really good at this or your operating
00:07:19.199 system they will take all of the generic optimization tuning so at some point it has that work has to be handed off to
00:07:25.280 you and you have to understand your problem Dom remain well enough that you know where to the bigest winds
00:07:30.960 are for us the first ones were things like backgrounded checkouts and this sounds crazy what do you mean they
00:07:36.400 weren't backgrounded before well the app was started in 2005 2004 and back then
00:07:41.960 backgrounding jobs in Ruby or rails was not really a common thing and we hadn't
00:07:47.080 really done it after that either because it was such a large source of technical debt so in 2012 a team sat down and
00:07:54.560 collected the massive amount of technical debt to move the background jobs into or remove the checkout process
00:08:00.319 into background jobs so the payments were captured not in a request that took a long time but in jobs asynchronously
00:08:06.560 with the rest and this of course was a massive source of speed up now you're not occupying all these workers with long
00:08:12.319 running requests another thing we did at these domain specific problems is was
00:08:18.560 inventory if you just you might think that inventory is just decrementing one number and doing that really fast if you
00:08:24.240 have thousands of people but my SQL is not good at that if you're trying to decrement the same number from thousands
00:08:29.960 of quaries at the same time you would run into lock contention so we had to solve this problem as well and these are
00:08:35.599 just two of many problems we solved in general what we did was we printed out the debug logs of every single querry on
00:08:41.519 the storefront on the checkout and all of the other hot paths and basically started checking them off I couldn't
00:08:47.160 find the original picture but I found one from a talk that someone from the company did uh three years ago where you
00:08:53.160 can see the wall here where the debug locks were taped and the team at the time would go and cross off and write
00:08:59.600 their name on the cories and figure out how to reduce this as much as possible but you need a feedback loop
00:09:04.839 for this we couldn't wait every until the next sale to see if the optimizations we've done actually made a
00:09:10.200 difference we needed a better way than just crashing at every single flash sale but having a tighter feedback loop just
00:09:16.040 like when you run the test locally you know whether it worked pretty much right away we wanted to do the same for
00:09:22.519 performance so we started we wrote a lad testing tool and what this load testing tool will do is that it will simulate a
00:09:28.519 user performing checkout it will go to the go to the storefront browse around a little bit find some products add them
00:09:34.839 to its cart and perform the full checkout it will fussy test this entire checkout procedure and then we spin up
00:09:41.480 thousands of these in parallel to test whether the performance actually made any difference this is now so deeply
00:09:46.880 webbed in our infrastructure culture that whenever someone makes a performance change people ask well how
00:09:52.040 did the low testing go this was really important for us and a Siege just hitting the storefront that
00:09:58.399 is something that just runs a bunch of the same requests is just not a realistic Benchmark of realistic
00:10:05.000 scenarios another thing we did at this time was we wrote a library called identity
00:10:10.240 cache we had a problem with we had one MySQL at this point hosting tens of thousands of stores and when you have
00:10:17.680 that you're pretty pretty protective of that single database and we were doing a lot of ques to it and especially these
00:10:23.920 sales were driving such a massive amount of traffic at once to these databases so we needed a way of of reducing the load
00:10:30.640 on the database the normal way of doing this or the the most common way of doing
00:10:35.959 this is to start sending queries to the read slaves so you have databases that feed off of the one that you write to
00:10:41.880 and you start reading from those and we try to do that at a time it's it has a
00:10:47.160 lot of nice properties to use to reach slaves over another method but when we did this back in the day there wasn't
00:10:53.440 any really good libraries in the rails world we tried to Fork some and tried to figure something out but we ran into
00:10:58.839 Data corre ruption issues we ran into just mismanagement of the Reed slaves
00:11:04.240 was really problematic at the time because we didn't have any dpas and overall mind you this is a this is a
00:11:10.079 team of rails developers who just had to turn um infrastructure developers and understand all of this
00:11:16.240 stuff and learned at the job because we were we decided to to handle flash sales
00:11:22.120 the way that we did so we just didn't know enough about MySQL and these things to to go that path so we decided to
00:11:30.200 figure out something else and deep inside of Shopify Toby had written a commit many many years ago introducing
00:11:36.920 this idea of identity cach of managing your cash out of bound in mcash idea
00:11:43.240 being that if I Cory for a product I look at mcast first and see if it's there if it's not if it's there I'll
00:11:49.480 just grab it and not even touch the database if it's not there I'll put it there so that for the next request it
00:11:55.320 will be there and every time we do a right we just expire those entries that's what we managed to do this has a
00:12:00.399 lot of drawbacks because that cache is never going to be 100% what is in the
00:12:05.440 database so when we do a read from that managed cache we never write that back to dat the database it's too dangerous
00:12:12.079 that's also why the API is opt in you have to do fetch instead of find to use
00:12:17.480 IDC because we only want to do it on these paths and it will return readon records so you cannot change them to not
00:12:23.519 corrupt your database this is the massive downside with either using read slaves or Identity cach or something
00:12:30.480 like this is that you have to deal with what are you going to do when the cash is expired or
00:12:36.600 old so this is what we decided to do at the time I don't know if this is what we
00:12:41.880 would have done today maybe we would have we've gotten much better at handling read slaves and they have a lot of other advantages such as being able
00:12:47.959 to do much more complicated queries but this is what we did at the time and if you're having severe scaling issues already identity cach is a very simple
00:12:55.199 thing to do and use so off the 2012 and what would have
00:13:02.560 been probably our worst Black Friday Cyber Monday ever because the team was working night and day to make this
00:13:07.920 happen there's this famous picture of our CTO face planted on the ground after
00:13:14.760 exhausting work of scaling Shopify at the time and someone then woke him up and told him hey dude check out us
00:13:21.760 down we were not in a good place but identity cach load testing and all this optimization it saved us and once once
00:13:30.560 the team had decompressed after this massive Sprint to survive these sales and survive Black Friday and Cyber Monday this year we decided to raise a
00:13:38.199 question of how can we never get into this situation again we'd spend a lot of time optimizing checkout and storefront
00:13:45.079 but this is not sustainable if you keep optimizing something for so long it becomes
00:13:50.480 inflexible often fast code is hard to change code if you've optimized
00:13:55.680 storefront and checkout and had a team that only knew how to do that there's going to be a developer who's going to
00:14:00.720 come in add a feature and add a query as a result and this should be okay people should be allowed to add queries without
00:14:07.600 understanding everything about the infrastructure often the more slower thing is more flexible think of a
00:14:13.880 completely normalized schema it is much easier to change and adapt upon and that is the entire point of a relational
00:14:19.720 database but once you make it fast it often is a trade-off of becoming more inflexible think of say a an algorithm
00:14:26.839 um a bubble sword n Square is how much is the complexity of the algorithm you
00:14:32.560 can make that really fast you can make that the fastest bubble sword in the world you can write a c extension in
00:14:38.360 Ruby that has inline a sample and this is the best bubble sword in the world but my terrible implementation of a
00:14:45.480 quick sword which is n login complexity is still going to be faster so at some
00:14:50.519 point you have to stop optimizing zoom out and re architect so that's what we did with
00:14:57.480 charting at some point point we we needed that flexibility back and
00:15:02.800 sharding seemed like a good way to do that we couldn't we also had the problem of fundamentally Shopify is an
00:15:08.399 application that will have a lot of Rights doing these sales there's going to be a lot of rights to the database and you can't cash rights so we have to
00:15:15.680 find a way to do that and sharting was it so basically we build this API a shop
00:15:21.399 is fundamentally isolated from other shops it should be shop a should not have to care about shop B so we did
00:15:28.639 person shop sharting where one Shop's data would all be on one Shard and another shop might be on another Shard
00:15:33.800 and a third shop might be on the together with the first one so this was the API basically this is all the
00:15:39.519 sharting API internally exposes within that block it will select a correct database where the product is for that
00:15:45.680 shop within that block you can't reach the other Shard that's illegal and in a controller this might
00:15:51.839 look something like this at this point most developers don't have to care about it it's all done by a
00:15:57.319 filter that will find a shop on another database wrap the entire request in the connection that that shop is on and any
00:16:05.920 product query will then go to the correct chart this is really simple and this means that the majority of the time
00:16:11.759 developers don't have to care about charting they don't even have to know if it's existent it just works like this
00:16:17.079 and jobs will work the same way but it has drawbacks there's tons of
00:16:22.440 things that you now can't do I talked about how optimization might you might lose flexibility with optimization but
00:16:29.480 with architecture you lose flexibility at a much grander scale fundamentally shops should be isolated from each other
00:16:36.000 but in the few cases where you want them to not be there's nothing you can do that's the drawback of architecture
00:16:43.160 and changing the architecture for example you might want to do joints across shops you might want to gather
00:16:48.880 some some data or an ad hoc query about app installation across shops and this
00:16:54.120 might not really seem like something you would need to do but the partners interface for all of our partners who build applications actually need to do
00:17:01.000 that they need to get all the shops and the installations for them so it was just written as something that did a join across all the shops and listed it
00:17:06.919 and this had to be changed and so the same thing went for our internal dashboard that would do things across
00:17:13.480 shops find all the shops with a certain app you just couldn't do that anymore so we have to find
00:17:18.799 Alternatives if you can get around it don't Shard fundamentally Shopify is an application that will have a lot of
00:17:24.720 Rights but that might not be your application it's really hard and it took us a year to do and figure
00:17:31.160 out we ended up doing an at at the application Level but there's many different parts of part or levels where
00:17:37.240 you can chart if your database is magical you don't have to do any of this
00:17:42.559 some databases are really good at handling this stuff and you can make some trade-offs at the database level so you don't have to do this at an
00:17:48.280 application Level but there are really nice things about being on a relational database transactions and and schemas
00:17:54.760 and the fact that most developers are just familiar with them are massive benefit and they're reliable they've been around
00:18:00.880 for 30 years and so they're probably going to be around for another 30 years at least we decided to do that the
00:18:07.440 application Level because we didn't have the experience to write a proxy and the databases that we looked at at the time
00:18:12.799 were just not mature enough and I actually looked at some of the databases that we were considering at the time and
00:18:18.840 most of them have gone out of business so we were lucky that we didn't buy into this proprietary technology and
00:18:25.200 solved it at the level that we felt most comfortable with at the time today we have a different team and we might have
00:18:30.440 solved this at a proxy level or somewhere else but this was the right decision at the
00:18:35.880 time in 2014 we started investing in resiliency and you might ask what is
00:18:41.720 what is resiliency doing in a talk about performance and scaling well as a function of scale
00:18:48.320 you're going to have more failures and this led us to a threshold in 2014 where
00:18:53.559 we had enough components that failures were happening quite rapidly and they had a disproportional impact on the
00:18:59.320 platform when one of our shards was experiencing a problem requests to other shards and
00:19:05.480 shops that were on other shards were either much slower or filling altogether it didn't make sense that when a single
00:19:11.440 reddis reddis server blew up all of Shopify was down this reminds me of a concept from
00:19:18.520 chemistry where it your reaction time is proportional to the amount of surface
00:19:24.520 area that you expose if you have two glasses of water and you put a a teaspoon of loose sugar in one and a
00:19:30.400 sugar cube cube in the other glass the glass with the loose sugar is going to be dissolved into water quicker because
00:19:36.520 the surface area is larger the same goes for technology when you have more more servers more components there's more
00:19:43.360 things that will react and can potentially fail and make it all fall apart this means that if you have a ton
00:19:50.039 of components and they're all tightly knitted together in a web where one of where if one of these components fail it
00:19:56.520 DRS a bunch of others with it and you have never thought about this adding a component will probably decrease your
00:20:01.880 availability and this happens exponentially as you add more components your overall availability goes down if
00:20:09.200 you have 10 components with four nines you have a a lot less downtime if
00:20:14.840 they're tightly webbed together in a way that one of them is a single point of failure and we hadn't really at this
00:20:20.000 point had the luxury of finding out what our single point of failures even were we thought it was going to be okay but I
00:20:26.280 bet you if you haven't actually verified this you will have single points of failure all over your application where
00:20:33.679 one failure will take everything down with it do you know what happens if your mcash cluster cluster go goes down we
00:20:39.799 didn't and we were quite surprised to find out this means that you're only really
00:20:45.520 as weak or as good as your weakest single point of failure and if you have multiple single points of failure
00:20:51.200 multiply the probability of all of those single points of failures together and you have the final probability of your
00:20:57.679 app being available very quickly the what looks like downtime of hours per
00:21:03.039 component will be days or even weeks of downtime globally advertised over an entire year if you're not paying
00:21:10.159 attention to this it means that adding a component will probably decrease your overall availability the outages looks something
00:21:16.840 like this your response time increases and this is this is a real graph of the incidents at the time in 2014 where
00:21:24.039 something became slow and as you can see here the timeout is probably 20 seconds
00:21:29.600 exactly so something was being really slow and hitting a time out of 20 seconds if all of the workers in your
00:21:35.640 application are spending 20 seconds waiting for something that's never going to return because it's going to time out
00:21:41.480 then there's no time to serve any requests that might actually work so if Shard one is slow requests for shard
00:21:47.960 zero are going to lag behind in a queue because these requests to to Shard one will never ever
00:21:55.400 complete the man that you have to adopt if when this starts becoming a problem for you is that single component failure
00:22:02.600 cannot compromise the availability or performance of your entire system your job is to build a reliable
00:22:10.240 system from unreliable components a really useful mental model
00:22:16.200 for thinking about this is the resiliency Matrix on the left hand side we have all
00:22:21.520 the components in our infrastructure at the top we have the sections of the infrastructure such as admin checkout
00:22:28.960 storefront the ones I showed from before every cell will tell you what happens if that
00:22:35.320 component on the left is unavailable or slow what happens to the section so if
00:22:41.880 red is goes down is storefront up is checkout up is admin up this is not what
00:22:47.400 it actually looked like in reality when we drew this out it was probably a lot worse and we were shocked to find out
00:22:53.440 how red and blue how down and degraded shabii looked when what we fought were
00:22:58.640 tangential um data stores like mcash and redis took down everything along with it
00:23:05.720 the other thing we were shocked about when we wrote this was this is really hard to figure out figuring out what all
00:23:10.760 these cells and the values of them are is really difficult how do you do that do you go into your production and just start tanging down stuff how do you know
00:23:17.440 what would you do in development so we wrote a a tool that will help you do this the tool is called
00:23:23.760 toxy proxy and what it does is that for a duration of a block it will emulate Network failures at the network level by
00:23:30.240 sitting in between you and that component on the left this means that you can write a test for every single
00:23:35.960 cell in that grid so when you flip it from being red to being green from being
00:23:41.520 bad to being good you can know that no one will ever reintroduce that failure so these these might look
00:23:48.960 something like this that when some message cue is down I get this section and I assert that the response is
00:23:55.039 Success at this point in Shopify we have very good coverage of our resiliency matrix by unit test that are all backed
00:24:01.919 by toxy proxy and this is really really simple to do another tool we wrote is called
00:24:09.039 semian it's fairly complicated exactly how how all of these components work and
00:24:14.200 how they work together in semian so I'm not going to go into it but there's a read me that goes into Vivid detail
00:24:20.000 about how semian works semian is a library that helps your application become more resilient and how it does
00:24:25.840 that I encourage you to check to read me um to find out how it works but this tool was also invaluable for us to run
00:24:33.159 to to not or or to be able to be a res more resilient
00:24:39.080 application what we the mental model we mapped out for how to work with resiliency was Data a PID where we had a
00:24:46.760 lot of resiliency debt because for 10 years we hadn't paid any attention to this the web I talked about before of
00:24:53.080 certain elements dragging down everything with it was eminent it was happening everywhere the resiliency Matrix was completely red when we
00:25:00.360 started and nowadays it's in it's in pretty good shape so we started climbing it we started figuring out writing all
00:25:06.559 these tool incorporating all these tools and then at the when we got to the very top someone asked a question what
00:25:13.159 happens if you flood the data center that's when we started working on multi-c in
00:25:19.960 2015 we needed a way such that if the data center caught fire we could fail
00:25:25.240 over to the other data center but resiliency and sharding and optimization were more important for us than going
00:25:31.720 multi DC multi DC was largely an infrastru largely an infrastructure effort of just
00:25:38.360 going from one to n this requires required a massive amount of changes in our cookbooks but finally we had
00:25:45.360 procured all the inventory and all the servers and stuff to spin up a second Data Center and at this point if you
00:25:50.559 want to fail over Shopify to another data center you just run this script and it's done all of Shopify has
00:25:57.559 moved to a different Data Center and the strategy that it uses is actually quite simple and one that most
00:26:03.720 rails apps can use pretty much as is if the traffic and things like that are set up
00:26:09.440 correctly Shopify is running in a data center right now in Virginia and one in Chicago if you go to a Shopify owned IP
00:26:17.440 you will go to the data center that is closest to you if you're in Toronto you're going to go to the data center in Chicago if you are in New Orleans you
00:26:25.080 might go to the data center in Virginia when you hit that data center the low balancers in that data center
00:26:31.120 inside of our network will know which one of the two data centers is active is it Chicago or is it asburn and it will
00:26:36.880 route all the traffic there so when we do a failover we tell the low balancers in all the data centers what is the
00:26:43.200 primary data center so if the primary data center was Chicago and we're moving it to Ashburn we tell the low balancers
00:26:48.760 in both the data centers to Route all traffic to Ashburn aspir in
00:26:54.600 Virginia when the traffic gets there and we've just moved over any write will fail the databases at
00:27:00.720 that point are in readon they are not writable in both locations at one because the risk of data corruption is too high so that means that most things
00:27:08.159 actually work if you're browsing around Shopify and Shopify Shopify storefront
00:27:13.600 looking at products which is the majority of traffic you won't see anything even if you are in admin you
00:27:19.279 might just be looking at your products and not notice this at all and while that's happening we're failing over all of the databases which means checking
00:27:25.480 that they're caught up in the new Data Center and then making them writeable so very quickly discharge recover over a a
00:27:31.960 couple of minutes um it could be anywhere from 10 to 60 seconds per database and then Shopify works again we
00:27:37.919 then move the jobs because when we move the when we move all the traffic we stop the jobs in the in the source data
00:27:43.320 center so we move all the jobs over to the new Data Center and everything just
00:27:49.600 ticks but then how do we use both of these data centers we have one data center that is essentially doing nothing
00:27:56.000 just very very expensive Hardware sitting there doing absolutely nothing how can we get to a state where we're
00:28:01.399 running traffic out of multiple data centers at the same time utilizing both of
00:28:06.480 them the architecture at first looked something like this it was shared we had shared red as instances shared mcash
00:28:13.080 between all of the shops when we say A Shard we're referring to a MySQL Shard
00:28:18.120 but we hadn't sharded reddis we hadn't sharded mcash and other things so all of this was
00:28:23.760 shared What If instead of running one big Shopify like this that we're moving around we run many small Shopify that
00:28:30.960 are independent from each other and have everything they need to run and we call this a pod so a pod will have everything
00:28:37.360 that a Shopify needs to run as the workers as the redis the mcash the myql
00:28:43.159 whatever else there might needs there needs to be for a little Shopify to run if you have these many Shopify and
00:28:50.080 they're completely independent they can be in multiple data centers at the same time you can have
00:28:55.480 some of them active in data center one and some of them active active in data center 2 pod one might be active in data
00:29:01.519 center 2 and pod two might be active in data center one so that's good but how do you get
00:29:08.200 traffic there so for Shopify every single
00:29:15.440 shop has usually a domain it might be a free domain that we provide or their own domain this when this request hits one
00:29:22.640 of the data centers the one that you're closest to Chicago or or um Virginia depending on where in the world you are
00:29:28.880 it goes to this little script that's very aply named Sorting Hat um and what Sorting Hat will do is
00:29:36.399 that it will look at the request and interpolate what shop what pod what mini
00:29:42.440 Shopify does this request belong to if that request is on a shop that is
00:29:48.559 going to pod two it will route us to Data Center one on the left but if it's
00:29:53.600 another one it will go to the right so Sorting Hat is just sitting there sorting the request and sending them to
00:29:58.720 the right data center doesn't care where you're Landing which data center you're Landing to it would just route you to the other data center if it needs
00:30:04.880 to okay so we have an idea now of what this multi-c strategy can look like but how do we know if it's safe turns out
00:30:12.080 that there's just needs to be two rules that are honored rule number one is that any
00:30:17.600 request must be annotated with the shop or the pot that it's going to all of these requests for the storefront are on
00:30:23.159 the shop domain so they're indirectly annotated with the shop they're going to through the domain with the domain we
00:30:29.640 know which pod which mini Shopify that this request is belonging to the second
00:30:34.760 rule is that any request can only Touch One pod otherwise it would have to go
00:30:40.880 across data centers and potentially this means that one request might have to reach Asia Europe maybe also North
00:30:46.600 America all in the same request and that's just not reliable again fundamentally shops and request to shops
00:30:52.799 should be independent so we should be able to honor these two rules so you might think well it sounds
00:30:59.039 reasonable like Shopify should just be an application with a bunch of control actions that just go to a shop but there
00:31:04.559 were hundreds if not a thousand requests that didn't that violated this they might look something like this they
00:31:10.960 might do something going over every Shard and Counting something or doing something like that or maybe it's
00:31:16.360 uninstalling PayPal accounts and seeing if if there are any other stores with it or something like that across multiple
00:31:22.360 stores when you have hundreds of endpoints that are violating something you're trying to do and you have a
00:31:28.639 hundred developers who are doing all kinds of other things and introducing new end points every single day that's
00:31:33.799 going to be a losing battle if you just send an email because tomorrow someone joins who's never read that email who's going to violate
00:31:40.080 this Raphael talked a little bit about this yesterday um he called it wh listing we called it shitless driven
00:31:49.039 development the idea is that your job if you want to honor rule one and two is to
00:31:54.799 build something that gives you a list a list of all the things that violate the rule if you do not obey the
00:32:01.039 list you raise an error telling people what to do instead this needs to be actionable you can't just tell tell
00:32:06.840 people not to do something unless you provide an alternative even if the alternative is that they come to you and
00:32:11.880 you help them solve the problem but this means that you stop the bleeding and you can then going forward rely on rule one
00:32:19.080 and two in this case being honored when we had this for Shopify
00:32:25.799 rule one and two honored our multi-c strategy worked and today with all of this building a top of
00:32:33.600 five years of work we're running 880,000 requests per second out of multiple data
00:32:39.440 centers and this is how we got there thank
00:32:48.200 you do you have any global data that doesn't fit into a Shard yes we have a
00:32:53.840 dreaded Master database and that database holds data that doesn't belong to a single
00:33:00.039 shop in there is for example the shop model right we need we need something that stores the shop globally because
00:33:05.720 otherwise the low balancers can't know globally where the shop is other examples are apps apps are sort of
00:33:11.399 inherently Global and then they're installed by many shops it can be billing data because it might span
00:33:17.519 multiple shops partner data there's actually a lot of this data so I didn't go into this at all but
00:33:24.159 I actually spent six months of my life solving this problem so we have a master database and
00:33:29.440 it spans multiple data multiple multiple data centers and the way that we solve this is essentially we have read slaves in every single data center that feed
00:33:36.159 off of a master database that is in one of the data data centers if you do a right you do cross DC wrs this sounds
00:33:43.880 super scary but we eliminated pretty much every path that has an high as a low from writing on this so billing has
00:33:51.559 a lower SLO in Shopify because the rights have to be cross DC but the thing is that billing and partner and the
00:33:58.240 other sections of this master database they're in different sections they're
00:34:03.880 fundamentally different applications and as we speak they're actually being extracted out of Shopify because Shopify should be a completely sharted
00:34:10.079 application and if they're extracted out of Shopify then you're also doing across DC right because you don't know where
00:34:15.399 that thing is so it's not really making the slos worse and it's okay that some
00:34:20.839 of these things have lower slos than the checkout and storefront and the admin that have the highest slos so that's how
00:34:26.919 we deal with that um we don't really deal with it how do you deal with a disproportionate amount of traffic to a
00:34:33.320 single pot or a single shop so the I showed a diagram earlier
00:34:40.599 that shows that the workers are isolated per pod this is actually a lie the
00:34:46.200 workers are shared which means that a single pod can grab up to 60 to 70% of
00:34:52.800 all of the capacity of Shopify so what's actually isolated in the pod are all the
00:34:58.240 data stores and the workers can sort of move between pods fluid like they're
00:35:04.200 fungible they will move between pods on the Fly the low balancer just sends requests to it and it will appropriately
00:35:10.119 connect to the correct pod so this means that the maximum capacity of a single
00:35:15.359 stor is somewhere between 60 and 70% of an entire Data Center and it's not 100%
00:35:20.520 because that would cause an outage because of a single store which we're not interested in but that's how we sort
00:35:25.880 of move this around does that answer how do we deal with large amounts of
00:35:31.839 data yeah like someone someone who's doing importing 100,000 customers or
00:35:37.760 100,000 orders well this is where the multi-tenancy strategy or architecture sort of shines
00:35:44.880 these databases are massive half a terabyte of memory many many tens of cores and so if one customer has tons of
00:35:52.000 orders then that just fits um and if the if the customer is so large that they
00:35:57.359 need to to be moved that's sort what this def fragmentation project is around is around moving these stores to
00:36:02.480 somewhere where there might be more space for them so basically we just deal with it by having massive massive data stores that can handle this without a
00:36:08.680 problem the import itself is just done in a job uh some of these jobs are quite slow for the big customers and there we
00:36:15.040 need to do some more parallelization work but most of the time it's not a big deal if you have millions of orders and
00:36:20.960 it takes a week to import that you have plenty of other work to do do during that time um otherwise so this is not
00:36:26.800 something that's been high high on the list uh how much time I left done okay
00:36:45.160 you