Zero-downtime payment platforms

by Prem Sichanugrist and Ryan Twomey

In the presentation titled "Zero-downtime payment platforms", Prem Sichanugrist and Ryan Twomey address the critical importance of maintaining high availability in payment processing platforms. They emphasize that even minor downtimes can adversely affect customer experience and revenue. Various strategies are discussed to mitigate risks associated with both internal and external downtimes.

Key Points:

Definition of Downtime: The speakers define two types of downtime:
- Internal Downtime: Caused by issues within their own application, such as application errors or infrastructure failures.
- External Downtime: Resulting from dependencies on third-party services, such as payment gateways or email providers.
Handling External Downtime: To counteract scenarios where payment gateways might go down, the team implemented a risk assessment system that allows the acceptance of low-risk orders even when the payment gateway is unavailable. This is managed through:
- A manual shutdown system initially, which evolved into automated processes for efficiency.
- A timeout mechanism that evaluates the risk before proceeding with order processing.
Internal Downtime Solutions: The presenters describe a fallback system, including:
- Chocolate, a separate Sinatra application that acts as a request replayer when their main Rails application fails. This ensures that customer requests can still be stored and processed later without immediate disruptions.
- Akamai Dynamic Router: This is utilized to reroute requests, minimizing the impact of application errors by handing off to the backup (Chocolate).
- The use of a unique request ID to avoid duplicate charges and to manage orderly processing even when switching between applications.

Significant Examples:

The first part of the talk delves into practical applications, illustrating how during high traffic periods, proactive measures were crucial in maintaining functionality and customer satisfaction, illustrating a situation with a high volume of transactions where they had automated processes in place to handle possible downtimes.

Conclusions and Key Takeaways:

The presenters stress that all systems are prone to failure; hence, preparations should include having a robust failover strategy.
Implementing a replayer mechanism can significantly enhance user experience by ensuring operations continue smoothly during disruptions.
It's crucial to constantly evaluate and refine risk assessment models to appropriately manage order acceptance during downtimes.

Ultimately, the speakers convey that while it is impossible to entirely eliminate downtime, thorough planning and intelligent design can have substantial positive impacts on system reliability and user satisfaction.

By Prem Sichanugrist & Ryan Twomey

When you're building a payment platform, you want to make sure that your system is always available to accept orders. However, the complexity of the platform introduces the potential for it to go down when any one of the moving parts fails. In this talk, I will show you the approaches that we've taken and the risks that we have to take to ensure that our platform will always be available for our customers. Even if you're not building a payment platform, these approaches can be applied to ensure a high availability for your platform or service as well.

Help us caption & translate this video!

http://amara.org/v/FGaR/

Rails Conf 2013

00:00:16.680 right uh good afternoon everyone hope you guys had a great

00:00:22.920 lunch yeah we going to get started here uh so today uh we're going to talk about

00:00:30.199 zero downtime playment platform or also known as like uh some techniques to make

00:00:35.800 your app uh never looks like it down so uh my name is Prim chog Gritz I

00:00:43.600 work for a company called fbot we are in Boston San Francisco Boulder and

00:00:50.760 Stockholm uh and I'm Ryan Tumi and I work for a company called level up that's in

00:00:56.480 Boston but since you're here at rosom I know you want to learn something uh at

00:01:01.719 thot we have a website called thot learn it's at learn thot.com so you might want

00:01:08.479 to go check it out we have like our links uh screencast books about uh ra

00:01:14.680 development and you can use a promo code rail conf to get 20% off your first

00:01:19.720 month of prime or anything else on the store so let's start out with some

00:01:27.240 background level up is a mobile payments advertising platform uh it's based in Boston like I said and what it does is

00:01:34.520 you hold up a your mobile phone just like you see on the left there it's got a QR code and you point at the cashier

00:01:40.520 scanner and that's how you place an order so if you want to get a coffee or a sandwich or something that's what you

00:01:45.560 would do and what this does it hits our rest API uh the create action actually

00:01:51.719 and it's on a rails 32 app for the main part and what that ultimately does is it

00:01:56.840 it goes through a whole bunch of processing but eventually ends up up hitting the customer's credit card to

00:02:02.000 then complete the order and to complete that order we go through a payment Gateway uh such as

00:02:08.160 brain tree or authorized.net or anything else really but the idea is that we go off to a thirdparty service to actually

00:02:14.480 complete that part so our stack is made up of a rails 32 app and we're on Heroku and it points

00:02:23.239 to a postgress 9.1 database we're in the process of evaluating 9.2 which is

00:02:28.280 really exciting if you guys haven't checked it out out by the way uh this database has two followers uh one is a

00:02:36.080 in the same data center but a separate availability Zone don't worry about the details there the other one is on the

00:02:41.800 west coast so the other side of the country and a follower incidentally if you're not familiar in Heroku parans

00:02:47.920 just means that it's a readon replica of the master database uh and then one last important

00:02:53.400 thing to note uh your app or any app really uh is always dependent on a lot of different things many of of which are

00:03:00.599 outside of your control so Heroku for instance is built on top of Amazon web

00:03:05.760 services so if there's an issue ever with Amazon web services then that issue

00:03:11.400 could percolate up and eventually affect your app so always being aware of everything that touches your app and and

00:03:16.840 affects its uptime is critical we're also doing quite a bit of

00:03:22.799 volume uh at Peak time we could be doing $1,000 a minute that's a lot of money

00:03:28.159 that's not like Amazon money but that's a lot of money and we're going pretty

00:03:33.519 quickly so downtime sucks we really don't want it so let's talk about the different

00:03:39.879 kinds of downtime that could affect us there's really kind of two there's us

00:03:45.400 I.E we can't execute our own code it's internal downtime it's something that that perhaps our app is crashing or

00:03:52.319 Heroku is down or something catastrophic is happening the other kind of downtime

00:03:57.360 is third party downtime something that we rely on something critical to us like

00:04:02.400 our payment Gateway or our email provider or something that we need to function is

00:04:07.519 down all right so I'm going to start off with uh

00:04:12.599 something on them which is like something that externally that we cannot

00:04:18.440 we don't have a control over it as Ryan said uh this includes the external database the email provider uh caching

00:04:26.759 provider and like payment Gateway but because uh low up is a payment uh is a

00:04:33.199 payment platform so we are actually focusing on payment Gateway

00:04:39.600 so if you ask yourself a question what happened to your payment platform if the

00:04:45.840 payment Gateway goes away it used to be the case before before we Implement all this stuff it

00:04:52.560 used to be the case that when a new order comes in and our payment Gateway

00:04:57.680 goes down the order would get rejected ejected and then we would turn away

00:05:03.199 customer they wouldn't be able to like pay for their sandwich or their hamburgers that they want and we result

00:05:10.759 in a sad customers so we started to think what if

00:05:18.160 we uh we taking in some risk but

00:05:23.479 then we like resulting in a happy customer

00:05:32.039 so we went uh we did the we we did the

00:05:37.360 process like iteratively so we start with something

00:05:42.520 simple uh what we call it as a manual shutdown so basically in the in on the

00:05:50.520 admin panel we have a big red button like this that the admin would go in and

00:05:57.280 then press on the button and then just shut down the uh the system and make it

00:06:04.600 goes into a a fail remote so this fail REM mode we

00:06:11.840 would accepting low low risk order uh we save it into the database and then we

00:06:18.680 will charge the customer later so you see that I'm I'm saying

00:06:24.759 about uh accepting low risk order so before we actually saving the saving the

00:06:31.440 order to into the database we do some risk assertion which well we cannot like

00:06:38.759 we cannot tell you guys how how how we do the risk assertion but it could be something something as simple like this

00:06:45.599 like if the order is less than 100 bucks then it's we consider as a low

00:06:52.000 risk uh so there are actually many ways that you

00:06:58.479 can assert the risk and for example you could you could also

00:07:04.479 like you could also check how many times they have been uh paying with your system how long they have been sign up

00:07:12.800 and like have they ever made like a fail transaction something like that so just

00:07:20.160 to make sure that you you'll be able to collect money from them later on so by implemen this kind of system at

00:07:29.080 least we will be able to allow customer to make purchase when our system go uh well

00:07:35.400 not our system when the payment Gateway goes down and we didn't we we don't turn

00:07:41.160 away any customers but for something for this

00:07:47.599 manual process it requires a human to be there when the payment Gateway goes down

00:07:55.000 and like uh Al like well people need sleep they

00:08:01.800 don't stay up for like 24/7 so we start to realize that this

00:08:08.919 doesn't work so well why don't we just like automate it

00:08:14.639 this so in terms of automating it there is a couple of different ways that we go about this uh the simplest approach is

00:08:22.000 in three main steps we start by taking the charge the part that goes out to the

00:08:27.120 payment Gateway that hits the credit card and we wrap that in timeout with a preset number of seconds that we're

00:08:32.200 going to wait for this thing to complete if it times out if we can't get out to

00:08:37.399 the payment Gateway if something doesn't happen that we wanted it to in that timeout then we'll stop and evaluate the

00:08:43.880 risk if the risk is too high we return back a failure that says sorry can't get

00:08:50.160 your hamburger move on if the risk is low however we'll save it and we'll return a success and that success

00:08:56.240 message mimics it looks exactly like a normal success there's no difference to the client in fact the client the the

00:09:02.640 scanner the cashier has no idea they've even gone through this process and lastly as we saved all these

00:09:09.399 things we then need to have some kind of a mechanism later that'll go through find them all and then retry them in the

00:09:15.920 background so we can eventually reconcile or complete these orders for us so let's look at how we do this so

00:09:22.600 the timeout code is pretty basic charge card via Gateway pretty descriptive I

00:09:27.880 like descriptive names by the way uh that's wrapped in timeout we've got timeout in seconds and what we'll do is

00:09:35.240 we'll go through this is in our customer charger class this will go through as part of the charge process and if we

00:09:41.040 fall if we run out of if that timeout happens we'll fall into the rescue there and we'll call a method that I really

00:09:47.440 like the name of assess risk of saving order without charging

00:09:52.519 card and what this guy will do is it will first check the risk so

00:09:59.399 the risk we'll see is low or high if the if it's high if it's not low basically

00:10:05.360 then we return back a very generic validation message generic because we really don't want to be too descriptive

00:10:10.640 about what we're really doing here so something like card failed sounds pretty good to me we return FAL so that there's

00:10:16.680 no more processing that goes on in the order in the other Branch though if the risk is

00:10:23.360 low we'll then set a string on the Gateway ID Gateway ID just as some

00:10:28.440 background is a random let's call it a Bas 64 just a random string that lets us uniquely identify any charges that we

00:10:35.320 run via our Gateway so we're going to save something that we can easily pattern match against and that's going

00:10:41.240 to say Gateway Dash Down Dash and then if you're not familiar with it secure random this will just give us a nice

00:10:46.839 long random string something that was not likely to collide with say another one that's happening at the exact same

00:10:53.200 time and then lastly we return true so that processing can continue now the last step of this whole

00:10:59.760 process is that we need this cron task that runs in the background that will then find all these orders and retry

00:11:06.079 them so roughly every 10 minutes or so throughout the day we have a Chron task that that kicks in that looks for any

00:11:12.839 orders that need to be reconciled we have something called order. reconcilable and what we'll do is

00:11:19.160 there's there's one interesting note before I move on to the next slide here is that there's a race condition uh in this in this and I'm going to point out

00:11:25.880 later but just keep that in mind so order. reconcilable this is a active

00:11:31.639 record scope on the order model and that'll find anything that matches our Gateway Das down

00:11:37.880 pattern if we find any then we call order. reconcile on it

00:11:44.120 and what this guy will do is steps through and it first runs something called similar order finder this is

00:11:49.839 don't worry about the details what's cool about it is that it calls out to our Gateway and asks have you do you

00:11:56.079 have any charges that were in the last say plus or minus 20 seconds of when this order was created that's for the

00:12:02.440 same amount and pointed to the same credit card if it does there's a really good chance that this is a duplicate and

00:12:09.320 somehow we've already run this charge this is like a paranoia check we really don't want to charge our customers twice by mistake so we're going to we've got

00:12:16.040 these kinds of things all over the place so if we do find one of those we just update the Gateway ID with the actual ID

00:12:22.800 that we found from our Gateway and save it so we don't have to keep rerunning this thing on the flip side if we find that

00:12:30.160 if we don't have an order found then we'll charge and we'll save and you remember charge from before was defined

00:12:35.880 to also be in that timeout so we're going to keep doing that process over and over again just in case that this thing fails

00:12:43.720 again I mentioned earlier about that issue with the race condition the race

00:12:49.279 condition here is that there's no locks and if this thing doesn't run sequentially what if you kicked off say

00:12:54.360 10 of those KRON tasks at the same time or what if one ran really slowly and the next one started kick in that could be

00:13:00.199 bad because then you can end up charging your customer a number of times and that's not fun so if you're going to do

00:13:06.680 this either make sure that it's running sequentially or put in some locks in terms of pros and cons on the

00:13:14.560 upside there's no humans involved so this can happen at 2 am and I don't have to worry about being up at 2 am to check

00:13:20.040 if our payment Gateway is down I can get some sleep so that's pretty cool on the downside

00:13:26.880 however it we found so first off we found this worked really well for a long period of time there were a couple of

00:13:33.839 sort of rough spots we found that similar order finder was a little finicky in a couple cases I would just

00:13:38.920 be careful of your gateway and do lots of testing to make sure that works the way you think it will uh but we found

00:13:44.480 that this technique really worked well so that's when they are down now

00:13:52.399 let's talk about what about if we are down

00:13:57.759 so we could the reason that we could go down uh several reason like we have

00:14:03.360 application error like application just throwing 500 errors uh

00:14:10.199 our hero go is failing uh Amazon web service goes

00:14:15.600 away uh the the problem is that there's

00:14:21.800 nothing we nothing much we can do uh to fix this internally when when it

00:14:27.000 happens uh I would go into more detail but first let's get into a break time

00:14:33.199 break time uh that's a purito by the way that's a cat burrito if you didn't know

00:14:39.800 so let me tell you a story involving burritos as all good stories involve on October 22nd AWS went down I bet at

00:14:48.040 least some of you knew about that here's what it looked like after the fact but

00:14:53.360 during the day there were a lot of status updates some of it was misleading some of it was helpful

00:14:59.199 but we really had no idea what was coming back I was still at work and I was watching these status

00:15:05.560 updates I was getting hungry I wanted a burrito I remember that there's a Kuda

00:15:10.680 in my house but I couldn't remember it's hours and I pull up their website to

00:15:15.759 check those hours and yep kudoba on

00:15:21.480 Heroku Heroku is on us East one bad times I never got my burrito so

00:15:30.959 remember kids always use a CDN to serve your static Pages like your hours page if

00:15:37.560 you're a Coba so let's get back into it all right

00:15:43.480 so what if okay so what if heroo goes down for us oh like

00:15:51.399 say uh Amazon web service with hogu is on so on that same day this was

00:15:59.720 happening to us uh the number of f Spike up so so bad

00:16:06.000 and normally when you when like when we saw something like when you see something like this you'll be freaked

00:16:11.600 out because like oh my God this is how many like customer you turn

00:16:16.880 away however we already plann ahead we know that

00:16:22.199 something like this could happen so we built several parts

00:16:29.639 we we built several new parts to our system to handle this kind of

00:16:35.560 situation so I'm going to introduce you to a uh

00:16:41.600 chocolate uh this is a request replayer on our on our

00:16:46.680 stack and we also using uh a dynamic fell

00:16:53.480 service which is powered by Aki uh you might not you might cannot

00:17:00.720 see how everything uh get together or works out so let me walk you through on

00:17:07.160 how how we uh layer our stack so at the top we have the

00:17:12.360 internet uh which like as a Gateway accepting the requests uh all the

00:17:18.160 request would route through the akami dynamic router and which get Rous to our hogu

00:17:27.400 application uh behind the alemi dynamic router we also put our shock glat in

00:17:34.280 there and also have akami CDN to serve uh static assets oh like static version

00:17:41.840 of the site as well so when that day what happened is that

00:17:51.240 when when uh when customer trying to make a purchase uh they scan the phone

00:17:56.640 the request go to the internet and goes through the arami dynamic router uh it hits our rails application

00:18:04.679 but then our application doesn't work uh sometime it raise the application error

00:18:11.640 and sometime might be not responding in the uh timeout that we specify so when

00:18:21.240 it happen the the acomi dynamic router would routing would reroute the the

00:18:27.360 request to chocolate and then chocolate will handle

00:18:33.240 the request after that so now you may be curious what is

00:18:39.120 chocolate um is it's really this yummy

00:18:44.840 and so chocolate is actually a a separate Sinatra application that we uh

00:18:53.320 write it up from the scratch we decide not to use the same code base as our

00:18:58.520 production application because like there might be some bugs in the in the

00:19:04.200 production that we don't want these two things to be like failing at the same

00:19:10.200 time so this application would perform the risk assertion that we were talking about before uh as well and it would

00:19:20.760 store a ra raw request in the database and when the production website backs up

00:19:27.440 it would replace the request back to production so we call it chocolate but

00:19:35.000 it's actually the weer for a web

00:19:40.480 request uh as as I said before the uh uh

00:19:46.400 chocolate is the Sinatra app we also deploy this guy into a different Cloud

00:19:54.120 not not Amazon web service so that we don't have a single part of flare

00:19:59.880 failure so I mean if Hoku or Amazon web service goes down then our customer

00:20:07.280 doesn't even notice happy customer I love that picture by the

00:20:12.960 way so I mean uh but again even even

00:20:18.000 though we have this we still have to assert the same risk as before if the Au if the order get accepted but that

00:20:25.280 cannot be charged later then we still yeah we still our block and so this also

00:20:34.440 needs like a good support team that would follow up to the customer and try to get money from them that sounds

00:20:40.760 pretty bad but yeah I mean we want we want you want to get we want to be able

00:20:46.039 to grab money for usually yeah so let's talk about how it works so chocolate as we mentioned is a

00:20:52.760 Sinatra app it has a post end point that looks identical to the path on our

00:20:58.559 production rails app and the idea is

00:21:03.799 whenever a request comes in we're going to do some basic checks we're going to pull out some interesting things such as

00:21:09.120 how much the order is for how much that hamburger costs and what customer we want to

00:21:14.559 charge if the order looks legit if it passes some of our basic sniff tests then we're then going to apply our risk

00:21:21.760 model if the risk is acceptable we'll save everything that we know about this request to the chocolate database so the

00:21:28.240 prams the headers everything that we possibly know both for debugging later but also for our replay

00:21:34.720 functionality and then finally we're going to return a response that's absolutely identical to what production would return both in the success and

00:21:42.200 failure state so again the cashier and the customer will have no idea anything has ever happened

00:21:49.240 here so let's talk about what a replay actually looks like what is replaying orders so once the rails production site

00:21:56.520 is back up again it's time to replay orders we have an order model on our Sinatra application that has a replay

00:22:02.600 method defined on it this method opens up a connection to the production app

00:22:07.799 and replays the request almost exactly like if the production app were receiving it straight from aami or

00:22:14.000 straight from the the cashier so to kick off these replays our support team will initiate them manually this is on

00:22:20.720 purpose this allows them to track any orders that don't replay successfully and also to keep track of what our

00:22:26.159 volume of replays are per day and they can then follow up with the customers if they have to there's one

00:22:32.039 last key piece to all this so that we haven't touched on and it it touches back with that paranoia that I was

00:22:37.200 talking about earlier so we have to be very careful because there are two separate apps

00:22:42.400 about duplicates it is possible that an order could end up in production could end up

00:22:47.600 actually being completed through our payment Gateway and charging the customer's credit card but it still ends

00:22:53.240 up in chocolate our replayer and I'll get into some reasons why that could happen but the important thing is that

00:22:59.360 if it does happen we want to be absolutely 100% sure that we don't accidentally charge that customer again

00:23:05.640 for that same order so we need to DDO the way that we do this is that aami

00:23:13.559 injects a custom HTTP header into every one of our requests that comes in every time an order is placed it puts an extra

00:23:20.679 header an extra HTTP header in the request that has a unique request ID so this unique request ID is then

00:23:28.240 stored on the order on the rails production side and if the order goes through and it's saved it's got this

00:23:34.640 request ID at the same time if the order ever fails over if it's ever sent over to chocolate chocolate also gets that same

00:23:41.640 request ID it doesn't change if it gets replayed it's the same one as what's originally on the rails production side

00:23:48.039 and we store it there too and then finally when we go to replay in order we're going to take that

00:23:53.600 order from chocolate and we're going to add that original request ID as part of the order uh post and so the rails

00:24:01.400 production site will then take it in and say oh yeah I've already got that order I've got an order that has that request

00:24:06.520 ID I'm going to reject this that's a duplicate on the flip side if it finds that it doesn't have that order already

00:24:12.240 then it's time to save it and move forward so now let's get into the detail

00:24:18.600 on when we would do the fail over so as I said before uh we have the

00:24:26.399 Timeout on on the akami dynamic router to uh resent the same request to shocate

00:24:35.240 if it takes more than 15 seconds uh after that request uh went

00:24:41.080 through the production server uh because sometimes even even

00:24:47.919 though they request the timeout but the charge uh but the customer might get

00:24:53.720 charged already on uh we our payment Gateway okay

00:25:00.039 so we was a little worried about it but then since we have the uh

00:25:05.919 dupes uh thing in production so that's actually automatically

00:25:13.399 handles so let's talk about uh the pros of using um multiple

00:25:21.919 layers like this is that it actually allows you to replay the exact same

00:25:27.159 request into a separate application which like doesn't have to be like in

00:25:34.080 the same physical location and if it dos correctly then your side will never goes

00:25:41.159 down or like oh it appears that it doesn't appear to be

00:25:46.320 down however because of you because you add so many Paces to the to the system

00:25:54.799 uh is actually at the amount of uh yeah it adds a layer of complexity

00:26:01.480 and also it would add cost to your bill because you have to pay for those uh

00:26:11.520 pieces so I want to talk about something that's a little strange that we've found and after we set up this whole failover

00:26:18.480 system that we've talked about we notice that every day we would still see some orders end up in chocolate and even

00:26:25.960 though all of our site are site was up nothing was being reported down all of our external Services were up there were

00:26:31.640 no problems being reported we were still seeing these issues we were still seeing orders pop up into chocolate so what

00:26:38.799 could be happening that's it's very strange check out all these spikes these are just random days there were no

00:26:44.440 downtime incidents on AWS Heroku or our side at all and yet we were still seeing

00:26:50.120 not an insignificant number that were failing over nothing was down what could be

00:26:56.480 causing this has anyone heard of random routing

00:27:05.600 before yes Dinos get backed up so every day a handful of orders still end up

00:27:11.600 failing over to chocolate and the way the reason why this happens is due to the way that random routing works if a

00:27:19.000 handful of requests all come in at the same time the router is going to randomly assign them to our available

00:27:24.559 dyo pool so let's look at how that actually works so you have a router you've got the heru

00:27:29.760 router it's randomly assigning requests Let's Pretend the blue box is a request coming in it hits the router and gets

00:27:36.440 assigned randomly to a dyo let's call it dyo one another request comes in while

00:27:42.159 that first request is being processed and this one gets randomly assigned to another

00:27:47.480 dyo this continues while requests are still being processed

00:27:52.679 and if you happen to have some requests that are a little slow like this blue one that's been sitting around for a

00:27:58.240 little while then we could run into a problem so let's say another request comes in and let's say that it randomly

00:28:04.960 gets assigned to a dyo that's already busy like say Dino 1 so what's going to happen it's going to get queed it's

00:28:12.120 going to sit there and it's going to wait until that dyo frees up and can actually process that

00:28:17.360 request so as requests get processed and new requests come in and get assigned

00:28:23.600 we've still got this one request that's sitting there backed up notice also that there's a couple of

00:28:29.399 dinos here that aren't doing anything that's kind of unfortunate and

00:28:34.600 as Prem mentioned earlier we have a Timeout on the aemi side that's been set for 15

00:28:40.679 seconds so if we find that a request comes in and it's sitting around waiting to be processed or perhaps it's even in

00:28:47.880 process with the dyno and the timer goes off we're still going to end up timing

00:28:53.640 out that request as soon as it gets timed out it's no longer in our control it's no

00:28:58.799 longer in heroku's control it's taken away because the aimi layers before that it now gets replayed to chocolate and

00:29:05.080 this is one of the key reasons why we would find so many orders that could get completed anyway that were still being

00:29:12.279 processed say they were still in process on the on our app side and still end up in chocolate our

00:29:18.799 failover this is doubly unfortunate because we also have a whole bunch of dinos that are just sitting there doing

00:29:24.640 nothing so how do we solve this well you can't the best way that you can do it is you

00:29:29.799 can speed up your dinos as much sorry speed up your requests as much as possible make it so that every Dyno is

00:29:36.919 processing requests very very quickly this will reduce the number that are going to be sitting there just waiting to get

00:29:43.159 processed um you can also just continuously tune everyone's app is very different and we found that we've done a

00:29:50.279 lot of aggressive tuning for both the number of unicorn workers that we have running as well as say how long we're

00:29:56.840 willing to wait for a single request to be processed not just the the system as a whole but individual

00:30:03.720 requests and ultimately you just have to accept that with random routing there's really no way that you can ever solve

00:30:11.240 this there's no solution to that you can just make it better so before we closing off uh there

00:30:20.080 are several things to remember from this talk uh first of all your site will goes

00:30:27.039 down no matter what something will break and your site will goes down so consider

00:30:34.600 having something like this in uh should be in your planning

00:30:42.720 process uh second you should have a replayer uh some uh Dynamic routing to

00:30:51.519 uh replay the critical web request such as like making an order or something

00:30:56.760 like that uh you might want to take in some bre if

00:31:02.840 that makes if if that's going to make your customer happy and make your

00:31:09.000 platform uh like be

00:31:14.639 reliable and also uh you need to keep your Endo lean lean and fast try to put

00:31:22.679 stuff into a background job if it can be to free up the the web

00:31:29.440 server and five because everyone sometimes want a burrito like late so

00:31:36.120 use a CDN to serve the static asset or static site if your main site goes

00:31:43.360 down so that's it uh anybody have any question sure so let me just repeat the

00:31:50.559 question so he's saying let me just paraphrase actually because a little little bit much there but I think what

00:31:55.799 you're saying is it sounds like the problem is Heroku this sounds like the problem is your main app going down why

00:32:01.080 not have a second copy of your app that you could scale up and if it goes down send all your requests to that and

00:32:07.919 you're actually up you're not just faking that you're up did I get that right cool so I think the easiest answer

00:32:14.919 to that is because state is very very hard State being things like I need to

00:32:20.919 then synchronize or you know DD whatever you want to call it if I have an order

00:32:26.039 come in on my set second application well in order doesn't exist in isolation it does a lot of different stuff when

00:32:32.840 someone places an order we have probably hundreds of different things that get kicked off that all affect all kinds of

00:32:39.480 different state if someone has say prepaid prepaid credit we need to deduct that right so in that particular case if

00:32:47.519 the second site is has a couple of orders go against it and then we resume

00:32:52.799 we go back to our main site well those numbers are going to be off until they've had a chance to syn synchronize and talk to each other right so we found

00:33:00.519 that the state and in other people's applications could be even more complicated but the thing is that the

00:33:05.600 state here the synchronizing between the two applications is incredibly hard so

00:33:11.600 let's go shopping let's create a replayer so that's the main idea yeah I think I think the I would like to add as

00:33:17.960 well that deploying the site to heroo is actually better than like having to have

00:33:23.919 the uh like CIS off in in our like in in our startup I mean because like pretty

00:33:31.440 much every every developer knows how to use git and like knows how to deploy to heroo so it's actually like even though

00:33:38.279 we have to maintain those stuff it's like the pros and cons actually like cancel out and then it's not that bad I

00:33:44.679 think there's another question yeah right there you just single ofure definitely yeah so the question

00:33:52.000 was didn't you just move the single point of failure from Amazon to aami and you're right

00:33:57.559 except that Akamai is a lot better about being up than we are we're still pretty small compared to them I think the last

00:34:04.120 time I read some of their marketing stuff it was something like 20% of all internet traffic ultimately ends up

00:34:10.440 routing through aami but replace aami with X with some other replayer that you

00:34:16.200 trust um they're very good about staying up I I don't remember if you guys

00:34:21.520 remember maybe six months ago something like that they did have some downtime which was pretty catastrophic so luckily

00:34:27.520 it's that that they're not perfect either but they're better than we are right now so so that's what we're doing

00:34:34.399 so far but it's an evolving system so yo have you thought about a third failover

00:34:39.919 so the question is have you thought about a third failure failover absolutely you could have a hundred

00:34:45.399 replayers if you wanted to there's no there's no reason why you couldn't you could even have redundant acamis what if

00:34:51.480 you had aami and then X and then Y in between right you could scale up this kind of a concept cep in many different

00:34:58.119 ways to then have something along the chain to then capture those requests and store them for later processing

00:35:04.320 absolutely so

00:35:09.400 yes so the question is how do you do deployments when you have multiple places that you're deploying to so the

00:35:15.520 two applications the main rails app and chocolate the replayer are all are both

00:35:20.760 very different completely separate and they are not deployed at the same time they're not even deployed in the same day we have as our standard operating

00:35:28.040 procedures we do not touch both apps at the same time if we could ever avoid it

00:35:33.440 and so far we've been able to avoid it yeah pretty much like there are some parts there are some parts of the code

00:35:39.320 like risk assertions that chair between those two application so we push those

00:35:44.880 thing up as a gem like in a in a private gem posting yep and then lastly just in

00:35:50.880 terms of tools we just push directly to to Heroku like everyone does and we also

00:35:55.960 use Capistrano for the second one which is on a a VPS so other questions way in

00:36:02.160 the back awesome question so how are you the

00:36:09.640 question is how are you assessing risk without depending on a third party so

00:36:14.720 what we showed you is a super simplistic example anything I think it was anything under 100 bucks just it's it's low risk

00:36:20.880 right that's one way to do it that's pretty naive there are other ways too you could consider for instance so you

00:36:27.960 could have say nightly a uh synchronized job between the two apps that checks to

00:36:33.599 see all of your most frequent customers so customers that you know that that have been a customer for months and

00:36:40.079 months or years that place lots of orders and have perhaps a predictable pattern you could take something like

00:36:45.599 that and say yes this person is is low risk up to some dollar amount or some

00:36:51.400 preset you know threshold and if your replayer ever encounters an order from that person then boom low risk and

00:36:58.480 accept it uh there are many different ways that you could do that the idea is to move the logic into your replayer and

00:37:04.440 never have the replayer dependent on something outside in order to assess that risk and so there's it's everyone's

00:37:11.680 app is different but for us we found a couple of different ways we can do that very reliably so any other questions

00:37:17.920 yeah right over here great question so the question is what if you just use a second Gateway for your app so what if

00:37:23.839 your first payment Gateway goes down you then fail over to a second Gateway uh the short answer is we actually already

00:37:29.560 do that um but the the longer answer is that you can take this technique and

00:37:35.079 apply it to things Beyond just a payment Gateway like what if you know you were doing something else that didn't rely on

00:37:41.680 payment gateways and You Were Somehow interested in something else a lot of this is similar to like delayed job

00:37:47.240 retries right if you're sending an email in a delayed job delayed job by default

00:37:52.520 will have up to three retries I believe so the same concept exists that's just

00:37:57.760 inside your app same thing too you could apply end number of payment gateways or end number of redundant services but you

00:38:04.400 may not have that option and if that's the case you could use something like this so I think we have time for maybe

00:38:09.839 one more question yes right over

00:38:17.280 there great question so uh question is uh what percentage of payments what

00:38:22.680 percentage of orders while you're down are considered high risk and thus rejected so the number started off very very high

00:38:30.640 I don't know the figures in front of me but I wouldn't be surprised if it was more than 50% um and we've tuned that over time as

00:38:37.800 we learn more about our customers learn more about their habits and then kind of figured out well what really is high-

00:38:43.960 risk to us is it the dollar amount is a $5 you know sandwich is that high risk

00:38:49.960 enough that we're willing to we're not willing to extend that out just in case it works or is it more about something

00:38:56.400 like like patterns and so for us if someone places one order at $5 okay

00:39:02.240 maybe that's low risk because $5 isn't too much especially when your VC backed haha um but second is what if they start

00:39:09.520 placing lots of $5 orders all right away and all consecutively well then that starts to become high-risk so we have to

00:39:16.760 put in some logic about that and we've been tuning this metric quite a bit over time and like we said before we're still

00:39:23.240 investigating we don't have a perfect answer I don't think there ever will be but it's something that we can tune and we can experiment with and learn from so

00:39:31.160 cool well thanks everyone thank

00:39:55.400 you

00:40:06.520 n