00:00:16.680
right uh good afternoon everyone hope you guys had a great
00:00:22.920
lunch yeah we going to get started here uh so today uh we're going to talk about
00:00:30.199
zero downtime playment platform or also known as like uh some techniques to make
00:00:35.800
your app uh never looks like it down so uh my name is Prim chog Gritz I
00:00:43.600
work for a company called fbot we are in Boston San Francisco Boulder and
00:00:50.760
Stockholm uh and I'm Ryan Tumi and I work for a company called level up that's in
00:00:56.480
Boston but since you're here at rosom I know you want to learn something uh at
00:01:01.719
thot we have a website called thot learn it's at learn thot.com so you might want
00:01:08.479
to go check it out we have like our links uh screencast books about uh ra
00:01:14.680
development and you can use a promo code rail conf to get 20% off your first
00:01:19.720
month of prime or anything else on the store so let's start out with some
00:01:27.240
background level up is a mobile payments advertising platform uh it's based in Boston like I said and what it does is
00:01:34.520
you hold up a your mobile phone just like you see on the left there it's got a QR code and you point at the cashier
00:01:40.520
scanner and that's how you place an order so if you want to get a coffee or a sandwich or something that's what you
00:01:45.560
would do and what this does it hits our rest API uh the create action actually
00:01:51.719
and it's on a rails 32 app for the main part and what that ultimately does is it
00:01:56.840
it goes through a whole bunch of processing but eventually ends up up hitting the customer's credit card to
00:02:02.000
then complete the order and to complete that order we go through a payment Gateway uh such as
00:02:08.160
brain tree or authorized.net or anything else really but the idea is that we go off to a thirdparty service to actually
00:02:14.480
complete that part so our stack is made up of a rails 32 app and we're on Heroku and it points
00:02:23.239
to a postgress 9.1 database we're in the process of evaluating 9.2 which is
00:02:28.280
really exciting if you guys haven't checked it out out by the way uh this database has two followers uh one is a
00:02:36.080
in the same data center but a separate availability Zone don't worry about the details there the other one is on the
00:02:41.800
west coast so the other side of the country and a follower incidentally if you're not familiar in Heroku parans
00:02:47.920
just means that it's a readon replica of the master database uh and then one last important
00:02:53.400
thing to note uh your app or any app really uh is always dependent on a lot of different things many of of which are
00:03:00.599
outside of your control so Heroku for instance is built on top of Amazon web
00:03:05.760
services so if there's an issue ever with Amazon web services then that issue
00:03:11.400
could percolate up and eventually affect your app so always being aware of everything that touches your app and and
00:03:16.840
affects its uptime is critical we're also doing quite a bit of
00:03:22.799
volume uh at Peak time we could be doing $1,000 a minute that's a lot of money
00:03:28.159
that's not like Amazon money but that's a lot of money and we're going pretty
00:03:33.519
quickly so downtime sucks we really don't want it so let's talk about the different
00:03:39.879
kinds of downtime that could affect us there's really kind of two there's us
00:03:45.400
I.E we can't execute our own code it's internal downtime it's something that that perhaps our app is crashing or
00:03:52.319
Heroku is down or something catastrophic is happening the other kind of downtime
00:03:57.360
is third party downtime something that we rely on something critical to us like
00:04:02.400
our payment Gateway or our email provider or something that we need to function is
00:04:07.519
down all right so I'm going to start off with uh
00:04:12.599
something on them which is like something that externally that we cannot
00:04:18.440
we don't have a control over it as Ryan said uh this includes the external database the email provider uh caching
00:04:26.759
provider and like payment Gateway but because uh low up is a payment uh is a
00:04:33.199
payment platform so we are actually focusing on payment Gateway
00:04:39.600
so if you ask yourself a question what happened to your payment platform if the
00:04:45.840
payment Gateway goes away it used to be the case before before we Implement all this stuff it
00:04:52.560
used to be the case that when a new order comes in and our payment Gateway
00:04:57.680
goes down the order would get rejected ejected and then we would turn away
00:05:03.199
customer they wouldn't be able to like pay for their sandwich or their hamburgers that they want and we result
00:05:10.759
in a sad customers so we started to think what if
00:05:18.160
we uh we taking in some risk but
00:05:23.479
then we like resulting in a happy customer
00:05:32.039
so we went uh we did the we we did the
00:05:37.360
process like iteratively so we start with something
00:05:42.520
simple uh what we call it as a manual shutdown so basically in the in on the
00:05:50.520
admin panel we have a big red button like this that the admin would go in and
00:05:57.280
then press on the button and then just shut down the uh the system and make it
00:06:04.600
goes into a a fail remote so this fail REM mode we
00:06:11.840
would accepting low low risk order uh we save it into the database and then we
00:06:18.680
will charge the customer later so you see that I'm I'm saying
00:06:24.759
about uh accepting low risk order so before we actually saving the saving the
00:06:31.440
order to into the database we do some risk assertion which well we cannot like
00:06:38.759
we cannot tell you guys how how how we do the risk assertion but it could be something something as simple like this
00:06:45.599
like if the order is less than 100 bucks then it's we consider as a low
00:06:52.000
risk uh so there are actually many ways that you
00:06:58.479
can assert the risk and for example you could you could also
00:07:04.479
like you could also check how many times they have been uh paying with your system how long they have been sign up
00:07:12.800
and like have they ever made like a fail transaction something like that so just
00:07:20.160
to make sure that you you'll be able to collect money from them later on so by implemen this kind of system at
00:07:29.080
least we will be able to allow customer to make purchase when our system go uh well
00:07:35.400
not our system when the payment Gateway goes down and we didn't we we don't turn
00:07:41.160
away any customers but for something for this
00:07:47.599
manual process it requires a human to be there when the payment Gateway goes down
00:07:55.000
and like uh Al like well people need sleep they
00:08:01.800
don't stay up for like 24/7 so we start to realize that this
00:08:08.919
doesn't work so well why don't we just like automate it
00:08:14.639
this so in terms of automating it there is a couple of different ways that we go about this uh the simplest approach is
00:08:22.000
in three main steps we start by taking the charge the part that goes out to the
00:08:27.120
payment Gateway that hits the credit card and we wrap that in timeout with a preset number of seconds that we're
00:08:32.200
going to wait for this thing to complete if it times out if we can't get out to
00:08:37.399
the payment Gateway if something doesn't happen that we wanted it to in that timeout then we'll stop and evaluate the
00:08:43.880
risk if the risk is too high we return back a failure that says sorry can't get
00:08:50.160
your hamburger move on if the risk is low however we'll save it and we'll return a success and that success
00:08:56.240
message mimics it looks exactly like a normal success there's no difference to the client in fact the client the the
00:09:02.640
scanner the cashier has no idea they've even gone through this process and lastly as we saved all these
00:09:09.399
things we then need to have some kind of a mechanism later that'll go through find them all and then retry them in the
00:09:15.920
background so we can eventually reconcile or complete these orders for us so let's look at how we do this so
00:09:22.600
the timeout code is pretty basic charge card via Gateway pretty descriptive I
00:09:27.880
like descriptive names by the way uh that's wrapped in timeout we've got timeout in seconds and what we'll do is
00:09:35.240
we'll go through this is in our customer charger class this will go through as part of the charge process and if we
00:09:41.040
fall if we run out of if that timeout happens we'll fall into the rescue there and we'll call a method that I really
00:09:47.440
like the name of assess risk of saving order without charging
00:09:52.519
card and what this guy will do is it will first check the risk so
00:09:59.399
the risk we'll see is low or high if the if it's high if it's not low basically
00:10:05.360
then we return back a very generic validation message generic because we really don't want to be too descriptive
00:10:10.640
about what we're really doing here so something like card failed sounds pretty good to me we return FAL so that there's
00:10:16.680
no more processing that goes on in the order in the other Branch though if the risk is
00:10:23.360
low we'll then set a string on the Gateway ID Gateway ID just as some
00:10:28.440
background is a random let's call it a Bas 64 just a random string that lets us uniquely identify any charges that we
00:10:35.320
run via our Gateway so we're going to save something that we can easily pattern match against and that's going
00:10:41.240
to say Gateway Dash Down Dash and then if you're not familiar with it secure random this will just give us a nice
00:10:46.839
long random string something that was not likely to collide with say another one that's happening at the exact same
00:10:53.200
time and then lastly we return true so that processing can continue now the last step of this whole
00:10:59.760
process is that we need this cron task that runs in the background that will then find all these orders and retry
00:11:06.079
them so roughly every 10 minutes or so throughout the day we have a Chron task that that kicks in that looks for any
00:11:12.839
orders that need to be reconciled we have something called order. reconcilable and what we'll do is
00:11:19.160
there's there's one interesting note before I move on to the next slide here is that there's a race condition uh in this in this and I'm going to point out
00:11:25.880
later but just keep that in mind so order. reconcilable this is a active
00:11:31.639
record scope on the order model and that'll find anything that matches our Gateway Das down
00:11:37.880
pattern if we find any then we call order. reconcile on it
00:11:44.120
and what this guy will do is steps through and it first runs something called similar order finder this is
00:11:49.839
don't worry about the details what's cool about it is that it calls out to our Gateway and asks have you do you
00:11:56.079
have any charges that were in the last say plus or minus 20 seconds of when this order was created that's for the
00:12:02.440
same amount and pointed to the same credit card if it does there's a really good chance that this is a duplicate and
00:12:09.320
somehow we've already run this charge this is like a paranoia check we really don't want to charge our customers twice by mistake so we're going to we've got
00:12:16.040
these kinds of things all over the place so if we do find one of those we just update the Gateway ID with the actual ID
00:12:22.800
that we found from our Gateway and save it so we don't have to keep rerunning this thing on the flip side if we find that
00:12:30.160
if we don't have an order found then we'll charge and we'll save and you remember charge from before was defined
00:12:35.880
to also be in that timeout so we're going to keep doing that process over and over again just in case that this thing fails
00:12:43.720
again I mentioned earlier about that issue with the race condition the race
00:12:49.279
condition here is that there's no locks and if this thing doesn't run sequentially what if you kicked off say
00:12:54.360
10 of those KRON tasks at the same time or what if one ran really slowly and the next one started kick in that could be
00:13:00.199
bad because then you can end up charging your customer a number of times and that's not fun so if you're going to do
00:13:06.680
this either make sure that it's running sequentially or put in some locks in terms of pros and cons on the
00:13:14.560
upside there's no humans involved so this can happen at 2 am and I don't have to worry about being up at 2 am to check
00:13:20.040
if our payment Gateway is down I can get some sleep so that's pretty cool on the downside
00:13:26.880
however it we found so first off we found this worked really well for a long period of time there were a couple of
00:13:33.839
sort of rough spots we found that similar order finder was a little finicky in a couple cases I would just
00:13:38.920
be careful of your gateway and do lots of testing to make sure that works the way you think it will uh but we found
00:13:44.480
that this technique really worked well so that's when they are down now
00:13:52.399
let's talk about what about if we are down
00:13:57.759
so we could the reason that we could go down uh several reason like we have
00:14:03.360
application error like application just throwing 500 errors uh
00:14:10.199
our hero go is failing uh Amazon web service goes
00:14:15.600
away uh the the problem is that there's
00:14:21.800
nothing we nothing much we can do uh to fix this internally when when it
00:14:27.000
happens uh I would go into more detail but first let's get into a break time
00:14:33.199
break time uh that's a purito by the way that's a cat burrito if you didn't know
00:14:39.800
so let me tell you a story involving burritos as all good stories involve on October 22nd AWS went down I bet at
00:14:48.040
least some of you knew about that here's what it looked like after the fact but
00:14:53.360
during the day there were a lot of status updates some of it was misleading some of it was helpful
00:14:59.199
but we really had no idea what was coming back I was still at work and I was watching these status
00:15:05.560
updates I was getting hungry I wanted a burrito I remember that there's a Kuda
00:15:10.680
in my house but I couldn't remember it's hours and I pull up their website to
00:15:15.759
check those hours and yep kudoba on
00:15:21.480
Heroku Heroku is on us East one bad times I never got my burrito so
00:15:30.959
remember kids always use a CDN to serve your static Pages like your hours page if
00:15:37.560
you're a Coba so let's get back into it all right
00:15:43.480
so what if okay so what if heroo goes down for us oh like
00:15:51.399
say uh Amazon web service with hogu is on so on that same day this was
00:15:59.720
happening to us uh the number of f Spike up so so bad
00:16:06.000
and normally when you when like when we saw something like when you see something like this you'll be freaked
00:16:11.600
out because like oh my God this is how many like customer you turn
00:16:16.880
away however we already plann ahead we know that
00:16:22.199
something like this could happen so we built several parts
00:16:29.639
we we built several new parts to our system to handle this kind of
00:16:35.560
situation so I'm going to introduce you to a uh
00:16:41.600
chocolate uh this is a request replayer on our on our
00:16:46.680
stack and we also using uh a dynamic fell
00:16:53.480
service which is powered by Aki uh you might not you might cannot
00:17:00.720
see how everything uh get together or works out so let me walk you through on
00:17:07.160
how how we uh layer our stack so at the top we have the
00:17:12.360
internet uh which like as a Gateway accepting the requests uh all the
00:17:18.160
request would route through the akami dynamic router and which get Rous to our hogu
00:17:27.400
application uh behind the alemi dynamic router we also put our shock glat in
00:17:34.280
there and also have akami CDN to serve uh static assets oh like static version
00:17:41.840
of the site as well so when that day what happened is that
00:17:51.240
when when uh when customer trying to make a purchase uh they scan the phone
00:17:56.640
the request go to the internet and goes through the arami dynamic router uh it hits our rails application
00:18:04.679
but then our application doesn't work uh sometime it raise the application error
00:18:11.640
and sometime might be not responding in the uh timeout that we specify so when
00:18:21.240
it happen the the acomi dynamic router would routing would reroute the the
00:18:27.360
request to chocolate and then chocolate will handle
00:18:33.240
the request after that so now you may be curious what is
00:18:39.120
chocolate um is it's really this yummy
00:18:44.840
and so chocolate is actually a a separate Sinatra application that we uh
00:18:53.320
write it up from the scratch we decide not to use the same code base as our
00:18:58.520
production application because like there might be some bugs in the in the
00:19:04.200
production that we don't want these two things to be like failing at the same
00:19:10.200
time so this application would perform the risk assertion that we were talking about before uh as well and it would
00:19:20.760
store a ra raw request in the database and when the production website backs up
00:19:27.440
it would replace the request back to production so we call it chocolate but
00:19:35.000
it's actually the weer for a web
00:19:40.480
request uh as as I said before the uh uh
00:19:46.400
chocolate is the Sinatra app we also deploy this guy into a different Cloud
00:19:54.120
not not Amazon web service so that we don't have a single part of flare
00:19:59.880
failure so I mean if Hoku or Amazon web service goes down then our customer
00:20:07.280
doesn't even notice happy customer I love that picture by the
00:20:12.960
way so I mean uh but again even even
00:20:18.000
though we have this we still have to assert the same risk as before if the Au if the order get accepted but that
00:20:25.280
cannot be charged later then we still yeah we still our block and so this also
00:20:34.440
needs like a good support team that would follow up to the customer and try to get money from them that sounds
00:20:40.760
pretty bad but yeah I mean we want we want you want to get we want to be able
00:20:46.039
to grab money for usually yeah so let's talk about how it works so chocolate as we mentioned is a
00:20:52.760
Sinatra app it has a post end point that looks identical to the path on our
00:20:58.559
production rails app and the idea is
00:21:03.799
whenever a request comes in we're going to do some basic checks we're going to pull out some interesting things such as
00:21:09.120
how much the order is for how much that hamburger costs and what customer we want to
00:21:14.559
charge if the order looks legit if it passes some of our basic sniff tests then we're then going to apply our risk
00:21:21.760
model if the risk is acceptable we'll save everything that we know about this request to the chocolate database so the
00:21:28.240
prams the headers everything that we possibly know both for debugging later but also for our replay
00:21:34.720
functionality and then finally we're going to return a response that's absolutely identical to what production would return both in the success and
00:21:42.200
failure state so again the cashier and the customer will have no idea anything has ever happened
00:21:49.240
here so let's talk about what a replay actually looks like what is replaying orders so once the rails production site
00:21:56.520
is back up again it's time to replay orders we have an order model on our Sinatra application that has a replay
00:22:02.600
method defined on it this method opens up a connection to the production app
00:22:07.799
and replays the request almost exactly like if the production app were receiving it straight from aami or
00:22:14.000
straight from the the cashier so to kick off these replays our support team will initiate them manually this is on
00:22:20.720
purpose this allows them to track any orders that don't replay successfully and also to keep track of what our
00:22:26.159
volume of replays are per day and they can then follow up with the customers if they have to there's one
00:22:32.039
last key piece to all this so that we haven't touched on and it it touches back with that paranoia that I was
00:22:37.200
talking about earlier so we have to be very careful because there are two separate apps
00:22:42.400
about duplicates it is possible that an order could end up in production could end up
00:22:47.600
actually being completed through our payment Gateway and charging the customer's credit card but it still ends
00:22:53.240
up in chocolate our replayer and I'll get into some reasons why that could happen but the important thing is that
00:22:59.360
if it does happen we want to be absolutely 100% sure that we don't accidentally charge that customer again
00:23:05.640
for that same order so we need to DDO the way that we do this is that aami
00:23:13.559
injects a custom HTTP header into every one of our requests that comes in every time an order is placed it puts an extra
00:23:20.679
header an extra HTTP header in the request that has a unique request ID so this unique request ID is then
00:23:28.240
stored on the order on the rails production side and if the order goes through and it's saved it's got this
00:23:34.640
request ID at the same time if the order ever fails over if it's ever sent over to chocolate chocolate also gets that same
00:23:41.640
request ID it doesn't change if it gets replayed it's the same one as what's originally on the rails production side
00:23:48.039
and we store it there too and then finally when we go to replay in order we're going to take that
00:23:53.600
order from chocolate and we're going to add that original request ID as part of the order uh post and so the rails
00:24:01.400
production site will then take it in and say oh yeah I've already got that order I've got an order that has that request
00:24:06.520
ID I'm going to reject this that's a duplicate on the flip side if it finds that it doesn't have that order already
00:24:12.240
then it's time to save it and move forward so now let's get into the detail
00:24:18.600
on when we would do the fail over so as I said before uh we have the
00:24:26.399
Timeout on on the akami dynamic router to uh resent the same request to shocate
00:24:35.240
if it takes more than 15 seconds uh after that request uh went
00:24:41.080
through the production server uh because sometimes even even
00:24:47.919
though they request the timeout but the charge uh but the customer might get
00:24:53.720
charged already on uh we our payment Gateway okay
00:25:00.039
so we was a little worried about it but then since we have the uh
00:25:05.919
dupes uh thing in production so that's actually automatically
00:25:13.399
handles so let's talk about uh the pros of using um multiple
00:25:21.919
layers like this is that it actually allows you to replay the exact same
00:25:27.159
request into a separate application which like doesn't have to be like in
00:25:34.080
the same physical location and if it dos correctly then your side will never goes
00:25:41.159
down or like oh it appears that it doesn't appear to be
00:25:46.320
down however because of you because you add so many Paces to the to the system
00:25:54.799
uh is actually at the amount of uh yeah it adds a layer of complexity
00:26:01.480
and also it would add cost to your bill because you have to pay for those uh
00:26:11.520
pieces so I want to talk about something that's a little strange that we've found and after we set up this whole failover
00:26:18.480
system that we've talked about we notice that every day we would still see some orders end up in chocolate and even
00:26:25.960
though all of our site are site was up nothing was being reported down all of our external Services were up there were
00:26:31.640
no problems being reported we were still seeing these issues we were still seeing orders pop up into chocolate so what
00:26:38.799
could be happening that's it's very strange check out all these spikes these are just random days there were no
00:26:44.440
downtime incidents on AWS Heroku or our side at all and yet we were still seeing
00:26:50.120
not an insignificant number that were failing over nothing was down what could be
00:26:56.480
causing this has anyone heard of random routing
00:27:05.600
before yes Dinos get backed up so every day a handful of orders still end up
00:27:11.600
failing over to chocolate and the way the reason why this happens is due to the way that random routing works if a
00:27:19.000
handful of requests all come in at the same time the router is going to randomly assign them to our available
00:27:24.559
dyo pool so let's look at how that actually works so you have a router you've got the heru
00:27:29.760
router it's randomly assigning requests Let's Pretend the blue box is a request coming in it hits the router and gets
00:27:36.440
assigned randomly to a dyo let's call it dyo one another request comes in while
00:27:42.159
that first request is being processed and this one gets randomly assigned to another
00:27:47.480
dyo this continues while requests are still being processed
00:27:52.679
and if you happen to have some requests that are a little slow like this blue one that's been sitting around for a
00:27:58.240
little while then we could run into a problem so let's say another request comes in and let's say that it randomly
00:28:04.960
gets assigned to a dyo that's already busy like say Dino 1 so what's going to happen it's going to get queed it's
00:28:12.120
going to sit there and it's going to wait until that dyo frees up and can actually process that
00:28:17.360
request so as requests get processed and new requests come in and get assigned
00:28:23.600
we've still got this one request that's sitting there backed up notice also that there's a couple of
00:28:29.399
dinos here that aren't doing anything that's kind of unfortunate and
00:28:34.600
as Prem mentioned earlier we have a Timeout on the aemi side that's been set for 15
00:28:40.679
seconds so if we find that a request comes in and it's sitting around waiting to be processed or perhaps it's even in
00:28:47.880
process with the dyno and the timer goes off we're still going to end up timing
00:28:53.640
out that request as soon as it gets timed out it's no longer in our control it's no
00:28:58.799
longer in heroku's control it's taken away because the aimi layers before that it now gets replayed to chocolate and
00:29:05.080
this is one of the key reasons why we would find so many orders that could get completed anyway that were still being
00:29:12.279
processed say they were still in process on the on our app side and still end up in chocolate our
00:29:18.799
failover this is doubly unfortunate because we also have a whole bunch of dinos that are just sitting there doing
00:29:24.640
nothing so how do we solve this well you can't the best way that you can do it is you
00:29:29.799
can speed up your dinos as much sorry speed up your requests as much as possible make it so that every Dyno is
00:29:36.919
processing requests very very quickly this will reduce the number that are going to be sitting there just waiting to get
00:29:43.159
processed um you can also just continuously tune everyone's app is very different and we found that we've done a
00:29:50.279
lot of aggressive tuning for both the number of unicorn workers that we have running as well as say how long we're
00:29:56.840
willing to wait for a single request to be processed not just the the system as a whole but individual
00:30:03.720
requests and ultimately you just have to accept that with random routing there's really no way that you can ever solve
00:30:11.240
this there's no solution to that you can just make it better so before we closing off uh there
00:30:20.080
are several things to remember from this talk uh first of all your site will goes
00:30:27.039
down no matter what something will break and your site will goes down so consider
00:30:34.600
having something like this in uh should be in your planning
00:30:42.720
process uh second you should have a replayer uh some uh Dynamic routing to
00:30:51.519
uh replay the critical web request such as like making an order or something
00:30:56.760
like that uh you might want to take in some bre if
00:31:02.840
that makes if if that's going to make your customer happy and make your
00:31:09.000
platform uh like be
00:31:14.639
reliable and also uh you need to keep your Endo lean lean and fast try to put
00:31:22.679
stuff into a background job if it can be to free up the the web
00:31:29.440
server and five because everyone sometimes want a burrito like late so
00:31:36.120
use a CDN to serve the static asset or static site if your main site goes
00:31:43.360
down so that's it uh anybody have any question sure so let me just repeat the
00:31:50.559
question so he's saying let me just paraphrase actually because a little little bit much there but I think what
00:31:55.799
you're saying is it sounds like the problem is Heroku this sounds like the problem is your main app going down why
00:32:01.080
not have a second copy of your app that you could scale up and if it goes down send all your requests to that and
00:32:07.919
you're actually up you're not just faking that you're up did I get that right cool so I think the easiest answer
00:32:14.919
to that is because state is very very hard State being things like I need to
00:32:20.919
then synchronize or you know DD whatever you want to call it if I have an order
00:32:26.039
come in on my set second application well in order doesn't exist in isolation it does a lot of different stuff when
00:32:32.840
someone places an order we have probably hundreds of different things that get kicked off that all affect all kinds of
00:32:39.480
different state if someone has say prepaid prepaid credit we need to deduct that right so in that particular case if
00:32:47.519
the second site is has a couple of orders go against it and then we resume
00:32:52.799
we go back to our main site well those numbers are going to be off until they've had a chance to syn synchronize and talk to each other right so we found
00:33:00.519
that the state and in other people's applications could be even more complicated but the thing is that the
00:33:05.600
state here the synchronizing between the two applications is incredibly hard so
00:33:11.600
let's go shopping let's create a replayer so that's the main idea yeah I think I think the I would like to add as
00:33:17.960
well that deploying the site to heroo is actually better than like having to have
00:33:23.919
the uh like CIS off in in our like in in our startup I mean because like pretty
00:33:31.440
much every every developer knows how to use git and like knows how to deploy to heroo so it's actually like even though
00:33:38.279
we have to maintain those stuff it's like the pros and cons actually like cancel out and then it's not that bad I
00:33:44.679
think there's another question yeah right there you just single ofure definitely yeah so the question
00:33:52.000
was didn't you just move the single point of failure from Amazon to aami and you're right
00:33:57.559
except that Akamai is a lot better about being up than we are we're still pretty small compared to them I think the last
00:34:04.120
time I read some of their marketing stuff it was something like 20% of all internet traffic ultimately ends up
00:34:10.440
routing through aami but replace aami with X with some other replayer that you
00:34:16.200
trust um they're very good about staying up I I don't remember if you guys
00:34:21.520
remember maybe six months ago something like that they did have some downtime which was pretty catastrophic so luckily
00:34:27.520
it's that that they're not perfect either but they're better than we are right now so so that's what we're doing
00:34:34.399
so far but it's an evolving system so yo have you thought about a third failover
00:34:39.919
so the question is have you thought about a third failure failover absolutely you could have a hundred
00:34:45.399
replayers if you wanted to there's no there's no reason why you couldn't you could even have redundant acamis what if
00:34:51.480
you had aami and then X and then Y in between right you could scale up this kind of a concept cep in many different
00:34:58.119
ways to then have something along the chain to then capture those requests and store them for later processing
00:35:04.320
absolutely so
00:35:09.400
yes so the question is how do you do deployments when you have multiple places that you're deploying to so the
00:35:15.520
two applications the main rails app and chocolate the replayer are all are both
00:35:20.760
very different completely separate and they are not deployed at the same time they're not even deployed in the same day we have as our standard operating
00:35:28.040
procedures we do not touch both apps at the same time if we could ever avoid it
00:35:33.440
and so far we've been able to avoid it yeah pretty much like there are some parts there are some parts of the code
00:35:39.320
like risk assertions that chair between those two application so we push those
00:35:44.880
thing up as a gem like in a in a private gem posting yep and then lastly just in
00:35:50.880
terms of tools we just push directly to to Heroku like everyone does and we also
00:35:55.960
use Capistrano for the second one which is on a a VPS so other questions way in
00:36:02.160
the back awesome question so how are you the
00:36:09.640
question is how are you assessing risk without depending on a third party so
00:36:14.720
what we showed you is a super simplistic example anything I think it was anything under 100 bucks just it's it's low risk
00:36:20.880
right that's one way to do it that's pretty naive there are other ways too you could consider for instance so you
00:36:27.960
could have say nightly a uh synchronized job between the two apps that checks to
00:36:33.599
see all of your most frequent customers so customers that you know that that have been a customer for months and
00:36:40.079
months or years that place lots of orders and have perhaps a predictable pattern you could take something like
00:36:45.599
that and say yes this person is is low risk up to some dollar amount or some
00:36:51.400
preset you know threshold and if your replayer ever encounters an order from that person then boom low risk and
00:36:58.480
accept it uh there are many different ways that you could do that the idea is to move the logic into your replayer and
00:37:04.440
never have the replayer dependent on something outside in order to assess that risk and so there's it's everyone's
00:37:11.680
app is different but for us we found a couple of different ways we can do that very reliably so any other questions
00:37:17.920
yeah right over here great question so the question is what if you just use a second Gateway for your app so what if
00:37:23.839
your first payment Gateway goes down you then fail over to a second Gateway uh the short answer is we actually already
00:37:29.560
do that um but the the longer answer is that you can take this technique and
00:37:35.079
apply it to things Beyond just a payment Gateway like what if you know you were doing something else that didn't rely on
00:37:41.680
payment gateways and You Were Somehow interested in something else a lot of this is similar to like delayed job
00:37:47.240
retries right if you're sending an email in a delayed job delayed job by default
00:37:52.520
will have up to three retries I believe so the same concept exists that's just
00:37:57.760
inside your app same thing too you could apply end number of payment gateways or end number of redundant services but you
00:38:04.400
may not have that option and if that's the case you could use something like this so I think we have time for maybe
00:38:09.839
one more question yes right over
00:38:17.280
there great question so uh question is uh what percentage of payments what
00:38:22.680
percentage of orders while you're down are considered high risk and thus rejected so the number started off very very high
00:38:30.640
I don't know the figures in front of me but I wouldn't be surprised if it was more than 50% um and we've tuned that over time as
00:38:37.800
we learn more about our customers learn more about their habits and then kind of figured out well what really is high-
00:38:43.960
risk to us is it the dollar amount is a $5 you know sandwich is that high risk
00:38:49.960
enough that we're willing to we're not willing to extend that out just in case it works or is it more about something
00:38:56.400
like like patterns and so for us if someone places one order at $5 okay
00:39:02.240
maybe that's low risk because $5 isn't too much especially when your VC backed haha um but second is what if they start
00:39:09.520
placing lots of $5 orders all right away and all consecutively well then that starts to become high-risk so we have to
00:39:16.760
put in some logic about that and we've been tuning this metric quite a bit over time and like we said before we're still
00:39:23.240
investigating we don't have a perfect answer I don't think there ever will be but it's something that we can tune and we can experiment with and learn from so
00:39:31.160
cool well thanks everyone thank
00:39:55.400
you
00:40:06.520
n