00:00:20.689
thank you all for coming to my talk thank you all for coming to railsconf and thank you Megan for the wonderful introduction I'm sure a lot of y'all
00:00:28.650
have done this in fact all of y'all might have done this but how many know
00:00:35.129
how to do this by the end of this talk hopefully all of you before we get too
00:00:43.710
far in I want to explain Who I am I'm Matt Matt design skia said you can find
00:00:52.170
me on Twitter you can find me on github please do fun fact I can plug a USB cord
00:00:58.020
in the right way round first time every time that's not true actually how'd you
00:01:06.450
go in as Megan said I work for a high-tech company in Southern California
00:01:12.479
we're possibly the highest tech company and we are building the platform for the
00:01:18.990
cannabis industry and the millions of cannabis users around the world sounds
00:01:24.330
lofty DHH might take me to task for that but if that sounds interesting to you definitely stop by the booth over in the
00:01:30.899
exhibit hall and talk you can't miss it we're always hiring and it's a damn fine
00:01:37.500
place to work if you just like having a good time and you're happy with your current job stop by the booth anyway an
00:01:43.530
RSVP for our party tonight which is going to be awesome I hope to see y'all there so in my time
00:01:53.610
at weed maps I've been there for a little over a year and a half and unfortunately I have taken down
00:02:00.810
production before I go into that though I want to do a quick audience
00:02:06.660
participation if I can see can can you
00:02:11.790
all raise your hands if you use a relational database like Postgres or by sequel as your primary data store well
00:02:18.629
your ought to talk about database migrations so I figured that was about about what you'd expect second question
00:02:25.680
how many of y'all have tables with 1 million rows or more in that database
00:02:33.000
okay good good then this talk is very applicable to you because the sort of
00:02:40.370
migrations that work pretty well for 50 records tends to not work so well for 50
00:02:47.670
million eggs so for our first example
00:02:54.540
let's talk about something that I think you would consider pretty simple adding
00:02:59.610
a column to the database let's say we have our user's table we have our
00:03:05.910
application we've got millions and millions of people using it because we're a really successful startup and we
00:03:15.090
want to be able to mark a user account as active or inactive I don't know how
00:03:23.220
you've got this many users without needing to mark them active but bear with me for the sake of this example
00:03:29.150
pretty simple database structure we had a column active it's a boolean is true
00:03:35.160
false and we probably want to mark our existing users active so we don't lock
00:03:42.870
everybody out of the system so in order to do that you might write a migration
00:03:48.989
that looks something like this add column users active booing default to
00:03:54.120
true let's look at the job done right wrong no one can log in no one can sign
00:04:05.280
up everyone is furious your app gets raked on Yahoo News your VC's get angry
00:04:11.640
it begins the downward spiral somehow a plane crashes into a mountain we're not sure how that happened but in order to
00:04:19.260
understand what's going on here we have to talk about what's going on under the
00:04:24.360
hood with locks locks there we go locks
00:04:31.080
are the mechanism that your relational database uses to ensure that multiple operations don't update the same row at
00:04:37.620
the same time there are a lot of different lock modes unfortunately too many to go into detail here but to give
00:04:45.030
you an idea they range from an access share which basically is the lowest level I'm
00:04:50.669
accessing data but anyone else can access it too I'm not doing anything dangerous all the way up to access
00:04:57.930
exclusive which is no one else can do anything with this piece of data that
00:05:03.300
I'm trying to access because I'm doing something that is potentially dangerous
00:05:08.659
like updating or deleting it might change when you're doing a database
00:05:16.680
migration even when as simple as the one we saw earlier it will obtain an access
00:05:22.589
exclusive lock on the table in question and prevent any other connections and
00:05:28.979
the entire rest of your application from reading or writing data to that table
00:05:34.020
until the migration is complete so let's
00:05:39.930
watch that in real time real ish time with our example users table you'll have
00:05:50.039
to imagine the remaining nine point nine nine nine million users
00:05:55.229
I could only fit so many on the slide but our migration begins running and the
00:06:02.699
first thing it's going to do is obtain that access exclusive lock it's going to
00:06:09.659
add the column so far so good and then we're gonna begin filling that column
00:06:16.469
with active true so people can log it so
00:06:21.539
our little elephant here is Postgres doing its thing we've marked the first
00:06:26.550
user Alice as active keep it going so far so good we marked Bob active uh-oh
00:06:34.580
first problem Alice is trying to log in but she can't because the entire table
00:06:41.939
is locked your application can't read data from it and she's stuck there at a
00:06:47.550
hung login screen or 500 error or she goes and uses your competitors application overall not good
00:06:55.349
your migrations still running though marked on the next column Charlize active now now Bob's trying to
00:07:03.120
log in same problem can't do it can't read the user data can't log in the
00:07:10.530
migration is continuing to run and users are continuing to try to log in and you
00:07:16.380
wind up with this pileup of queries that can't be responded to and this pileup of users that are very upset
00:07:24.270
the migration finally finishes running releases that lock now that it's done filling all the data and what you're
00:07:32.820
left with is a cadre of dissatisfied customers probably not what you were
00:07:38.580
planning on when you decided to mark users as active or inactive so how do we
00:07:44.430
avoid this well don't do it first of all
00:07:49.520
don't add columns with the default value that's a dangerous operation because of the locking mode it uses and cannon will
00:07:57.480
cause downtime if you have enough rows in your database and enough traffic on your system it is worth mentioning at
00:08:03.870
this point that Postgres 11 actually addresses this particular problem in certain circumstances adding a static
00:08:11.340
default value no longer requires obtaining a table level access exclusive lock but note the caveat under certain
00:08:20.220
circumstances there are other circumstances where adding a default value for example backfilling a new UUID
00:08:26.490
column will obtain that lock and it will
00:08:31.710
be a dangerous operation also it's hard to stay up to date on that latest
00:08:37.919
version for example we use Aurora at weed maps and Aurora does not support
00:08:44.580
Postgres in London the highest version they have is 10 6 so you're at the
00:08:50.430
behest of your database your SAS provider your DBA your ops team to be on
00:08:58.740
the latest version to be able to take advantage of those new features which you might not be so general good
00:09:05.790
practice don't add columns with the default value suite we know what to do now we're not gonna
00:09:14.440
add that column with the default value just get rid of that little bit right there we're gonna add it without a
00:09:22.000
default value first we're going to change the default once the column has been added and then we're going to go
00:09:28.900
back and update all of our users to mark them as active and of course we've split it into up and down because we're good
00:09:35.530
rails developers and our migration should be reversible and we tested this in development and everything is
00:09:41.170
hunky-dory nailed it not so what's happening this
00:09:49.630
time sorry I just wanted to see that loop one more time so what's happening
00:09:57.610
this time now we got to talk about transactions transactions basically
00:10:08.110
combine multiple database operations into a single all-or-nothing operation either all of your statements succeed
00:10:14.770
within the transaction and it is committed or something fails violates an
00:10:20.530
integrity constraint whatever goes wrong and the entire operation is rolled back
00:10:26.340
so transactions provide for guarantees atomicity consistency isolation and
00:10:31.930
durability which you might hear referred to by the acronym acid check out our
00:10:37.540
sister site LSD Maps to learn more not a real product so this is a topic for a
00:10:46.600
talk in and of itself probably many talks and we don't necessarily have to go into it in great depth here suffice
00:10:54.040
it to say that the consistency and isolation that a transaction provides are guaranteed by locks specifically
00:11:01.380
consistency means that any operation will take your database from a valid
00:11:06.730
state to another valid state if you write a row that row has to be valid in
00:11:11.920
order to be saved and in order to be read by subsequent queries isolation
00:11:17.290
guarantees that uncommitted transactions are uncommitted operations from one can
00:11:23.430
can't interfere with the operation of another connection if the transaction hasn't been committed that database
00:11:29.579
hasn't made that consistent transition to a new state and so other connections can't read the data and there's some
00:11:37.649
interesting stuff going on under the hood with tuples and indexing and
00:11:42.920
multi-layer indexing and again not going to get into that right now
00:11:49.220
come find me at the booth because I would love to nerd out on database stuff with you though when a rows being
00:11:57.149
updated an exclusive lock is issued to prevent other people from updating that same row which makes sense right if I've
00:12:04.500
got some data too right and you've got some data too right and we try to do it at the same time what's going to come out well we don't know that violates the
00:12:11.699
consistency principle in relation databases and so you cannot do that updates to rows are always made in a
00:12:19.500
serial manner and that serial order is guaranteed by locking within a
00:12:26.519
transaction locks are issued on a first-come first-served basis they live
00:12:32.339
for the duration of a transaction even if the statement that requested the lock has already executed so for most of your
00:12:40.589
database operations this might not be a very big problem because they usually happen on the order of milliseconds
00:12:47.360
where it does become a problem is as you saw earlier when you have to perform
00:12:53.120
millions of database operations on a very large data set it tends to take a
00:12:59.250
long time so how does this tie in to
00:13:05.339
migrations well through a little bit of that rails magic that we all know and
00:13:11.459
love everything you do in a migration is automatically wrapped inside a
00:13:16.589
transaction so but you can imagine what
00:13:24.690
this looks like but fortunately you don't have to because I have more Mach tables we're going to start after our
00:13:32.279
columns been added and it's time to mark our users active again starting with ro1 we do not lock the
00:13:39.900
entire table this time because we're not adding a default value for the column
00:13:45.210
we're just individually updating all of our rows however in order to update a
00:13:52.680
row you have to obtain a lock on it so
00:13:57.810
we mark our first rows active lock it mark it active true move on and even
00:14:03.570
though that was successful as I mentioned that lock doesn't get released until your transaction can commit and
00:14:11.660
that can't happen until you update all of the rows just like you wrote in the migration so now we're on row 2 we
00:14:20.880
obtain another lock for that row we mark it as true we keep going in this manner
00:14:26.910
and now Alice is trying to log in and unfortunately in that row is locked so
00:14:33.720
we can't write data to it now Charlie's
00:14:39.060
Row is locked mark him active one thing to note about this is that it is actually better than
00:14:46.440
the first failure case that we described because some of your users those that
00:14:52.680
haven't been updated to be marked as active are not locked yet during this
00:14:58.710
migration and can log in successfully so Doug is a happy user and then he goes
00:15:06.660
off on his merry way to use your application none the wiser that there are any users experiencing issues and
00:15:12.770
now we move on to his row and now more users are trying to log in and lock that
00:15:19.560
and we mark it as true and Doug is now active fortunately for him he already
00:15:25.050
logged in and now our migration is done all of the locks are released as the
00:15:30.630
transaction is committed and you're left with slightly fewer but probably no less
00:15:37.260
angry users again not what you were intending to do now if you were
00:15:45.960
listening closely you might have noticed I said can't write data because update locks don't
00:15:53.360
prevent you from reading the previous value of the row until that transaction
00:15:58.580
is committed and the value is updated so when a user is logging in we're just
00:16:05.720
reading the password we can read data why is this a problem are you sure
00:16:13.279
you're not making updates when your user logs in you might want to consider it and even if this particularly contrived
00:16:22.310
example doesn't necessarily apply to you I'm sure there's a table in your application that you can't afford to
00:16:28.940
block updates and write traffic to for an e-commerce application that might be your orders table if you're a SAS
00:16:35.930
blogging service that might be your posts table there's always some reason
00:16:41.510
for your application to be writing data and while you can chance it and maybe to
00:16:48.920
unhappy users is an acceptable margin of error it's better to just not backfill
00:16:55.880
data inside a transaction a lot of advice is just don't do it right if only
00:17:02.630
everything was so easy but hold on a
00:17:11.419
second we like transactions they have all those nice acid properties that we
00:17:18.020
want when we're manipulating a database we don't want to get into an inconsistent state we don't want connections to start conflicting with
00:17:24.740
each other that makes for an unstable system and that's just not gonna work so you'll notice it says don't backfill
00:17:33.740
data inside a transaction what we're gonna do is first pull this line out of
00:17:42.770
the migration because we're inside a transaction we like that we want our
00:17:48.620
column changes to happen in a safe manner and we're going to write a
00:17:53.870
completely separate migration to mark all of our users is active that runs after the first line here that you
00:18:02.300
the disabled DDL transaction DDL data description layer disables that global
00:18:09.290
transaction that I mentioned earlier so it's implicitly enabled but you do
00:18:15.080
have the option to explicitly opt out of that behavior while you're running particular migration then within our up
00:18:24.710
method rather than marking every single user is active inside a transaction
00:18:30.560
which is dangerous and rather than marking every single user is active outside of a transaction which is
00:18:36.290
dangerous for different reasons what we're doing here is iterating through our users in batches and wrapping each
00:18:43.010
individual batch inside of a transaction batch size defaults to a thousand of
00:18:49.400
course it's configurable based on your individual needs but the key here is that this transaction that's updating a
00:18:55.730
thousand rows is going to complete and commit much faster than a transaction
00:19:00.920
updating ten million rows orders of magnitude faster believe it or not so
00:19:06.320
that changes your lag time from minutes where your entire user base can't login
00:19:12.920
or use your software to the order of seconds or even less while an individual
00:19:18.350
subset of users might receive a slightly delayed response which in almost all
00:19:24.710
circumstances is an extremely extremely good trade-off and your users excuse me
00:19:30.590
your users most likely won't even notice that anything happened so our rule of
00:19:37.400
thumb here is don't mix schema and data changes because you're gonna wind up
00:19:43.970
getting lumped in this transaction and that's gonna delay your updates cause
00:19:50.240
extensive locks to be held for long periods of time and generally bottleneck
00:19:55.610
your system at the database don't forget
00:20:00.830
to put those updates back inside a transaction just at a scale that works
00:20:07.610
for you and works for your system and it isn't going to block for extended periods of time
00:20:14.230
now we finally got our column added we finally got all the values put into it
00:20:20.170
we only cost two production outages pretty good week right and now we
00:20:27.700
probably want to be able to look up which of our users are active or not so
00:20:36.220
we're gonna add an index you'll note that this says for Postgres only and
00:20:41.260
that's because the internal implementation of indexing in my sequel doesn't have this particular failure
00:20:48.580
failure mode that we're about to discuss again not going to talk about how
00:20:54.690
different meta indexes point to different tuples which point to actual disk locations suffice it to say that
00:21:01.120
Postgres has a slightly more complex indexing strategy that has a particular
00:21:06.880
edge case associated with it so we have
00:21:12.460
our very simple migration here or adding a ioan index to the active column on users totally straight forward nothing
00:21:21.370
could go wrong of course you've been at this talk long enough to know that's not
00:21:27.790
why I put that up on the screen in addition to running that inside a
00:21:33.610
transaction adding an index and Postgres will by default obtain one of those
00:21:40.450
nasty table level locks that we described earlier while it reads all of
00:21:46.180
the data in the table and writes it all to the index in one fell swoop and just
00:21:51.970
like you are locked from updating an individual row while Postgres is trying
00:21:57.730
to write an index you can't make changes that would affect the way that index is
00:22:03.100
written such as updating a different row and so basically your database state is
00:22:09.730
frozen by Postgres until the index is calculated and written at which point
00:22:15.070
you can then modify rows again fortunately this is generally realized
00:22:22.510
as a bad thing even by the folks who maintain Postgres and they added a very
00:22:28.980
algorithm that can create a new index and write it to disk without blocking
00:22:36.120
reason writes so in addition to disabling that transaction at the top
00:22:41.770
just like we talked about earlier we're going to apply the algorithm concurrently to our index which tells
00:22:49.150
Postgres to add it in a non-blocking manner simplest thing in the world
00:22:56.280
note as I said Postgres only if you put algorithm concurrently on your migration
00:23:02.590
and you're using my sequel this will fail with an error you won't break anything but it won't work either so
00:23:10.770
don't try it or do try it your database
00:23:16.980
but whatever you do do add Postgres indexes concurrently it adds a little
00:23:22.600
bit of overhead but it avoids massive table level locking that will generally cause problems with your application all
00:23:31.930
right so now that we've talked about several potential failure modes and
00:23:37.360
things that you shouldn't do now it's time to talk about a time when I did
00:23:44.320
those things that you shouldn't do so a
00:23:50.320
tweed maps we have a pretty significant audit trail of operations that people
00:23:55.810
are doing in the system makes sense right we want to know when people are
00:24:00.940
logging in who's making updates to what compliance security general goodness and what that
00:24:09.040
looks like at a very high level is we write actual changes to the model to our
00:24:15.130
main Postgres database and then we write an audit record for whatever changed
00:24:20.890
what operation happened to create update or delete and the properties of the model that were changed and we write
00:24:26.800
that we write that to a separate database you could easily write it to a separate table in the same database
00:24:32.650
that's an architecture decision based on the size of your data set and scaling needs but I will say
00:24:40.030
I'm very excited for real six in the multi DB support because there's a lot of hackery that we had to do to make
00:24:46.120
this work but that's neither here nor there the specific case that we were
00:24:52.750
trying to address is that we needed to update a compound index on the audits
00:24:58.060
table if you haven't seen a compound index it's the same as a simple index
00:25:04.090
it's just on two columns in this case it was a polymorphic Association and we wanted two index on both the type and
00:25:10.450
the ID column because that's how we're going to be querying so what do we do we
00:25:16.480
drop the old index we add the new index concurrently of course no locks no
00:25:23.080
transactions didn't make any obvious mistakes with the migration what could
00:25:28.660
possibly go wrong a lot ultimately we
00:25:39.940
wound up taking down our entire production system for about 45 minutes which is about 45 minutes in the middle
00:25:48.820
of the afternoon as people are getting off work as people are using our service
00:25:54.310
to try to you find cannabis so that time to have a massive production outage but
00:26:02.560
what happened if we look at the migration it looks pretty sane we're not
00:26:07.750
in a transaction we're adding that index concurrently what could possibly have
00:26:14.440
gone wrong why did we take down production with this indexing change in
00:26:20.500
order to know that we have to talk about how concurrency works now there's gonna
00:26:26.290
be some math on this next slide that's the math don't worry we'll walk through
00:26:32.110
it littles law is a very simple equation that governs the relationship between
00:26:37.720
concurrency throughput and response time which ultimately controls how many requests your system and your
00:26:44.770
application is able to handle within a given time frame for example if you have
00:26:51.790
a hundred requests and you're averaging a forty millisecond response for that request for those
00:26:59.110
requests I should say that means that at any given point in time you have four
00:27:04.240
requests being served by your application this also works in Reverse where if you have four threads in Puma
00:27:13.840
and your average response time is 40 milliseconds you can calculate and know
00:27:19.930
that you can serve up to a theoretical maximum of 100 requests per second note
00:27:27.430
the relationship between response time and concurrency here if I add a zero to
00:27:33.010
your response time you also end up adding a zero to your concurrent requests and I don't know how many of
00:27:42.790
y'all run 40 or more servers but what I do know is that scaling up from 4 to 40
00:27:49.360
is not something that you can do very quickly and possibly won't expect to do
00:27:55.630
at all so littles law helps us determine the
00:28:01.900
theoretical maximum level of concurrency that your application can support at any given time if you use APM if you've
00:28:09.610
heard of littles law before if you've applied it to your own application you might be thinking in terms of rails requests and responses but your database
00:28:17.460
obeys the same principles it's a service it has a connection pool it takes
00:28:22.990
requests from your rails app and it generates responses in response where
00:28:35.410
you can run into problems here is when requests start queuing because they arrive faster than your application or
00:28:42.490
its database can respond to them if a database operation blocks many requests
00:28:50.230
for a long time you'll wind up saturating your connection pool you won't be able to serve you won't be able
00:28:58.150
to access your database first of all usually that causes problems for your
00:29:03.580
application so you probably won't be able to serve any requests and the entire system will
00:29:10.889
grind to a halt probably at the worst possible time when you're at a
00:29:16.019
conference and no one's on call so why
00:29:25.169
was this a problem for our audits well that's because we didn't understand what
00:29:33.210
was happening when those audit records were being written so now an auditing
00:29:38.610
overview but accurate first we I write
00:29:43.740
the actual changes to the database but second we have a little little tiny read
00:29:51.539
query here just to see what version that record is at so we're able to
00:29:58.139
subsequently write the next version to the audit database so rather than being
00:30:04.500
a write only append log that we thought we had it turns out that every every
00:30:13.730
audited database operation was actually generating a read in addition to the
00:30:19.350
write on that that audits database rather than being 99% writes unless user
00:30:28.470
happens to want to look at the audit trail for their listing we were actually about half and half reads writes and
00:30:40.159
that causes problems because why do you index things fast reads what did we do
00:30:48.419
we dropped an index and that changed our
00:30:54.809
query performance on the audit database from several milliseconds generally
00:31:01.200
acceptable to several seconds generally not acceptable and if you remember the
00:31:06.870
littles law multiplier earlier that changed our level of concurrency / the
00:31:15.539
number of requests that we can handle by several orders of magnitude so rather than being able to handle our
00:31:21.890
10,000 rpm production traffic at that particular time we wound up with enough
00:31:28.169
bandwidth to handle 10 RPM so what that
00:31:36.090
looks like with some more pretty animations is when we have our auto
00:31:41.220
scaling group we have our multi region AZ cool stuff this is true for a single
00:31:46.710
deployment as well but I figured I might as well give you an accurate representation of we'd map stack we have
00:31:53.789
these different rails applications talking to the database and requests start coming in they're trying to
00:32:00.210
generate that audit trail as they've been programmed to do but queries that were previously quite fast are now quite
00:32:07.140
slow and so once all of these requests come in they're all trying to access
00:32:12.330
that audit database you wind up with
00:32:17.490
your connection pool saturated all waiting for a response from Postgres and
00:32:22.940
none of your applications can do anything else because all these requests
00:32:28.380
are hung open waiting for a response from the database server and then all
00:32:35.340
the other requests come in and completely take everything down note
00:32:42.000
that our particular example relates to our audits database and our audit trail
00:32:47.240
but you don't have to be writing an audit to be affected by this any long
00:32:52.679
running query can potentially saturate your database connection pool on any
00:32:59.340
table at any time and if that happens all of your queries are going to be stuck waiting for those long-running
00:33:05.850
queries to complete so what do you do
00:33:13.250
you do test your database performance you only need to understand the
00:33:19.620
performance characteristics of your application to understand how they change during and after a migration and
00:33:27.590
you need to do this on a regular basis so you know what your application is doing if we had
00:33:34.650
identified the audit reads as a significant portion of our database traffic before that migration ran we
00:33:41.520
likely would not have written it and run it in the manner that we did and even though query performance was fine before
00:33:48.770
the migration and it would have been fine after the migration once that concurrent index was riad 'add so slow
00:33:56.430
queries during the migration caused a cascading failure that took down our entire application so let's talk about
00:34:07.110
how you can do that little tools and resources to help keep your database
00:34:13.140
healthy while still allowing you to make schema changes first a couple gems there
00:34:21.179
are some static analysis tools that will warn in advance about certain unsafe migrations the things that I've covered
00:34:26.880
here and the things that haven't been covered here changing column type dropping a column adding a JSON column
00:34:33.659
to a Postgres database there's a lot of things that are unsafe that I just don't have time to cover but in general you
00:34:41.760
want to catch problems at dev time not deploy time if you write safe migrations
00:34:46.830
you'll run safe migrations so I advise checking out one of these two gems
00:34:52.380
strong migrations zero downtime migrations possibly check out both of them they hook into the migrated DB
00:35:02.070
migrate rape task and will throw an error if your migration is written in an
00:35:07.440
unsafe manner which one works better for you will likely depend on your specific
00:35:12.660
database engine version schema what you're doing there don't exactly overlap
00:35:18.840
they both cover some slightly different things and it's not technically a gem
00:35:24.960
but I did want to throw it in there that gitlab actually has a really really cool
00:35:31.140
powerful migration helper system that they've built in that handles a lot of
00:35:36.450
these things and more which unfortunately is not abstracted out into a gem so it's difficult to pull in and
00:35:42.630
use in your own project but it worth checking out because they give a very good detailed explanation of these
00:35:49.250
things that I've talked about here as well second APM I think you could probably
00:35:57.020
throw this slide in any talk about any application in any language but
00:36:02.619
understanding your performance is absolutely critical to understanding how migrations are going to change it there
00:36:10.400
are all sorts of options for this New Relic skylight Scout app signal many
00:36:17.030
others I hope all of you are using one already but if not maybe start by
00:36:24.320
checking out the ones that sponsored this conference but the most important tool in your toolbox is you you as the
00:36:34.820
developer are going to be your own best resource in not doing dangerous things static analysis can only go so far and
00:36:41.990
as we saw in the audits case study you can take down your database in a manner
00:36:47.210
that the software itself would be completely unable to detect because it
00:36:52.460
doesn't know how your queries work and it doesn't know what the relative performance of them will be you need to
00:37:00.680
keep up with changes in your database engine both knowing that there are changes knowing what's in those changes
00:37:06.710
and keeping your database up-to-date I wish we were on Postgres 11 but we're
00:37:13.430
not practice these things in a
00:37:18.950
low-stakes environment you don't want to have the first time you write a
00:37:24.580
complicated safe multistage migration to be when you're running that migration on
00:37:31.250
a multi million row table in production there's nothing specific or special
00:37:37.400
about these techniques that don't apply to smaller tables and I advise
00:37:43.390
implementing this even in situations where you don't think you need it because when you write safe migrations
00:37:50.270
by default you never have that cognitive overhead of man how big is my table does
00:37:55.820
this one need to disable the transaction or can I run it in the transaction you don't want to
00:38:01.060
spend time thinking about it just do it safely all the time and you won't have
00:38:06.760
to worry it's always better to listen to
00:38:11.980
that nagging little voice in your head did I do this right is this safe am I
00:38:17.050
gonna take down prod I would always rather have someone prove that a
00:38:22.210
migration is safe or challenge me to prove that a migration is safe then assume it's safe by default right up
00:38:30.160
until that little adrenaline hit or large adrenaline hit when all of your
00:38:35.320
metrics go into the toilet on a deploy because you've got a bad migration running quick little recap commit this
00:38:45.520
to your memory don't add columns with the default value
00:38:52.950
don't backfill data inside a transaction a transaction use many smaller ones
00:38:59.260
don't mix schema and data changes in the same migration because you wind up with that global transaction that you need to
00:39:06.040
avoid do add your Postgres indexes concurrently and do monitor and test
00:39:13.900
your database performance before during and after migrations test all the
00:39:19.960
friggin time and remember uptime begins