rails db:migrate:safely

by Matt Duszynski

In the talk titled "rails db:migrate:safely" at RailsConf 2019, Matt Duszynski presents valuable insights on safely handling database migrations when working with large datasets, specifically focusing on relational databases like PostgreSQL. The discussion emphasizes the risks associated with copying and modifying millions of records during migrations, urging developers to learn from real-world mistakes.

Key points discussed include:
- Understanding Locks: Migrations can obtain exclusive locks, causing system downtime when adding or modifying columns, leading to user frustrations. For example, adding a column with a default value can lock the entire table, preventing all access until the operation completes.
- Transactions and Their Role: Transactions can delay updates as they require locks to be released only after all operations complete. The problem intensifies with large data sets, where all rows are locked during bulk updates.
- Best Practices for Migrations:
- Avoid adding columns with default values.
- Perform data updates in smaller transactions, rather than a single large batch, to keep the system responsive.
- Disable DDL transactions when necessary to avoid data and schema change collisions.
- Utilize concurrent index creation to prevent table locks during index updates.
- Anecdotal Case Study: Duszynski shares a cautionary tale about an indexing operation at Weedmaps that took down production due to unexpected database concurrency issues caused by not realizing the read operations generated during auditing processes.
- Testing and Monitoring: Regular performance tests of database operations before, during, and after migrations are crucial to ensure stability and performance. Tools like strong_migrations can help catch unsafe migration practices early in the development cycle.

Conclusions and Takeaways:
- Developers must prioritize writing safe and efficient database migrations. By following best practices and being aware of database behavior, particularly during migration processes, one can significantly reduce risks of downtime or performance degradation during migrations. Uptime is paramount, and actions taken towards safe migrations reflect directly on user satisfaction and system reliability. Ultimately, learning from past mistakes and being proactive in testing and monitoring can lead to a pain-free migration experience.

RailsConf 2019 - rails db:migrate:safely by Matt Duszynski
_______________________________________________________________________________________________
Cloud 66 - Pain Free Rails Deployments
Cloud 66 for Rails acts like your in-house DevOps team to build, deploy and maintain your Rails applications on any cloud or server.

Get $100 Cloud 66 Free Credits with the code: RailsConf-19
($100 Cloud 66 Free Credits, for the new user only, valid till 31st December 2019)

Link to the website: https://cloud66.com/rails?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
Link to sign up: https://app.cloud66.com/users/sign_in?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
_______________________________________________________________________________________________
When you're dealing with fifty million records, simple migrations can become very dangerous. Come and learn from my mistakes instead of making your own. We'll talk about what's going on behind the scenes in your database, and how to safely write and run some common migrations. Remember, uptime begins at $HOME.

RailsConf 2019

00:00:20.689 thank you all for coming to my talk thank you all for coming to railsconf and thank you Megan for the wonderful introduction I'm sure a lot of y'all

00:00:28.650 have done this in fact all of y'all might have done this but how many know

00:00:35.129 how to do this by the end of this talk hopefully all of you before we get too

00:00:43.710 far in I want to explain Who I am I'm Matt Matt design skia said you can find

00:00:52.170 me on Twitter you can find me on github please do fun fact I can plug a USB cord

00:00:58.020 in the right way round first time every time that's not true actually how'd you

00:01:06.450 go in as Megan said I work for a high-tech company in Southern California

00:01:12.479 we're possibly the highest tech company and we are building the platform for the

00:01:18.990 cannabis industry and the millions of cannabis users around the world sounds

00:01:24.330 lofty DHH might take me to task for that but if that sounds interesting to you definitely stop by the booth over in the

00:01:30.899 exhibit hall and talk you can't miss it we're always hiring and it's a damn fine

00:01:37.500 place to work if you just like having a good time and you're happy with your current job stop by the booth anyway an

00:01:43.530 RSVP for our party tonight which is going to be awesome I hope to see y'all there so in my time

00:01:53.610 at weed maps I've been there for a little over a year and a half and unfortunately I have taken down

00:02:00.810 production before I go into that though I want to do a quick audience

00:02:06.660 participation if I can see can can you

00:02:11.790 all raise your hands if you use a relational database like Postgres or by sequel as your primary data store well

00:02:18.629 your ought to talk about database migrations so I figured that was about about what you'd expect second question

00:02:25.680 how many of y'all have tables with 1 million rows or more in that database

00:02:33.000 okay good good then this talk is very applicable to you because the sort of

00:02:40.370 migrations that work pretty well for 50 records tends to not work so well for 50

00:02:47.670 million eggs so for our first example

00:02:54.540 let's talk about something that I think you would consider pretty simple adding

00:02:59.610 a column to the database let's say we have our user's table we have our

00:03:05.910 application we've got millions and millions of people using it because we're a really successful startup and we

00:03:15.090 want to be able to mark a user account as active or inactive I don't know how

00:03:23.220 you've got this many users without needing to mark them active but bear with me for the sake of this example

00:03:29.150 pretty simple database structure we had a column active it's a boolean is true

00:03:35.160 false and we probably want to mark our existing users active so we don't lock

00:03:42.870 everybody out of the system so in order to do that you might write a migration

00:03:48.989 that looks something like this add column users active booing default to

00:03:54.120 true let's look at the job done right wrong no one can log in no one can sign

00:04:05.280 up everyone is furious your app gets raked on Yahoo News your VC's get angry

00:04:11.640 it begins the downward spiral somehow a plane crashes into a mountain we're not sure how that happened but in order to

00:04:19.260 understand what's going on here we have to talk about what's going on under the

00:04:24.360 hood with locks locks there we go locks

00:04:31.080 are the mechanism that your relational database uses to ensure that multiple operations don't update the same row at

00:04:37.620 the same time there are a lot of different lock modes unfortunately too many to go into detail here but to give

00:04:45.030 you an idea they range from an access share which basically is the lowest level I'm

00:04:50.669 accessing data but anyone else can access it too I'm not doing anything dangerous all the way up to access

00:04:57.930 exclusive which is no one else can do anything with this piece of data that

00:05:03.300 I'm trying to access because I'm doing something that is potentially dangerous

00:05:08.659 like updating or deleting it might change when you're doing a database

00:05:16.680 migration even when as simple as the one we saw earlier it will obtain an access

00:05:22.589 exclusive lock on the table in question and prevent any other connections and

00:05:28.979 the entire rest of your application from reading or writing data to that table

00:05:34.020 until the migration is complete so let's

00:05:39.930 watch that in real time real ish time with our example users table you'll have

00:05:50.039 to imagine the remaining nine point nine nine nine million users

00:05:55.229 I could only fit so many on the slide but our migration begins running and the

00:06:02.699 first thing it's going to do is obtain that access exclusive lock it's going to

00:06:09.659 add the column so far so good and then we're gonna begin filling that column

00:06:16.469 with active true so people can log it so

00:06:21.539 our little elephant here is Postgres doing its thing we've marked the first

00:06:26.550 user Alice as active keep it going so far so good we marked Bob active uh-oh

00:06:34.580 first problem Alice is trying to log in but she can't because the entire table

00:06:41.939 is locked your application can't read data from it and she's stuck there at a

00:06:47.550 hung login screen or 500 error or she goes and uses your competitors application overall not good

00:06:55.349 your migrations still running though marked on the next column Charlize active now now Bob's trying to

00:07:03.120 log in same problem can't do it can't read the user data can't log in the

00:07:10.530 migration is continuing to run and users are continuing to try to log in and you

00:07:16.380 wind up with this pileup of queries that can't be responded to and this pileup of users that are very upset

00:07:24.270 the migration finally finishes running releases that lock now that it's done filling all the data and what you're

00:07:32.820 left with is a cadre of dissatisfied customers probably not what you were

00:07:38.580 planning on when you decided to mark users as active or inactive so how do we

00:07:44.430 avoid this well don't do it first of all

00:07:49.520 don't add columns with the default value that's a dangerous operation because of the locking mode it uses and cannon will

00:07:57.480 cause downtime if you have enough rows in your database and enough traffic on your system it is worth mentioning at

00:08:03.870 this point that Postgres 11 actually addresses this particular problem in certain circumstances adding a static

00:08:11.340 default value no longer requires obtaining a table level access exclusive lock but note the caveat under certain

00:08:20.220 circumstances there are other circumstances where adding a default value for example backfilling a new UUID

00:08:26.490 column will obtain that lock and it will

00:08:31.710 be a dangerous operation also it's hard to stay up to date on that latest

00:08:37.919 version for example we use Aurora at weed maps and Aurora does not support

00:08:44.580 Postgres in London the highest version they have is 10 6 so you're at the

00:08:50.430 behest of your database your SAS provider your DBA your ops team to be on

00:08:58.740 the latest version to be able to take advantage of those new features which you might not be so general good

00:09:05.790 practice don't add columns with the default value suite we know what to do now we're not gonna

00:09:14.440 add that column with the default value just get rid of that little bit right there we're gonna add it without a

00:09:22.000 default value first we're going to change the default once the column has been added and then we're going to go

00:09:28.900 back and update all of our users to mark them as active and of course we've split it into up and down because we're good

00:09:35.530 rails developers and our migration should be reversible and we tested this in development and everything is

00:09:41.170 hunky-dory nailed it not so what's happening this

00:09:49.630 time sorry I just wanted to see that loop one more time so what's happening

00:09:57.610 this time now we got to talk about transactions transactions basically

00:10:08.110 combine multiple database operations into a single all-or-nothing operation either all of your statements succeed

00:10:14.770 within the transaction and it is committed or something fails violates an

00:10:20.530 integrity constraint whatever goes wrong and the entire operation is rolled back

00:10:26.340 so transactions provide for guarantees atomicity consistency isolation and

00:10:31.930 durability which you might hear referred to by the acronym acid check out our

00:10:37.540 sister site LSD Maps to learn more not a real product so this is a topic for a

00:10:46.600 talk in and of itself probably many talks and we don't necessarily have to go into it in great depth here suffice

00:10:54.040 it to say that the consistency and isolation that a transaction provides are guaranteed by locks specifically

00:11:01.380 consistency means that any operation will take your database from a valid

00:11:06.730 state to another valid state if you write a row that row has to be valid in

00:11:11.920 order to be saved and in order to be read by subsequent queries isolation

00:11:17.290 guarantees that uncommitted transactions are uncommitted operations from one can

00:11:23.430 can't interfere with the operation of another connection if the transaction hasn't been committed that database

00:11:29.579 hasn't made that consistent transition to a new state and so other connections can't read the data and there's some

00:11:37.649 interesting stuff going on under the hood with tuples and indexing and

00:11:42.920 multi-layer indexing and again not going to get into that right now

00:11:49.220 come find me at the booth because I would love to nerd out on database stuff with you though when a rows being

00:11:57.149 updated an exclusive lock is issued to prevent other people from updating that same row which makes sense right if I've

00:12:04.500 got some data too right and you've got some data too right and we try to do it at the same time what's going to come out well we don't know that violates the

00:12:11.699 consistency principle in relation databases and so you cannot do that updates to rows are always made in a

00:12:19.500 serial manner and that serial order is guaranteed by locking within a

00:12:26.519 transaction locks are issued on a first-come first-served basis they live

00:12:32.339 for the duration of a transaction even if the statement that requested the lock has already executed so for most of your

00:12:40.589 database operations this might not be a very big problem because they usually happen on the order of milliseconds

00:12:47.360 where it does become a problem is as you saw earlier when you have to perform

00:12:53.120 millions of database operations on a very large data set it tends to take a

00:12:59.250 long time so how does this tie in to

00:13:05.339 migrations well through a little bit of that rails magic that we all know and

00:13:11.459 love everything you do in a migration is automatically wrapped inside a

00:13:16.589 transaction so but you can imagine what

00:13:24.690 this looks like but fortunately you don't have to because I have more Mach tables we're going to start after our

00:13:32.279 columns been added and it's time to mark our users active again starting with ro1 we do not lock the

00:13:39.900 entire table this time because we're not adding a default value for the column

00:13:45.210 we're just individually updating all of our rows however in order to update a

00:13:52.680 row you have to obtain a lock on it so

00:13:57.810 we mark our first rows active lock it mark it active true move on and even

00:14:03.570 though that was successful as I mentioned that lock doesn't get released until your transaction can commit and

00:14:11.660 that can't happen until you update all of the rows just like you wrote in the migration so now we're on row 2 we

00:14:20.880 obtain another lock for that row we mark it as true we keep going in this manner

00:14:26.910 and now Alice is trying to log in and unfortunately in that row is locked so

00:14:33.720 we can't write data to it now Charlie's

00:14:39.060 Row is locked mark him active one thing to note about this is that it is actually better than

00:14:46.440 the first failure case that we described because some of your users those that

00:14:52.680 haven't been updated to be marked as active are not locked yet during this

00:14:58.710 migration and can log in successfully so Doug is a happy user and then he goes

00:15:06.660 off on his merry way to use your application none the wiser that there are any users experiencing issues and

00:15:12.770 now we move on to his row and now more users are trying to log in and lock that

00:15:19.560 and we mark it as true and Doug is now active fortunately for him he already

00:15:25.050 logged in and now our migration is done all of the locks are released as the

00:15:30.630 transaction is committed and you're left with slightly fewer but probably no less

00:15:37.260 angry users again not what you were intending to do now if you were

00:15:45.960 listening closely you might have noticed I said can't write data because update locks don't

00:15:53.360 prevent you from reading the previous value of the row until that transaction

00:15:58.580 is committed and the value is updated so when a user is logging in we're just

00:16:05.720 reading the password we can read data why is this a problem are you sure

00:16:13.279 you're not making updates when your user logs in you might want to consider it and even if this particularly contrived

00:16:22.310 example doesn't necessarily apply to you I'm sure there's a table in your application that you can't afford to

00:16:28.940 block updates and write traffic to for an e-commerce application that might be your orders table if you're a SAS

00:16:35.930 blogging service that might be your posts table there's always some reason

00:16:41.510 for your application to be writing data and while you can chance it and maybe to

00:16:48.920 unhappy users is an acceptable margin of error it's better to just not backfill

00:16:55.880 data inside a transaction a lot of advice is just don't do it right if only

00:17:02.630 everything was so easy but hold on a

00:17:11.419 second we like transactions they have all those nice acid properties that we

00:17:18.020 want when we're manipulating a database we don't want to get into an inconsistent state we don't want connections to start conflicting with

00:17:24.740 each other that makes for an unstable system and that's just not gonna work so you'll notice it says don't backfill

00:17:33.740 data inside a transaction what we're gonna do is first pull this line out of

00:17:42.770 the migration because we're inside a transaction we like that we want our

00:17:48.620 column changes to happen in a safe manner and we're going to write a

00:17:53.870 completely separate migration to mark all of our users is active that runs after the first line here that you

00:18:02.300 the disabled DDL transaction DDL data description layer disables that global

00:18:09.290 transaction that I mentioned earlier so it's implicitly enabled but you do

00:18:15.080 have the option to explicitly opt out of that behavior while you're running particular migration then within our up

00:18:24.710 method rather than marking every single user is active inside a transaction

00:18:30.560 which is dangerous and rather than marking every single user is active outside of a transaction which is

00:18:36.290 dangerous for different reasons what we're doing here is iterating through our users in batches and wrapping each

00:18:43.010 individual batch inside of a transaction batch size defaults to a thousand of

00:18:49.400 course it's configurable based on your individual needs but the key here is that this transaction that's updating a

00:18:55.730 thousand rows is going to complete and commit much faster than a transaction

00:19:00.920 updating ten million rows orders of magnitude faster believe it or not so

00:19:06.320 that changes your lag time from minutes where your entire user base can't login

00:19:12.920 or use your software to the order of seconds or even less while an individual

00:19:18.350 subset of users might receive a slightly delayed response which in almost all

00:19:24.710 circumstances is an extremely extremely good trade-off and your users excuse me

00:19:30.590 your users most likely won't even notice that anything happened so our rule of

00:19:37.400 thumb here is don't mix schema and data changes because you're gonna wind up

00:19:43.970 getting lumped in this transaction and that's gonna delay your updates cause

00:19:50.240 extensive locks to be held for long periods of time and generally bottleneck

00:19:55.610 your system at the database don't forget

00:20:00.830 to put those updates back inside a transaction just at a scale that works

00:20:07.610 for you and works for your system and it isn't going to block for extended periods of time

00:20:14.230 now we finally got our column added we finally got all the values put into it

00:20:20.170 we only cost two production outages pretty good week right and now we

00:20:27.700 probably want to be able to look up which of our users are active or not so

00:20:36.220 we're gonna add an index you'll note that this says for Postgres only and

00:20:41.260 that's because the internal implementation of indexing in my sequel doesn't have this particular failure

00:20:48.580 failure mode that we're about to discuss again not going to talk about how

00:20:54.690 different meta indexes point to different tuples which point to actual disk locations suffice it to say that

00:21:01.120 Postgres has a slightly more complex indexing strategy that has a particular

00:21:06.880 edge case associated with it so we have

00:21:12.460 our very simple migration here or adding a ioan index to the active column on users totally straight forward nothing

00:21:21.370 could go wrong of course you've been at this talk long enough to know that's not

00:21:27.790 why I put that up on the screen in addition to running that inside a

00:21:33.610 transaction adding an index and Postgres will by default obtain one of those

00:21:40.450 nasty table level locks that we described earlier while it reads all of

00:21:46.180 the data in the table and writes it all to the index in one fell swoop and just

00:21:51.970 like you are locked from updating an individual row while Postgres is trying

00:21:57.730 to write an index you can't make changes that would affect the way that index is

00:22:03.100 written such as updating a different row and so basically your database state is

00:22:09.730 frozen by Postgres until the index is calculated and written at which point

00:22:15.070 you can then modify rows again fortunately this is generally realized

00:22:22.510 as a bad thing even by the folks who maintain Postgres and they added a very

00:22:28.980 algorithm that can create a new index and write it to disk without blocking

00:22:36.120 reason writes so in addition to disabling that transaction at the top

00:22:41.770 just like we talked about earlier we're going to apply the algorithm concurrently to our index which tells

00:22:49.150 Postgres to add it in a non-blocking manner simplest thing in the world

00:22:56.280 note as I said Postgres only if you put algorithm concurrently on your migration

00:23:02.590 and you're using my sequel this will fail with an error you won't break anything but it won't work either so

00:23:10.770 don't try it or do try it your database

00:23:16.980 but whatever you do do add Postgres indexes concurrently it adds a little

00:23:22.600 bit of overhead but it avoids massive table level locking that will generally cause problems with your application all

00:23:31.930 right so now that we've talked about several potential failure modes and

00:23:37.360 things that you shouldn't do now it's time to talk about a time when I did

00:23:44.320 those things that you shouldn't do so a

00:23:50.320 tweed maps we have a pretty significant audit trail of operations that people

00:23:55.810 are doing in the system makes sense right we want to know when people are

00:24:00.940 logging in who's making updates to what compliance security general goodness and what that

00:24:09.040 looks like at a very high level is we write actual changes to the model to our

00:24:15.130 main Postgres database and then we write an audit record for whatever changed

00:24:20.890 what operation happened to create update or delete and the properties of the model that were changed and we write

00:24:26.800 that we write that to a separate database you could easily write it to a separate table in the same database

00:24:32.650 that's an architecture decision based on the size of your data set and scaling needs but I will say

00:24:40.030 I'm very excited for real six in the multi DB support because there's a lot of hackery that we had to do to make

00:24:46.120 this work but that's neither here nor there the specific case that we were

00:24:52.750 trying to address is that we needed to update a compound index on the audits

00:24:58.060 table if you haven't seen a compound index it's the same as a simple index

00:25:04.090 it's just on two columns in this case it was a polymorphic Association and we wanted two index on both the type and

00:25:10.450 the ID column because that's how we're going to be querying so what do we do we

00:25:16.480 drop the old index we add the new index concurrently of course no locks no

00:25:23.080 transactions didn't make any obvious mistakes with the migration what could

00:25:28.660 possibly go wrong a lot ultimately we

00:25:39.940 wound up taking down our entire production system for about 45 minutes which is about 45 minutes in the middle

00:25:48.820 of the afternoon as people are getting off work as people are using our service

00:25:54.310 to try to you find cannabis so that time to have a massive production outage but

00:26:02.560 what happened if we look at the migration it looks pretty sane we're not

00:26:07.750 in a transaction we're adding that index concurrently what could possibly have

00:26:14.440 gone wrong why did we take down production with this indexing change in

00:26:20.500 order to know that we have to talk about how concurrency works now there's gonna

00:26:26.290 be some math on this next slide that's the math don't worry we'll walk through

00:26:32.110 it littles law is a very simple equation that governs the relationship between

00:26:37.720 concurrency throughput and response time which ultimately controls how many requests your system and your

00:26:44.770 application is able to handle within a given time frame for example if you have

00:26:51.790 a hundred requests and you're averaging a forty millisecond response for that request for those

00:26:59.110 requests I should say that means that at any given point in time you have four

00:27:04.240 requests being served by your application this also works in Reverse where if you have four threads in Puma

00:27:13.840 and your average response time is 40 milliseconds you can calculate and know

00:27:19.930 that you can serve up to a theoretical maximum of 100 requests per second note

00:27:27.430 the relationship between response time and concurrency here if I add a zero to

00:27:33.010 your response time you also end up adding a zero to your concurrent requests and I don't know how many of

00:27:42.790 y'all run 40 or more servers but what I do know is that scaling up from 4 to 40

00:27:49.360 is not something that you can do very quickly and possibly won't expect to do

00:27:55.630 at all so littles law helps us determine the

00:28:01.900 theoretical maximum level of concurrency that your application can support at any given time if you use APM if you've

00:28:09.610 heard of littles law before if you've applied it to your own application you might be thinking in terms of rails requests and responses but your database

00:28:17.460 obeys the same principles it's a service it has a connection pool it takes

00:28:22.990 requests from your rails app and it generates responses in response where

00:28:35.410 you can run into problems here is when requests start queuing because they arrive faster than your application or

00:28:42.490 its database can respond to them if a database operation blocks many requests

00:28:50.230 for a long time you'll wind up saturating your connection pool you won't be able to serve you won't be able

00:28:58.150 to access your database first of all usually that causes problems for your

00:29:03.580 application so you probably won't be able to serve any requests and the entire system will

00:29:10.889 grind to a halt probably at the worst possible time when you're at a

00:29:16.019 conference and no one's on call so why

00:29:25.169 was this a problem for our audits well that's because we didn't understand what

00:29:33.210 was happening when those audit records were being written so now an auditing

00:29:38.610 overview but accurate first we I write

00:29:43.740 the actual changes to the database but second we have a little little tiny read

00:29:51.539 query here just to see what version that record is at so we're able to

00:29:58.139 subsequently write the next version to the audit database so rather than being

00:30:04.500 a write only append log that we thought we had it turns out that every every

00:30:13.730 audited database operation was actually generating a read in addition to the

00:30:19.350 write on that that audits database rather than being 99% writes unless user

00:30:28.470 happens to want to look at the audit trail for their listing we were actually about half and half reads writes and

00:30:40.159 that causes problems because why do you index things fast reads what did we do

00:30:48.419 we dropped an index and that changed our

00:30:54.809 query performance on the audit database from several milliseconds generally

00:31:01.200 acceptable to several seconds generally not acceptable and if you remember the

00:31:06.870 littles law multiplier earlier that changed our level of concurrency / the

00:31:15.539 number of requests that we can handle by several orders of magnitude so rather than being able to handle our

00:31:21.890 10,000 rpm production traffic at that particular time we wound up with enough

00:31:28.169 bandwidth to handle 10 RPM so what that

00:31:36.090 looks like with some more pretty animations is when we have our auto

00:31:41.220 scaling group we have our multi region AZ cool stuff this is true for a single

00:31:46.710 deployment as well but I figured I might as well give you an accurate representation of we'd map stack we have

00:31:53.789 these different rails applications talking to the database and requests start coming in they're trying to

00:32:00.210 generate that audit trail as they've been programmed to do but queries that were previously quite fast are now quite

00:32:07.140 slow and so once all of these requests come in they're all trying to access

00:32:12.330 that audit database you wind up with

00:32:17.490 your connection pool saturated all waiting for a response from Postgres and

00:32:22.940 none of your applications can do anything else because all these requests

00:32:28.380 are hung open waiting for a response from the database server and then all

00:32:35.340 the other requests come in and completely take everything down note

00:32:42.000 that our particular example relates to our audits database and our audit trail

00:32:47.240 but you don't have to be writing an audit to be affected by this any long

00:32:52.679 running query can potentially saturate your database connection pool on any

00:32:59.340 table at any time and if that happens all of your queries are going to be stuck waiting for those long-running

00:33:05.850 queries to complete so what do you do

00:33:13.250 you do test your database performance you only need to understand the

00:33:19.620 performance characteristics of your application to understand how they change during and after a migration and

00:33:27.590 you need to do this on a regular basis so you know what your application is doing if we had

00:33:34.650 identified the audit reads as a significant portion of our database traffic before that migration ran we

00:33:41.520 likely would not have written it and run it in the manner that we did and even though query performance was fine before

00:33:48.770 the migration and it would have been fine after the migration once that concurrent index was riad 'add so slow

00:33:56.430 queries during the migration caused a cascading failure that took down our entire application so let's talk about

00:34:07.110 how you can do that little tools and resources to help keep your database

00:34:13.140 healthy while still allowing you to make schema changes first a couple gems there

00:34:21.179 are some static analysis tools that will warn in advance about certain unsafe migrations the things that I've covered

00:34:26.880 here and the things that haven't been covered here changing column type dropping a column adding a JSON column

00:34:33.659 to a Postgres database there's a lot of things that are unsafe that I just don't have time to cover but in general you

00:34:41.760 want to catch problems at dev time not deploy time if you write safe migrations

00:34:46.830 you'll run safe migrations so I advise checking out one of these two gems

00:34:52.380 strong migrations zero downtime migrations possibly check out both of them they hook into the migrated DB

00:35:02.070 migrate rape task and will throw an error if your migration is written in an

00:35:07.440 unsafe manner which one works better for you will likely depend on your specific

00:35:12.660 database engine version schema what you're doing there don't exactly overlap

00:35:18.840 they both cover some slightly different things and it's not technically a gem

00:35:24.960 but I did want to throw it in there that gitlab actually has a really really cool

00:35:31.140 powerful migration helper system that they've built in that handles a lot of

00:35:36.450 these things and more which unfortunately is not abstracted out into a gem so it's difficult to pull in and

00:35:42.630 use in your own project but it worth checking out because they give a very good detailed explanation of these

00:35:49.250 things that I've talked about here as well second APM I think you could probably

00:35:57.020 throw this slide in any talk about any application in any language but

00:36:02.619 understanding your performance is absolutely critical to understanding how migrations are going to change it there

00:36:10.400 are all sorts of options for this New Relic skylight Scout app signal many

00:36:17.030 others I hope all of you are using one already but if not maybe start by

00:36:24.320 checking out the ones that sponsored this conference but the most important tool in your toolbox is you you as the

00:36:34.820 developer are going to be your own best resource in not doing dangerous things static analysis can only go so far and

00:36:41.990 as we saw in the audits case study you can take down your database in a manner

00:36:47.210 that the software itself would be completely unable to detect because it

00:36:52.460 doesn't know how your queries work and it doesn't know what the relative performance of them will be you need to

00:37:00.680 keep up with changes in your database engine both knowing that there are changes knowing what's in those changes

00:37:06.710 and keeping your database up-to-date I wish we were on Postgres 11 but we're

00:37:13.430 not practice these things in a

00:37:18.950 low-stakes environment you don't want to have the first time you write a

00:37:24.580 complicated safe multistage migration to be when you're running that migration on

00:37:31.250 a multi million row table in production there's nothing specific or special

00:37:37.400 about these techniques that don't apply to smaller tables and I advise

00:37:43.390 implementing this even in situations where you don't think you need it because when you write safe migrations

00:37:50.270 by default you never have that cognitive overhead of man how big is my table does

00:37:55.820 this one need to disable the transaction or can I run it in the transaction you don't want to

00:38:01.060 spend time thinking about it just do it safely all the time and you won't have

00:38:06.760 to worry it's always better to listen to

00:38:11.980 that nagging little voice in your head did I do this right is this safe am I

00:38:17.050 gonna take down prod I would always rather have someone prove that a

00:38:22.210 migration is safe or challenge me to prove that a migration is safe then assume it's safe by default right up

00:38:30.160 until that little adrenaline hit or large adrenaline hit when all of your

00:38:35.320 metrics go into the toilet on a deploy because you've got a bad migration running quick little recap commit this

00:38:45.520 to your memory don't add columns with the default value

00:38:52.950 don't backfill data inside a transaction a transaction use many smaller ones

00:38:59.260 don't mix schema and data changes in the same migration because you wind up with that global transaction that you need to

00:39:06.040 avoid do add your Postgres indexes concurrently and do monitor and test

00:39:13.900 your database performance before during and after migrations test all the

00:39:19.960 friggin time and remember uptime begins