List

rails db:migrate:safely

rails db:migrate:safely

by Matt Duszynski

In the talk titled "rails db:migrate:safely" at RailsConf 2019, Matt Duszynski presents valuable insights on safely handling database migrations when working with large datasets, specifically focusing on relational databases like PostgreSQL. The discussion emphasizes the risks associated with copying and modifying millions of records during migrations, urging developers to learn from real-world mistakes.

Key points discussed include:
- Understanding Locks: Migrations can obtain exclusive locks, causing system downtime when adding or modifying columns, leading to user frustrations. For example, adding a column with a default value can lock the entire table, preventing all access until the operation completes.
- Transactions and Their Role: Transactions can delay updates as they require locks to be released only after all operations complete. The problem intensifies with large data sets, where all rows are locked during bulk updates.
- Best Practices for Migrations:
- Avoid adding columns with default values.
- Perform data updates in smaller transactions, rather than a single large batch, to keep the system responsive.
- Disable DDL transactions when necessary to avoid data and schema change collisions.
- Utilize concurrent index creation to prevent table locks during index updates.
- Anecdotal Case Study: Duszynski shares a cautionary tale about an indexing operation at Weedmaps that took down production due to unexpected database concurrency issues caused by not realizing the read operations generated during auditing processes.
- Testing and Monitoring: Regular performance tests of database operations before, during, and after migrations are crucial to ensure stability and performance. Tools like strong_migrations can help catch unsafe migration practices early in the development cycle.

Conclusions and Takeaways:
- Developers must prioritize writing safe and efficient database migrations. By following best practices and being aware of database behavior, particularly during migration processes, one can significantly reduce risks of downtime or performance degradation during migrations. Uptime is paramount, and actions taken towards safe migrations reflect directly on user satisfaction and system reliability. Ultimately, learning from past mistakes and being proactive in testing and monitoring can lead to a pain-free migration experience.

RailsConf 2019 - rails db:migrate:safely by Matt Duszynski
_______________________________________________________________________________________________
Cloud 66 - Pain Free Rails Deployments
Cloud 66 for Rails acts like your in-house DevOps team to build, deploy and maintain your Rails applications on any cloud or server.

Get $100 Cloud 66 Free Credits with the code: RailsConf-19
($100 Cloud 66 Free Credits, for the new user only, valid till 31st December 2019)

Link to the website: https://cloud66.com/rails?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
Link to sign up: https://app.cloud66.com/users/sign_in?utm_source=-&utm_medium=-&utm_campaign=RailsConf19
_______________________________________________________________________________________________
When you're dealing with fifty million records, simple migrations can become very dangerous. Come and learn from my mistakes instead of making your own. We'll talk about what's going on behind the scenes in your database, and how to safely write and run some common migrations. Remember, uptime begins at $HOME.

RailsConf 2019

00:00:20.689 thank you all for coming to my talk thank you all for coming to railsconf and thank you Megan for the wonderful introduction I'm sure a lot of y'all
00:00:28.650 have done this in fact all of y'all might have done this but how many know
00:00:35.129 how to do this by the end of this talk hopefully all of you before we get too
00:00:43.710 far in I want to explain Who I am I'm Matt Matt design skia said you can find
00:00:52.170 me on Twitter you can find me on github please do fun fact I can plug a USB cord
00:00:58.020 in the right way round first time every time that's not true actually how'd you
00:01:06.450 go in as Megan said I work for a high-tech company in Southern California
00:01:12.479 we're possibly the highest tech company and we are building the platform for the
00:01:18.990 cannabis industry and the millions of cannabis users around the world sounds
00:01:24.330 lofty DHH might take me to task for that but if that sounds interesting to you definitely stop by the booth over in the
00:01:30.899 exhibit hall and talk you can't miss it we're always hiring and it's a damn fine
00:01:37.500 place to work if you just like having a good time and you're happy with your current job stop by the booth anyway an
00:01:43.530 RSVP for our party tonight which is going to be awesome I hope to see y'all there so in my time
00:01:53.610 at weed maps I've been there for a little over a year and a half and unfortunately I have taken down
00:02:00.810 production before I go into that though I want to do a quick audience
00:02:06.660 participation if I can see can can you
00:02:11.790 all raise your hands if you use a relational database like Postgres or by sequel as your primary data store well
00:02:18.629 your ought to talk about database migrations so I figured that was about about what you'd expect second question
00:02:25.680 how many of y'all have tables with 1 million rows or more in that database
00:02:33.000 okay good good then this talk is very applicable to you because the sort of
00:02:40.370 migrations that work pretty well for 50 records tends to not work so well for 50
00:02:47.670 million eggs so for our first example
00:02:54.540 let's talk about something that I think you would consider pretty simple adding
00:02:59.610 a column to the database let's say we have our user's table we have our
00:03:05.910 application we've got millions and millions of people using it because we're a really successful startup and we
00:03:15.090 want to be able to mark a user account as active or inactive I don't know how
00:03:23.220 you've got this many users without needing to mark them active but bear with me for the sake of this example
00:03:29.150 pretty simple database structure we had a column active it's a boolean is true
00:03:35.160 false and we probably want to mark our existing users active so we don't lock
00:03:42.870 everybody out of the system so in order to do that you might write a migration
00:03:48.989 that looks something like this add column users active booing default to
00:03:54.120 true let's look at the job done right wrong no one can log in no one can sign
00:04:05.280 up everyone is furious your app gets raked on Yahoo News your VC's get angry
00:04:11.640 it begins the downward spiral somehow a plane crashes into a mountain we're not sure how that happened but in order to
00:04:19.260 understand what's going on here we have to talk about what's going on under the
00:04:24.360 hood with locks locks there we go locks
00:04:31.080 are the mechanism that your relational database uses to ensure that multiple operations don't update the same row at
00:04:37.620 the same time there are a lot of different lock modes unfortunately too many to go into detail here but to give
00:04:45.030 you an idea they range from an access share which basically is the lowest level I'm
00:04:50.669 accessing data but anyone else can access it too I'm not doing anything dangerous all the way up to access
00:04:57.930 exclusive which is no one else can do anything with this piece of data that
00:05:03.300 I'm trying to access because I'm doing something that is potentially dangerous
00:05:08.659 like updating or deleting it might change when you're doing a database
00:05:16.680 migration even when as simple as the one we saw earlier it will obtain an access
00:05:22.589 exclusive lock on the table in question and prevent any other connections and
00:05:28.979 the entire rest of your application from reading or writing data to that table
00:05:34.020 until the migration is complete so let's
00:05:39.930 watch that in real time real ish time with our example users table you'll have
00:05:50.039 to imagine the remaining nine point nine nine nine million users
00:05:55.229 I could only fit so many on the slide but our migration begins running and the
00:06:02.699 first thing it's going to do is obtain that access exclusive lock it's going to
00:06:09.659 add the column so far so good and then we're gonna begin filling that column
00:06:16.469 with active true so people can log it so
00:06:21.539 our little elephant here is Postgres doing its thing we've marked the first
00:06:26.550 user Alice as active keep it going so far so good we marked Bob active uh-oh
00:06:34.580 first problem Alice is trying to log in but she can't because the entire table
00:06:41.939 is locked your application can't read data from it and she's stuck there at a
00:06:47.550 hung login screen or 500 error or she goes and uses your competitors application overall not good
00:06:55.349 your migrations still running though marked on the next column Charlize active now now Bob's trying to
00:07:03.120 log in same problem can't do it can't read the user data can't log in the
00:07:10.530 migration is continuing to run and users are continuing to try to log in and you
00:07:16.380 wind up with this pileup of queries that can't be responded to and this pileup of users that are very upset
00:07:24.270 the migration finally finishes running releases that lock now that it's done filling all the data and what you're
00:07:32.820 left with is a cadre of dissatisfied customers probably not what you were
00:07:38.580 planning on when you decided to mark users as active or inactive so how do we
00:07:44.430 avoid this well don't do it first of all
00:07:49.520 don't add columns with the default value that's a dangerous operation because of the locking mode it uses and cannon will
00:07:57.480 cause downtime if you have enough rows in your database and enough traffic on your system it is worth mentioning at
00:08:03.870 this point that Postgres 11 actually addresses this particular problem in certain circumstances adding a static
00:08:11.340 default value no longer requires obtaining a table level access exclusive lock but note the caveat under certain
00:08:20.220 circumstances there are other circumstances where adding a default value for example backfilling a new UUID
00:08:26.490 column will obtain that lock and it will
00:08:31.710 be a dangerous operation also it's hard to stay up to date on that latest
00:08:37.919 version for example we use Aurora at weed maps and Aurora does not support
00:08:44.580 Postgres in London the highest version they have is 10 6 so you're at the
00:08:50.430 behest of your database your SAS provider your DBA your ops team to be on
00:08:58.740 the latest version to be able to take advantage of those new features which you might not be so general good
00:09:05.790 practice don't add columns with the default value suite we know what to do now we're not gonna
00:09:14.440 add that column with the default value just get rid of that little bit right there we're gonna add it without a
00:09:22.000 default value first we're going to change the default once the column has been added and then we're going to go
00:09:28.900 back and update all of our users to mark them as active and of course we've split it into up and down because we're good
00:09:35.530 rails developers and our migration should be reversible and we tested this in development and everything is
00:09:41.170 hunky-dory nailed it not so what's happening this
00:09:49.630 time sorry I just wanted to see that loop one more time so what's happening
00:09:57.610 this time now we got to talk about transactions transactions basically
00:10:08.110 combine multiple database operations into a single all-or-nothing operation either all of your statements succeed
00:10:14.770 within the transaction and it is committed or something fails violates an
00:10:20.530 integrity constraint whatever goes wrong and the entire operation is rolled back
00:10:26.340 so transactions provide for guarantees atomicity consistency isolation and
00:10:31.930 durability which you might hear referred to by the acronym acid check out our
00:10:37.540 sister site LSD Maps to learn more not a real product so this is a topic for a
00:10:46.600 talk in and of itself probably many talks and we don't necessarily have to go into it in great depth here suffice
00:10:54.040 it to say that the consistency and isolation that a transaction provides are guaranteed by locks specifically
00:11:01.380 consistency means that any operation will take your database from a valid
00:11:06.730 state to another valid state if you write a row that row has to be valid in
00:11:11.920 order to be saved and in order to be read by subsequent queries isolation
00:11:17.290 guarantees that uncommitted transactions are uncommitted operations from one can
00:11:23.430 can't interfere with the operation of another connection if the transaction hasn't been committed that database
00:11:29.579 hasn't made that consistent transition to a new state and so other connections can't read the data and there's some
00:11:37.649 interesting stuff going on under the hood with tuples and indexing and
00:11:42.920 multi-layer indexing and again not going to get into that right now
00:11:49.220 come find me at the booth because I would love to nerd out on database stuff with you though when a rows being
00:11:57.149 updated an exclusive lock is issued to prevent other people from updating that same row which makes sense right if I've
00:12:04.500 got some data too right and you've got some data too right and we try to do it at the same time what's going to come out well we don't know that violates the
00:12:11.699 consistency principle in relation databases and so you cannot do that updates to rows are always made in a
00:12:19.500 serial manner and that serial order is guaranteed by locking within a
00:12:26.519 transaction locks are issued on a first-come first-served basis they live
00:12:32.339 for the duration of a transaction even if the statement that requested the lock has already executed so for most of your
00:12:40.589 database operations this might not be a very big problem because they usually happen on the order of milliseconds
00:12:47.360 where it does become a problem is as you saw earlier when you have to perform
00:12:53.120 millions of database operations on a very large data set it tends to take a
00:12:59.250 long time so how does this tie in to
00:13:05.339 migrations well through a little bit of that rails magic that we all know and
00:13:11.459 love everything you do in a migration is automatically wrapped inside a
00:13:16.589 transaction so but you can imagine what
00:13:24.690 this looks like but fortunately you don't have to because I have more Mach tables we're going to start after our
00:13:32.279 columns been added and it's time to mark our users active again starting with ro1 we do not lock the
00:13:39.900 entire table this time because we're not adding a default value for the column
00:13:45.210 we're just individually updating all of our rows however in order to update a
00:13:52.680 row you have to obtain a lock on it so
00:13:57.810 we mark our first rows active lock it mark it active true move on and even
00:14:03.570 though that was successful as I mentioned that lock doesn't get released until your transaction can commit and
00:14:11.660 that can't happen until you update all of the rows just like you wrote in the migration so now we're on row 2 we
00:14:20.880 obtain another lock for that row we mark it as true we keep going in this manner
00:14:26.910 and now Alice is trying to log in and unfortunately in that row is locked so
00:14:33.720 we can't write data to it now Charlie's
00:14:39.060 Row is locked mark him active one thing to note about this is that it is actually better than
00:14:46.440 the first failure case that we described because some of your users those that
00:14:52.680 haven't been updated to be marked as active are not locked yet during this
00:14:58.710 migration and can log in successfully so Doug is a happy user and then he goes
00:15:06.660 off on his merry way to use your application none the wiser that there are any users experiencing issues and
00:15:12.770 now we move on to his row and now more users are trying to log in and lock that
00:15:19.560 and we mark it as true and Doug is now active fortunately for him he already
00:15:25.050 logged in and now our migration is done all of the locks are released as the
00:15:30.630 transaction is committed and you're left with slightly fewer but probably no less
00:15:37.260 angry users again not what you were intending to do now if you were
00:15:45.960 listening closely you might have noticed I said can't write data because update locks don't
00:15:53.360 prevent you from reading the previous value of the row until that transaction
00:15:58.580 is committed and the value is updated so when a user is logging in we're just
00:16:05.720 reading the password we can read data why is this a problem are you sure
00:16:13.279 you're not making updates when your user logs in you might want to consider it and even if this particularly contrived
00:16:22.310 example doesn't necessarily apply to you I'm sure there's a table in your application that you can't afford to
00:16:28.940 block updates and write traffic to for an e-commerce application that might be your orders table if you're a SAS
00:16:35.930 blogging service that might be your posts table there's always some reason
00:16:41.510 for your application to be writing data and while you can chance it and maybe to
00:16:48.920 unhappy users is an acceptable margin of error it's better to just not backfill
00:16:55.880 data inside a transaction a lot of advice is just don't do it right if only
00:17:02.630 everything was so easy but hold on a
00:17:11.419 second we like transactions they have all those nice acid properties that we
00:17:18.020 want when we're manipulating a database we don't want to get into an inconsistent state we don't want connections to start conflicting with
00:17:24.740 each other that makes for an unstable system and that's just not gonna work so you'll notice it says don't backfill
00:17:33.740 data inside a transaction what we're gonna do is first pull this line out of
00:17:42.770 the migration because we're inside a transaction we like that we want our
00:17:48.620 column changes to happen in a safe manner and we're going to write a
00:17:53.870 completely separate migration to mark all of our users is active that runs after the first line here that you
00:18:02.300 the disabled DDL transaction DDL data description layer disables that global
00:18:09.290 transaction that I mentioned earlier so it's implicitly enabled but you do
00:18:15.080 have the option to explicitly opt out of that behavior while you're running particular migration then within our up
00:18:24.710 method rather than marking every single user is active inside a transaction
00:18:30.560 which is dangerous and rather than marking every single user is active outside of a transaction which is
00:18:36.290 dangerous for different reasons what we're doing here is iterating through our users in batches and wrapping each
00:18:43.010 individual batch inside of a transaction batch size defaults to a thousand of
00:18:49.400 course it's configurable based on your individual needs but the key here is that this transaction that's updating a
00:18:55.730 thousand rows is going to complete and commit much faster than a transaction
00:19:00.920 updating ten million rows orders of magnitude faster believe it or not so
00:19:06.320 that changes your lag time from minutes where your entire user base can't login
00:19:12.920 or use your software to the order of seconds or even less while an individual
00:19:18.350 subset of users might receive a slightly delayed response which in almost all
00:19:24.710 circumstances is an extremely extremely good trade-off and your users excuse me
00:19:30.590 your users most likely won't even notice that anything happened so our rule of
00:19:37.400 thumb here is don't mix schema and data changes because you're gonna wind up
00:19:43.970 getting lumped in this transaction and that's gonna delay your updates cause
00:19:50.240 extensive locks to be held for long periods of time and generally bottleneck
00:19:55.610 your system at the database don't forget
00:20:00.830 to put those updates back inside a transaction just at a scale that works
00:20:07.610 for you and works for your system and it isn't going to block for extended periods of time
00:20:14.230 now we finally got our column added we finally got all the values put into it
00:20:20.170 we only cost two production outages pretty good week right and now we
00:20:27.700 probably want to be able to look up which of our users are active or not so
00:20:36.220 we're gonna add an index you'll note that this says for Postgres only and
00:20:41.260 that's because the internal implementation of indexing in my sequel doesn't have this particular failure
00:20:48.580 failure mode that we're about to discuss again not going to talk about how
00:20:54.690 different meta indexes point to different tuples which point to actual disk locations suffice it to say that
00:21:01.120 Postgres has a slightly more complex indexing strategy that has a particular
00:21:06.880 edge case associated with it so we have
00:21:12.460 our very simple migration here or adding a ioan index to the active column on users totally straight forward nothing
00:21:21.370 could go wrong of course you've been at this talk long enough to know that's not
00:21:27.790 why I put that up on the screen in addition to running that inside a
00:21:33.610 transaction adding an index and Postgres will by default obtain one of those
00:21:40.450 nasty table level locks that we described earlier while it reads all of
00:21:46.180 the data in the table and writes it all to the index in one fell swoop and just
00:21:51.970 like you are locked from updating an individual row while Postgres is trying
00:21:57.730 to write an index you can't make changes that would affect the way that index is
00:22:03.100 written such as updating a different row and so basically your database state is
00:22:09.730 frozen by Postgres until the index is calculated and written at which point
00:22:15.070 you can then modify rows again fortunately this is generally realized
00:22:22.510 as a bad thing even by the folks who maintain Postgres and they added a very
00:22:28.980 algorithm that can create a new index and write it to disk without blocking
00:22:36.120 reason writes so in addition to disabling that transaction at the top
00:22:41.770 just like we talked about earlier we're going to apply the algorithm concurrently to our index which tells
00:22:49.150 Postgres to add it in a non-blocking manner simplest thing in the world
00:22:56.280 note as I said Postgres only if you put algorithm concurrently on your migration
00:23:02.590 and you're using my sequel this will fail with an error you won't break anything but it won't work either so
00:23:10.770 don't try it or do try it your database
00:23:16.980 but whatever you do do add Postgres indexes concurrently it adds a little
00:23:22.600 bit of overhead but it avoids massive table level locking that will generally cause problems with your application all
00:23:31.930 right so now that we've talked about several potential failure modes and
00:23:37.360 things that you shouldn't do now it's time to talk about a time when I did
00:23:44.320 those things that you shouldn't do so a
00:23:50.320 tweed maps we have a pretty significant audit trail of operations that people
00:23:55.810 are doing in the system makes sense right we want to know when people are
00:24:00.940 logging in who's making updates to what compliance security general goodness and what that
00:24:09.040 looks like at a very high level is we write actual changes to the model to our
00:24:15.130 main Postgres database and then we write an audit record for whatever changed
00:24:20.890 what operation happened to create update or delete and the properties of the model that were changed and we write
00:24:26.800 that we write that to a separate database you could easily write it to a separate table in the same database
00:24:32.650 that's an architecture decision based on the size of your data set and scaling needs but I will say
00:24:40.030 I'm very excited for real six in the multi DB support because there's a lot of hackery that we had to do to make
00:24:46.120 this work but that's neither here nor there the specific case that we were
00:24:52.750 trying to address is that we needed to update a compound index on the audits
00:24:58.060 table if you haven't seen a compound index it's the same as a simple index
00:25:04.090 it's just on two columns in this case it was a polymorphic Association and we wanted two index on both the type and
00:25:10.450 the ID column because that's how we're going to be querying so what do we do we
00:25:16.480 drop the old index we add the new index concurrently of course no locks no
00:25:23.080 transactions didn't make any obvious mistakes with the migration what could
00:25:28.660 possibly go wrong a lot ultimately we
00:25:39.940 wound up taking down our entire production system for about 45 minutes which is about 45 minutes in the middle
00:25:48.820 of the afternoon as people are getting off work as people are using our service
00:25:54.310 to try to you find cannabis so that time to have a massive production outage but
00:26:02.560 what happened if we look at the migration it looks pretty sane we're not
00:26:07.750 in a transaction we're adding that index concurrently what could possibly have
00:26:14.440 gone wrong why did we take down production with this indexing change in
00:26:20.500 order to know that we have to talk about how concurrency works now there's gonna
00:26:26.290 be some math on this next slide that's the math don't worry we'll walk through
00:26:32.110 it littles law is a very simple equation that governs the relationship between
00:26:37.720 concurrency throughput and response time which ultimately controls how many requests your system and your
00:26:44.770 application is able to handle within a given time frame for example if you have
00:26:51.790 a hundred requests and you're averaging a forty millisecond response for that request for those
00:26:59.110 requests I should say that means that at any given point in time you have four
00:27:04.240 requests being served by your application this also works in Reverse where if you have four threads in Puma
00:27:13.840 and your average response time is 40 milliseconds you can calculate and know
00:27:19.930 that you can serve up to a theoretical maximum of 100 requests per second note
00:27:27.430 the relationship between response time and concurrency here if I add a zero to
00:27:33.010 your response time you also end up adding a zero to your concurrent requests and I don't know how many of
00:27:42.790 y'all run 40 or more servers but what I do know is that scaling up from 4 to 40
00:27:49.360 is not something that you can do very quickly and possibly won't expect to do
00:27:55.630 at all so littles law helps us determine the
00:28:01.900 theoretical maximum level of concurrency that your application can support at any given time if you use APM if you've
00:28:09.610 heard of littles law before if you've applied it to your own application you might be thinking in terms of rails requests and responses but your database
00:28:17.460 obeys the same principles it's a service it has a connection pool it takes
00:28:22.990 requests from your rails app and it generates responses in response where
00:28:35.410 you can run into problems here is when requests start queuing because they arrive faster than your application or
00:28:42.490 its database can respond to them if a database operation blocks many requests
00:28:50.230 for a long time you'll wind up saturating your connection pool you won't be able to serve you won't be able
00:28:58.150 to access your database first of all usually that causes problems for your
00:29:03.580 application so you probably won't be able to serve any requests and the entire system will
00:29:10.889 grind to a halt probably at the worst possible time when you're at a
00:29:16.019 conference and no one's on call so why
00:29:25.169 was this a problem for our audits well that's because we didn't understand what
00:29:33.210 was happening when those audit records were being written so now an auditing
00:29:38.610 overview but accurate first we I write
00:29:43.740 the actual changes to the database but second we have a little little tiny read
00:29:51.539 query here just to see what version that record is at so we're able to
00:29:58.139 subsequently write the next version to the audit database so rather than being
00:30:04.500 a write only append log that we thought we had it turns out that every every
00:30:13.730 audited database operation was actually generating a read in addition to the
00:30:19.350 write on that that audits database rather than being 99% writes unless user
00:30:28.470 happens to want to look at the audit trail for their listing we were actually about half and half reads writes and
00:30:40.159 that causes problems because why do you index things fast reads what did we do
00:30:48.419 we dropped an index and that changed our
00:30:54.809 query performance on the audit database from several milliseconds generally
00:31:01.200 acceptable to several seconds generally not acceptable and if you remember the
00:31:06.870 littles law multiplier earlier that changed our level of concurrency / the
00:31:15.539 number of requests that we can handle by several orders of magnitude so rather than being able to handle our
00:31:21.890 10,000 rpm production traffic at that particular time we wound up with enough
00:31:28.169 bandwidth to handle 10 RPM so what that
00:31:36.090 looks like with some more pretty animations is when we have our auto
00:31:41.220 scaling group we have our multi region AZ cool stuff this is true for a single
00:31:46.710 deployment as well but I figured I might as well give you an accurate representation of we'd map stack we have
00:31:53.789 these different rails applications talking to the database and requests start coming in they're trying to
00:32:00.210 generate that audit trail as they've been programmed to do but queries that were previously quite fast are now quite
00:32:07.140 slow and so once all of these requests come in they're all trying to access
00:32:12.330 that audit database you wind up with
00:32:17.490 your connection pool saturated all waiting for a response from Postgres and
00:32:22.940 none of your applications can do anything else because all these requests
00:32:28.380 are hung open waiting for a response from the database server and then all
00:32:35.340 the other requests come in and completely take everything down note
00:32:42.000 that our particular example relates to our audits database and our audit trail
00:32:47.240 but you don't have to be writing an audit to be affected by this any long
00:32:52.679 running query can potentially saturate your database connection pool on any
00:32:59.340 table at any time and if that happens all of your queries are going to be stuck waiting for those long-running
00:33:05.850 queries to complete so what do you do
00:33:13.250 you do test your database performance you only need to understand the
00:33:19.620 performance characteristics of your application to understand how they change during and after a migration and
00:33:27.590 you need to do this on a regular basis so you know what your application is doing if we had
00:33:34.650 identified the audit reads as a significant portion of our database traffic before that migration ran we
00:33:41.520 likely would not have written it and run it in the manner that we did and even though query performance was fine before
00:33:48.770 the migration and it would have been fine after the migration once that concurrent index was riad 'add so slow
00:33:56.430 queries during the migration caused a cascading failure that took down our entire application so let's talk about
00:34:07.110 how you can do that little tools and resources to help keep your database
00:34:13.140 healthy while still allowing you to make schema changes first a couple gems there
00:34:21.179 are some static analysis tools that will warn in advance about certain unsafe migrations the things that I've covered
00:34:26.880 here and the things that haven't been covered here changing column type dropping a column adding a JSON column
00:34:33.659 to a Postgres database there's a lot of things that are unsafe that I just don't have time to cover but in general you
00:34:41.760 want to catch problems at dev time not deploy time if you write safe migrations
00:34:46.830 you'll run safe migrations so I advise checking out one of these two gems
00:34:52.380 strong migrations zero downtime migrations possibly check out both of them they hook into the migrated DB
00:35:02.070 migrate rape task and will throw an error if your migration is written in an
00:35:07.440 unsafe manner which one works better for you will likely depend on your specific
00:35:12.660 database engine version schema what you're doing there don't exactly overlap
00:35:18.840 they both cover some slightly different things and it's not technically a gem
00:35:24.960 but I did want to throw it in there that gitlab actually has a really really cool
00:35:31.140 powerful migration helper system that they've built in that handles a lot of
00:35:36.450 these things and more which unfortunately is not abstracted out into a gem so it's difficult to pull in and
00:35:42.630 use in your own project but it worth checking out because they give a very good detailed explanation of these
00:35:49.250 things that I've talked about here as well second APM I think you could probably
00:35:57.020 throw this slide in any talk about any application in any language but
00:36:02.619 understanding your performance is absolutely critical to understanding how migrations are going to change it there
00:36:10.400 are all sorts of options for this New Relic skylight Scout app signal many
00:36:17.030 others I hope all of you are using one already but if not maybe start by
00:36:24.320 checking out the ones that sponsored this conference but the most important tool in your toolbox is you you as the
00:36:34.820 developer are going to be your own best resource in not doing dangerous things static analysis can only go so far and
00:36:41.990 as we saw in the audits case study you can take down your database in a manner
00:36:47.210 that the software itself would be completely unable to detect because it
00:36:52.460 doesn't know how your queries work and it doesn't know what the relative performance of them will be you need to
00:37:00.680 keep up with changes in your database engine both knowing that there are changes knowing what's in those changes
00:37:06.710 and keeping your database up-to-date I wish we were on Postgres 11 but we're
00:37:13.430 not practice these things in a
00:37:18.950 low-stakes environment you don't want to have the first time you write a
00:37:24.580 complicated safe multistage migration to be when you're running that migration on
00:37:31.250 a multi million row table in production there's nothing specific or special
00:37:37.400 about these techniques that don't apply to smaller tables and I advise
00:37:43.390 implementing this even in situations where you don't think you need it because when you write safe migrations
00:37:50.270 by default you never have that cognitive overhead of man how big is my table does
00:37:55.820 this one need to disable the transaction or can I run it in the transaction you don't want to
00:38:01.060 spend time thinking about it just do it safely all the time and you won't have
00:38:06.760 to worry it's always better to listen to
00:38:11.980 that nagging little voice in your head did I do this right is this safe am I
00:38:17.050 gonna take down prod I would always rather have someone prove that a
00:38:22.210 migration is safe or challenge me to prove that a migration is safe then assume it's safe by default right up
00:38:30.160 until that little adrenaline hit or large adrenaline hit when all of your
00:38:35.320 metrics go into the toilet on a deploy because you've got a bad migration running quick little recap commit this
00:38:45.520 to your memory don't add columns with the default value
00:38:52.950 don't backfill data inside a transaction a transaction use many smaller ones
00:38:59.260 don't mix schema and data changes in the same migration because you wind up with that global transaction that you need to
00:39:06.040 avoid do add your Postgres indexes concurrently and do monitor and test
00:39:13.900 your database performance before during and after migrations test all the
00:39:19.960 friggin time and remember uptime begins