Biggish Data With Rails and Postgresql

by Starr Horne

In the talk "Biggish Data With Rails and Postgresql" at RailsConf 2014, Starr Horne explores the challenges of managing large datasets in Rails applications using PostgreSQL. The presentation focuses on how traditional ActiveRecord conventions become ineffective as databases grow, leading to performance degradation. Horne shares his experiences from working at Honeybadger, a company that monitors errors and uptime, and emphasizes that common Rails practices can falter as data volume increases. The key points discussed include:

Understanding Biggish Data: Horne clarifies that biggish data doesn't involve complex solutions like Hadoop but refers to practical issues faced as simple applications scale up to handle large data volumes.
Performance Considerations: The speaker discusses how the performance characteristics of a database change significantly as its size increases. For example, typical pagination systems may fail at high page numbers due to inefficiencies in how queries are executed.
Query Optimization: Horne emphasizes the importance of refining queries to limit the number of touched rows. He introduces the EXPLAIN command in PostgreSQL as a tool for understanding query performance and suggests using range queries to avoid the pitfalls of inefficient pagination.
Infrastructure Improvements: The talk covers practical steps to enhance database performance, such as increasing RAM, optimizing disk I/O, and utilizing connection pooling to manage database connections efficiently.
Database Management Techniques: Key strategies for maintaining performance over time include regular database vacuuming, partitioning data to improve deletion processes, and implementing read replicas for intensive queries.
Backup Strategies for Large Datasets: Horne highlights challenges in backing up large databases and recommends using tools like WAL-E for incremental backups.

Ultimately, the speaker reassures the audience that while managing biggish data can be daunting, many issues arise one at a time and are solvable with practical approaches. The overarching message is that proper preparation and proactive measures can significantly ease the transition into handling large datasets, making applications more robust and performant.

Once your database hits a few hundred million rows normal ActiveRecord conventions don't work so well.

...you find yourself in a new world. One where calling count() can bring your app to its knees. One where migrations can take all day.

This talk will show you how to work with big datasets in Rails and Postgresql. We'll show how normal conventions break down, and offer practical real-world advice on maintaining performance, doing maintenance, and tuning rails for optimal DB performance.

Starr likes building things that make people happy, usually with Ruby and Javascript. (He once built a bookshelf, but it was crooked and made noone happy.) He's the cofounder of Honeybadger.io. He lives in Guadalajara, Mexico and speaks very bad Spanish.

Help us caption & translate this video!

http://amara.org/v/FG18/

RailsConf 2014

00:00:17.119 so thank you guys for coming uh it's gonna be an awesome talk quite possibly the best talk you'll ever hear

00:00:22.720 in your life but um i actually have a built-in escape hatch here so

00:00:28.160 if things start going wrong i might just bail yeah so anyway a couple weeks ago i was

00:00:39.600 well i i was sort of having a panic attack about this presentation

00:00:46.000 because i'm a programmer my natural habitat is uh in some sort of dark place looking at

00:00:51.280 a screen it's not it's not talking to a bunch of people with like lights what's up with this

00:00:59.840 so i was in the middle of this panic attack and i went out on a walk because that always clears my mind i was listening to a podcast

00:01:06.080 uh my favorite podcast is called hardcore history i highly recommend it yeah oh

00:01:12.479 somebody likes it yeah so this this particular episode about

00:01:18.080 hardcore in of hardcore history was about the punic wars and

00:01:24.159 in case you haven't heard about the punic wars don't know the story i'll just give you like a brief rundown basically a long

00:01:30.640 long time ago there was this place uh called rome and they they decided that they should take over the world

00:01:37.600 and for the most part they did but there was this one country that was a thorn in their side for a long time called

00:01:43.680 carthage and they fought all these wars back and forth it was kind of a stalemate and then one day when i don't know maybe

00:01:51.040 he was making a sacrifice to the the fire god or whatever this guy named

00:01:56.399 hannibal had this great idea he was like i'm i'm gonna lead this sneak attack on rome

00:02:01.680 and i'm gonna do so by marching my army through the alps which is pretty cool i mean that's

00:02:08.959 pretty badass but i i think the most awesome part of the story for me at least

00:02:14.400 is that the dude had war elephants so i i don't know if you can see it in

00:02:20.160 the slide but there's actually a war elephant it's it's kind of i don't know it's kind

00:02:26.400 of under the g and a lot of people people have doubted this story you know over the years

00:02:33.120 and so in 1959 there was a british engineer who on a bar bet

00:02:40.879 borrowed a circus elephant named jumbo and marched it across the alps too so

00:02:47.599 i i don't know what my point is here really i guess don't underestimate elephants

00:02:54.319 because they don't like it and they have long

00:03:02.840 memories so okay this talk is really about biggish data so what

00:03:08.159 what the hell is biggish data um it's not wait whoa i've got one of these things it's

00:03:14.400 not big data it's not about you know this talk isn't going to be about hadoop clusters and

00:03:21.200 uh server farms and all that stuff that's that's way over my head it's not about

00:03:27.760 fancy architectures i'm not gonna show you how to make multi-right

00:03:33.440 postgres clusters that do automatic sharding and stuff i mean that's all like wicked cool but it's not what

00:03:40.400 we're going to talk about instead i'm going to talk about something that i think is it's more

00:03:46.319 practical and it's kind of more interesting to me that is how do you keep your app working as your production data set grows and

00:03:52.000 grows into biggish data territory

00:03:58.959 and this is very easy this is very easy to happen even if you don't have a fire hose of data coming in if you run a popular

00:04:06.319 ecommerce site or just a site with a lot of users over the course of years you can accumulate a

00:04:12.560 ton of data and as this data accumulates you find that your site performance goes downhill sort

00:04:19.680 of gradually in ways that you don't understand

00:04:25.280 so what is biggish data this talk is based on my experience at honeybadger in case you haven't heard

00:04:31.759 of us we are a exception performance and uptime monitoring company

00:04:38.080 that means that we essentially have a fire hose of new data coming in every day right now we have about a

00:04:44.400 terabyte of errors in our database the world has a lot of errors you guys

00:04:51.440 need to start like doing a better job and and we get about we get two

00:04:57.360 gigabytes of new errors per day roughly and all this goes into a plain vanilla

00:05:03.600 uh postgres database and it's served that backs a pretty plain vanilla rails

00:05:09.520 application well the good news is that postgres can

00:05:15.120 handle this pretty easily postgres got it covered the bad news

00:05:21.039 is that unless you've engineered your app to deal with this level of data

00:05:27.520 you're kind of screwed and the reason is that a 100 megabyte database behaves

00:05:33.520 fundamentally different than a one terabyte database

00:05:38.880 and a lot of the conventions a lot of the normal things we do in rails apps just stop

00:05:44.320 working when you have this much data

00:05:50.840 uh yeah this is actually this is actually a live feed this query has been running

00:05:56.000 since this morning

00:06:02.400 uh something i learned pretty much every pagination system breaks on page like 2000.

00:06:11.520 even if you you're like dear god make this stop i just want to delete half my data

00:06:17.840 you're still screwed because deleting data takes a lot of time

00:06:23.440 now i'm going to explain all of this i'm going to explain why this happens how you can work around it how you can

00:06:28.720 optimize the queries in your own database and optimize your stack but to really understand this we've got

00:06:36.240 to take a little trip back in time to the summer of 1978.

00:06:44.560 the summer of disco free love and the vt-100 terminal

00:06:50.000 the vt-100 was the first sort of computer-shaped thing that you could buy from an actual company and set on your

00:06:56.639 desk to prove that you were the alpha nerd of the office

00:07:01.680 it was also the time that oracle 1.0 was being developed

00:07:06.800 which was one of the first uh databases that used the sql language it's pretty cool is written

00:07:12.560 in pdp assem 11 assembly language and and in case you're like too young to have ever seen a pdp

00:07:20.000 11 this is what a pdp 11 looks like

00:07:26.960 yeah give show some love for the pdp 11.

00:07:34.560 now this is pretty awesome uh nowadays in our like modern day times right this

00:07:40.960 second uh marketers down in the the presentation what they call that the exhibit hall are

00:07:49.360 right now throwing around all sorts of words about uh cloud computing and platform as a service and

00:07:55.440 and i don't know some some stuff i haven't even heard of probably and back then it was the same but the

00:08:02.160 the buzzwords were different the buzz words on everybody the buzzword on everybody's lips at that

00:08:07.759 time was uh real time computing

00:08:12.879 and i i'm using air quotes around real time because that meant that you

00:08:19.680 could have your your report generated in five minutes instead of five days which i'm sure was pretty

00:08:26.639 awesome like i would have been really happy about that but uh

00:08:32.240 you know looking at this it doesn't really seem like the sort of thing i would want to build a web app on top of

00:08:38.839 right like i yeah i wouldn't stick honey badger on that

00:08:45.440 but a funny thing happened in the past 30 36 years a funny thing happened is that

00:08:53.120 moore's law made this into a web stack now i know everybody here knows sort of

00:08:58.560 vaguely what moore's law is but i thought it would be good to go over the technical

00:09:03.680 precise definition of moore's law

00:09:11.279 and that is that whoa

00:09:16.800 that is that uh computers get more awesome as time goes forward did that come back normally yeah

00:09:22.399 as time goes forward and awesomeness is generally defined as well technically being able to do more

00:09:28.320 stuff faster being able to process more data

00:09:34.839 faster so based on this i would like to postulate uh for the first time in

00:09:40.880 public you guys are very lucky to be able to hear this the first time in public star is corollary to moore's law which is

00:09:48.560 as database growth outpaces moore's law you literally travel back in time

00:10:00.160 that was the ted moment of this of this talk it's like i can hear your guy your minds being

00:10:05.760 blown it sounds like bubble wrap popping

00:10:12.959 so based on my my extensive research of time travel movies the the first rule

00:10:20.480 for surviving your new time period is that you need to understand the methods and the motivations of

00:10:26.480 the people in that time period we need to start thinking like our ancestors and our ancestors were very interested

00:10:34.240 in this thing called hardware

00:10:40.399 well i i don't mean to be flip but i just want to bring this up first because if you happen to be having

00:10:47.360 uh database scaling issues like if your app right now is getting slower because the amount of

00:10:53.040 data in your database is getting bigger and your app happens to live anywhere

00:10:58.079 named ec2 linode i don't know whatever whatever other

00:11:04.399 companies offer that you can probably solve your scaling issue like right now just by going and

00:11:10.560 buying a real damn computer

00:11:15.920 yeah you'll be shocked and amazed because the two things that databases need to

00:11:21.279 perform uh well to work at high performance with lots of data is they need lots of ram and they need a

00:11:28.399 really fast disk io and virtual servers give you neither of those things so just go buy a real damn computer

00:11:37.279 yeah and while you're at it like throw in a bunch of disks

00:11:42.399 i'm not talking about a raid array uh you probably want each of your you probably want each of your operating

00:11:49.120 system your postgres data your postgres log file you probably want all that stuff to be on

00:11:55.839 separate hard drives that's just going to make the operating system able to more efficiently

00:12:02.240 sort of schedule disk io so if if that solves your problems you know

00:12:08.560 great you're welcome if you guys want to leave the room now

00:12:14.800 that's fine i won't be offended you can go call your local colo facility work something out if that doesn't solve

00:12:21.839 your problem or even if it does you probably want to look at your queries next

00:12:31.040 now most people most of us develop oops most of us develop against smallish data

00:12:36.639 sets so when you develop against a small data set you don't notice inefficient queries

00:12:42.160 it's just life and books have been written about query optimization it's

00:12:47.440 a very huge topic and i can't explain everything in 30 minutes so i'm

00:12:53.279 just going to explain one thing and that's called explain fortunately postgres gives us an

00:12:59.839 awesome tool called explain which basically you pass it a query and

00:13:05.920 it gives us a query plan

00:13:16.480 oh wait wait a second that's that's a chapter from the iliad sorry it gives us a query plan

00:13:23.360 which still kind of looks like a chapter from the iliad but fortunately we only have to

00:13:28.399 to look at one metric here the only thing we need to worry about is rows for this talk

00:13:36.880 and here we have an example of a very efficient query right it's going to use an index and it's

00:13:42.320 going to look at one row that's pretty sweet very fast this is the type of query that bigish data

00:13:48.720 works with but there's one thing you need to know about rows

00:13:57.440 and then obviously the more rows you're dealing with the more data your computer has to crunch to give you the answers you want

00:14:03.680 so the whole the whole name of the game in query optimization is to limit the number of rows that you have to touch

00:14:10.720 let's go back to that inefficient count query so if you run explain on that it turns

00:14:17.519 out that when you ask postgres to count all of the t all of the uh rows in a table it

00:14:25.199 actually literally counts all the rows in the table one by one

00:14:32.240 and so you wind up with a one of these things it's not pretty

00:14:39.600 and it's often hard to know when stuff like this is going to crop up for example i mentioned that most

00:14:45.519 pagination systems break at page a thousand or two thousand and that's because well the offset

00:14:53.199 and limit operators work a lot like count and that

00:14:59.279 if you do an offset of 500 000 postgres is going to count up 500

00:15:04.959 000 rows and then if you have a limit of 100 it's going to count up another 100. and so what you see is that pages 1 two

00:15:11.519 and three load extremely quickly um and so you as a developer are actually when

00:15:18.079 i say you i mean me because i actually i actually fell into this trap you

00:15:23.600 test it with like one or two pages and it works fine so you move on and then later on your

00:15:28.720 customers are complaining because they can't access page 500 it's timing out the solution to this

00:15:35.360 is to use a range query range queries are very fast and i'm

00:15:41.600 going to give you a really stupid example here here's a link with

00:15:48.240 a much more detailed analysis of all this but essentially if you do a range query

00:15:54.880 you're going to be able to use an index it's going to touch 100 rows and it's going to be really fast

00:16:00.880 i know what you guys are thinking that's not the same as an offset limit query

00:16:05.920 and that's the reason i gave you the link in the previous slide so sorting sorting is tricky sorting is the

00:16:13.040 devil sometimes it's super fast sometimes if you happen to be asking for

00:16:18.720 data sorted in exactly the same way that an index is sorted well it'll be super fast other times

00:16:25.600 even if you have an index on a column if it's not set up right or if you're asking for the data in a slightly

00:16:31.600 unusual way you'll wind up sorting the whole damn data set it's no good so here's your rundown for

00:16:40.320 query optimization you want to develop against a real data set because otherwise you won't know when things are

00:16:45.759 going wrong until they go wrong in production use use explain use it a lot and the whole

00:16:53.360 name of the game here is to limit the number of rows that you touch because the more rows you touch the slower your queries are you know in

00:16:59.759 general ah so

00:17:05.039 i i don't know about you but that just like that just was a lot of cognitive load happening

00:17:10.799 right up here so i want to give you guys a cookie

00:17:24.160 for the second half of this talk we're going to cover a lot of issues that

00:17:30.320 relate to kind of the infrastructure around uh biggish data around big data sets and

00:17:35.840 fire hoses of data coming in and here's a legend

00:17:42.320 of of how we're going to approach this we're going to have a cute picture uh that hopefully sort of symbolizes

00:17:48.799 something about the the topic we're gonna have the name of the topic and then we're gonna have a link with

00:17:53.840 more info about the topic if you want to you know find this later you don't have to write all these links down

00:18:00.000 right now at the end of the presentation at the very last slide uh i'll have a url where you can get a

00:18:07.039 list of all of them and uh yeah and i promise this isn't malware so

00:18:12.240 you just gotta take my word on that our ancestors were really interested in

00:18:18.000 disc operating systems it seems like they were building a new disc operating system like every two seconds but

00:18:24.799 what's up with that i i don't know now we have like three the first thing that you should do if

00:18:29.919 you're using linux is to increase your read ahead cash a read head cache is something that

00:18:36.559 well i found not not a ton of people know about but it's a super easy way to get

00:18:41.760 up to a doubling of your read throughput

00:18:48.559 and essentially what this means is that linux examines the requests that you

00:18:54.400 make to the to the desk and if it sees that you're asking for lots of blocks that come

00:18:59.600 right after another in a row it's going to preload the next set of blocks into ram

00:19:04.720 the normal the the default for this is like 256k ram

00:19:11.679 that it uses to prefetch these blocks if you update this to use like two megs four megs you'll get a

00:19:18.799 really big increase in read performance i use a modern file system that means

00:19:25.080 ext3 is not an option uh if you want to know why check out the

00:19:31.200 link and and if you happen to be using ext4

00:19:37.440 or xfs i can never say that unless i say it's super slow

00:19:42.960 you might want to consider you might want to look into journaling settings if you have your

00:19:48.160 database on a completely separate hard drive and it's running ext4 and you have

00:19:54.960 full data journaling enabled since postgres does its own journaling you're going to have an inefficiency

00:20:01.039 there because you have two things journaling finally or way not finally uh anyway you

00:20:08.159 gotta tell pg tell you gotta tell paul graham about about all the ram that you uh

00:20:15.919 bought in that that fancy new box uh a really easy way to set this up is to use a pg tune script

00:20:23.280 it examines your computer and writes a postgres configuration file

00:20:28.880 that has some pretty pretty reasonable values in it and you can

00:20:34.400 tweak them from there this stuff is all really easy to find and then finally the bane of db

00:20:40.799 administrators is the vacuum command postgres needs to vacuum because it's

00:20:47.440 messy it uh when you run queries when you delete things when you update things

00:20:53.120 it leaves sort of a lot of stuff undone and it does that it does that in the uh

00:21:01.440 in the name of speed right it only does what is necessary at the time of the query to get you an answer to your query and then it uses

00:21:09.360 vacuum to go and sort of clean up after itself uh the problem is that vacuum can be really

00:21:15.520 resource intensive and so if you're if your server is under

00:21:21.120 a lot of load and you see that okay vacuum is also causing a lot of load you may be tempted

00:21:26.720 to turn off vacuum or to uh or to make auto vacuum happen maybe once

00:21:32.960 a night or something and that's generally a bad idea we actually got

00:21:38.320 bitten by this ourselves so that's why i bring it up usually the answer to vacuum problems is to vacuum more often not less often

00:21:46.400 all right so velocity i really wanted to like i didn't have

00:21:52.240 time i really wanted to have a little uh a little like tron guy on a motorcycle like going down the little

00:21:59.280 grid there but now we're going to talk about a couple

00:22:05.039 things that are important if you have a ton of data coming in or a ton of read requests coming in

00:22:11.520 a ton of queries coming in the first is too many database connections each database connection in postgres is

00:22:18.640 its own process and each process has its own ram overhead so there's a limited number of connections that you want to

00:22:25.120 have to your your database server if you have i don't know a thousand

00:22:30.799 workers and and web app processes and all this you don't want to open a thousand

00:22:36.080 database connections and you probably already know about this the solution is to pool connections

00:22:42.000 there's ways to do this in ruby there's also if you're interested in a more opsi approach

00:22:47.280 you can check out something like pg bouncer which is a proxy that sits in between your ruby app and your

00:22:54.400 database and functions as a as a connection pool

00:23:00.799 you can also run into problems with too many locks and this is this is the sort of problem

00:23:05.919 that you don't really ever run into if you don't have a fire hose of database of data coming in

00:23:15.679 i don't expect you to know like everything about locks because it's a pretty complex topic but you should know that

00:23:20.880 within a transaction if you go to update a row that transaction is going

00:23:28.320 to put a lock on the row until it's done it's going to say that nothing else can write to that row until

00:23:34.000 it's done and you know that makes sense but where this can bite you is if you

00:23:40.640 have say imagine you have two rails models you have a parent model and

00:23:45.919 a child model and the parent model has a counter cache that gets incremented every time you add

00:23:51.279 a child normally this is no big deal people do the sort of thing all the time

00:23:57.279 but if something crazy happens and someone you know bombards your api and

00:24:04.640 suddenly you have like 15 000 children created you're going to have some locking issues

00:24:11.279 because what's going to happen is your first query is going to execute fine the child is going to get created it's

00:24:17.279 going to increment the counter everything's going to be fine and while it was doing that it put

00:24:22.320 a lock on on that row and so once the first query is done the lock is

00:24:28.159 is removed and the second query happens and if you had infinite time

00:24:34.400 like this would all be fine it would all eventually work itself out but what happens is that by the time you

00:24:40.240 finish query 100 query 15 15 000 has timed out

00:24:46.640 which causes all sorts of fun in your uh you know in your unicorns and your

00:24:52.000 sidekicks and all that it's a yeah it's a huge pain in the neck and the way you avoid this is to

00:25:00.640 it's just an architectural thing you just have to avoid any situation where you could be

00:25:06.480 updating the same record in the database like a jillion times per second

00:25:15.679 intensive database queries like sometimes sometimes we have uh our production

00:25:21.760 database and we need that to be very uh very performant for our users but we also

00:25:29.440 need to maybe do some pre-processing on data as it comes in a really simple way to do this is to use

00:25:35.279 postgres's streaming replication facilities to create a read-only replica and

00:25:42.240 then you just do your intensive queries against that and it doesn't affect affect your users it's super simple

00:25:49.600 so i i kind of feel dumb even talking about it here but there you go partitioning is

00:25:56.880 awesome partitioning is like the best thing ever because

00:26:02.000 it allows you to dramatically speed up data culling and data archiving what you

00:26:10.240 can set up in in postgres is a partitioning scheme in which data for different days

00:26:16.799 go into different physical database tables and you know that's alright but the

00:26:23.679 really cool thing is that you access that data by querying a parent sort of virtual

00:26:30.400 table that then propagates that query across

00:26:35.600 all its children and you know magically does its results and and spits them back too so you don't have to update your ruby code at

00:26:41.360 all which is sweet and this way if you want to say delete data that's

00:26:47.760 180 days old you just drop the table that's associated with 180 days ago

00:26:54.320 if you were to do a delete where you know created at is less than 180 days ago you're probably

00:27:01.679 going to be waiting for weeks and last but not least uh backups are a

00:27:07.600 real pain in the neck when you have a huge data set like that cron job that dumps your entire database

00:27:12.960 uh and uploads it to s3 doesn't work so well when your database

00:27:18.000 is like two terabytes big and yeah that's a big s3 build too

00:27:25.440 fortunately we can take advantage of the same sort of uh things that postgres does to allow it to

00:27:31.679 do streaming replication to do sort of on the fly incremental backups and

00:27:38.640 there's a tool called wall e which makes this super easy

00:27:44.159 and it's really cool because it allows it makes it very easy to upload

00:27:49.520 incremental backups to say s3 and then when you want to restore your database it makes it really easy

00:27:55.039 to to restore a specific point in time

00:28:00.399 and so the wall is really awesome love it now i i stayed in my hotel room last

00:28:07.440 night i didn't get to go to the uh the speakeasy thing which kind of bumped me out

00:28:12.880 but i i had to work on these slides for you people

00:28:18.159 and and when i got done with this when i i sort of reached this point i look back

00:28:24.000 on all these these slides there's like 54 of them and i was like holy crap this is a lot

00:28:30.240 of information yeah this is a lot of information

00:28:38.480 but the thing i want to stress is that at least in in my experience uh these issues tend to come at you

00:28:46.080 one at a time this isn't the sort of thing where you have to know all this stuff in advance in order to

00:28:53.600 deal with biggish data in postgres things come at you one at a time and you

00:28:58.640 can deal with them one at a time and i have faith in you i think you can do it because

00:29:04.320 you're awesome because nobody told us that we could turn like

00:29:09.760 transistors into law cats but we did it like that's the type of people we are

00:29:16.799 and that's that's why i'm proud to be at railsconf if you're interested in like talking to

00:29:21.840 me about this if you think i'm full of about anything which i probably am on at least one

00:29:26.880 point just just a statistic say so just feel free

00:29:31.919 to come up to me after the conference or after my talk here i have delicious hot and

00:29:37.120 spicy mexican candy as an incentive so there you go

00:29:43.120 if you want to learn more about uh the stuff that uh are you taking do you want to take a picture of the mexican candy

00:29:50.240 okay okay i can send it to you

00:29:57.039 if you want to learn more uh all of the links referenced in this talk are at this url and if you are interested in having more

00:30:03.600 visibility into your production errors check out honey badger because yeah

00:30:09.919 we love rails doves and that's it that's uh it says end of show here so i guess it

00:30:15.200 must be the end of show

00:30:43.919 you