List

Biggish Data With Rails and Postgresql

Biggish Data With Rails and Postgresql

by Starr Horne

In the talk "Biggish Data With Rails and Postgresql" at RailsConf 2014, Starr Horne explores the challenges of managing large datasets in Rails applications using PostgreSQL. The presentation focuses on how traditional ActiveRecord conventions become ineffective as databases grow, leading to performance degradation. Horne shares his experiences from working at Honeybadger, a company that monitors errors and uptime, and emphasizes that common Rails practices can falter as data volume increases. The key points discussed include:

  • Understanding Biggish Data: Horne clarifies that biggish data doesn't involve complex solutions like Hadoop but refers to practical issues faced as simple applications scale up to handle large data volumes.
  • Performance Considerations: The speaker discusses how the performance characteristics of a database change significantly as its size increases. For example, typical pagination systems may fail at high page numbers due to inefficiencies in how queries are executed.
  • Query Optimization: Horne emphasizes the importance of refining queries to limit the number of touched rows. He introduces the EXPLAIN command in PostgreSQL as a tool for understanding query performance and suggests using range queries to avoid the pitfalls of inefficient pagination.
  • Infrastructure Improvements: The talk covers practical steps to enhance database performance, such as increasing RAM, optimizing disk I/O, and utilizing connection pooling to manage database connections efficiently.
  • Database Management Techniques: Key strategies for maintaining performance over time include regular database vacuuming, partitioning data to improve deletion processes, and implementing read replicas for intensive queries.
  • Backup Strategies for Large Datasets: Horne highlights challenges in backing up large databases and recommends using tools like WAL-E for incremental backups.

Ultimately, the speaker reassures the audience that while managing biggish data can be daunting, many issues arise one at a time and are solvable with practical approaches. The overarching message is that proper preparation and proactive measures can significantly ease the transition into handling large datasets, making applications more robust and performant.

Once your database hits a few hundred million rows normal ActiveRecord conventions don't work so well.

...you find yourself in a new world. One where calling count() can bring your app to its knees. One where migrations can take all day.

This talk will show you how to work with big datasets in Rails and Postgresql. We'll show how normal conventions break down, and offer practical real-world advice on maintaining performance, doing maintenance, and tuning rails for optimal DB performance.

Starr likes building things that make people happy, usually with Ruby and Javascript. (He once built a bookshelf, but it was crooked and made noone happy.) He's the cofounder of Honeybadger.io. He lives in Guadalajara, Mexico and speaks very bad Spanish.

Help us caption & translate this video!

http://amara.org/v/FG18/

RailsConf 2014

00:00:17.119 so thank you guys for coming uh it's gonna be an awesome talk quite possibly the best talk you'll ever hear
00:00:22.720 in your life but um i actually have a built-in escape hatch here so
00:00:28.160 if things start going wrong i might just bail yeah so anyway a couple weeks ago i was
00:00:39.600 well i i was sort of having a panic attack about this presentation
00:00:46.000 because i'm a programmer my natural habitat is uh in some sort of dark place looking at
00:00:51.280 a screen it's not it's not talking to a bunch of people with like lights what's up with this
00:00:59.840 so i was in the middle of this panic attack and i went out on a walk because that always clears my mind i was listening to a podcast
00:01:06.080 uh my favorite podcast is called hardcore history i highly recommend it yeah oh
00:01:12.479 somebody likes it yeah so this this particular episode about
00:01:18.080 hardcore in of hardcore history was about the punic wars and
00:01:24.159 in case you haven't heard about the punic wars don't know the story i'll just give you like a brief rundown basically a long
00:01:30.640 long time ago there was this place uh called rome and they they decided that they should take over the world
00:01:37.600 and for the most part they did but there was this one country that was a thorn in their side for a long time called
00:01:43.680 carthage and they fought all these wars back and forth it was kind of a stalemate and then one day when i don't know maybe
00:01:51.040 he was making a sacrifice to the the fire god or whatever this guy named
00:01:56.399 hannibal had this great idea he was like i'm i'm gonna lead this sneak attack on rome
00:02:01.680 and i'm gonna do so by marching my army through the alps which is pretty cool i mean that's
00:02:08.959 pretty badass but i i think the most awesome part of the story for me at least
00:02:14.400 is that the dude had war elephants so i i don't know if you can see it in
00:02:20.160 the slide but there's actually a war elephant it's it's kind of i don't know it's kind
00:02:26.400 of under the g and a lot of people people have doubted this story you know over the years
00:02:33.120 and so in 1959 there was a british engineer who on a bar bet
00:02:40.879 borrowed a circus elephant named jumbo and marched it across the alps too so
00:02:47.599 i i don't know what my point is here really i guess don't underestimate elephants
00:02:54.319 because they don't like it and they have long
00:03:02.840 memories so okay this talk is really about biggish data so what
00:03:08.159 what the hell is biggish data um it's not wait whoa i've got one of these things it's
00:03:14.400 not big data it's not about you know this talk isn't going to be about hadoop clusters and
00:03:21.200 uh server farms and all that stuff that's that's way over my head it's not about
00:03:27.760 fancy architectures i'm not gonna show you how to make multi-right
00:03:33.440 postgres clusters that do automatic sharding and stuff i mean that's all like wicked cool but it's not what
00:03:40.400 we're going to talk about instead i'm going to talk about something that i think is it's more
00:03:46.319 practical and it's kind of more interesting to me that is how do you keep your app working as your production data set grows and
00:03:52.000 grows into biggish data territory
00:03:58.959 and this is very easy this is very easy to happen even if you don't have a fire hose of data coming in if you run a popular
00:04:06.319 ecommerce site or just a site with a lot of users over the course of years you can accumulate a
00:04:12.560 ton of data and as this data accumulates you find that your site performance goes downhill sort
00:04:19.680 of gradually in ways that you don't understand
00:04:25.280 so what is biggish data this talk is based on my experience at honeybadger in case you haven't heard
00:04:31.759 of us we are a exception performance and uptime monitoring company
00:04:38.080 that means that we essentially have a fire hose of new data coming in every day right now we have about a
00:04:44.400 terabyte of errors in our database the world has a lot of errors you guys
00:04:51.440 need to start like doing a better job and and we get about we get two
00:04:57.360 gigabytes of new errors per day roughly and all this goes into a plain vanilla
00:05:03.600 uh postgres database and it's served that backs a pretty plain vanilla rails
00:05:09.520 application well the good news is that postgres can
00:05:15.120 handle this pretty easily postgres got it covered the bad news
00:05:21.039 is that unless you've engineered your app to deal with this level of data
00:05:27.520 you're kind of screwed and the reason is that a 100 megabyte database behaves
00:05:33.520 fundamentally different than a one terabyte database
00:05:38.880 and a lot of the conventions a lot of the normal things we do in rails apps just stop
00:05:44.320 working when you have this much data
00:05:50.840 uh yeah this is actually this is actually a live feed this query has been running
00:05:56.000 since this morning
00:06:02.400 uh something i learned pretty much every pagination system breaks on page like 2000.
00:06:11.520 even if you you're like dear god make this stop i just want to delete half my data
00:06:17.840 you're still screwed because deleting data takes a lot of time
00:06:23.440 now i'm going to explain all of this i'm going to explain why this happens how you can work around it how you can
00:06:28.720 optimize the queries in your own database and optimize your stack but to really understand this we've got
00:06:36.240 to take a little trip back in time to the summer of 1978.
00:06:44.560 the summer of disco free love and the vt-100 terminal
00:06:50.000 the vt-100 was the first sort of computer-shaped thing that you could buy from an actual company and set on your
00:06:56.639 desk to prove that you were the alpha nerd of the office
00:07:01.680 it was also the time that oracle 1.0 was being developed
00:07:06.800 which was one of the first uh databases that used the sql language it's pretty cool is written
00:07:12.560 in pdp assem 11 assembly language and and in case you're like too young to have ever seen a pdp
00:07:20.000 11 this is what a pdp 11 looks like
00:07:26.960 yeah give show some love for the pdp 11.
00:07:34.560 now this is pretty awesome uh nowadays in our like modern day times right this
00:07:40.960 second uh marketers down in the the presentation what they call that the exhibit hall are
00:07:49.360 right now throwing around all sorts of words about uh cloud computing and platform as a service and
00:07:55.440 and i don't know some some stuff i haven't even heard of probably and back then it was the same but the
00:08:02.160 the buzzwords were different the buzz words on everybody the buzzword on everybody's lips at that
00:08:07.759 time was uh real time computing
00:08:12.879 and i i'm using air quotes around real time because that meant that you
00:08:19.680 could have your your report generated in five minutes instead of five days which i'm sure was pretty
00:08:26.639 awesome like i would have been really happy about that but uh
00:08:32.240 you know looking at this it doesn't really seem like the sort of thing i would want to build a web app on top of
00:08:38.839 right like i yeah i wouldn't stick honey badger on that
00:08:45.440 but a funny thing happened in the past 30 36 years a funny thing happened is that
00:08:53.120 moore's law made this into a web stack now i know everybody here knows sort of
00:08:58.560 vaguely what moore's law is but i thought it would be good to go over the technical
00:09:03.680 precise definition of moore's law
00:09:11.279 and that is that whoa
00:09:16.800 that is that uh computers get more awesome as time goes forward did that come back normally yeah
00:09:22.399 as time goes forward and awesomeness is generally defined as well technically being able to do more
00:09:28.320 stuff faster being able to process more data
00:09:34.839 faster so based on this i would like to postulate uh for the first time in
00:09:40.880 public you guys are very lucky to be able to hear this the first time in public star is corollary to moore's law which is
00:09:48.560 as database growth outpaces moore's law you literally travel back in time
00:10:00.160 that was the ted moment of this of this talk it's like i can hear your guy your minds being
00:10:05.760 blown it sounds like bubble wrap popping
00:10:12.959 so based on my my extensive research of time travel movies the the first rule
00:10:20.480 for surviving your new time period is that you need to understand the methods and the motivations of
00:10:26.480 the people in that time period we need to start thinking like our ancestors and our ancestors were very interested
00:10:34.240 in this thing called hardware
00:10:40.399 well i i don't mean to be flip but i just want to bring this up first because if you happen to be having
00:10:47.360 uh database scaling issues like if your app right now is getting slower because the amount of
00:10:53.040 data in your database is getting bigger and your app happens to live anywhere
00:10:58.079 named ec2 linode i don't know whatever whatever other
00:11:04.399 companies offer that you can probably solve your scaling issue like right now just by going and
00:11:10.560 buying a real damn computer
00:11:15.920 yeah you'll be shocked and amazed because the two things that databases need to
00:11:21.279 perform uh well to work at high performance with lots of data is they need lots of ram and they need a
00:11:28.399 really fast disk io and virtual servers give you neither of those things so just go buy a real damn computer
00:11:37.279 yeah and while you're at it like throw in a bunch of disks
00:11:42.399 i'm not talking about a raid array uh you probably want each of your you probably want each of your operating
00:11:49.120 system your postgres data your postgres log file you probably want all that stuff to be on
00:11:55.839 separate hard drives that's just going to make the operating system able to more efficiently
00:12:02.240 sort of schedule disk io so if if that solves your problems you know
00:12:08.560 great you're welcome if you guys want to leave the room now
00:12:14.800 that's fine i won't be offended you can go call your local colo facility work something out if that doesn't solve
00:12:21.839 your problem or even if it does you probably want to look at your queries next
00:12:31.040 now most people most of us develop oops most of us develop against smallish data
00:12:36.639 sets so when you develop against a small data set you don't notice inefficient queries
00:12:42.160 it's just life and books have been written about query optimization it's
00:12:47.440 a very huge topic and i can't explain everything in 30 minutes so i'm
00:12:53.279 just going to explain one thing and that's called explain fortunately postgres gives us an
00:12:59.839 awesome tool called explain which basically you pass it a query and
00:13:05.920 it gives us a query plan
00:13:16.480 oh wait wait a second that's that's a chapter from the iliad sorry it gives us a query plan
00:13:23.360 which still kind of looks like a chapter from the iliad but fortunately we only have to
00:13:28.399 to look at one metric here the only thing we need to worry about is rows for this talk
00:13:36.880 and here we have an example of a very efficient query right it's going to use an index and it's
00:13:42.320 going to look at one row that's pretty sweet very fast this is the type of query that bigish data
00:13:48.720 works with but there's one thing you need to know about rows
00:13:57.440 and then obviously the more rows you're dealing with the more data your computer has to crunch to give you the answers you want
00:14:03.680 so the whole the whole name of the game in query optimization is to limit the number of rows that you have to touch
00:14:10.720 let's go back to that inefficient count query so if you run explain on that it turns
00:14:17.519 out that when you ask postgres to count all of the t all of the uh rows in a table it
00:14:25.199 actually literally counts all the rows in the table one by one
00:14:32.240 and so you wind up with a one of these things it's not pretty
00:14:39.600 and it's often hard to know when stuff like this is going to crop up for example i mentioned that most
00:14:45.519 pagination systems break at page a thousand or two thousand and that's because well the offset
00:14:53.199 and limit operators work a lot like count and that
00:14:59.279 if you do an offset of 500 000 postgres is going to count up 500
00:15:04.959 000 rows and then if you have a limit of 100 it's going to count up another 100. and so what you see is that pages 1 two
00:15:11.519 and three load extremely quickly um and so you as a developer are actually when
00:15:18.079 i say you i mean me because i actually i actually fell into this trap you
00:15:23.600 test it with like one or two pages and it works fine so you move on and then later on your
00:15:28.720 customers are complaining because they can't access page 500 it's timing out the solution to this
00:15:35.360 is to use a range query range queries are very fast and i'm
00:15:41.600 going to give you a really stupid example here here's a link with
00:15:48.240 a much more detailed analysis of all this but essentially if you do a range query
00:15:54.880 you're going to be able to use an index it's going to touch 100 rows and it's going to be really fast
00:16:00.880 i know what you guys are thinking that's not the same as an offset limit query
00:16:05.920 and that's the reason i gave you the link in the previous slide so sorting sorting is tricky sorting is the
00:16:13.040 devil sometimes it's super fast sometimes if you happen to be asking for
00:16:18.720 data sorted in exactly the same way that an index is sorted well it'll be super fast other times
00:16:25.600 even if you have an index on a column if it's not set up right or if you're asking for the data in a slightly
00:16:31.600 unusual way you'll wind up sorting the whole damn data set it's no good so here's your rundown for
00:16:40.320 query optimization you want to develop against a real data set because otherwise you won't know when things are
00:16:45.759 going wrong until they go wrong in production use use explain use it a lot and the whole
00:16:53.360 name of the game here is to limit the number of rows that you touch because the more rows you touch the slower your queries are you know in
00:16:59.759 general ah so
00:17:05.039 i i don't know about you but that just like that just was a lot of cognitive load happening
00:17:10.799 right up here so i want to give you guys a cookie
00:17:24.160 for the second half of this talk we're going to cover a lot of issues that
00:17:30.320 relate to kind of the infrastructure around uh biggish data around big data sets and
00:17:35.840 fire hoses of data coming in and here's a legend
00:17:42.320 of of how we're going to approach this we're going to have a cute picture uh that hopefully sort of symbolizes
00:17:48.799 something about the the topic we're gonna have the name of the topic and then we're gonna have a link with
00:17:53.840 more info about the topic if you want to you know find this later you don't have to write all these links down
00:18:00.000 right now at the end of the presentation at the very last slide uh i'll have a url where you can get a
00:18:07.039 list of all of them and uh yeah and i promise this isn't malware so
00:18:12.240 you just gotta take my word on that our ancestors were really interested in
00:18:18.000 disc operating systems it seems like they were building a new disc operating system like every two seconds but
00:18:24.799 what's up with that i i don't know now we have like three the first thing that you should do if
00:18:29.919 you're using linux is to increase your read ahead cash a read head cache is something that
00:18:36.559 well i found not not a ton of people know about but it's a super easy way to get
00:18:41.760 up to a doubling of your read throughput
00:18:48.559 and essentially what this means is that linux examines the requests that you
00:18:54.400 make to the to the desk and if it sees that you're asking for lots of blocks that come
00:18:59.600 right after another in a row it's going to preload the next set of blocks into ram
00:19:04.720 the normal the the default for this is like 256k ram
00:19:11.679 that it uses to prefetch these blocks if you update this to use like two megs four megs you'll get a
00:19:18.799 really big increase in read performance i use a modern file system that means
00:19:25.080 ext3 is not an option uh if you want to know why check out the
00:19:31.200 link and and if you happen to be using ext4
00:19:37.440 or xfs i can never say that unless i say it's super slow
00:19:42.960 you might want to consider you might want to look into journaling settings if you have your
00:19:48.160 database on a completely separate hard drive and it's running ext4 and you have
00:19:54.960 full data journaling enabled since postgres does its own journaling you're going to have an inefficiency
00:20:01.039 there because you have two things journaling finally or way not finally uh anyway you
00:20:08.159 gotta tell pg tell you gotta tell paul graham about about all the ram that you uh
00:20:15.919 bought in that that fancy new box uh a really easy way to set this up is to use a pg tune script
00:20:23.280 it examines your computer and writes a postgres configuration file
00:20:28.880 that has some pretty pretty reasonable values in it and you can
00:20:34.400 tweak them from there this stuff is all really easy to find and then finally the bane of db
00:20:40.799 administrators is the vacuum command postgres needs to vacuum because it's
00:20:47.440 messy it uh when you run queries when you delete things when you update things
00:20:53.120 it leaves sort of a lot of stuff undone and it does that it does that in the uh
00:21:01.440 in the name of speed right it only does what is necessary at the time of the query to get you an answer to your query and then it uses
00:21:09.360 vacuum to go and sort of clean up after itself uh the problem is that vacuum can be really
00:21:15.520 resource intensive and so if you're if your server is under
00:21:21.120 a lot of load and you see that okay vacuum is also causing a lot of load you may be tempted
00:21:26.720 to turn off vacuum or to uh or to make auto vacuum happen maybe once
00:21:32.960 a night or something and that's generally a bad idea we actually got
00:21:38.320 bitten by this ourselves so that's why i bring it up usually the answer to vacuum problems is to vacuum more often not less often
00:21:46.400 all right so velocity i really wanted to like i didn't have
00:21:52.240 time i really wanted to have a little uh a little like tron guy on a motorcycle like going down the little
00:21:59.280 grid there but now we're going to talk about a couple
00:22:05.039 things that are important if you have a ton of data coming in or a ton of read requests coming in
00:22:11.520 a ton of queries coming in the first is too many database connections each database connection in postgres is
00:22:18.640 its own process and each process has its own ram overhead so there's a limited number of connections that you want to
00:22:25.120 have to your your database server if you have i don't know a thousand
00:22:30.799 workers and and web app processes and all this you don't want to open a thousand
00:22:36.080 database connections and you probably already know about this the solution is to pool connections
00:22:42.000 there's ways to do this in ruby there's also if you're interested in a more opsi approach
00:22:47.280 you can check out something like pg bouncer which is a proxy that sits in between your ruby app and your
00:22:54.400 database and functions as a as a connection pool
00:23:00.799 you can also run into problems with too many locks and this is this is the sort of problem
00:23:05.919 that you don't really ever run into if you don't have a fire hose of database of data coming in
00:23:15.679 i don't expect you to know like everything about locks because it's a pretty complex topic but you should know that
00:23:20.880 within a transaction if you go to update a row that transaction is going
00:23:28.320 to put a lock on the row until it's done it's going to say that nothing else can write to that row until
00:23:34.000 it's done and you know that makes sense but where this can bite you is if you
00:23:40.640 have say imagine you have two rails models you have a parent model and
00:23:45.919 a child model and the parent model has a counter cache that gets incremented every time you add
00:23:51.279 a child normally this is no big deal people do the sort of thing all the time
00:23:57.279 but if something crazy happens and someone you know bombards your api and
00:24:04.640 suddenly you have like 15 000 children created you're going to have some locking issues
00:24:11.279 because what's going to happen is your first query is going to execute fine the child is going to get created it's
00:24:17.279 going to increment the counter everything's going to be fine and while it was doing that it put
00:24:22.320 a lock on on that row and so once the first query is done the lock is
00:24:28.159 is removed and the second query happens and if you had infinite time
00:24:34.400 like this would all be fine it would all eventually work itself out but what happens is that by the time you
00:24:40.240 finish query 100 query 15 15 000 has timed out
00:24:46.640 which causes all sorts of fun in your uh you know in your unicorns and your
00:24:52.000 sidekicks and all that it's a yeah it's a huge pain in the neck and the way you avoid this is to
00:25:00.640 it's just an architectural thing you just have to avoid any situation where you could be
00:25:06.480 updating the same record in the database like a jillion times per second
00:25:15.679 intensive database queries like sometimes sometimes we have uh our production
00:25:21.760 database and we need that to be very uh very performant for our users but we also
00:25:29.440 need to maybe do some pre-processing on data as it comes in a really simple way to do this is to use
00:25:35.279 postgres's streaming replication facilities to create a read-only replica and
00:25:42.240 then you just do your intensive queries against that and it doesn't affect affect your users it's super simple
00:25:49.600 so i i kind of feel dumb even talking about it here but there you go partitioning is
00:25:56.880 awesome partitioning is like the best thing ever because
00:26:02.000 it allows you to dramatically speed up data culling and data archiving what you
00:26:10.240 can set up in in postgres is a partitioning scheme in which data for different days
00:26:16.799 go into different physical database tables and you know that's alright but the
00:26:23.679 really cool thing is that you access that data by querying a parent sort of virtual
00:26:30.400 table that then propagates that query across
00:26:35.600 all its children and you know magically does its results and and spits them back too so you don't have to update your ruby code at
00:26:41.360 all which is sweet and this way if you want to say delete data that's
00:26:47.760 180 days old you just drop the table that's associated with 180 days ago
00:26:54.320 if you were to do a delete where you know created at is less than 180 days ago you're probably
00:27:01.679 going to be waiting for weeks and last but not least uh backups are a
00:27:07.600 real pain in the neck when you have a huge data set like that cron job that dumps your entire database
00:27:12.960 uh and uploads it to s3 doesn't work so well when your database
00:27:18.000 is like two terabytes big and yeah that's a big s3 build too
00:27:25.440 fortunately we can take advantage of the same sort of uh things that postgres does to allow it to
00:27:31.679 do streaming replication to do sort of on the fly incremental backups and
00:27:38.640 there's a tool called wall e which makes this super easy
00:27:44.159 and it's really cool because it allows it makes it very easy to upload
00:27:49.520 incremental backups to say s3 and then when you want to restore your database it makes it really easy
00:27:55.039 to to restore a specific point in time
00:28:00.399 and so the wall is really awesome love it now i i stayed in my hotel room last
00:28:07.440 night i didn't get to go to the uh the speakeasy thing which kind of bumped me out
00:28:12.880 but i i had to work on these slides for you people
00:28:18.159 and and when i got done with this when i i sort of reached this point i look back
00:28:24.000 on all these these slides there's like 54 of them and i was like holy crap this is a lot
00:28:30.240 of information yeah this is a lot of information
00:28:38.480 but the thing i want to stress is that at least in in my experience uh these issues tend to come at you
00:28:46.080 one at a time this isn't the sort of thing where you have to know all this stuff in advance in order to
00:28:53.600 deal with biggish data in postgres things come at you one at a time and you
00:28:58.640 can deal with them one at a time and i have faith in you i think you can do it because
00:29:04.320 you're awesome because nobody told us that we could turn like
00:29:09.760 transistors into law cats but we did it like that's the type of people we are
00:29:16.799 and that's that's why i'm proud to be at railsconf if you're interested in like talking to
00:29:21.840 me about this if you think i'm full of about anything which i probably am on at least one
00:29:26.880 point just just a statistic say so just feel free
00:29:31.919 to come up to me after the conference or after my talk here i have delicious hot and
00:29:37.120 spicy mexican candy as an incentive so there you go
00:29:43.120 if you want to learn more about uh the stuff that uh are you taking do you want to take a picture of the mexican candy
00:29:50.240 okay okay i can send it to you
00:29:57.039 if you want to learn more uh all of the links referenced in this talk are at this url and if you are interested in having more
00:30:03.600 visibility into your production errors check out honey badger because yeah
00:30:09.919 we love rails doves and that's it that's uh it says end of show here so i guess it
00:30:15.200 must be the end of show
00:30:43.919 you