Summarized using AI

Herding Elephants

Clint Shryock • February 20, 2014 • Earth • Talk

Herding Elephants presents insights on how Heroku operates the largest fleet of PostgreSQL databases through a blend of Ruby applications, emphasizing service-oriented architecture, infrastructure as code, and robust fault tolerance. Speaker Clint Shryock, a support engineer at Heroku, uses humor and personal anecdotes to connect with the audience while delving into the technical aspects of their database management approach.

Key Points:

  • Introduction to Heroku and Its Postgres Team:

    • Clint clarifies his role at Heroku and distinguishes his team's responsibilities, noting that they are a small unit managing a vast infrastructure with thousands of PostgreSQL databases.
    • Emphasizes the concept of a database as a service and the add-on relationship of Heroku Postgres, highlighting its early adoption in the marketplace.
  • Evolution of Heroku Postgres:

    • Begins with a simple Sinatra application that has grown into a constellation of applications for effective management.
    • Describes a distributed architecture with specific applications handling different tasks, enhancing operational responsibilities.
  • Monitoring and Managing Databases:

    • Importance of continuously monitoring several databases to spot issues early.
    • States that they adopt an outside-in approach, where workers gather information to assess different database statuses, rather than relying solely on software installations for monitoring.
  • State Machines and Stateless Workers:

    • Clint outlines the use of state machines to manage complex behaviors and transitions among various states (e.g., up, down, uncertain) for server resources.
    • Discusses the efficiency of stateless workers that quickly execute tasks without maintaining deep state connections, allowing for rapid recovery from issues.
  • Incident Management:

    • Design of incident resolution protocols ensures issues are documented and addressed systematically.
    • Playbooks for common incidents promote knowledge sharing across team members, reducing reliance on individual expertise.
  • Handling Failures and Escalations:

    • If resolution efforts fail, there are escalation procedures in place that involve human intervention to resolve complex problems.
    • Stresses the importance of expecting failures as an inherent part of operating at scale and maintaining a positive attitude.

Conclusion:

Clint’s talk illustrates the significance of simplicity in design, effective monitoring, and error management in complex systems. He asserts that embracing the inevitability of failures while having a structured approach to handling them is crucial for success. This presentation serves as a valuable resource for engineering teams looking to improve database management processes while maintaining system reliability.

Herding Elephants
Clint Shryock • February 20, 2014 • Earth • Talk

Herding Elephants: How Heroku Uses Ruby to Run the Largest Fleet of Postgres Databases in the World

Heroku operates the largest fleet of Postgres databases in the world. Service oriented architecture, infrastructure as code, and fault tolerance make it possible. Come hear how the Heroku Postgres team uses a handful of Ruby applications to operate and scale the largest herd of your favorite elephant themed RDBMS.

Help us caption & translate this video!

http://amara.org/v/FG4A/

Big Ruby 2014

00:00:20.480 All right, that's our h.
00:00:25.599 Hi everyone, my name is Clint.
00:00:31.039 I like to start all my presentations with a very awkward photo of myself to kind of break the ice.
00:00:38.079 And my name is spelled very largely; that's mostly for vanity purposes, but also just to clarify that my name is not
00:00:45.920 Glenn, Quint, Client, Chad Quinn, or anything that ends in 'nt'.
00:00:52.000 Do I have any other Clints in the audience? No? If there was another Clint, he would know exactly what I'm talking about.
00:00:58.719 Whenever I call and order things, I have to spell my name. My wife makes fun of me, but if I don't, I'm always called Quinn.
00:01:04.720 Or Client; I don't get that one either.
00:01:10.880 Maybe you've noticed that I'm not wearing boots, I’m not wearing sandals, and I don't have a cowboy hat, so I am not from Texas.
00:01:16.400 I come all the way from Missouri. Yay, Missouri!
00:01:22.400 That's actually the next part, right? So, is anyone from Missouri? Yes!
00:01:27.840 Hey, two people! That's twice as many as I was expecting.
00:01:33.360 When I got here, I was meeting other people, and I said, 'Yeah, I'm from Missouri.'
00:01:38.479 As one from Missouri does, people are like, 'Oh yeah, Missouri! St. Louis! God, I love St. Louis, it's great. You know they’ve got baseball and an awesome hockey team, and the Arch, right?'
00:01:44.000 But yeah, I'm not from St. Louis.
00:01:49.680 So they're like, 'Oh, Kansas City! The City of Fountains!' which I honestly don't even know if that's Kansas City. Google says it is.
00:01:56.159 Why it’s called the City of Fountains, I have no idea.
00:02:01.439 There are some fountains there apparently, and no, I'm not from Kansas City either.
00:02:07.759 So now I have their attention because their list of cities in Missouri is exhausted, and I tell them I'm from Columbia.
00:02:13.360 Of course, they have to talk, I don’t know who you are, but he probably knows what I'm talking about.
00:02:18.640 Because then the next comment is 'Oh, okay, where's Columbia?' And I say, 'Well, come on, Missouri, right?'
00:02:23.840 So we've got Kansas City on the west, St. Louis on the right, and in between we've got Columbia!
00:02:29.120 Columbia is well known for two things: one, the University of Missouri, which had an awesome, fantastic football season better than any Texas team, for sure.
00:02:36.560 And, of course, we're known for being exactly in the middle of Kansas City and St. Louis.
00:02:41.760 I’m convinced we were founded on the wagon days, probably about two days' time outside of St. Louis.
00:02:47.440 And on the second day, you really don’t want to sleep in your wagon again, so we had to build a roof or something.
00:02:53.040 So, yeah, I tell people I'm from Columbia.
00:02:59.040 That's how that goes.
00:03:05.840 I'm glad I got a little laughs there. If anybody does not find this slide hilarious, I'm really sorry you don't find this funny.
00:03:11.280 I laugh every time I see this internally; I don't know why.
00:03:17.520 But if you don't find this funny, then just hunker down because you're in for a rough ride.
00:03:22.640 This was just posted like an hour ago, and I know it's kind of cliché, but this is my first time presenting at a conference.
00:03:30.400 It happens to be on a Friday, so it'd be really cool if everybody would stand up.
00:03:37.680 I can take that funny little picture. Come on, no one's standing up? Oh God.
00:03:44.000 It's happening! Yay, all right! There we go, yay! Hugs, great.
00:03:50.120 They even got some woos for me! You know, woo is not really native to Missouri, but that's okay.
00:03:55.440 I'll post that picture in a little bit; if I'm on my A-game, go ahead and retweet that.
00:04:02.080 I'll be really famous after that and somehow profit.
00:04:08.639 Yeah, I work for a company named Heroku. If you’ve heard of us, you know how awesome we are.
00:04:14.240 If you have not heard of us, we do not make signs. We're actually a platform company.
00:04:19.840 In Missouri, that is really, really difficult to describe.
00:04:26.800 I’m a support engineer at Heroku, and a lot of people think, 'Oh, so you do support?'
00:04:34.080 Well, yes, I do, but at Heroku a lot of support engineering is just like you.
00:04:40.240 We're programming all day, but we have a mindset of taking all the support tickets we get and engineering ourselves out of support tickets.
00:04:47.760 Specifically, I work with the Heroku Postgres team where we are a database as a service for your favorite elephant-themed relational database.
00:04:55.440 The team itself is less than 10 people; we have hundreds of Amazon servers.
00:05:02.000 The last I checked, I guess I don’t know; we don’t really pay attention, but we have thousands of Postgres databases.
00:05:09.600 Just like the RTF team, we are internally referred to by an acronym - DOD, which stands for Department of Data.
00:05:17.240 And if you need to remember what that means, it just means we are way better than the RTA.
00:05:22.560 Oh, I ruined that joke! Is Richard even here? No? He's not even here for me to make fun of. There he is! Hi, Richard!
00:05:29.440 Sorry for poking fun at you.
00:05:34.240 Today, we're going to talk about the approach we take to managing lots and lots of databases in a talk I call Herding Elephants.
00:05:41.120 Get it? Elephant? You know, Postgres works out.
00:05:47.440 So, how Heroku uses Ruby to run the largest fleet of Postgres databases in the world.
00:05:53.440 The asterix there means probably the largest at the time of writing.
00:05:58.800 That's probably true; it might not be true for the next few years – who knows?
00:06:03.680 Things change; it's not really a vanity metric we keep track of, but it sounds really cool on slides!
00:06:09.760 So, maybe you can tell I've never really spoken at a large conference.
00:06:17.200 When I got my little acceptance email, you're probably thinking now, like, 'Wow, what were they thinking with that?'
00:06:22.880 But I had this title, you know, this thinking of Herding Elephants. This is going to be great!
00:06:28.960 I thought I could come up with this whole talk that's all oriented around the Postgres elephant and elephants in general.
00:06:36.160 But as I've discovered, talking to a couple of people here, I've been here for the past few days.
00:06:43.680 This is not a PostgreSQL talk. Postgres is awesome, we love it, and it's great.
00:06:50.320 But no, I'm not actually going to talk about Postgres things, really.
00:06:55.680 We could be managing, I don't know, bots or something of whatever.
00:07:01.920 It's really more of an architecture talk, right? It's about managing a lot of things in the cloud.
00:07:07.680 Things that we refer to as fleets. Really, it's about managing fleets.
00:07:14.400 And who doesn't love Star Destroyers? Come on!
00:07:19.440 You are a rebel scum!
00:07:25.680 So, if you came here to hear important and cool things about managing Postgres, I’m sorry.
00:07:32.240 And I had this idea of herding elephants because it's thousands of Postgres databases.
00:07:39.440 I'd make this all elephant-and-herd-themed, but it's also not a talk about elephants.
00:07:46.080 If you came here for elephants, I’m sorry; I can't help you.
00:07:52.080 So, quick backstory: Heroku Postgres is actually not a core thing of the Heroku platform itself.
00:07:59.360 We're what's called an add-on. We exist in the add-on marketplace, which is a nice offering from Heroku.
00:08:06.400 It allows you to easily extend and add things to your applications, like New Relic, Redis, and Postgres.
00:08:12.720 You can attach these things as you will.
00:08:20.000 We were actually one of the first ones, which is cool. We kind of broke a lot of ground and a lot of things.
00:08:27.440 So, you can kind of think of it like this: you talk to Heroku, Heroku talks to us.
00:08:34.240 And we're kind of in our own little realm, even though we actually all work for Heroku.
00:08:40.960 We all sit there and eat the awesome lunches and stuff.
00:08:47.200 And all of our applications run on Heroku, so that’s pretty cool.
00:08:54.080 So, Heroku Postgres version zero, the very first thing, was just a single Sinatra application.
00:09:02.240 It used a library called Stem to speak to AWS. Can you read the orange?
00:09:08.160 Oh, so sorry! I'll just read the orange parts. That’s really disappointing; it looked great on my screen.
00:09:13.839 So, we used Stem to talk directly to AWS, and we used SQL to speak directly to the Postgres instance.
00:09:20.799 There weren't that many databases then, so this worked out great!
00:09:26.640 We just had one app and one server, and they talked to each other.
00:09:32.960 This was all great when you didn't have a lot of databases. The goal, design model, or mantra we had was just the simplest thing that could possibly work.
00:09:40.000 But no less, this is a common theme in the DOD; this is something they strive to do.
00:09:46.160 It's just something that's always in the back of our mind.
00:09:52.240 More things, more features mean more broken stuff.
00:09:58.400 Every line of code you write to do anything is something that will bite you at some point.
00:10:05.680 Or, if you move on, it'll bite someone else, and they won’t like you.
00:10:11.760 So, fast forward to now, the latest version has grown into a constellation of applications.
00:10:18.080 About five applications, still all using Sinatra. We're using Fog now to talk to AWS, and we still use SQL to talk to the Postgres instances themselves.
00:10:25.440 We've also grown to use background workers, which is underlined wonderfully with Sidekiq and Q Classic to do the bulk of our work.
00:10:32.160 Now it kind of looks like this: this is nothing groundbreaking, right? It's a constellation of applications.
00:10:38.560 They communicate over APIs, and there’s separation of concerns.
00:10:46.480 We have one that's just in charge of managing the production tier and one that's in charge of the starter tier.
00:10:52.280 We've got one that does your data clips, and you could probably count PG backups in there that does backups and snapshots.
00:10:58.160 There's also an internal one used for administratively managing things.
00:11:04.080 Some of them talk to AWS, some of them don't, but we’ve grown and spread out like that.
00:11:11.040 Almost all these applications are Sinatra, running within Sinatra.
00:11:17.440 So it's expanded; we've got various middleware, and you've got all your different endpoints that themselves really just launch individual Postgres Sinatra applications.
00:11:23.760 By nesting Sinatra applications like this, it allows you to focus and isolate specific things within the single application domain.
00:11:32.320 It makes the code easier to reason about as it's separated into different endpoints.
00:11:40.480 The apps themselves are divided into several processes.
00:11:47.280 Anyone familiar with Heroku probably recognizes this; it's a Procfile.
00:11:53.600 It's a way of taking a single application and defining individual processes contained in there.
00:12:00.160 This is a feature that you can use and scale horizontally.
00:12:06.720 You can also see that we have a summary here.
00:12:13.040 The point is that the majority of these are workers.
00:12:20.080 Your background processes do most of the heavy lifting, while the front-end stuff is usually quick and doesn’t do much of the lifting.
00:12:28.080 We literally run hundreds of workers across the five applications we have.
00:12:35.200 I think we have about 50 distinct process types, and each application itself has maybe three or four web workers.
00:12:41.520 Each of them has probably 200 plus workers of various kinds; some of the queues or processes have over 200 workers.
00:12:48.480 So we use workers a lot.
00:12:55.440 So, even while splitting this into an ecosystem of applications, it's still the simplest thing that could possibly work.
00:13:02.640 But no less than that; so, that's kind of the ecosystem.
00:13:09.200 Or the architecture of the lay of the land, so to speak.
00:13:16.640 Now, on to managing databases.
00:13:23.920 So, like I said, we have this fleet, this great awesome fleet of things.
00:13:30.560 In order to successfully run a service like this, you have to be continuously monitoring them.
00:13:37.760 You have to keep watch of everything.
00:13:44.080 If you’re looking at this, you might be asking yourself what's wrong with this picture.
00:13:51.040 It should jump out pretty quickly that it's this guy; we've got this whole fleet doing stuff.
00:13:57.280 Then we got this random one going the wrong direction.
00:14:02.160 What’s this guy doing? I mean, all these ships are coming this way; this is dangerous.
00:14:09.680 This is no good; it's going to run into somebody!
00:14:15.520 So you need to be able to monitor the fleet and identify and spot this guy.
00:14:22.080 Find out what's going on.
00:14:28.480 When you manage and monitor a lot of things, you tend to expect them to go wrong.
00:14:34.960 Things do, and you have to have this attitude about it.
00:14:40.880 But whatever this guy is doing, you need to be able to expect that this is going to happen.
00:14:48.640 Yeah, he is doing his own thing, probably causing trouble.
00:14:54.000 When you see things like this from a service point of view, someone is probably having a bad time.
00:14:59.520 That represents someone's servers or someone's database that’s gone astray.
00:15:05.680 If you don't keep your eye on these things, if you don't monitor them, things will go wrong.
00:15:11.840 People will open support tickets stating, 'Things go wrong! My database is down,' and you can tell that people are mad.
00:15:18.080 So how do you do that? You've got thousands of these things to monitor, both at the server level.
00:15:25.760 And how do you monitor them at the resource level? I'll get to the resource in a minute.
00:15:32.360 Your first thought might be: well, with these images that we’re using, we'll install software on them.
00:15:37.919 That’s not the approach we took; the approach we’ve taken is kind of outside-in.
00:15:44.160 So workers connect with SSH and they collect information about the environment.
00:15:50.640 The servers themselves are actually very dumb; we try to keep them dumb.
00:15:58.400 They only have the base OS, which is Ubuntu 12 or whatever long-term support we had last.
00:16:03.680 Postgres nine plus; we finally killed all Postgres eights. We used to have those until about six months ago.
00:16:10.560 That was a pain. There’s also this thing called Wall-e.
00:16:18.080 Wall-e was something developed internally for shipping our write-ahead logs, which is a feature of PostgreSQL.
00:16:24.480 We ship that off-site; that's part of our durability.
00:16:30.000 All that stuff, but Wall-e is written in Python.
00:16:35.280 So, if you came here for the Wall-e story, we're not going to talk about that.
00:16:41.920 Again, outside-in information is gathered by the workers.
00:16:48.320 It's used to determine the state and makes an observation, then decides the action to take.
00:16:54.720 The primary things we observe here are resources; these represent the databases.
00:17:02.080 Things like, uh, information collected such as database name, port, created at, and various database type information.
00:17:09.760 Then we have servers, which represent the physical things on AWS, or virtual things on AWS.
00:17:16.320 You've got IPs, instance ID, what availability zone it's in, how long it’s been up, and that kind of fun stuff.
00:17:24.080 We need to monitor all of these things all the time.
00:17:31.440 To do this, we use two things which are awesome: state machines and stateless workers.
00:17:37.360 I'll explain a little bit more on that; you probably know what a state machine is.
00:17:42.560 The history there is rooted in game programming.
00:17:48.320 Peter V. H. is one of the founders of Heroku Postgres, and his background was in game development.
00:17:55.200 So when it came to this kind of monitoring idea, he naturally thought of gaming.
00:18:01.120 Where you have this constant loop of observing your environment and taking action.
00:18:07.280 Am I on fire? What should I do about that? Am I being attacked by a goblin? What should I do about that?
00:18:13.760 Am I sleeping in a tent? Great!
00:18:20.080 You know, it’s like we connect to a server, and we talk to it; we say hello.
00:18:26.880 The server says hi, all right, well that's established, we can connect to the server and make progress.
00:18:33.960 Then we say things like, 'All right, select one from Postgres,' and it's like, 'Oh one.' You're like, 'Great!'
00:18:41.200 Now the server is not only up, but Postgres is running.
00:18:47.520 We do this all the time, forever.
00:18:54.080 Right? We’ve got thousands of resources, thousands of things that need to be checked.
00:19:00.160 And every one of them gets checked at least once a minute.
00:19:06.480 So, yeah, you connect to another server.
00:19:14.080 It has Postgres installed; hi, hello, select one. Great!
00:19:20.320 Then you connect to yet another server; hello, hi, select one. Great!
00:19:26.240 All these workers are going around, feeling their environment, thinking about things, and maybe doing stuff.
00:19:32.560 And we need to do this all the time, forever.
00:19:38.720 So to do this, we use a queue but treat it like a ring.
00:19:46.240 We want a worker to grab a database off the top of the queue, you know, shift it off.
00:19:52.960 We want to feel, which is a method name, and it sounds kind of odd when I stand up here.
00:19:59.680 But we want to feel its state, and then we tell it to think, where we take action.
00:20:05.120 Once we're done with that, we push it back onto the queue.
00:20:11.760 We don't linger here; these steps are usually pretty quick, and we need to do this all the time.
00:20:17.920 We use state machines to help us out.
00:20:24.000 When you're creating a server or making a new database, we have different states that these things can be in.
00:20:29.920 The creating stage, the happy-up stage, the maybe stage, the 'whoops' stage, and destroying.
00:20:36.000 We have these workers, and they need to go around and find out what state these things are in.
00:20:43.200 They need to feel their environment and determine, 'Am I up? Great! Let’s keep on going!'
00:20:50.960 'Am I maybe up? Well, that was my last state, so maybe I'm up now; maybe I'm back; I don’t know!'
00:20:57.200 My last one was not so great; maybe I'm down.
00:21:04.000 So feeling is when we go and connect to a server and observe the environment.
00:21:10.240 We have this class resource that is obviously abbreviated.
00:21:17.680 We have this class called Feeler, and the feelers collect information about the system.
00:21:22.880 So when we grab the resource, we say feel.
00:21:28.080 We just do is create a new observation with that feeler and grab the current environment.
00:21:34.240 We record that in the observations table, which is an append-only type table.
00:21:40.480 We don’t update an observation; we create a new record each time; hence, we have a history.
00:21:47.760 The observations are very simple; they have an ID, when they were created, attributes, and foreign UUIDs pointing to the resource or server.
00:21:55.040 Once we record that, we move on to the next step, which is thinking.
00:22:02.480 We consider the last observation we made.
00:22:08.960 What do we do? We include this thing called... I thought I switched to new slides.
00:22:15.040 So resources have these things called states; it’s a method that comes from the Staple module.
00:22:22.320 When the resource itself loads, we execute this method, which has a name and a block of code.
00:22:28.400 We end up creating this map of things.
00:22:35.680 Here's the Staple module summarized: the state method takes a name, a default nil, which I have no idea what it does.
00:22:42.000 Then we create like a map of names to blocks of code.
00:22:49.440 The uncertain one gets this block of code, and the available one gets that block, so on and so forth.
00:22:57.440 And here we get the think method.
00:23:03.760 After we've observed, we now say, hey, evaluate the state that we're currently in.
00:23:10.560 So look at this code and evaluate it; do this thing.
00:23:16.880 So if we're available, what do we do?
00:23:24.080 Well, if the last observation we had said the service was not available, we transition to the uncertain stage.
00:23:31.600 If it didn't say that, well we just move on; hooray!
00:23:38.000 Just like if it's uncertain, and now it says it is true, we'll go back to available and get on with our lives.
00:23:44.800 So pulling something off of the queue, feeling its environment, thinking about it, pushing it back onto the queue.
00:23:51.440 We need to do this all the time, forever.
00:23:57.440 So, state machines, stateless workers. The workers don't know much about the state.
00:24:05.360 They don't want to track the state because we don't want to tie up a worker to a resource.
00:24:11.680 We want to be able to quickly just grab it, do something really quickly, and move on.
00:24:18.080 We don't want a worker to have too much of an important relationship with what it’s doing.
00:24:24.320 Workers go down; all sorts of different things happen there.
00:24:31.440 So workers are constantly going through the queue; they’re the ones that do all the heavy lifting.
00:24:37.440 They’re the ones who do the things that take time.
00:24:43.680 The stateless workers are the ones who talk to AWS; they’re the ones who talk to the Heroku API.
00:24:50.160 To synchronize information or get commands or whatever they need to do.
00:24:56.960 And they’re the ones that connect to databases and talk directly to Postgres.
00:25:03.760 These, as far as computer terms go, are the things that take time.
00:25:10.560 We need to offload them to background workers because all of these things require networks.
00:25:16.720 In a giant cloud, even a great one like Amazon, all of them can fail, and they all do fail all the time.
00:25:22.720 The great aspect of Amazon's network is the idea of quickly getting and doing this.
00:25:29.440 Feeling, thinking, and moving on; part of this is, if for some reason a worker can’t connect to that service,
00:25:35.360 it just immediately moves on to the next thing.
00:25:42.240 But due to the way Amazon's network happens, sometimes that’s just a little glitch,
00:25:48.480 and the next worker is going to pop up in some other place in an entirely different availability zone.
00:25:55.760 It might have no problem at all connecting.
00:26:02.960 So we've kind of avoided a false positive situation and worked around maybe network partitions or various things that could go wrong.
00:26:10.080 I’m tired of you already. Sorry.
00:26:15.280 I was supposed to tie that in with the last thing: what to do when things fail.
00:26:20.560 When we've figured out that it's not a network issue and the server is down.
00:26:28.240 So, when that last observation said that the service was not available, we need to create an incident.
00:26:35.680 It's a certain type of incident; we have a lot of these.
00:26:42.240 Incidents occur when things go wrong - as they will.
00:26:48.560 As I've said, on the cloud and at scale, strange things happen.
00:26:55.120 I was talking to Tanner about how, as you scale, edge cases will remain edge cases.
00:27:01.840 But they simply become more frequent; you're doing things so many times that they are no longer bizarre.
00:27:08.080 They're just kind of strange, and there are many different types of incidents.
00:27:14.160 So we could have a resource down, stalls, failed followers stuck, the mounting drives, critical servers down, or duplicates.
00:27:21.520 In order to address all of these things, you naturally start to develop playbooks.
00:27:27.760 Things that engineers can read and use to solve problems.
00:27:34.160 That way, the solutions to these things aren't tied in an individual's head.
00:27:40.720 Once you start codifying and cataloging these things, you can create yet another state machine.
00:27:46.560 If incidents have their own state machine, you can have a triggered one with resolved, waiting, needing a human, or resolving.
00:27:52.720 So we bounce back and forth here using the stateful module to do all of this.
00:27:59.280 We utilize state machines and stateless workers.
00:28:06.720 State machines, stateful modules, take that home with you.
00:28:13.600 We have yet another queue—a ring of incidents!
00:28:20.080 The workers go along, and we don't feel at this point because we know things are wrong.
00:28:27.680 We just need to start taking action!
00:28:33.600 We pop something off the queue; we need to take some action, and then we push it back onto the queue.
00:28:40.000 Maybe that action will actually resolve things.
00:28:46.960 This worker won’t know; it will just execute the code, transition to the stage it needs to, and move on.
00:28:54.160 Then the next worker picks it up and says, 'Hey, you're all better now!'
00:29:01.440 So we need to do this all the time, a lot, all the time.
00:29:07.760 So again, with our wonderful stateful module, we have incidents.
00:29:15.120 All of these incidents are going to attempt this resolution.
00:29:22.080 And if the resolution doesn't immediately work, we will open a ticket to the customer.
00:29:29.920 If we get an error somewhere in there, we escalate to a human.
00:29:36.880 We’ll get to that in a minute.
00:29:43.040 So, the same with a wait for resolution, we have one of these state blocks for basically everything.
00:29:49.040 We usually wrap these explicitly in begin type statements because it's common for these things just to completely bomb out.
00:29:54.880 If something is just unreachable, we deal with it.
00:30:01.680 If all else fails, you know, escalate to a human.
00:30:08.640 Actually, looking at a different part of the file, we have all these types here; it’s an array.
00:30:14.560 We load all these resolvers; these resolvers are files codified from our playbooks.
00:30:20.160 We've actually written it into code how to do these things.
00:30:26.800 So we load up all these files and create an in-memory hash of them for the type of incident and the resolver that can handle it.
00:30:33.360 We do that by calling the handles method.
00:30:39.760 So, a resolver like this is a basic restart one, and it can handle the resource down state.
00:30:46.240 When the worker comes along, it’s going to attempt to initiate the resolution.
00:30:53.280 Here, it gets a lock on the resource in the database, so no other worker comes along and tries to do something.
00:31:00.480 If it sees that this is locked, it won't try to touch it.
00:31:06.720 The first thing it does, in this case, is try to restart it.
00:31:12.800 So, it turns out that on Amazon, if you're using elastic block storage and your thing crashes, and if you restart it,
00:31:20.080 your thing—in this case, Postgres—is going to come up probably in the same availability zone.
00:31:25.920 But on a different machine, and if you've used AWS long enough, you notice that sometimes just restarting the thing,
00:31:31.840 and having it come up somewhere else, resolves all your problems.
00:31:38.440 So, the very first thing we do is often just restart it, and everything sorts itself out.
00:31:44.320 Later along the line, we'll call resolve, perform a new observation, get a new state, and we'll do the tick method.
00:31:51.200 This is actually an alias for the think method, and we’ll repeat the process.
00:31:57.440 We'll continue to check these things and transition to the right state along the state machine until we determine things are better.
00:32:04.720 Until we think they get resolved.
00:32:11.200 Here's another example of a resolver; this is for server down or for stuck EBS volumes with a production tier.
00:32:17.440 We might try to fail over if you have a high availability failover.
00:32:24.080 This will actually restart the entire server instead of the Postgres instance.
00:32:30.720 That's when the thing will move.
00:32:38.080 We do another thing like resolve or perform the observation: 'Is it available now? Does it still have a stuck EBS? What’s going on?'
00:32:44.480 So we’re doing good.
00:32:51.200 But even that’s not perfect, right? So, we’re left with the obvious question of what to do when even these resolutions fail.
00:32:58.240 Because they will.
00:33:05.760 So you saw there, when the resolver doesn't work, we have to escalate to a human.
00:33:12.080 We actually have to call somebody, which is the software equivalent of, 'Well, I tried!'
00:33:18.560 Eventually, this will all read or lead to Dumbledore there, which ultimately leads to pager duty.
00:33:25.440 Escalate to humans; we’ll wake somebody up, and then we’ll have to go and figure out why the resolvers didn’t work.
00:33:31.760 Why am I being woken up in the night? Not me, fortunately!
00:33:37.760 So, I have no idea how I'm on time, but I'm pretty close to the end.
00:33:44.160 Oh wow, that's like right on time; odd.
00:33:50.960 So yeah, in summary: the simplest thing that can work.
00:33:56.320 The simplest thing you can do that works, but no less than that.
00:34:03.680 State machines are fantastic for modeling complex behavior, complex states, and things that can go wrong.
00:34:09.600 Stateless workers are great because you don’t get too tied up in what you're doing.
00:34:17.440 You can quickly move to resolution, especially when things take time.
00:34:24.160 When you get big and you need to monitor things like that, you should expect things to break.
00:34:30.960 And have a good attitude about it; just remember: well yeah, things break.
00:34:37.920 This is the summary of my talk.
00:34:43.680 Welp! Well-driven presentations.
00:34:50.320 All right, that’s all I got.
00:34:57.200 Thank you!
00:35:00.000 That's all I've got.
Explore all talks recorded at Big Ruby 2014
+14