List

Event Streaming on Rails

Event Streaming on Rails

by Brad Urani

In the presentation titled Event Streaming on Rails at RailsConf 2022, Brad Urani, Principal Engineer at Procore, explores the concept of event streaming as a solution for scaling large Rails applications. Urani discusses how companies can transition from monolithic Rails applications to microservice architectures through effective use of event streaming technologies, primarily using Apache Kafka. The core topics addressed in the talk include:

  • Communication Challenges in Large Applications: Urani outlines the difficulties in making large Rails applications communicate, especially when trying to split a monolith into smaller services and facilitating data replication across different geographic regions.

  • Introduction to Apache Kafka: The speaker explains Kafka as a distributed event streaming platform, highlighting its capabilities, including durability, guaranteed delivery, and message ordering, which is crucial for maintaining data consistency across systems.

  • Integration with Ruby on Rails: Practical insights are provided on how to implement Kafka within Rails using the Ruby Kafka gem. Urani describes both synchronous and asynchronous message publishing methods, emphasizing the trade-offs between speed and message durability.

  • Common Use Cases for Kafka:

    • Service-to-service communication, where Kafka allows for decoupled services that can operate independently, improving fault tolerance.
    • Multicasting events for diverse processes, such as notifying various downstream services when a user signs up.
    • Ensuring consistency between heterogeneous data stores like PostgreSQL, Elasticsearch, and reporting databases using event sourcing and change data capture techniques.
  • Transactional Outbox Pattern: Urani discusses the dual-write problem inherent in standard database operations and proposes the transactional outbox pattern as a solution. This technique involves writing to an 'outbox' table within a single transaction to ensure that the data published to Kafka aligns with the database entries, providing a reliable mechanism to maintain consistency.

  • Operational Considerations: The necessity for operational strategies like Kafka Connect to simplify the integration of various data sinks and sources was also discussed, particularly for maintaining historical data.

In conclusion, Urani illustrates that adopting event streaming with Kafka not only boosts scalability and performance of Rails applications but also enhances their reliability through strong guarantees on message delivery and ordering. By leveraging these technologies, developers can create a more resilient architecture, paving the way for future growth and adaptability within their applications.

Pop quiz: How do you best make one Rails app communicate with another? How do you split one big Rails app into two smaller ones? How do you switch from a Rails app to a constellation of Rails services? Event streaming provides the most robust answer. Simple but powerful Kafka streams unlock a world of capabilities with their durability and consistency, but integrating with Rails poses challenges. Join us and learn simple reading and writing to Kafka with Rails, broader distributed systems design, and the magical transactional outbox. You'll leave with the knowledge you need to make the switch.

RailsConf 2022

00:00:00.900 foreign
00:00:12.440 excited to be back at railsconf this is my first conference since 2019. dang it
00:00:17.820 is nice do you all miss this yeah um so my name is Brad yorani I'm a
00:00:24.060 principal engineer at procore I live in Austin Texas uh what the hell is that oh there we go
00:00:31.560 uh Pro core build software for the construction industry if you're building a skyscraper like this one a shopping
00:00:36.899 mall um you know a subway system uh you're probably using procore it's a big Suite
00:00:43.020 of enterprise software we build the software that builds the world that is our model
00:00:49.140 um holy cow uh there's a one of our iPad apps used for building design and things like that we are I think 200 maybe 300
00:00:57.360 rails developers strong um Grew From a tiny little shop up into a great big one starting on Rails 0.9
00:01:04.619 over 16 years ago um we now have I think might be the
00:01:10.260 world's largest rails app I know that Shopify has got a big one not sure if you all are still contributing to that
00:01:15.479 or if it's still growing ours is huge over 43 000 Ruby files over a thousand database tables over 12 000 routes two
00:01:23.280 or three hundred devs probably 250 or 300 devs working on it every single day it continues to grow
00:01:30.840 um with this comes some questions right because well uh this model of giant rails apps has served us well uh it
00:01:37.680 can't scale Forever at some point uh we need to do something to do something else to break that apart and that brings
00:01:43.200 up some really tough questions how do we split the world's largest monolith um how does our model f talk to new
00:01:49.320 services not as obvious as it might seem how do we make it multi-region so if we have like our AWS regions you know we've
00:01:55.979 got an instance of this whole application in Europe in North America they do need to communicate to each other they need to share data with each other
00:02:02.520 how do we do that how do you break out things like search and Reporting which are sometimes tough problems um and the
00:02:08.819 source of a lot of performance problems um having you know reporting queries baked into your primary database or
00:02:15.180 having Search Baked into your primary database along with the problems of making sure all the data gets into all
00:02:20.940 these systems Downstream in a consistent way so that all these different data stores match at the end of the day
00:02:27.420 foreign platform team at procore this was
00:02:34.980 basically what we came up uh how we're building solutions to a lot of these problems that's actually a picture of
00:02:40.620 our platform if you zoom in on the water it's actually like little ones and zeros floating Downstream uh so um this is at
00:02:47.220 the core of our distributed system strategy it is basically the backbone of our service to service communication
00:02:52.340 amongst other things it's our way of replicating data Zone to Zone it's a suite of tools built by my team and I
00:02:58.920 that allow sort of self-service creation of services that can publish and consume from streams
00:03:04.800 um so that we can finally break this monolith right and start developing microservices so streams themselves right they go by a
00:03:12.060 few different names you might hear it called an event bus right a transaction log I made the icon of log get it
00:03:17.940 transaction log yeah um a stream um some people call it a pub sub right for those of you born in the 90s that is
00:03:24.659 a newspaper it's like the internet but on paper um
00:03:30.060 and at the heart of this right is Apache Kafka Kafka is a distributed log right
00:03:36.360 it was invented by um Jay krepps who was at LinkedIn at the time and it is fundamentally um oh well
00:03:43.920 I should say that um Apache Kafka is not the only solution that does that there are some competitors Amazon has Kinesis
00:03:50.099 which is similar but not as fully featured Google has Cloud Pub sub I
00:03:55.379 don't know who actually uses Google Cloud but um uh so they're our competitors but most people in this ecosystem are using Kafka it's kind of
00:04:01.920 the tool of choice definitely the most common and has the most tools and things built into it and libraries
00:04:09.900 um so Kafka is fundamentally fundamentally this system where you have something that publishes data
00:04:16.019 um to what's called a topic a topic is like a single Stream So you uh you publish to a Kafka topic um and then
00:04:21.720 something else consumes from it it is uh so I'm going to go into just a few nuts
00:04:28.199 and bolts about how you read and write to Kafka from Ruby before we get into the fun stuff
00:04:34.259 so uh there's this wonderful gem it's called Ruby Kafka it's maintained by the fine folks at zendesk if you're here and
00:04:40.139 you work on this I love you please come talk to me uh so um and and using it is actually
00:04:46.919 pretty simple you instantiate this Kafka client you pass it some config options with broker names and things like that
00:04:51.960 you deliver a message you've got the message you've got which topic you're going to write to and you've got a partition number which I'm going to talk
00:04:58.020 about in a second so um not too difficult really to to kind of get started with this once you
00:05:03.900 have Kafka up and running somewhere um it also has an asynchronous producer mode so the first one I showed you was
00:05:09.840 synchronous where you actually write to Kafka it blocks you ain't for a response and then uh you get your your thread
00:05:16.139 back right um there's also an async version of this which when you publish a message to Kafka actually puts the
00:05:21.780 message in a in-memory background queue and there's a background thread on your web servers that takes those messages in
00:05:27.539 that queue in small batches and publishes them to Kafka the nice thing there is that it doesn't block your
00:05:32.940 request right so in your rails app users aren't waiting for you to publish to Kafka and you return a little bit faster
00:05:38.539 the downside of that is it's not durable so in like a kill 9 scenario or if your server crashes you can lose messages
00:05:44.100 that way of very important things to consider in fact I've got a whole section coming up about how to produce and how to produce
00:05:51.360 consistently based on your needs um Kafka topics are broken up into
00:05:56.699 partitions so what happens is these consumers um they subscribe to a Kafka topic they
00:06:02.699 take off one small batch at a time and they process it and then they grab the next batch to make that scale it's
00:06:08.460 broken up into partitions based on a partition key you choose your own partition key that could be something like user ID and for that particular
00:06:15.600 partition messages are in order um and then you can parallelize it by running uh consumers right up to one
00:06:22.740 consumer per partition um that's how Kafka scales um when you're reading from Kafka like
00:06:29.520 just in basic Ruby um once again this is the Ruby Kafka gem you could figure it with these broker
00:06:34.800 names Brokers or like the Kafka servers right um You have got this Loop here you subscribe to a topic in this Loop and it
00:06:41.759 just loops and loops and Loops consuming micro batches of messages how you do exception handling in there is real
00:06:48.360 critical if you don't want it to stop and and um definitely something to pay attention to because it determines uh basically
00:06:54.600 your ordering guarantees uh which is something important I'm going to talk about a lot
00:06:59.639 um if you're using rails right you can in fact turn your rails app into a Kafka consumer it is a little bit fraud it's
00:07:07.620 not necessarily straightforward or easy but um if you basically make a new class right call application initialize that
00:07:13.860 you know starts off all your auto loaders and your initializers right which that allows you to use like your
00:07:19.139 model classes and things like that then you subscribe to Kafka and you start processing messages so this is skipping
00:07:24.960 the whole web server right and the normal rail server we're basically using
00:07:30.000 um we're basically just running a ruby file directly and then loading rails classes and you can do that depending on
00:07:35.699 which version of rails you're using it may or may not be more complicated than that to get the environment to load but
00:07:41.099 that's nice you can you can write all the classes you've already written in your rails app if you don't want to break them out into a gym or something
00:07:46.199 like that uh so why Kafka this is the author Franz
00:07:51.720 Kafka he wrote a great book about a man it turned into a cockroach
00:07:57.300 um Kafka has some interesting guarantees that aren't really true about most other similar but different Technologies um
00:08:03.780 one is at least once delivery a Kafka consumer always retries until it
00:08:08.880 succeeds in processing your message so it will sit there you know it's subscribed to one partition it will read the messages
00:08:15.539 off that partition and if it fails to get an acknowledgment back it will keep trying and in that way you are
00:08:22.319 guaranteed to never lose a message provided someday you fix your bug or your performance issue or whatever it is
00:08:28.680 that is preventing that message from getting through it will sit there retry forever it's a disk based storage system
00:08:34.020 the messages don't get lost all the messages are replicated on disk across Brokers
00:08:39.539 um and it's got a retention period which I'm going to talk about Kafka gives you guaranteed ordering the order in which
00:08:46.320 the messages go in is the order in which they are processed and that is key to all of this I can't overstate how
00:08:52.440 important that is for these types of system because that is what allows you to make a consistent system say for
00:08:58.440 instance you've got a rails app right and it's writing to a database and then you publish to Kafka if you want that
00:09:04.140 and that down Downstream you're consuming it and writing to another database to a search index or something
00:09:10.320 um what you don't want to do is have like you know a record get inserted in your database and then updated and then
00:09:16.560 update it again and then but the last update goes before the first one and then you overwrite new changes with old
00:09:21.899 old ones that's bad right because then your data is inconsistent that's why this guaranteed ordering is so important
00:09:29.940 also multicasts so um you can have multiple consumers on a
00:09:35.160 single topic that's key too this is what really allows your system to be really
00:09:40.260 flexible um and you can replay it because uh Kafka Masters are retained uh often you
00:09:46.560 set your own retention window ours is a week uh if you mess something up right if you fail to write if you forget to
00:09:52.500 write a column to your database you can go back in time reconsume the stream and repopulate your database so uh with at
00:09:59.040 least once delivery and guaranteed ordering we have a system that is consistent with multicast we have something that is
00:10:04.440 democratized because many people could consume this use cases you don't even know about yet and it is decoupled you
00:10:10.380 have decoupled the upstream and downstream and different services that are consuming are decoupled from each other this is not necessarily true for
00:10:17.339 similar things right if you're used to rabbitmq Amazon sqs they don't have
00:10:22.500 retention you pick that message up off the queue it's gone you can't replay it they are not ordered so you can enqueue
00:10:28.980 two things often depends on some of the settings but you can enqueue two things and have them the order be swapped you can have them
00:10:35.940 processed at the same time this is also true of sidekick which is different but you know it uses redis as a queue often
00:10:42.540 you get parallel execution of these things which is not consistent because it is not ordered and you can have old things overriding new things if
00:10:49.140 you're not careful and if for actually running Kafka you've got some options if you don't actually want to spin up your own Kafka servers
00:10:55.320 Amazon has a relatively new one called msk even though they have a competing Kinesis go figure
00:11:01.620 um heroku's got a great one surprisingly we used to use it it's actually really awesome and it's cheap and it's fun and
00:11:07.019 then confluent is the Kafka company they make all the open source tooling but then they've got commercial versions of
00:11:12.240 this stuff Cloud hosted versions uh use cases that is Jacobs the inventor
00:11:17.700 of Kafka he did not turn into a cockroach um so simple question here
00:11:23.940 um actually it's not that simple it's deceivingly complicated what is the best way to make one rails app talk to
00:11:29.700 another hmm all right well you know a simple thing to do would be an HTTP post
00:11:36.060 so that's not as great as it sounds um it's not tolerant to errors if that
00:11:41.760 if that Downstream system is offline if it's being deployed if it's bogged down due to database performance that's going
00:11:48.959 to return an error right and then you either have to swallow that message and lose it or you have to return an air all the way back to the user so it is not
00:11:56.040 fault tolerant at all if this is not your app Downstream if it's a third-party app you're especially vulnerable here this can create big
00:12:02.640 problems latency is also a problem because you don't know 100 the performance characteristics of the
00:12:07.860 downstream app if the database is bogged down you could be waiting for seconds for that thing to return you know if
00:12:13.019 under Peak load or if someone runs some giant query that they're not supposed to run you could be sitting there waiting
00:12:18.060 forever your Upstream one could actually time out whereas writing to Kafka is fast once you write to Kafka it's fire
00:12:24.600 and forget continue on with your life right and the downstream system will catch up
00:12:29.760 um and it's also one and done there's no Downstream consumers here the history is kind of lost unless you're like logging
00:12:34.860 all requests right um the history is lost you can't go back and examine your your past history and
00:12:40.500 things like that it's one and done not so great grpc this is like a streaming technology uh Google streaming
00:12:46.380 technology which is a much more efficient protocol than HTTP but It suffers from a lot of the same problems
00:12:52.980 um again you can multicast with that but again it's not fault tolerant Etc et cetera Etc
00:12:59.579 um Kafka fixes these it's buffered which means um like it could smooth out Performance Bikes right because if you
00:13:05.579 get a spike of Rights the reads are relatively consistent it's based on the number of consumers you have and the number of partitions it's durable the
00:13:12.120 message is not lost you can go back and replay it's consistent um and it's democratized meaning lots of
00:13:18.959 consumers can subscribe to it which you know can pay off in the future for use cases you haven't even anticipated yet
00:13:25.019 em potency is key I put that key emoji on there to reinforce its key that was my last minute Edition in the speaker
00:13:30.420 Lounge it's key um because the Kafka always retries there's always the possibility of
00:13:36.839 duplicated messages so you have to upsert on the right side there um otherwise you could get duplicate
00:13:43.440 inserts right so it's got to be item potent operations um so uh here's an uh use case number
00:13:48.899 two so use case number one was service to service use case number two here is this like kind of multicast idea imagine
00:13:54.360 you've got the service in a new user signs up there's a lot of things you need to do send a welcome email
00:14:00.320 insert a search record like maybe in your elasticsearch cluster so that person is searchable in your directory
00:14:05.339 maybe you suggest people to follow or recipes to try or whatever and you've got some machine learning model or
00:14:10.500 something that you're that you're calling on here we've split that up with multiple consumers um those are now independent they are
00:14:17.220 decoupled you've got independent retry for each one of these and because of guaranteed delivery you know that even
00:14:22.500 if one of these Services down it doesn't affect the other ones and the message will never get lost as long as you get the running back at some point your
00:14:29.040 messages are still there and you're guaranteed to get all these steps to complete heterogeneous data stores similar but
00:14:36.360 different but this is a great case where you've got a rails app you know primarily you're running on postgres but a lot of us also need other things we
00:14:43.019 want to sort of mature a bit and use a real search engine like elasticsearch right maybe we want to use a reporting
00:14:48.180 database like snowflake which is great for running big aggregate queries this
00:14:53.519 allows you to make sure that all of these databases are consistent and have the same data in them if your app writes
00:14:59.279 to Kafka you know that because of its guaranteed delivery that all of these will eventually get the message and then
00:15:05.760 you'll have the same thing in every database and once again that's where the ordering comes in not only will they get there they'll arrive in order so you
00:15:11.760 don't have old things overriding new ones um one thing I do have to mention
00:15:16.860 because if you look at Kafka you're going to come up with this it's called Kafka connect and this is a solution to
00:15:22.199 make it easier to produce and consume from Kafka it's this application you spin up a bunch of PODS right like a
00:15:27.899 bunch of containers of this they talk to each other and you can create consumers and producers just by submitting
00:15:33.060 configuration files so if you want to write to elasticsearch right to S3 HTTP post these batches you just post you
00:15:39.660 just send these Json configs to Kafka connect and it spins up producers for you and in that way a lot of the
00:15:46.620 downstream stuff like the typical things of like take a message write it to a database you know take a message HTTP
00:15:52.260 post it somewhere take a message put it in S3 right you don't have to write all that code
00:15:58.680 um and like I said this works really well on the consuming end so you've got all these built-in database connectors
00:16:03.959 so you don't even have to develop this stuff yourself what the heck
00:16:10.199 now it is Java but you know you don't actually have to actually write that much Java to use it right this is the
00:16:17.040 spare girl right here she's really sad about that um and then use case number four right
00:16:22.260 similar but different is this idea of a data lake so if you're not familiar with like sort of uh Enterprise data
00:16:28.260 infrastructure parlance a data lake is like just a giant catch-all of all your Enterprises data most people including
00:16:36.000 us we implement this with S3 buckets right Amazon's cheap storage and this is
00:16:41.040 kind of cool if your services are communicating each other in Kafka there's no reason not to put an HTTP
00:16:46.620 sync on every single Kafka topic and back it all up to the cloud right so it's there forever we have ours there
00:16:53.639 we've never used it but it's there in case we do right but seriously um this
00:17:00.600 um there's new technologies that are coming online that make like basically querying S3 buckets as if they're SQL
00:17:06.839 um cheaper and better and it's been slow to mature which has been frustrating but this stuff is coming which I basically
00:17:12.179 can make all like years and years and years of your Enterprise's data searchable and queryable which is really
00:17:17.220 powerful and Kafka and Kafka connect helps you get it in there I tried to put like a lake Emoji or like a lake clip
00:17:23.760 art behind the buckets but I couldn't get it to look good so I just use a blue circle you know because Lakes are usually kind of ovular and less square
00:17:29.640 and I made it blue to look like water um my dream ultimately we have not built
00:17:35.940 this but I would love to um would be the ability to take historical data out of our data Lake and
00:17:41.880 load it back into Kafka so if you wanted to say hey I want to reprocess all of 2021's data to power a new machine
00:17:48.840 learning model you know we could press a button and have this tool created that goes puts it in Kafka you know populates
00:17:55.440 this thing or if you wanted to say like hey I've redesigned my reporting database I want to use a totally different table structure drop the old
00:18:02.340 database replay traffic from the beginning of time repopulated that we're trying to build a system where like for
00:18:08.880 every domain like a domain is like like a tool or a piece a slice of functionality in our in our system right
00:18:14.100 that we have complete historical data and the ability to rebuild it to basically repopulate by replaying all
00:18:20.220 that traffic we don't have that yet but we're going to get there
00:18:26.100 um um I forgot to change the title slide but finally use case number six um this
00:18:32.700 should be titled um event sourcing uh we all know that in a rails project right you typically attach
00:18:39.120 it to a database right but you don't necessarily have to do that if you read a lot of the literature on Kafka they
00:18:45.059 Advocate this pattern with the rails app or your producing app writes to Kafka first then you have a consumer that
00:18:51.299 writes the database and all the app does is Select from the database so it reads but does not write to the database this
00:18:58.200 decoupling is really powerful because again if you've got multiple databases a search Index right a machine learning
00:19:03.840 model a reporting database and things like that it's easier to make sure they're all the
00:19:09.059 same um we call this event sourcing that's you know definitions may vary out what
00:19:14.760 exactly that means uh but it's a powerful thing and if you're doing this in rails of course then you get to like go and reorganize your rails app rails
00:19:20.880 is great for this the way you can customize it you'd have like different read models different write models right and you get to basically redo your whole
00:19:27.660 data layer and create your own framework which would be fun
00:19:33.240 kind of thing uh usually this is not usually but often I'd say usually this is coupled with materialized views so
00:19:40.320 like imagine your basic widget right you've got a widget list view the list Saw the widget you've got an item view that shows like the details of that
00:19:46.740 widget you've got like an aggregate view like how many of those widgets sold you know day by day month by month week by
00:19:52.679 week using this allows you to pre-materialize those things to pre-compute them to make these uh tables
00:20:00.179 with that data in it so you don't have to query your database and calculate that on the fly right there are many
00:20:05.580 ways you can do it postgres has materialized views you know you could use a nosql style store which is pretty
00:20:11.400 common right and actually basically create you know data that matches what your page look like so list view in this
00:20:17.340 table an item view in this table right an aggregate view in this table you need Kafka to do that because it's the only
00:20:23.880 way to ensure that they all get the same data because of guaranteed delivery three consumers right really consumer
00:20:29.820 groups three consumer groups writing to three you can you know that they all succeed and you don't get those inconsistencies again if you try to just
00:20:36.600 insert into list view insert into item view insert into aggregate view if something fails right then you're
00:20:42.780 inconsistent you've got you know bad data forever basically
00:20:48.720 um and then once again right with this um with this event sourcing pattern where you don't write to your database
00:20:54.380 it does make it easier for heterogeneous stores right a reporting database a search index Etc
00:21:00.720 there's a name for this right this is one version of a pattern called command query responsibility segregation right
00:21:07.320 which basically means that your read model is different from your right model um and you know it already kind of is if
00:21:13.980 I were gonna you know take a shot at rails um you've got this model class which
00:21:19.200 represents a table right let's say that's a user um you always write one user record at a
00:21:24.660 time but when you read you're often inner joining aren't you join join joins and how many people have rails apps where
00:21:30.299 like half the queries are like eight joins or more right we do so fundamentally that is a different shape
00:21:36.000 of data isn't it it looks totally different yet they're one model file which you know leads to some confusion
00:21:42.179 um this definitely smashes that the the extreme Way by totally different tables right
00:21:48.720 um use case number six region to region replication uh we have customers in America we have customers in Europe and
00:21:55.380 Australia in Canada and other places and um they do we do not actually share records because we have General
00:22:01.020 construction contractors who hire you know in Europe who hire subcontractors in the US and um Kafka makes a great way
00:22:08.100 to um to basically replicate that across regions by having one consumer there are
00:22:13.320 a number of patterns you can do this with you can have one consumer you know post to another region
00:22:18.780 um there's this thing called Mirror maker which replicates a Kafka topic conveniently it is part of Kafka connect
00:22:24.120 so once again if you don't mind a little bit of java right
00:22:29.640 um you know you can employ that right and replicate topics across zones
00:22:35.400 um one more pattern that I will mention um is that like I said turning your rails app into a Kafka consumer it
00:22:42.179 definitely can be done um there's another way which is to have an HTTP sync which is just something
00:22:48.539 that consumes a micro batch from Kafka and HTTP post it to your service right you still get the same retry
00:22:55.380 mechanisms because if the post returns you know anything but a 200 response it continues to retry and that saves you
00:23:01.320 some of the Ops headaches from having to like spin up every instance of other instances of your rails app right or kind of hack the Ruby boot loading you
00:23:08.100 know the boot framework and stuff like that to figure out how to get your classes to run that way the other thing you could do is just use a plain old
00:23:13.559 Ruby consumer right and stick your model classes out of your rails app and put them in a gem that you share
00:23:19.919 uh so let's talk a little bit about getting stuff into Kafka because if you don't get it into Kafka the right way if
00:23:25.980 you can't make your production message production consistent um your whole system is going to be inconsistent inconsistency is bad
00:23:32.520 imagine you've got someone's medical records they're dealing with someone's money God forbid you know you'd have one database that shows have X dollars or
00:23:38.640 you know X whatever maladies and the other says something else you wouldn't want that
00:23:43.799 um so um there's a issue here that I have not that I've sort of glossed over
00:23:49.679 um but I'm going to address it now uh rails apps have a database typically
00:23:54.960 um and you may also want to write to Kafka right for Downstream things so
00:24:00.240 other things can see this data this message this introduces a problem called dual right
00:24:05.520 uh what dual right means is all right imagine I published a Kafka and I opened a database transaction insert into my
00:24:12.059 local database right well you've published a Kafka but your database transaction can fail it can roll back
00:24:17.400 and now you got something in Kafka that's not in the database and you are inconsistent if I put it at the end that doesn't help
00:24:23.520 right because you put in the database and then you can fail to publish to Kafka right now you are inconsistent
00:24:29.100 again because your local database does not reflect what's Downstream it doesn't help if you try to put the
00:24:34.860 publish in the middle of your database transaction which is typically a bad idea anyway um but sadly we do it a lot
00:24:42.600 um side note right quick side note if you are using sidekick right you may already be victim to this problem and not even
00:24:49.500 know it sidekick uses redis right sidekick or rescue these are these Ruby job processing Frameworks if you are
00:24:55.980 inserting into a database and then enqueuing a sidekick job there's no guarantee that you get one or the other and that is also a source of
00:25:01.679 inconsistencies or if you're using postgres for your database and elastic or like you know elasticsearch for
00:25:08.340 search right there's no guarantee that you get one or the other because you can always fail in the middle
00:25:14.220 um so to fix this we have a system called change data capture and this is a key to our like I told you we have this
00:25:19.260 giant monolith and we're trying to split it apart this is going to be a multi-year effort and it's probably going to take us eight years
00:25:24.779 um and take hundreds of developers but this is one way we get the data out in a way that avoids that dual right which is
00:25:30.840 uh change data capture we use the system called dibesium conveniently it is a Kafka connect right
00:25:37.559 um Source connector which means that to customize it we have the right job but um
00:25:43.260 what it does is it actually uh reads the postgres right ahead log right which is
00:25:49.260 right where the data gets serialized on a disk in a single process reads it
00:25:54.299 writes it to Kafka and that way you can actually just track a table you give it the name of a table see it's my users table and it will read the data as it is
00:26:01.080 written to the right ahead log and postgres and publish it to Kafka and in that way your database always matches
00:26:08.220 what's in Kafka and you've fixed the Dual right Java oh my gosh
00:26:13.799 um one more um one more quick thing to realize about that is it comes with some caveats a rails database is typically if
00:26:21.179 you're using you know like postgres a relational database it's typically a normalized database you know something
00:26:27.059 simple like a user record a product record is often split into multiple tables you are leaking the implementation details of that
00:26:33.000 Downstream if you're just doing direct change data capture right we all know these wonderful rails patterns like
00:26:39.179 single table inheritance and polymorphic relationships now guess what you're dealing with that Downstream right which
00:26:44.520 is not very fun um so the way around that is using this pattern called transactional outbox
00:26:51.419 um it works like this you um you insert like say this is a denormalized user or
00:26:56.460 sorry a normalized user record right so you maybe insert into users you insert into accounts you insert into profiles
00:27:02.460 then you do a fourth insert into a special table called outbox right which could be it could contain all that stuff
00:27:08.039 right you could take all the things that win on those three tables jam it together maybe even in some Json blob
00:27:13.260 right and then publish it to the outbox and then you use change data capture to follow the postgres while log right
00:27:20.179 consume that table publish it to Kafka and in that way because this is all in a database transaction so you begin commit
00:27:26.220 either all of them succeed or all of them fail and you're not left inconsistent running to BCM right is um
00:27:33.059 it's actually not that hard you turn on postgres rological replication you do have to run Kafka connect we've got it
00:27:38.100 in kubernetes and then you submit this configuration file and with your database connection
00:27:44.279 strings and stuff like that and it kind of attaches and reads all the changes we have it used to be I don't know if it's
00:27:50.220 still true the world highest volume Aurora postgres instance in AWS we were processing something like we still are
00:27:56.400 processing like something over 200 000 transactions a second and it keeps up on a single thread
00:28:03.779 you must choose but Choose Wisely it had this rails conferide I had to put a meme somewhere um so we've got a few methods of
00:28:10.320 publishing director Kafka synchronously which is the safe way to do it asynchronously which is faster but you
00:28:16.260 might kill you might lose messages if the server restarts right or crashes CDC
00:28:21.299 just tracking your database tables and publish them to message change data capture publishing them directly to
00:28:26.520 Kafka or transactional outbox the cleaner way to do it because you can decouple your stream from your database
00:28:34.919 consistent regimes require consistent producers um your requirements may vary honestly a
00:28:40.080 lot of you could probably deal with the Dual right because um you know unless you're dealing with someone's medical history it might not
00:28:45.360 be that big of a problem don't over build it don't use this stuff just because we use it we have different requirements than you
00:28:52.380 so you know summing up a little bit of this stuff we get streaming gives us a globally distributed eventually
00:28:57.600 consistent near real-time way to send messages between services and across zones or regions in a way that is fast
00:29:03.360 and reliable with strong guarantees that it is correct
00:29:08.400 um I'm running out of time so I'm going to go through this real fast what's in a message you've got headers partition key
00:29:14.159 the data and you got to choose a format right um you know at your company choose
00:29:20.279 standard headers for all your messages for all different services that allows you to route and filter and do different things Json is good but you might
00:29:27.659 consider a serialized binary format like Avro or protobuff it's more efficient and those these have schemas that get
00:29:34.440 recorded in the schema registry and allows you to track schemas That Vary over time there's a there's an issue if you're dealing with like data lakes and
00:29:40.380 Reporting databases of schema Evolution this helps you keep track of what message had what schema at one time
00:29:47.880 the future one more thing Apache Flink I've got one minute I don't know why they chose a squirrel for their like whatever I guess
00:29:54.480 those are cute I find them annoying I'm annoying because they're cute because like you want to pet them but no
00:30:00.059 one's ever petted a squirrel right um flank is this thing we're going to
00:30:05.159 add it in the future is super powerful I'm really excited it allows you to join streams together like crossing the
00:30:10.620 streams and Ghostbusters right imagine you've got a situation where you've got like a record of like a post like social
00:30:15.659 media like a Facebook post right it has a user ID oh um well wait now I've got another scream
00:30:21.539 another like a Kafka topic with all my users in it I can actually join them together just like a sequel join on a
00:30:27.120 stream that's great imagine you want to make these searchable right like an elasticsearch you can't Index this post
00:30:32.159 in your elasticsearch if you don't have the name of the author in it right so by combining by enriching stream take one
00:30:37.679 stream enriching it with another incredible powerful stuff um you can aggregate records Flink has a
00:30:44.279 built-in data store so you can aggregate how my time is up um aggregate records right just like a group buy but on a
00:30:49.500 stream great article the original by Jay craps um kind of talks in abstract terms about
00:30:55.679 streams and about Kafka this is essential reading if you get into this I'm Brad yurani I used to tweet I don't
00:31:01.620 tweet anymore I quit I was addicted it made my life better except then I took that like compulsive scrolling and I got addicted to eBay instead and started
00:31:08.159 collecting watches and now I own 100 watches um anyway I work in Austin Texas at procore we are hiring we're a great
00:31:15.720 place to work come talk to me unfortunately I don't have time for questions but you can come up and ask me if you want nope