List

Geolocation EXPLAINed

Geolocation EXPLAINed

by Kevin Lesht

In the session titled "Geolocation EXPLAINed," Kevin Lesht discusses how to determine the location of visitors to a website quickly and efficiently. This presentation stems from his experiences at Home Chef, where he encountered challenges in conditionally displaying meal offerings based on customers' geographical locations. The goal was to provide site visitors with relevant meal options depending on which facility their orders were being shipped from, all while ensuring quick response times.

Key Points Discussed:
- Problem Overview: Lesht outlines the major challenge of identifying logged-out visitors' locations without having access to their delivery address, which they can only obtain from logged-in customers.
- Initial Solutions Explored: Several approaches were considered, including using third-party APIs and leveraging Cloudflare's geolocation services. However, these options were dismissed due to performance limitations.
- Finding a Viable Solution: The talk introduces MaxMind, a service that provides geolocation data linked to IP addresses. Lesht explains how they set up their own geolocation system from scratch using this data.
- Understanding IP Addressing: He delves into networking concepts, including CIDR notation, to explain how to manage and interpret IP address ranges effectively.
- Performance Tuning: A portion of the session focuses on database optimizations. Lesht utilizes PostgreSQL's query analysis tools to investigate the slow performance of geolocation requests, identifying three key operations: sequential scans, nested loop joins, and index scans.
- Optimization Techniques: Lesht partly demonstrates how to enhance query performance by using indexing strategies to significantly reduce response times. He showcases that an initial geolocation request that took approximately 200 milliseconds could be optimized down to around 1.67 milliseconds by correctly implementing an index scan.
- Final Results: The improvements made not only expedited the geolocation requests dramatically but also showcased his team's capability to enhance the system's efficiency.

In conclusion, the presentation not only provides technical insights into geolocation implementation via IP address management but also illustrates how database optimizations, especially indexing, can lead to substantial performance improvements. Lesht emphasizes the importance of these techniques in enhancing user experience and meeting business needs in real-time data processing.

How do you find the location of someone visiting your site? And, how do you do it fast? If you've ever been curious about how analytics services can place your site visitors on a map, or about how to analyze and improve a slow running query, then this talk is for you!

In this session, you'll learn about IP address networking, fundamental database operations, and query performance tuning. We'll develop a geolocation system from the ground up, and make sure it's running lightning fast along the way.

RailsConf 2022

00:00:00.900 foreign
00:00:16.880 geolocation explained
00:00:20.039 we get about 200 slides to get through
00:00:22.920 with this talk here so we're gonna have
00:00:25.019 some fun
00:00:26.340 just trying not to blank okay
00:00:29.939 so the idea for this talk came from some
00:00:31.740 work that we recently produced at home
00:00:34.079 Chef where I work on our engineering
00:00:36.239 team
00:00:37.200 home Chef the meal kit delivery company
00:00:39.540 and each week we put out a menu with
00:00:42.000 meals
00:00:43.559 customers can then order meals from a
00:00:45.840 menu
00:00:47.280 and the ingredients for those meals come
00:00:49.200 pre-portioned and ready to cook
00:00:52.500 we ship our orders out of three
00:00:54.300 facilities scattered across the U.S
00:00:58.620 and until recently all customers had
00:01:01.800 access to the same set of meals
00:01:04.199 regardless of where they were ordering
00:01:06.060 from
00:01:07.439 but then we wanted to make a change
00:01:09.720 product team wanted to be able to offer
00:01:12.000 additional meals to customer service by
00:01:14.880 our West Coast facility much like a
00:01:17.280 digital a B test we wanted to begin
00:01:19.380 testing the impact of different meal
00:01:21.479 offerings
00:01:22.619 now this opened up a number of different
00:01:25.020 challenges that we had to work through
00:01:26.640 most of which revolved around figuring
00:01:29.460 out what meals were eligible for a
00:01:32.220 customer
00:01:33.780 my order is being delivered out of our
00:01:36.000 West Coast facility I'll have access to
00:01:38.640 all the meals
00:01:40.079 but if my order is being delivered out
00:01:42.000 of our Midwest or east coast facility
00:01:45.600 I'm not going to have access to the test
00:01:47.579 meals that are only available out of our
00:01:50.159 West Coast facility
00:01:52.920 so for logged in customers this wasn't
00:01:54.720 too bad to figure out we've got the
00:01:56.880 customer's delivery address and we know
00:01:58.979 which facility is going to be shipping
00:02:01.259 an order for that address
00:02:03.180 so we can determine what meals the
00:02:05.100 customer can access
00:02:07.560 but the logged out experience became a
00:02:10.140 little more interesting
00:02:11.940 for site visitors we don't have any
00:02:14.640 address information and our marketing
00:02:16.980 team wanted to be able to conditionally
00:02:18.540 display the different menu offerings and
00:02:21.420 as best as we could
00:02:23.040 make sure that someone visiting the site
00:02:24.840 from within the service range of our
00:02:26.580 West Coast facility would see those test
00:02:28.560 meals
00:02:29.580 conversely visitors outside of the range
00:02:32.459 of our West Coast facility shouldn't see
00:02:35.340 the test meals
00:02:38.580 so without access to the shipping
00:02:40.560 address information that we have on
00:02:42.480 logged in customers we had to figure out
00:02:45.060 how to find location for individuals
00:02:47.519 visiting our site and that's what my
00:02:49.680 talk brings you today
00:02:50.940 we're going to step through how to
00:02:52.800 geo-locate site visitors and how to do
00:02:55.319 it fast yeah
00:02:57.840 okay so to restate our problem
00:03:00.120 when someone visits our website we want
00:03:02.519 to find their geographic location and
00:03:05.040 then conditionally display some content
00:03:07.080 based on that location
00:03:09.660 and because we're dealing with display
00:03:10.980 logic here we really want to get to that
00:03:13.200 location fast so we can bring our page
00:03:15.840 content to the visitor as quickly as we
00:03:18.000 can
00:03:19.500 so how do we do it
00:03:21.120 well we started by exploring a few
00:03:23.220 options
00:03:24.180 first we thought about leveraging
00:03:26.040 leveraging some service with an API to
00:03:29.040 handle everything for us
00:03:30.599 but this option got ruled out pretty
00:03:33.060 quickly because it would add too much
00:03:35.099 time onto our web requests
00:03:37.319 the amount of time it would take to send
00:03:39.239 a request out to a third party and then
00:03:41.400 hear back
00:03:42.480 it's just a non-starter
00:03:44.519 next we thought about cloudflare
00:03:46.620 cloudflare manages our DNS and so a
00:03:49.319 request coming to our site it's first
00:03:51.120 going to pass through them
00:03:52.860 they provide geolocation as a part of
00:03:55.379 their analytics platform so we explored
00:03:57.720 if that information could be forwarded
00:03:59.760 from cloudflare to our app
00:04:02.580 fortunately they only offer country code
00:04:05.280 which just wasn't granular enough
00:04:08.640 then we found this service called
00:04:10.140 maxmind
00:04:11.400 Max mine provides geolocation data that
00:04:14.700 allows you to find a location based on
00:04:16.859 an IP address
00:04:18.120 and rails does give us easy access to a
00:04:21.239 visitor's connecting address
00:04:22.860 so this one seemed like it could be a
00:04:24.360 good fit
00:04:26.040 the geolocation data comes as two tables
00:04:28.500 they separate the IP data and the
00:04:31.620 location data and the two are linked
00:04:33.660 together by Foreign key
00:04:35.699 a network block record belongs to a
00:04:38.280 network location and a network location
00:04:40.460 may have many Network blocks pretty cool
00:04:45.240 only let's take a closer look at this
00:04:47.880 data and specifically this network
00:04:50.460 address field
00:04:52.080 it looks like an IP address but what's
00:04:54.960 going on with this slash and the number
00:04:57.000 after it
00:04:58.440 after summer search what we have here is
00:05:01.020 cidr notation this stands for classless
00:05:04.860 inner domain routing which is a way to
00:05:08.040 specify within an IP the number of bits
00:05:10.620 that make up the networking portion of
00:05:12.240 the address
00:05:13.139 and the range of IP addresses that are
00:05:15.720 assignable off of that Network
00:05:19.080 so how do we get the range of IP
00:05:20.759 addresses from the notation well let's
00:05:23.520 start by breaking down what an IP
00:05:25.800 address actually is
00:05:27.780 an ipv4 address is composed of 32 bits
00:05:31.800 broken down into four 8-bit segments as
00:05:36.240 divided by a period
00:05:39.900 each segment is typically a numerical
00:05:42.060 value from 0 to 255 and a segment can
00:05:46.620 also be referred to as an octet to
00:05:49.080 express that it's representing eight
00:05:50.820 bits of data plus just a word that's
00:05:53.940 more fun to say
00:05:55.560 so back to cidr notation
00:05:57.780 the number following the forward slash
00:05:59.880 represents the number of bits that make
00:06:02.039 up the networking portion of the address
00:06:04.860 taking our example here it'd be the
00:06:06.900 first three octets that make up 24 bits
00:06:09.900 and Define our Network address
00:06:14.160 the rest of the address can be used to
00:06:16.080 define hosts on the network
00:06:18.960 so looking at this network address any
00:06:21.419 client on this network will have an IP
00:06:23.759 address where that last segment is
00:06:26.039 between 0 and 255.
00:06:29.400 to put pictures to all of this if we
00:06:32.100 take a city like Chicago there might be
00:06:34.319 any number of internet service providers
00:06:36.120 and any number of network nodes handling
00:06:38.639 the clients of those providers
00:06:42.180 and each node is going to have its own
00:06:44.460 unique Network address and an allocation
00:06:46.919 of IP addresses that can be issued off
00:06:49.319 of that Network address
00:06:51.060 and all of this can be represented by
00:06:53.639 cidr notation
00:06:55.860 looking at network node one we know that
00:06:58.259 the first 16 bits of the address are
00:07:00.840 going to represent the networking
00:07:02.220 portion and so it's the last 16 bits
00:07:05.280 that are addressable
00:07:07.440 for network node two we see 24 bits as
00:07:10.620 the network address so that leaves 8
00:07:12.840 Bits as addressable
00:07:15.300 from this we know that for a client
00:07:17.160 connecting to network node one
00:07:19.199 they're going to have an IP address
00:07:20.699 that's within the bounds of that node
00:07:22.979 cidr block and just the same for network
00:07:25.560 node two
00:07:26.699 any client being serviced by this node
00:07:29.160 is going to have an address that's
00:07:30.780 within the bounds of its cidr block
00:07:34.500 once we know the bounds of a cidr block
00:07:36.840 we can identify the starting address and
00:07:39.180 ending address and when a request comes
00:07:41.460 in we find the block record that
00:07:43.680 contains our incoming address
00:07:45.780 from there we just need to follow the
00:07:47.940 foreign key to find the location
00:07:50.460 okay
00:07:52.620 so how do we code out that lookup
00:07:54.840 well we've got these two tables and say
00:07:57.120 the IP address of the client visiting
00:07:59.220 our site
00:08:00.240 the first thing that we could do is
00:08:02.340 filter down our Network blocks table to
00:08:04.500 only those instances where the IP
00:08:06.599 address fits between the starting and
00:08:08.880 ending address
00:08:10.259 and with our Network block records
00:08:12.060 unique by starting a dress and ending
00:08:14.099 address that should take us down to one
00:08:16.560 row
00:08:17.759 once we've found that record the next
00:08:20.160 thing we could do is leverage that Geo
00:08:22.080 Name ID field which acts as a foreign
00:08:24.419 key pointing to a record on the network
00:08:26.460 locations table
00:08:28.199 what this means is that we can follow
00:08:30.419 the value off of our Network blocks
00:08:32.640 table to join a network locations row
00:08:35.219 where the Geo Name ID field is of the
00:08:37.919 same value
00:08:40.320 once we've joined our rows we can set up
00:08:42.719 our select to return everything off of
00:08:45.000 the network locations table since it's
00:08:47.220 ultimately the location data that we
00:08:48.959 care about
00:08:50.220 bring this into rails for you active
00:08:52.320 record feds out there here's what that
00:08:54.360 syntax would look like
00:08:55.740 nice and tidy with the only change to
00:08:57.959 our query that gets generated from this
00:08:59.580 being
00:09:00.839 that a limit is applied which comes from
00:09:03.000 the five Buy
00:09:04.140 but since we know that we should be
00:09:05.700 finding a distinct match anyways not a
00:09:08.040 problem and makes for a nice interface
00:09:09.660 within our app
00:09:11.399 putting everything together we now have
00:09:13.620 a system in place for taking an IP
00:09:15.600 address and finding the location for
00:09:18.060 that address
00:09:19.200 pretty cool
00:09:21.480 so we send this work out the door and
00:09:23.640 everything's working as expected only
00:09:26.580 things are running a bit slow
00:09:30.060 taking a look at New Relic our
00:09:31.740 performance monitoring service around
00:09:33.360 the time the work was sent out
00:09:35.760 the controller action where our query
00:09:37.740 lives was already listed within the key
00:09:40.740 transactions overview
00:09:42.660 if you're using New Relic this list here
00:09:44.820 is going to be where you'll find the
00:09:46.560 longest running transactions within your
00:09:48.300 app not a good place to be
00:09:50.580 and to give us a closer look we can
00:09:52.380 navigate through to get even more
00:09:54.240 information about what all is happening
00:09:56.880 this chart is going to break down all of
00:09:59.519 the components making up the response to
00:10:02.100 the request
00:10:03.120 and if we scroll down a bit we're given
00:10:05.279 a nice breakdown table
00:10:07.320 here we can see that it's taking about
00:10:09.240 200 milliseconds to find our Network
00:10:12.060 location
00:10:13.320 and this 200 milliseconds is making up
00:10:15.779 about 95 percent of our geolocation
00:10:18.720 request time
00:10:20.519 not the best for our end users out there
00:10:23.100 but this does offer up something nice
00:10:25.560 what we have here is an opportunity
00:10:27.360 those performance reviews don't write
00:10:29.399 themselves so let's see if we can dig in
00:10:31.620 and make some improvements
00:10:34.200 let's go back to our query
00:10:36.720 looking at things directly we could make
00:10:38.880 some assumptions but we don't really
00:10:40.860 have eyes into why this thing is taking
00:10:43.019 up so much time
00:10:44.640 luckily database Technologies provide us
00:10:47.519 with some utility functions to detail
00:10:50.100 out exactly how they plan to execute our
00:10:53.160 queries
00:10:54.180 we'll be referencing postgres as our
00:10:56.279 database technology but everything that
00:10:58.740 we're going to talk through should
00:11:00.300 generally apply for all SQL back
00:11:02.040 databases
00:11:03.959 so back to those utility functions by
00:11:06.600 prefacing our query with an explain
00:11:08.339 statement postgres will evaluate our ask
00:11:11.100 and reference our schema and some
00:11:13.800 statistics on our database like table
00:11:15.660 size to plot out what it thinks will be
00:11:18.540 the best way to respond to our query
00:11:21.720 the output here is a set of nodes that
00:11:24.000 make up a plan each node represents an
00:11:26.880 operation that postgres will perform as
00:11:29.399 a part of fulfilling the query
00:11:31.380 the main thing that postgres is
00:11:33.779 optimizing for when putting together a
00:11:35.640 plan is Speed
00:11:37.500 postgres wants to return to you an
00:11:39.240 answer as quickly as it can and it's
00:11:41.760 going to make some reasonably informed
00:11:43.620 decisions about how to accomplish that
00:11:47.040 taking things one step further adding
00:11:49.500 and analyze onto our explained statement
00:11:51.899 is going to Output what would actually
00:11:53.880 happen when your query runs
00:11:56.519 explain is projection explain analyze is
00:11:59.579 retrospection
00:12:01.320 the thing to be aware of with explain
00:12:02.880 analyze is that postgres is actually
00:12:05.279 going to run your query and return both
00:12:07.740 the plan of execution and also some data
00:12:10.500 points on the actual run of each node in
00:12:13.140 the plan tree
00:12:14.640 here we can see how long each node took
00:12:17.880 to run and how many rows were evaluated
00:12:20.640 at each step along the way
00:12:23.760 now explain statements used to look a
00:12:25.860 little intimidating to me and I think
00:12:27.480 that came from just not knowing the
00:12:29.279 language of all that was being said and
00:12:31.800 also how all the pieces fit together
00:12:34.560 so this plan here jumped out is a really
00:12:37.079 good candidate for demonstrating three
00:12:39.480 fundamental database operations that we
00:12:41.820 can step through and then leverage to
00:12:43.980 understand why our queries running so
00:12:45.899 slow and how we can make it run faster
00:12:49.320 the three operations that we're going to
00:12:51.240 take a look at are going to be a
00:12:52.980 sequential scan
00:12:54.360 nested Loop join and index scan
00:12:58.560 gonna be the crowd favorite right there
00:13:00.000 I guarantee it
00:13:02.220 so a sequential scan can be thought of
00:13:04.320 as a for Loop and to put this into terms
00:13:07.139 of time complexity because that's what
00:13:09.480 our database cares about a sequential
00:13:11.760 scan is going to be a linear operation
00:13:14.060 what this means is that the time it
00:13:16.680 takes to perform a sequential scan is
00:13:18.779 going to be directly related to the size
00:13:20.760 of your data set
00:13:22.139 in a worst case scenario where we start
00:13:24.720 scanning at the top of our table and the
00:13:27.000 item we're looking for is ordered last
00:13:28.860 the time it takes to get there is going
00:13:31.200 to be the length of our data set
00:13:33.480 for that reason linear operations can
00:13:36.600 also be represented like o of n where n
00:13:40.560 is the number of items in your set
00:13:43.380 now if you're not familiar with this
00:13:45.300 kind of Big O notation
00:13:47.519 it's just a way of representing time in
00:13:49.500 an abstract way it lets us classify
00:13:51.959 operations without having to deal with
00:13:53.760 raw numbers of the data set or having to
00:13:56.579 factor in variable stuff like computer
00:13:58.440 processing power or language speed
00:14:01.560 it's a way to communicate a generalized
00:14:03.839 runtime
00:14:04.920 so to put this into practical terms
00:14:07.260 taking our where statement from our
00:14:09.480 geolocation query we're looking for a
00:14:12.000 record that can bound our IP address and
00:14:14.519 in order for postgres to fulfill this
00:14:16.680 condition by sequential scan it's going
00:14:19.380 to need to visit every item in our table
00:14:21.420 to identify any matches
00:14:26.279 and that's what we see happening in our
00:14:27.839 query here
00:14:28.920 this portion of the explained statement
00:14:30.959 is telling us that postgres visited a
00:14:33.060 little over a million rows but Ford
00:14:35.399 found a record meeting our condition
00:14:38.519 now something to note here
00:14:40.320 there are over 3 million rows in our
00:14:42.540 Network blocks table and we know that
00:14:45.240 Network block records represent a unique
00:14:47.760 range of IP addresses and that there
00:14:49.680 should only be one record that can bound
00:14:52.320 an IP address
00:14:53.820 but how was postgres able to figure that
00:14:55.740 out without having to visit all three
00:14:57.600 million rows why was it able to stop
00:15:00.600 filtering about a third of the way
00:15:02.279 through after only visiting a million
00:15:04.380 rows out of the 3 million
00:15:07.320 well the answer is postgres knows that
00:15:09.540 this sequential scan is serving a higher
00:15:11.579 purpose going into the execution of our
00:15:14.040 query it already has a plan and knows
00:15:17.040 that part of that plan is limiting the
00:15:18.899 return to one record
00:15:20.699 so once we've found a match we can stop
00:15:23.100 our scan and move on to the next step of
00:15:25.860 the plan
00:15:27.300 now speaking of a higher purpose
00:15:29.399 postgres has decided that the best way
00:15:31.500 to perform our query is to get our
00:15:33.360 Network block data as a sub step of our
00:15:35.699 join statement
00:15:36.959 it's a good time to note that query
00:15:39.180 plans are executed inside out meaning
00:15:42.480 that the far most nested nodes are going
00:15:44.820 to evaluate before their ancestors
00:15:47.639 this step being interior to the nested
00:15:50.699 Loop node is acting as a dependency
00:15:53.699 postgres has determined that to
00:15:55.320 effectively perform our join we're going
00:15:57.779 to need the return from this step of the
00:15:59.459 plan first
00:16:00.779 but with that in hand
00:16:02.579 let's now take a look at nested loops
00:16:05.519 now there are actually a bunch of
00:16:06.959 algorithms that postgres can choose from
00:16:09.000 for joining two tables together and
00:16:11.339 nested Loop is the most basic operation
00:16:14.279 what's pretty cool is that a nested Loop
00:16:16.500 join can actually be achieved by a
00:16:18.600 number of different time complexities
00:16:21.000 the slowest version of this is going to
00:16:23.760 be performed by sequentially scanning
00:16:25.440 all of the items in your first table and
00:16:27.899 for each item in your first table
00:16:29.540 sequentially scanning each item in your
00:16:32.100 second table
00:16:33.300 at each visit we're evaluating the join
00:16:35.880 condition for True to see if we found a
00:16:38.100 match
00:16:41.639 may have got a little carried away with
00:16:42.899 the slides here
00:16:45.899 we'll call it right about there
00:16:49.440 so in other terms it's a nested for Loop
00:16:52.019 which can be classified as multi-linear
00:16:54.959 time and can be represented by o of n
00:16:58.500 times M where n is the number of items
00:17:01.980 in your first table and M is the number
00:17:04.620 of items in your second table
00:17:06.839 now there's a pretty slow time
00:17:08.819 complexity
00:17:09.959 what's cool about understanding this
00:17:11.760 though is that it highlights a reason
00:17:13.380 why databases like postgres always index
00:17:16.319 primary keys
00:17:17.760 usually a join between two tables is
00:17:20.760 going to be done by matching a foreign
00:17:22.380 key that lines up with a primary key on
00:17:24.839 another table
00:17:26.040 coming off of our Network blocks table
00:17:27.720 we already have a reference to the key
00:17:30.000 we're looking for in our Network
00:17:31.559 locations table and as we've seen by
00:17:34.320 running a sequential scan we can step
00:17:36.480 through each row to try and find a match
00:17:39.780 but an index scan offers us a way of
00:17:43.080 getting there more quickly
00:17:44.580 so let's take a look at that
00:17:47.400 now postgres supports a bunch of options
00:17:50.100 for creating an index and by default
00:17:52.500 it's going to use a balanced tree
00:17:54.539 structure
00:17:55.799 so if we were to create a b tree index
00:17:58.020 on say a table's primary key postgres is
00:18:01.440 going to gather up all of the values in
00:18:03.240 that field and find a relative midpoint
00:18:06.059 it's then going to distribute our data
00:18:08.160 downwards following a simple set of
00:18:10.200 rules and these rules are going to be
00:18:12.600 incredibly helpful if we want to read
00:18:14.700 back our data
00:18:16.919 they are
00:18:18.299 a higher value will be placed to the
00:18:20.640 right of the current node
00:18:22.620 and a lower value will be placed to the
00:18:24.960 left of the current node
00:18:27.299 following these rules postgres will step
00:18:29.460 through our data set to build out our
00:18:31.500 trait
00:18:36.900 now I do want to note what we've built
00:18:39.539 here is a bit of a simplification of a
00:18:42.299 balanced tree structure and all that it
00:18:44.160 can do which is actually a lot more like
00:18:46.500 having more than two children off of a
00:18:48.360 node and even clustering data together
00:18:50.400 on a single node but we had so much
00:18:54.059 slide space here and remember we're
00:18:56.220 already pushing like 200 slides let's
00:18:58.799 let's see how we're doing by the way
00:19:00.960 okay okay
00:19:03.419 back to indexing what we're going to
00:19:06.120 step through does highlight the
00:19:07.559 fundamentals it's really the rules for
00:19:09.780 organization that we care about and for
00:19:11.940 the most part those are the same
00:19:14.220 following these rules we're going to be
00:19:16.320 able to Traverse our tree and
00:19:18.120 demonstrate that the time complexity of
00:19:19.919 a b tree index scan is logarithmic which
00:19:22.980 can be represented as o of log of n
00:19:26.100 where n is the number of items in our
00:19:28.740 data set
00:19:30.000 so check this out
00:19:31.559 take a value of 53. in order to find the
00:19:34.799 node for this value in our tree we can
00:19:36.960 leverage the rules of the structure to
00:19:38.940 answer some basic questions at each stop
00:19:41.280 we make
00:19:42.419 is the value I'm looking for greater
00:19:44.880 than the current value
00:19:46.679 on our root node here 53 is greater than
00:19:49.620 50 so we move to the right
00:19:52.440 53 is less than 64. so we move to the
00:19:56.100 left and boom we've found the node for
00:19:59.280 our data point
00:20:00.539 by following this progression we're able
00:20:02.580 to throw away half of our remaining tree
00:20:04.980 at each stop and that's what makes this
00:20:07.440 logarithmic
00:20:09.000 following a sequential scan we would
00:20:11.220 have had to have visited each row on our
00:20:13.260 table until we found a match and the
00:20:15.600 cost of that operation would scale
00:20:17.460 linearly with the size of our data set
00:20:20.460 but with an index scan we don't have to
00:20:23.280 visit every record for each move we make
00:20:26.340 we effectively filter out half of the
00:20:28.740 remaining data and that's how
00:20:30.720 logarithmic operations really show their
00:20:32.940 speed
00:20:34.500 today we start kicking up the size of
00:20:36.059 our data set
00:20:37.200 to find a record using a sequential scan
00:20:39.600 which is a linear operation the time it
00:20:42.539 would take to find an item is going to
00:20:44.160 be directly proportional to the number
00:20:46.140 of items in our set
00:20:47.880 by using an index scan running in
00:20:50.100 logarithmic Time Each doubling of our
00:20:53.039 data here only introduces one added unit
00:20:56.520 of time onto our search
00:20:59.039 we can find our way through a million
00:21:01.140 items in at most 20 progressions through
00:21:03.720 our index
00:21:04.799 with logarithmic time our data can grow
00:21:07.260 exponentially and the cost of finding a
00:21:09.660 node won't change all that much it's
00:21:12.480 super powerful
00:21:14.520 now one thing to keep in mind
00:21:16.620 indexes exist separate from your data
00:21:19.679 what this means is that once we've found
00:21:21.900 a node matching our condition postgres
00:21:24.600 is going to follow some metadata stored
00:21:26.940 with that node to access our table row
00:21:29.280 which lives in What's called the Heap
00:21:31.740 not a lot we're going to dig into here
00:21:33.780 but it is a good thing to have an
00:21:35.520 awareness for
00:21:36.659 consider that for every table change
00:21:39.299 your database may have to update any
00:21:41.940 indexes that might need to understand
00:21:43.559 that change
00:21:45.419 the concern here could be that if you
00:21:48.059 have a heavily indexed table all of your
00:21:50.280 right and delete operations might get
00:21:52.919 expensive as you spend time keeping your
00:21:54.900 indexes in check
00:21:56.460 not something you run into often
00:21:58.500 but some good knowledge to keep in your
00:22:00.659 back pocket for your next database
00:22:02.340 trivia night
00:22:03.600 okay
00:22:05.700 so now we have a handle on indexing
00:22:07.380 let's look back at our query plan
00:22:09.600 we know that in order to fulfill our
00:22:11.640 join statement we're going to have to
00:22:13.500 visit both tables there's no getting
00:22:15.299 around that and luckily we Dodge the
00:22:18.059 double sequential scan our Network
00:22:20.280 locations Geo Name ID index is allowing
00:22:22.919 us to more quickly scan that table for
00:22:25.500 any rows that match the foreign key on
00:22:27.900 our Network block rows
00:22:30.120 now before we set up a nested Loop join
00:22:32.880 as multi-linear because we had to
00:22:34.860 sequentially scan both tables but
00:22:37.740 knowing that we're able to leverage the
00:22:39.059 index on network locations the cost of
00:22:41.940 our join is now going to come in at
00:22:44.460 o of n Times log of M time one
00:22:48.840 sequential scan to find network block
00:22:50.400 records and then for each record that we
00:22:52.860 find which from our limit will just be
00:22:55.320 one an index lookup to find our Network
00:22:58.500 location
00:23:00.780 so if we were looking to optimize this
00:23:02.760 query it's that sequential scan that
00:23:05.220 grabs our attention this thing is really
00:23:07.380 slowing us down
00:23:08.820 taking a look at the actual runtime
00:23:10.860 that's everything right there the
00:23:12.960 visiting of over a million rows it's
00:23:15.179 killing us
00:23:17.039 so what if we could evaluate that table
00:23:19.080 with an index scan that would take us to
00:23:22.260 the most performant option for our query
00:23:24.299 here
00:23:25.559 oh a vlog of n plus log of M time one
00:23:29.760 index lookup by IP address to find our
00:23:31.799 Network block record and then using the
00:23:34.020 Geo Name ID value off of that record
00:23:36.120 what index lookup to find its Network
00:23:39.360 location
00:23:40.860 by understanding the time complexity is
00:23:42.840 associated with each operation we know
00:23:44.880 where to look and can leverage this
00:23:46.679 stuff to better design our query or our
00:23:48.960 schema to take advantage
00:23:51.720 now remember back to our explain
00:23:53.220 statement
00:23:54.419 postgres is going to provide you with
00:23:56.520 what it thinks is the fastest query plan
00:23:59.039 and it's usually really good about
00:24:01.380 making that decision
00:24:02.940 but postgres only has available what we
00:24:05.340 give it and so in using explain
00:24:07.080 statements we might be able to identify
00:24:09.480 an opportunity for adding a new index or
00:24:12.179 maybe we could change our query
00:24:14.159 altogether to coerce postgres towards
00:24:16.919 putting together a better plan
00:24:18.900 and that's what we were able to do here
00:24:20.940 an index on the starting IP address and
00:24:23.760 ending IP address was already available
00:24:26.700 but postgres still decided to go for the
00:24:29.220 sequential scan it turns out that by
00:24:32.340 throwing a limit statement onto our
00:24:34.500 query of a low enough value in our case
00:24:37.080 one which came from active records fine
00:24:39.179 by postgres thinks that a sequential
00:24:41.820 scan is going to find our record faster
00:24:44.100 than going through the index remember
00:24:46.320 the query plan is built on projection
00:24:49.500 in actuality though we saw from the
00:24:52.260 numbers that this assumption postgres is
00:24:54.419 making turns out to be the bottleneck of
00:24:56.520 our query
00:24:57.720 so how do we influence postgres towards
00:25:00.419 using that index
00:25:01.860 well
00:25:03.059 turns out to be pretty simple clever but
00:25:06.120 simple
00:25:07.500 if we had an order statement to our
00:25:09.960 query postgres knows that it's going to
00:25:12.360 have to sort our rows at some point in
00:25:14.159 time and with a sequential scan this has
00:25:17.220 to be done as an added step only after
00:25:20.039 we've collected all the rows can
00:25:21.900 postgres figure out how they should be
00:25:23.820 sorted
00:25:25.080 but if you remember back to the rules of
00:25:27.659 how an index is designed they're built
00:25:30.120 around an order postgres is able to
00:25:32.460 leverage this and as it progresses
00:25:34.260 through an index scan it can keep track
00:25:36.539 of how the rows it picked up were
00:25:38.460 ordered
00:25:39.360 this allows us to sort along the way and
00:25:42.299 eliminates the need of that extra step
00:25:44.279 that came with the sequential scan
00:25:47.039 by adding an order statement to our
00:25:48.960 query postgres is going to change its
00:25:51.600 plan to favor our index so that it can
00:25:54.000 reduce what would have been two steps a
00:25:56.460 sequential scan followed by a sort down
00:25:59.700 to one step
00:26:01.740 an index scan was sorting along the way
00:26:05.640 by engaging our index we lose the
00:26:08.279 sequential scan and pick up a huge
00:26:10.440 performance win
00:26:12.059 with the initial version of our query
00:26:13.980 using the sequential scan we had to
00:26:16.620 filter through about a third of the
00:26:18.120 network blocks table and that led to a
00:26:20.400 run time of about 200 milliseconds
00:26:23.460 now with the index scan the runtime on
00:26:26.520 this thing comes down to
00:26:30.200 .05 milliseconds
00:26:32.820 look at that change
00:26:35.159 rolling the updated query out to our app
00:26:37.320 check out this reduction of runtime
00:26:39.659 before a geolocation request took about
00:26:42.779 200 milliseconds and was one of the top
00:26:45.480 offending transactions in our app
00:26:47.760 now
00:26:49.140 doesn't even register
00:26:51.559 1.67 milliseconds that's over 100 times
00:26:55.080 faster right there and that is going to
00:26:57.900 look very nice on my performance review
00:27:01.080 and I hope it made for a nice talk too
00:27:02.940 thank you
00:27:05.960 yeah thank you all for attending
00:27:10.200 back to our organizers
00:27:12.299 and thank you to our sponsors
00:27:14.460 the slides for this talk can be found at
00:27:16.919 Kevin lash.com railsconf2022
00:27:21.299 if you have any questions or compliments
00:27:24.360 for me you can find me on Twitter at
00:27:26.760 Kevin Lesh
00:27:28.320 and if you'd like to check out my other
00:27:29.940 projects you can keep up with those at
00:27:32.400 Kevin lesh.com
00:27:34.200 thank you
00:27:36.120 foreign