Geolocation EXPLAINed

by Kevin Lesht

In the session titled "Geolocation EXPLAINed," Kevin Lesht discusses how to determine the location of visitors to a website quickly and efficiently. This presentation stems from his experiences at Home Chef, where he encountered challenges in conditionally displaying meal offerings based on customers' geographical locations. The goal was to provide site visitors with relevant meal options depending on which facility their orders were being shipped from, all while ensuring quick response times.

Key Points Discussed:
- Problem Overview: Lesht outlines the major challenge of identifying logged-out visitors' locations without having access to their delivery address, which they can only obtain from logged-in customers.
- Initial Solutions Explored: Several approaches were considered, including using third-party APIs and leveraging Cloudflare's geolocation services. However, these options were dismissed due to performance limitations.
- Finding a Viable Solution: The talk introduces MaxMind, a service that provides geolocation data linked to IP addresses. Lesht explains how they set up their own geolocation system from scratch using this data.
- Understanding IP Addressing: He delves into networking concepts, including CIDR notation, to explain how to manage and interpret IP address ranges effectively.
- Performance Tuning: A portion of the session focuses on database optimizations. Lesht utilizes PostgreSQL's query analysis tools to investigate the slow performance of geolocation requests, identifying three key operations: sequential scans, nested loop joins, and index scans.
- Optimization Techniques: Lesht partly demonstrates how to enhance query performance by using indexing strategies to significantly reduce response times. He showcases that an initial geolocation request that took approximately 200 milliseconds could be optimized down to around 1.67 milliseconds by correctly implementing an index scan.
- Final Results: The improvements made not only expedited the geolocation requests dramatically but also showcased his team's capability to enhance the system's efficiency.

In conclusion, the presentation not only provides technical insights into geolocation implementation via IP address management but also illustrates how database optimizations, especially indexing, can lead to substantial performance improvements. Lesht emphasizes the importance of these techniques in enhancing user experience and meeting business needs in real-time data processing.

How do you find the location of someone visiting your site? And, how do you do it fast? If you've ever been curious about how analytics services can place your site visitors on a map, or about how to analyze and improve a slow running query, then this talk is for you!

In this session, you'll learn about IP address networking, fundamental database operations, and query performance tuning. We'll develop a geolocation system from the ground up, and make sure it's running lightning fast along the way.

RailsConf 2022

00:00:00.900 foreign

00:00:16.880 geolocation explained

00:00:20.039 we get about 200 slides to get through

00:00:22.920 with this talk here so we're gonna have

00:00:25.019 some fun

00:00:26.340 just trying not to blank okay

00:00:29.939 so the idea for this talk came from some

00:00:31.740 work that we recently produced at home

00:00:34.079 Chef where I work on our engineering

00:00:36.239 team

00:00:37.200 home Chef the meal kit delivery company

00:00:39.540 and each week we put out a menu with

00:00:42.000 meals

00:00:43.559 customers can then order meals from a

00:00:45.840 menu

00:00:47.280 and the ingredients for those meals come

00:00:49.200 pre-portioned and ready to cook

00:00:52.500 we ship our orders out of three

00:00:54.300 facilities scattered across the U.S

00:00:58.620 and until recently all customers had

00:01:01.800 access to the same set of meals

00:01:04.199 regardless of where they were ordering

00:01:06.060 from

00:01:07.439 but then we wanted to make a change

00:01:09.720 product team wanted to be able to offer

00:01:12.000 additional meals to customer service by

00:01:14.880 our West Coast facility much like a

00:01:17.280 digital a B test we wanted to begin

00:01:19.380 testing the impact of different meal

00:01:21.479 offerings

00:01:22.619 now this opened up a number of different

00:01:25.020 challenges that we had to work through

00:01:26.640 most of which revolved around figuring

00:01:29.460 out what meals were eligible for a

00:01:32.220 customer

00:01:33.780 my order is being delivered out of our

00:01:36.000 West Coast facility I'll have access to

00:01:38.640 all the meals

00:01:40.079 but if my order is being delivered out

00:01:42.000 of our Midwest or east coast facility

00:01:45.600 I'm not going to have access to the test

00:01:47.579 meals that are only available out of our

00:01:50.159 West Coast facility

00:01:52.920 so for logged in customers this wasn't

00:01:54.720 too bad to figure out we've got the

00:01:56.880 customer's delivery address and we know

00:01:58.979 which facility is going to be shipping

00:02:01.259 an order for that address

00:02:03.180 so we can determine what meals the

00:02:05.100 customer can access

00:02:07.560 but the logged out experience became a

00:02:10.140 little more interesting

00:02:11.940 for site visitors we don't have any

00:02:14.640 address information and our marketing

00:02:16.980 team wanted to be able to conditionally

00:02:18.540 display the different menu offerings and

00:02:21.420 as best as we could

00:02:23.040 make sure that someone visiting the site

00:02:24.840 from within the service range of our

00:02:26.580 West Coast facility would see those test

00:02:28.560 meals

00:02:29.580 conversely visitors outside of the range

00:02:32.459 of our West Coast facility shouldn't see

00:02:35.340 the test meals

00:02:38.580 so without access to the shipping

00:02:40.560 address information that we have on

00:02:42.480 logged in customers we had to figure out

00:02:45.060 how to find location for individuals

00:02:47.519 visiting our site and that's what my

00:02:49.680 talk brings you today

00:02:50.940 we're going to step through how to

00:02:52.800 geo-locate site visitors and how to do

00:02:55.319 it fast yeah

00:02:57.840 okay so to restate our problem

00:03:00.120 when someone visits our website we want

00:03:02.519 to find their geographic location and

00:03:05.040 then conditionally display some content

00:03:07.080 based on that location

00:03:09.660 and because we're dealing with display

00:03:10.980 logic here we really want to get to that

00:03:13.200 location fast so we can bring our page

00:03:15.840 content to the visitor as quickly as we

00:03:18.000 can

00:03:19.500 so how do we do it

00:03:21.120 well we started by exploring a few

00:03:23.220 options

00:03:24.180 first we thought about leveraging

00:03:26.040 leveraging some service with an API to

00:03:29.040 handle everything for us

00:03:30.599 but this option got ruled out pretty

00:03:33.060 quickly because it would add too much

00:03:35.099 time onto our web requests

00:03:37.319 the amount of time it would take to send

00:03:39.239 a request out to a third party and then

00:03:41.400 hear back

00:03:42.480 it's just a non-starter

00:03:44.519 next we thought about cloudflare

00:03:46.620 cloudflare manages our DNS and so a

00:03:49.319 request coming to our site it's first

00:03:51.120 going to pass through them

00:03:52.860 they provide geolocation as a part of

00:03:55.379 their analytics platform so we explored

00:03:57.720 if that information could be forwarded

00:03:59.760 from cloudflare to our app

00:04:02.580 fortunately they only offer country code

00:04:05.280 which just wasn't granular enough

00:04:08.640 then we found this service called

00:04:10.140 maxmind

00:04:11.400 Max mine provides geolocation data that

00:04:14.700 allows you to find a location based on

00:04:16.859 an IP address

00:04:18.120 and rails does give us easy access to a

00:04:21.239 visitor's connecting address

00:04:22.860 so this one seemed like it could be a

00:04:24.360 good fit

00:04:26.040 the geolocation data comes as two tables

00:04:28.500 they separate the IP data and the

00:04:31.620 location data and the two are linked

00:04:33.660 together by Foreign key

00:04:35.699 a network block record belongs to a

00:04:38.280 network location and a network location

00:04:40.460 may have many Network blocks pretty cool

00:04:45.240 only let's take a closer look at this

00:04:47.880 data and specifically this network

00:04:50.460 address field

00:04:52.080 it looks like an IP address but what's

00:04:54.960 going on with this slash and the number

00:04:57.000 after it

00:04:58.440 after summer search what we have here is

00:05:01.020 cidr notation this stands for classless

00:05:04.860 inner domain routing which is a way to

00:05:08.040 specify within an IP the number of bits

00:05:10.620 that make up the networking portion of

00:05:12.240 the address

00:05:13.139 and the range of IP addresses that are

00:05:15.720 assignable off of that Network

00:05:19.080 so how do we get the range of IP

00:05:20.759 addresses from the notation well let's

00:05:23.520 start by breaking down what an IP

00:05:25.800 address actually is

00:05:27.780 an ipv4 address is composed of 32 bits

00:05:31.800 broken down into four 8-bit segments as

00:05:36.240 divided by a period

00:05:39.900 each segment is typically a numerical

00:05:42.060 value from 0 to 255 and a segment can

00:05:46.620 also be referred to as an octet to

00:05:49.080 express that it's representing eight

00:05:50.820 bits of data plus just a word that's

00:05:53.940 more fun to say

00:05:55.560 so back to cidr notation

00:05:57.780 the number following the forward slash

00:05:59.880 represents the number of bits that make

00:06:02.039 up the networking portion of the address

00:06:04.860 taking our example here it'd be the

00:06:06.900 first three octets that make up 24 bits

00:06:09.900 and Define our Network address

00:06:14.160 the rest of the address can be used to

00:06:16.080 define hosts on the network

00:06:18.960 so looking at this network address any

00:06:21.419 client on this network will have an IP

00:06:23.759 address where that last segment is

00:06:26.039 between 0 and 255.

00:06:29.400 to put pictures to all of this if we

00:06:32.100 take a city like Chicago there might be

00:06:34.319 any number of internet service providers

00:06:36.120 and any number of network nodes handling

00:06:38.639 the clients of those providers

00:06:42.180 and each node is going to have its own

00:06:44.460 unique Network address and an allocation

00:06:46.919 of IP addresses that can be issued off

00:06:49.319 of that Network address

00:06:51.060 and all of this can be represented by

00:06:53.639 cidr notation

00:06:55.860 looking at network node one we know that

00:06:58.259 the first 16 bits of the address are

00:07:00.840 going to represent the networking

00:07:02.220 portion and so it's the last 16 bits

00:07:05.280 that are addressable

00:07:07.440 for network node two we see 24 bits as

00:07:10.620 the network address so that leaves 8

00:07:12.840 Bits as addressable

00:07:15.300 from this we know that for a client

00:07:17.160 connecting to network node one

00:07:19.199 they're going to have an IP address

00:07:20.699 that's within the bounds of that node

00:07:22.979 cidr block and just the same for network

00:07:25.560 node two

00:07:26.699 any client being serviced by this node

00:07:29.160 is going to have an address that's

00:07:30.780 within the bounds of its cidr block

00:07:34.500 once we know the bounds of a cidr block

00:07:36.840 we can identify the starting address and

00:07:39.180 ending address and when a request comes

00:07:41.460 in we find the block record that

00:07:43.680 contains our incoming address

00:07:45.780 from there we just need to follow the

00:07:47.940 foreign key to find the location

00:07:50.460 okay

00:07:52.620 so how do we code out that lookup

00:07:54.840 well we've got these two tables and say

00:07:57.120 the IP address of the client visiting

00:07:59.220 our site

00:08:00.240 the first thing that we could do is

00:08:02.340 filter down our Network blocks table to

00:08:04.500 only those instances where the IP

00:08:06.599 address fits between the starting and

00:08:08.880 ending address

00:08:10.259 and with our Network block records

00:08:12.060 unique by starting a dress and ending

00:08:14.099 address that should take us down to one

00:08:16.560 row

00:08:17.759 once we've found that record the next

00:08:20.160 thing we could do is leverage that Geo

00:08:22.080 Name ID field which acts as a foreign

00:08:24.419 key pointing to a record on the network

00:08:26.460 locations table

00:08:28.199 what this means is that we can follow

00:08:30.419 the value off of our Network blocks

00:08:32.640 table to join a network locations row

00:08:35.219 where the Geo Name ID field is of the

00:08:37.919 same value

00:08:40.320 once we've joined our rows we can set up

00:08:42.719 our select to return everything off of

00:08:45.000 the network locations table since it's

00:08:47.220 ultimately the location data that we

00:08:48.959 care about

00:08:50.220 bring this into rails for you active

00:08:52.320 record feds out there here's what that

00:08:54.360 syntax would look like

00:08:55.740 nice and tidy with the only change to

00:08:57.959 our query that gets generated from this

00:08:59.580 being

00:09:00.839 that a limit is applied which comes from

00:09:03.000 the five Buy

00:09:04.140 but since we know that we should be

00:09:05.700 finding a distinct match anyways not a

00:09:08.040 problem and makes for a nice interface

00:09:09.660 within our app

00:09:11.399 putting everything together we now have

00:09:13.620 a system in place for taking an IP

00:09:15.600 address and finding the location for

00:09:18.060 that address

00:09:19.200 pretty cool

00:09:21.480 so we send this work out the door and

00:09:23.640 everything's working as expected only

00:09:26.580 things are running a bit slow

00:09:30.060 taking a look at New Relic our

00:09:31.740 performance monitoring service around

00:09:33.360 the time the work was sent out

00:09:35.760 the controller action where our query

00:09:37.740 lives was already listed within the key

00:09:40.740 transactions overview

00:09:42.660 if you're using New Relic this list here

00:09:44.820 is going to be where you'll find the

00:09:46.560 longest running transactions within your

00:09:48.300 app not a good place to be

00:09:50.580 and to give us a closer look we can

00:09:52.380 navigate through to get even more

00:09:54.240 information about what all is happening

00:09:56.880 this chart is going to break down all of

00:09:59.519 the components making up the response to

00:10:02.100 the request

00:10:03.120 and if we scroll down a bit we're given

00:10:05.279 a nice breakdown table

00:10:07.320 here we can see that it's taking about

00:10:09.240 200 milliseconds to find our Network

00:10:12.060 location

00:10:13.320 and this 200 milliseconds is making up

00:10:15.779 about 95 percent of our geolocation

00:10:18.720 request time

00:10:20.519 not the best for our end users out there

00:10:23.100 but this does offer up something nice

00:10:25.560 what we have here is an opportunity

00:10:27.360 those performance reviews don't write

00:10:29.399 themselves so let's see if we can dig in

00:10:31.620 and make some improvements

00:10:34.200 let's go back to our query

00:10:36.720 looking at things directly we could make

00:10:38.880 some assumptions but we don't really

00:10:40.860 have eyes into why this thing is taking

00:10:43.019 up so much time

00:10:44.640 luckily database Technologies provide us

00:10:47.519 with some utility functions to detail

00:10:50.100 out exactly how they plan to execute our

00:10:53.160 queries

00:10:54.180 we'll be referencing postgres as our

00:10:56.279 database technology but everything that

00:10:58.740 we're going to talk through should

00:11:00.300 generally apply for all SQL back

00:11:02.040 databases

00:11:03.959 so back to those utility functions by

00:11:06.600 prefacing our query with an explain

00:11:08.339 statement postgres will evaluate our ask

00:11:11.100 and reference our schema and some

00:11:13.800 statistics on our database like table

00:11:15.660 size to plot out what it thinks will be

00:11:18.540 the best way to respond to our query

00:11:21.720 the output here is a set of nodes that

00:11:24.000 make up a plan each node represents an

00:11:26.880 operation that postgres will perform as

00:11:29.399 a part of fulfilling the query

00:11:31.380 the main thing that postgres is

00:11:33.779 optimizing for when putting together a

00:11:35.640 plan is Speed

00:11:37.500 postgres wants to return to you an

00:11:39.240 answer as quickly as it can and it's

00:11:41.760 going to make some reasonably informed

00:11:43.620 decisions about how to accomplish that

00:11:47.040 taking things one step further adding

00:11:49.500 and analyze onto our explained statement

00:11:51.899 is going to Output what would actually

00:11:53.880 happen when your query runs

00:11:56.519 explain is projection explain analyze is

00:11:59.579 retrospection

00:12:01.320 the thing to be aware of with explain

00:12:02.880 analyze is that postgres is actually

00:12:05.279 going to run your query and return both

00:12:07.740 the plan of execution and also some data

00:12:10.500 points on the actual run of each node in

00:12:13.140 the plan tree

00:12:14.640 here we can see how long each node took

00:12:17.880 to run and how many rows were evaluated

00:12:20.640 at each step along the way

00:12:23.760 now explain statements used to look a

00:12:25.860 little intimidating to me and I think

00:12:27.480 that came from just not knowing the

00:12:29.279 language of all that was being said and

00:12:31.800 also how all the pieces fit together

00:12:34.560 so this plan here jumped out is a really

00:12:37.079 good candidate for demonstrating three

00:12:39.480 fundamental database operations that we

00:12:41.820 can step through and then leverage to

00:12:43.980 understand why our queries running so

00:12:45.899 slow and how we can make it run faster

00:12:49.320 the three operations that we're going to

00:12:51.240 take a look at are going to be a

00:12:52.980 sequential scan

00:12:54.360 nested Loop join and index scan

00:12:58.560 gonna be the crowd favorite right there

00:13:00.000 I guarantee it

00:13:02.220 so a sequential scan can be thought of

00:13:04.320 as a for Loop and to put this into terms

00:13:07.139 of time complexity because that's what

00:13:09.480 our database cares about a sequential

00:13:11.760 scan is going to be a linear operation

00:13:14.060 what this means is that the time it

00:13:16.680 takes to perform a sequential scan is

00:13:18.779 going to be directly related to the size

00:13:20.760 of your data set

00:13:22.139 in a worst case scenario where we start

00:13:24.720 scanning at the top of our table and the

00:13:27.000 item we're looking for is ordered last

00:13:28.860 the time it takes to get there is going

00:13:31.200 to be the length of our data set

00:13:33.480 for that reason linear operations can

00:13:36.600 also be represented like o of n where n

00:13:40.560 is the number of items in your set

00:13:43.380 now if you're not familiar with this

00:13:45.300 kind of Big O notation

00:13:47.519 it's just a way of representing time in

00:13:49.500 an abstract way it lets us classify

00:13:51.959 operations without having to deal with

00:13:53.760 raw numbers of the data set or having to

00:13:56.579 factor in variable stuff like computer

00:13:58.440 processing power or language speed

00:14:01.560 it's a way to communicate a generalized

00:14:03.839 runtime

00:14:04.920 so to put this into practical terms

00:14:07.260 taking our where statement from our

00:14:09.480 geolocation query we're looking for a

00:14:12.000 record that can bound our IP address and

00:14:14.519 in order for postgres to fulfill this

00:14:16.680 condition by sequential scan it's going

00:14:19.380 to need to visit every item in our table

00:14:21.420 to identify any matches

00:14:26.279 and that's what we see happening in our

00:14:27.839 query here

00:14:28.920 this portion of the explained statement

00:14:30.959 is telling us that postgres visited a

00:14:33.060 little over a million rows but Ford

00:14:35.399 found a record meeting our condition

00:14:38.519 now something to note here

00:14:40.320 there are over 3 million rows in our

00:14:42.540 Network blocks table and we know that

00:14:45.240 Network block records represent a unique

00:14:47.760 range of IP addresses and that there

00:14:49.680 should only be one record that can bound

00:14:52.320 an IP address

00:14:53.820 but how was postgres able to figure that

00:14:55.740 out without having to visit all three

00:14:57.600 million rows why was it able to stop

00:15:00.600 filtering about a third of the way

00:15:02.279 through after only visiting a million

00:15:04.380 rows out of the 3 million

00:15:07.320 well the answer is postgres knows that

00:15:09.540 this sequential scan is serving a higher

00:15:11.579 purpose going into the execution of our

00:15:14.040 query it already has a plan and knows

00:15:17.040 that part of that plan is limiting the

00:15:18.899 return to one record

00:15:20.699 so once we've found a match we can stop

00:15:23.100 our scan and move on to the next step of

00:15:25.860 the plan

00:15:27.300 now speaking of a higher purpose

00:15:29.399 postgres has decided that the best way

00:15:31.500 to perform our query is to get our

00:15:33.360 Network block data as a sub step of our

00:15:35.699 join statement

00:15:36.959 it's a good time to note that query

00:15:39.180 plans are executed inside out meaning

00:15:42.480 that the far most nested nodes are going

00:15:44.820 to evaluate before their ancestors

00:15:47.639 this step being interior to the nested

00:15:50.699 Loop node is acting as a dependency

00:15:53.699 postgres has determined that to

00:15:55.320 effectively perform our join we're going

00:15:57.779 to need the return from this step of the

00:15:59.459 plan first

00:16:00.779 but with that in hand

00:16:02.579 let's now take a look at nested loops

00:16:05.519 now there are actually a bunch of

00:16:06.959 algorithms that postgres can choose from

00:16:09.000 for joining two tables together and

00:16:11.339 nested Loop is the most basic operation

00:16:14.279 what's pretty cool is that a nested Loop

00:16:16.500 join can actually be achieved by a

00:16:18.600 number of different time complexities

00:16:21.000 the slowest version of this is going to

00:16:23.760 be performed by sequentially scanning

00:16:25.440 all of the items in your first table and

00:16:27.899 for each item in your first table

00:16:29.540 sequentially scanning each item in your

00:16:32.100 second table

00:16:33.300 at each visit we're evaluating the join

00:16:35.880 condition for True to see if we found a

00:16:38.100 match

00:16:41.639 may have got a little carried away with

00:16:42.899 the slides here

00:16:45.899 we'll call it right about there

00:16:49.440 so in other terms it's a nested for Loop

00:16:52.019 which can be classified as multi-linear

00:16:54.959 time and can be represented by o of n

00:16:58.500 times M where n is the number of items

00:17:01.980 in your first table and M is the number

00:17:04.620 of items in your second table

00:17:06.839 now there's a pretty slow time

00:17:08.819 complexity

00:17:09.959 what's cool about understanding this

00:17:11.760 though is that it highlights a reason

00:17:13.380 why databases like postgres always index

00:17:16.319 primary keys

00:17:17.760 usually a join between two tables is

00:17:20.760 going to be done by matching a foreign

00:17:22.380 key that lines up with a primary key on

00:17:24.839 another table

00:17:26.040 coming off of our Network blocks table

00:17:27.720 we already have a reference to the key

00:17:30.000 we're looking for in our Network

00:17:31.559 locations table and as we've seen by

00:17:34.320 running a sequential scan we can step

00:17:36.480 through each row to try and find a match

00:17:39.780 but an index scan offers us a way of

00:17:43.080 getting there more quickly

00:17:44.580 so let's take a look at that

00:17:47.400 now postgres supports a bunch of options

00:17:50.100 for creating an index and by default

00:17:52.500 it's going to use a balanced tree

00:17:54.539 structure

00:17:55.799 so if we were to create a b tree index

00:17:58.020 on say a table's primary key postgres is

00:18:01.440 going to gather up all of the values in

00:18:03.240 that field and find a relative midpoint

00:18:06.059 it's then going to distribute our data

00:18:08.160 downwards following a simple set of

00:18:10.200 rules and these rules are going to be

00:18:12.600 incredibly helpful if we want to read

00:18:14.700 back our data

00:18:16.919 they are

00:18:18.299 a higher value will be placed to the

00:18:20.640 right of the current node

00:18:22.620 and a lower value will be placed to the

00:18:24.960 left of the current node

00:18:27.299 following these rules postgres will step

00:18:29.460 through our data set to build out our

00:18:31.500 trait

00:18:36.900 now I do want to note what we've built

00:18:39.539 here is a bit of a simplification of a

00:18:42.299 balanced tree structure and all that it

00:18:44.160 can do which is actually a lot more like

00:18:46.500 having more than two children off of a

00:18:48.360 node and even clustering data together

00:18:50.400 on a single node but we had so much

00:18:54.059 slide space here and remember we're

00:18:56.220 already pushing like 200 slides let's

00:18:58.799 let's see how we're doing by the way

00:19:00.960 okay okay

00:19:03.419 back to indexing what we're going to

00:19:06.120 step through does highlight the

00:19:07.559 fundamentals it's really the rules for

00:19:09.780 organization that we care about and for

00:19:11.940 the most part those are the same

00:19:14.220 following these rules we're going to be

00:19:16.320 able to Traverse our tree and

00:19:18.120 demonstrate that the time complexity of

00:19:19.919 a b tree index scan is logarithmic which

00:19:22.980 can be represented as o of log of n

00:19:26.100 where n is the number of items in our

00:19:28.740 data set

00:19:30.000 so check this out

00:19:31.559 take a value of 53. in order to find the

00:19:34.799 node for this value in our tree we can

00:19:36.960 leverage the rules of the structure to

00:19:38.940 answer some basic questions at each stop

00:19:41.280 we make

00:19:42.419 is the value I'm looking for greater

00:19:44.880 than the current value

00:19:46.679 on our root node here 53 is greater than

00:19:49.620 50 so we move to the right

00:19:52.440 53 is less than 64. so we move to the

00:19:56.100 left and boom we've found the node for

00:19:59.280 our data point

00:20:00.539 by following this progression we're able

00:20:02.580 to throw away half of our remaining tree

00:20:04.980 at each stop and that's what makes this

00:20:07.440 logarithmic

00:20:09.000 following a sequential scan we would

00:20:11.220 have had to have visited each row on our

00:20:13.260 table until we found a match and the

00:20:15.600 cost of that operation would scale

00:20:17.460 linearly with the size of our data set

00:20:20.460 but with an index scan we don't have to

00:20:23.280 visit every record for each move we make

00:20:26.340 we effectively filter out half of the

00:20:28.740 remaining data and that's how

00:20:30.720 logarithmic operations really show their

00:20:32.940 speed

00:20:34.500 today we start kicking up the size of

00:20:36.059 our data set

00:20:37.200 to find a record using a sequential scan

00:20:39.600 which is a linear operation the time it

00:20:42.539 would take to find an item is going to

00:20:44.160 be directly proportional to the number

00:20:46.140 of items in our set

00:20:47.880 by using an index scan running in

00:20:50.100 logarithmic Time Each doubling of our

00:20:53.039 data here only introduces one added unit

00:20:56.520 of time onto our search

00:20:59.039 we can find our way through a million

00:21:01.140 items in at most 20 progressions through

00:21:03.720 our index

00:21:04.799 with logarithmic time our data can grow

00:21:07.260 exponentially and the cost of finding a

00:21:09.660 node won't change all that much it's

00:21:12.480 super powerful

00:21:14.520 now one thing to keep in mind

00:21:16.620 indexes exist separate from your data

00:21:19.679 what this means is that once we've found

00:21:21.900 a node matching our condition postgres

00:21:24.600 is going to follow some metadata stored

00:21:26.940 with that node to access our table row

00:21:29.280 which lives in What's called the Heap

00:21:31.740 not a lot we're going to dig into here

00:21:33.780 but it is a good thing to have an

00:21:35.520 awareness for

00:21:36.659 consider that for every table change

00:21:39.299 your database may have to update any

00:21:41.940 indexes that might need to understand

00:21:43.559 that change

00:21:45.419 the concern here could be that if you

00:21:48.059 have a heavily indexed table all of your

00:21:50.280 right and delete operations might get

00:21:52.919 expensive as you spend time keeping your

00:21:54.900 indexes in check

00:21:56.460 not something you run into often

00:21:58.500 but some good knowledge to keep in your

00:22:00.659 back pocket for your next database

00:22:02.340 trivia night

00:22:03.600 okay

00:22:05.700 so now we have a handle on indexing

00:22:07.380 let's look back at our query plan

00:22:09.600 we know that in order to fulfill our

00:22:11.640 join statement we're going to have to

00:22:13.500 visit both tables there's no getting

00:22:15.299 around that and luckily we Dodge the

00:22:18.059 double sequential scan our Network

00:22:20.280 locations Geo Name ID index is allowing

00:22:22.919 us to more quickly scan that table for

00:22:25.500 any rows that match the foreign key on

00:22:27.900 our Network block rows

00:22:30.120 now before we set up a nested Loop join

00:22:32.880 as multi-linear because we had to

00:22:34.860 sequentially scan both tables but

00:22:37.740 knowing that we're able to leverage the

00:22:39.059 index on network locations the cost of

00:22:41.940 our join is now going to come in at

00:22:44.460 o of n Times log of M time one

00:22:48.840 sequential scan to find network block

00:22:50.400 records and then for each record that we

00:22:52.860 find which from our limit will just be

00:22:55.320 one an index lookup to find our Network

00:22:58.500 location

00:23:00.780 so if we were looking to optimize this

00:23:02.760 query it's that sequential scan that

00:23:05.220 grabs our attention this thing is really

00:23:07.380 slowing us down

00:23:08.820 taking a look at the actual runtime

00:23:10.860 that's everything right there the

00:23:12.960 visiting of over a million rows it's

00:23:15.179 killing us

00:23:17.039 so what if we could evaluate that table

00:23:19.080 with an index scan that would take us to

00:23:22.260 the most performant option for our query

00:23:24.299 here

00:23:25.559 oh a vlog of n plus log of M time one

00:23:29.760 index lookup by IP address to find our

00:23:31.799 Network block record and then using the

00:23:34.020 Geo Name ID value off of that record

00:23:36.120 what index lookup to find its Network

00:23:39.360 location

00:23:40.860 by understanding the time complexity is

00:23:42.840 associated with each operation we know

00:23:44.880 where to look and can leverage this

00:23:46.679 stuff to better design our query or our

00:23:48.960 schema to take advantage

00:23:51.720 now remember back to our explain

00:23:53.220 statement

00:23:54.419 postgres is going to provide you with

00:23:56.520 what it thinks is the fastest query plan

00:23:59.039 and it's usually really good about

00:24:01.380 making that decision

00:24:02.940 but postgres only has available what we

00:24:05.340 give it and so in using explain

00:24:07.080 statements we might be able to identify

00:24:09.480 an opportunity for adding a new index or

00:24:12.179 maybe we could change our query

00:24:14.159 altogether to coerce postgres towards

00:24:16.919 putting together a better plan

00:24:18.900 and that's what we were able to do here

00:24:20.940 an index on the starting IP address and

00:24:23.760 ending IP address was already available

00:24:26.700 but postgres still decided to go for the

00:24:29.220 sequential scan it turns out that by

00:24:32.340 throwing a limit statement onto our

00:24:34.500 query of a low enough value in our case

00:24:37.080 one which came from active records fine

00:24:39.179 by postgres thinks that a sequential

00:24:41.820 scan is going to find our record faster

00:24:44.100 than going through the index remember

00:24:46.320 the query plan is built on projection

00:24:49.500 in actuality though we saw from the

00:24:52.260 numbers that this assumption postgres is

00:24:54.419 making turns out to be the bottleneck of

00:24:56.520 our query

00:24:57.720 so how do we influence postgres towards

00:25:00.419 using that index

00:25:01.860 well

00:25:03.059 turns out to be pretty simple clever but

00:25:06.120 simple

00:25:07.500 if we had an order statement to our

00:25:09.960 query postgres knows that it's going to

00:25:12.360 have to sort our rows at some point in

00:25:14.159 time and with a sequential scan this has

00:25:17.220 to be done as an added step only after

00:25:20.039 we've collected all the rows can

00:25:21.900 postgres figure out how they should be

00:25:23.820 sorted

00:25:25.080 but if you remember back to the rules of

00:25:27.659 how an index is designed they're built

00:25:30.120 around an order postgres is able to

00:25:32.460 leverage this and as it progresses

00:25:34.260 through an index scan it can keep track

00:25:36.539 of how the rows it picked up were

00:25:38.460 ordered

00:25:39.360 this allows us to sort along the way and

00:25:42.299 eliminates the need of that extra step

00:25:44.279 that came with the sequential scan

00:25:47.039 by adding an order statement to our

00:25:48.960 query postgres is going to change its

00:25:51.600 plan to favor our index so that it can

00:25:54.000 reduce what would have been two steps a

00:25:56.460 sequential scan followed by a sort down

00:25:59.700 to one step

00:26:01.740 an index scan was sorting along the way

00:26:05.640 by engaging our index we lose the

00:26:08.279 sequential scan and pick up a huge

00:26:10.440 performance win

00:26:12.059 with the initial version of our query

00:26:13.980 using the sequential scan we had to

00:26:16.620 filter through about a third of the

00:26:18.120 network blocks table and that led to a

00:26:20.400 run time of about 200 milliseconds

00:26:23.460 now with the index scan the runtime on

00:26:26.520 this thing comes down to

00:26:30.200 .05 milliseconds

00:26:32.820 look at that change

00:26:35.159 rolling the updated query out to our app

00:26:37.320 check out this reduction of runtime

00:26:39.659 before a geolocation request took about

00:26:42.779 200 milliseconds and was one of the top

00:26:45.480 offending transactions in our app

00:26:47.760 now

00:26:49.140 doesn't even register

00:26:51.559 1.67 milliseconds that's over 100 times

00:26:55.080 faster right there and that is going to

00:26:57.900 look very nice on my performance review

00:27:01.080 and I hope it made for a nice talk too

00:27:02.940 thank you

00:27:05.960 yeah thank you all for attending

00:27:10.200 back to our organizers

00:27:12.299 and thank you to our sponsors

00:27:14.460 the slides for this talk can be found at

00:27:16.919 Kevin lash.com railsconf2022

00:27:21.299 if you have any questions or compliments

00:27:24.360 for me you can find me on Twitter at

00:27:26.760 Kevin Lesh

00:27:28.320 and if you'd like to check out my other

00:27:29.940 projects you can keep up with those at

00:27:32.400 Kevin lesh.com

00:27:34.200 thank you

00:27:36.120 foreign