00:00:00.900
foreign
00:00:16.880
geolocation explained
00:00:20.039
we get about 200 slides to get through
00:00:22.920
with this talk here so we're gonna have
00:00:25.019
some fun
00:00:26.340
just trying not to blank okay
00:00:29.939
so the idea for this talk came from some
00:00:31.740
work that we recently produced at home
00:00:34.079
Chef where I work on our engineering
00:00:36.239
team
00:00:37.200
home Chef the meal kit delivery company
00:00:39.540
and each week we put out a menu with
00:00:42.000
meals
00:00:43.559
customers can then order meals from a
00:00:45.840
menu
00:00:47.280
and the ingredients for those meals come
00:00:49.200
pre-portioned and ready to cook
00:00:52.500
we ship our orders out of three
00:00:54.300
facilities scattered across the U.S
00:00:58.620
and until recently all customers had
00:01:01.800
access to the same set of meals
00:01:04.199
regardless of where they were ordering
00:01:06.060
from
00:01:07.439
but then we wanted to make a change
00:01:09.720
product team wanted to be able to offer
00:01:12.000
additional meals to customer service by
00:01:14.880
our West Coast facility much like a
00:01:17.280
digital a B test we wanted to begin
00:01:19.380
testing the impact of different meal
00:01:21.479
offerings
00:01:22.619
now this opened up a number of different
00:01:25.020
challenges that we had to work through
00:01:26.640
most of which revolved around figuring
00:01:29.460
out what meals were eligible for a
00:01:32.220
customer
00:01:33.780
my order is being delivered out of our
00:01:36.000
West Coast facility I'll have access to
00:01:38.640
all the meals
00:01:40.079
but if my order is being delivered out
00:01:42.000
of our Midwest or east coast facility
00:01:45.600
I'm not going to have access to the test
00:01:47.579
meals that are only available out of our
00:01:50.159
West Coast facility
00:01:52.920
so for logged in customers this wasn't
00:01:54.720
too bad to figure out we've got the
00:01:56.880
customer's delivery address and we know
00:01:58.979
which facility is going to be shipping
00:02:01.259
an order for that address
00:02:03.180
so we can determine what meals the
00:02:05.100
customer can access
00:02:07.560
but the logged out experience became a
00:02:10.140
little more interesting
00:02:11.940
for site visitors we don't have any
00:02:14.640
address information and our marketing
00:02:16.980
team wanted to be able to conditionally
00:02:18.540
display the different menu offerings and
00:02:21.420
as best as we could
00:02:23.040
make sure that someone visiting the site
00:02:24.840
from within the service range of our
00:02:26.580
West Coast facility would see those test
00:02:28.560
meals
00:02:29.580
conversely visitors outside of the range
00:02:32.459
of our West Coast facility shouldn't see
00:02:35.340
the test meals
00:02:38.580
so without access to the shipping
00:02:40.560
address information that we have on
00:02:42.480
logged in customers we had to figure out
00:02:45.060
how to find location for individuals
00:02:47.519
visiting our site and that's what my
00:02:49.680
talk brings you today
00:02:50.940
we're going to step through how to
00:02:52.800
geo-locate site visitors and how to do
00:02:55.319
it fast yeah
00:02:57.840
okay so to restate our problem
00:03:00.120
when someone visits our website we want
00:03:02.519
to find their geographic location and
00:03:05.040
then conditionally display some content
00:03:07.080
based on that location
00:03:09.660
and because we're dealing with display
00:03:10.980
logic here we really want to get to that
00:03:13.200
location fast so we can bring our page
00:03:15.840
content to the visitor as quickly as we
00:03:18.000
can
00:03:19.500
so how do we do it
00:03:21.120
well we started by exploring a few
00:03:23.220
options
00:03:24.180
first we thought about leveraging
00:03:26.040
leveraging some service with an API to
00:03:29.040
handle everything for us
00:03:30.599
but this option got ruled out pretty
00:03:33.060
quickly because it would add too much
00:03:35.099
time onto our web requests
00:03:37.319
the amount of time it would take to send
00:03:39.239
a request out to a third party and then
00:03:41.400
hear back
00:03:42.480
it's just a non-starter
00:03:44.519
next we thought about cloudflare
00:03:46.620
cloudflare manages our DNS and so a
00:03:49.319
request coming to our site it's first
00:03:51.120
going to pass through them
00:03:52.860
they provide geolocation as a part of
00:03:55.379
their analytics platform so we explored
00:03:57.720
if that information could be forwarded
00:03:59.760
from cloudflare to our app
00:04:02.580
fortunately they only offer country code
00:04:05.280
which just wasn't granular enough
00:04:08.640
then we found this service called
00:04:10.140
maxmind
00:04:11.400
Max mine provides geolocation data that
00:04:14.700
allows you to find a location based on
00:04:16.859
an IP address
00:04:18.120
and rails does give us easy access to a
00:04:21.239
visitor's connecting address
00:04:22.860
so this one seemed like it could be a
00:04:24.360
good fit
00:04:26.040
the geolocation data comes as two tables
00:04:28.500
they separate the IP data and the
00:04:31.620
location data and the two are linked
00:04:33.660
together by Foreign key
00:04:35.699
a network block record belongs to a
00:04:38.280
network location and a network location
00:04:40.460
may have many Network blocks pretty cool
00:04:45.240
only let's take a closer look at this
00:04:47.880
data and specifically this network
00:04:50.460
address field
00:04:52.080
it looks like an IP address but what's
00:04:54.960
going on with this slash and the number
00:04:57.000
after it
00:04:58.440
after summer search what we have here is
00:05:01.020
cidr notation this stands for classless
00:05:04.860
inner domain routing which is a way to
00:05:08.040
specify within an IP the number of bits
00:05:10.620
that make up the networking portion of
00:05:12.240
the address
00:05:13.139
and the range of IP addresses that are
00:05:15.720
assignable off of that Network
00:05:19.080
so how do we get the range of IP
00:05:20.759
addresses from the notation well let's
00:05:23.520
start by breaking down what an IP
00:05:25.800
address actually is
00:05:27.780
an ipv4 address is composed of 32 bits
00:05:31.800
broken down into four 8-bit segments as
00:05:36.240
divided by a period
00:05:39.900
each segment is typically a numerical
00:05:42.060
value from 0 to 255 and a segment can
00:05:46.620
also be referred to as an octet to
00:05:49.080
express that it's representing eight
00:05:50.820
bits of data plus just a word that's
00:05:53.940
more fun to say
00:05:55.560
so back to cidr notation
00:05:57.780
the number following the forward slash
00:05:59.880
represents the number of bits that make
00:06:02.039
up the networking portion of the address
00:06:04.860
taking our example here it'd be the
00:06:06.900
first three octets that make up 24 bits
00:06:09.900
and Define our Network address
00:06:14.160
the rest of the address can be used to
00:06:16.080
define hosts on the network
00:06:18.960
so looking at this network address any
00:06:21.419
client on this network will have an IP
00:06:23.759
address where that last segment is
00:06:26.039
between 0 and 255.
00:06:29.400
to put pictures to all of this if we
00:06:32.100
take a city like Chicago there might be
00:06:34.319
any number of internet service providers
00:06:36.120
and any number of network nodes handling
00:06:38.639
the clients of those providers
00:06:42.180
and each node is going to have its own
00:06:44.460
unique Network address and an allocation
00:06:46.919
of IP addresses that can be issued off
00:06:49.319
of that Network address
00:06:51.060
and all of this can be represented by
00:06:53.639
cidr notation
00:06:55.860
looking at network node one we know that
00:06:58.259
the first 16 bits of the address are
00:07:00.840
going to represent the networking
00:07:02.220
portion and so it's the last 16 bits
00:07:05.280
that are addressable
00:07:07.440
for network node two we see 24 bits as
00:07:10.620
the network address so that leaves 8
00:07:12.840
Bits as addressable
00:07:15.300
from this we know that for a client
00:07:17.160
connecting to network node one
00:07:19.199
they're going to have an IP address
00:07:20.699
that's within the bounds of that node
00:07:22.979
cidr block and just the same for network
00:07:25.560
node two
00:07:26.699
any client being serviced by this node
00:07:29.160
is going to have an address that's
00:07:30.780
within the bounds of its cidr block
00:07:34.500
once we know the bounds of a cidr block
00:07:36.840
we can identify the starting address and
00:07:39.180
ending address and when a request comes
00:07:41.460
in we find the block record that
00:07:43.680
contains our incoming address
00:07:45.780
from there we just need to follow the
00:07:47.940
foreign key to find the location
00:07:50.460
okay
00:07:52.620
so how do we code out that lookup
00:07:54.840
well we've got these two tables and say
00:07:57.120
the IP address of the client visiting
00:07:59.220
our site
00:08:00.240
the first thing that we could do is
00:08:02.340
filter down our Network blocks table to
00:08:04.500
only those instances where the IP
00:08:06.599
address fits between the starting and
00:08:08.880
ending address
00:08:10.259
and with our Network block records
00:08:12.060
unique by starting a dress and ending
00:08:14.099
address that should take us down to one
00:08:16.560
row
00:08:17.759
once we've found that record the next
00:08:20.160
thing we could do is leverage that Geo
00:08:22.080
Name ID field which acts as a foreign
00:08:24.419
key pointing to a record on the network
00:08:26.460
locations table
00:08:28.199
what this means is that we can follow
00:08:30.419
the value off of our Network blocks
00:08:32.640
table to join a network locations row
00:08:35.219
where the Geo Name ID field is of the
00:08:37.919
same value
00:08:40.320
once we've joined our rows we can set up
00:08:42.719
our select to return everything off of
00:08:45.000
the network locations table since it's
00:08:47.220
ultimately the location data that we
00:08:48.959
care about
00:08:50.220
bring this into rails for you active
00:08:52.320
record feds out there here's what that
00:08:54.360
syntax would look like
00:08:55.740
nice and tidy with the only change to
00:08:57.959
our query that gets generated from this
00:08:59.580
being
00:09:00.839
that a limit is applied which comes from
00:09:03.000
the five Buy
00:09:04.140
but since we know that we should be
00:09:05.700
finding a distinct match anyways not a
00:09:08.040
problem and makes for a nice interface
00:09:09.660
within our app
00:09:11.399
putting everything together we now have
00:09:13.620
a system in place for taking an IP
00:09:15.600
address and finding the location for
00:09:18.060
that address
00:09:19.200
pretty cool
00:09:21.480
so we send this work out the door and
00:09:23.640
everything's working as expected only
00:09:26.580
things are running a bit slow
00:09:30.060
taking a look at New Relic our
00:09:31.740
performance monitoring service around
00:09:33.360
the time the work was sent out
00:09:35.760
the controller action where our query
00:09:37.740
lives was already listed within the key
00:09:40.740
transactions overview
00:09:42.660
if you're using New Relic this list here
00:09:44.820
is going to be where you'll find the
00:09:46.560
longest running transactions within your
00:09:48.300
app not a good place to be
00:09:50.580
and to give us a closer look we can
00:09:52.380
navigate through to get even more
00:09:54.240
information about what all is happening
00:09:56.880
this chart is going to break down all of
00:09:59.519
the components making up the response to
00:10:02.100
the request
00:10:03.120
and if we scroll down a bit we're given
00:10:05.279
a nice breakdown table
00:10:07.320
here we can see that it's taking about
00:10:09.240
200 milliseconds to find our Network
00:10:12.060
location
00:10:13.320
and this 200 milliseconds is making up
00:10:15.779
about 95 percent of our geolocation
00:10:18.720
request time
00:10:20.519
not the best for our end users out there
00:10:23.100
but this does offer up something nice
00:10:25.560
what we have here is an opportunity
00:10:27.360
those performance reviews don't write
00:10:29.399
themselves so let's see if we can dig in
00:10:31.620
and make some improvements
00:10:34.200
let's go back to our query
00:10:36.720
looking at things directly we could make
00:10:38.880
some assumptions but we don't really
00:10:40.860
have eyes into why this thing is taking
00:10:43.019
up so much time
00:10:44.640
luckily database Technologies provide us
00:10:47.519
with some utility functions to detail
00:10:50.100
out exactly how they plan to execute our
00:10:53.160
queries
00:10:54.180
we'll be referencing postgres as our
00:10:56.279
database technology but everything that
00:10:58.740
we're going to talk through should
00:11:00.300
generally apply for all SQL back
00:11:02.040
databases
00:11:03.959
so back to those utility functions by
00:11:06.600
prefacing our query with an explain
00:11:08.339
statement postgres will evaluate our ask
00:11:11.100
and reference our schema and some
00:11:13.800
statistics on our database like table
00:11:15.660
size to plot out what it thinks will be
00:11:18.540
the best way to respond to our query
00:11:21.720
the output here is a set of nodes that
00:11:24.000
make up a plan each node represents an
00:11:26.880
operation that postgres will perform as
00:11:29.399
a part of fulfilling the query
00:11:31.380
the main thing that postgres is
00:11:33.779
optimizing for when putting together a
00:11:35.640
plan is Speed
00:11:37.500
postgres wants to return to you an
00:11:39.240
answer as quickly as it can and it's
00:11:41.760
going to make some reasonably informed
00:11:43.620
decisions about how to accomplish that
00:11:47.040
taking things one step further adding
00:11:49.500
and analyze onto our explained statement
00:11:51.899
is going to Output what would actually
00:11:53.880
happen when your query runs
00:11:56.519
explain is projection explain analyze is
00:11:59.579
retrospection
00:12:01.320
the thing to be aware of with explain
00:12:02.880
analyze is that postgres is actually
00:12:05.279
going to run your query and return both
00:12:07.740
the plan of execution and also some data
00:12:10.500
points on the actual run of each node in
00:12:13.140
the plan tree
00:12:14.640
here we can see how long each node took
00:12:17.880
to run and how many rows were evaluated
00:12:20.640
at each step along the way
00:12:23.760
now explain statements used to look a
00:12:25.860
little intimidating to me and I think
00:12:27.480
that came from just not knowing the
00:12:29.279
language of all that was being said and
00:12:31.800
also how all the pieces fit together
00:12:34.560
so this plan here jumped out is a really
00:12:37.079
good candidate for demonstrating three
00:12:39.480
fundamental database operations that we
00:12:41.820
can step through and then leverage to
00:12:43.980
understand why our queries running so
00:12:45.899
slow and how we can make it run faster
00:12:49.320
the three operations that we're going to
00:12:51.240
take a look at are going to be a
00:12:52.980
sequential scan
00:12:54.360
nested Loop join and index scan
00:12:58.560
gonna be the crowd favorite right there
00:13:00.000
I guarantee it
00:13:02.220
so a sequential scan can be thought of
00:13:04.320
as a for Loop and to put this into terms
00:13:07.139
of time complexity because that's what
00:13:09.480
our database cares about a sequential
00:13:11.760
scan is going to be a linear operation
00:13:14.060
what this means is that the time it
00:13:16.680
takes to perform a sequential scan is
00:13:18.779
going to be directly related to the size
00:13:20.760
of your data set
00:13:22.139
in a worst case scenario where we start
00:13:24.720
scanning at the top of our table and the
00:13:27.000
item we're looking for is ordered last
00:13:28.860
the time it takes to get there is going
00:13:31.200
to be the length of our data set
00:13:33.480
for that reason linear operations can
00:13:36.600
also be represented like o of n where n
00:13:40.560
is the number of items in your set
00:13:43.380
now if you're not familiar with this
00:13:45.300
kind of Big O notation
00:13:47.519
it's just a way of representing time in
00:13:49.500
an abstract way it lets us classify
00:13:51.959
operations without having to deal with
00:13:53.760
raw numbers of the data set or having to
00:13:56.579
factor in variable stuff like computer
00:13:58.440
processing power or language speed
00:14:01.560
it's a way to communicate a generalized
00:14:03.839
runtime
00:14:04.920
so to put this into practical terms
00:14:07.260
taking our where statement from our
00:14:09.480
geolocation query we're looking for a
00:14:12.000
record that can bound our IP address and
00:14:14.519
in order for postgres to fulfill this
00:14:16.680
condition by sequential scan it's going
00:14:19.380
to need to visit every item in our table
00:14:21.420
to identify any matches
00:14:26.279
and that's what we see happening in our
00:14:27.839
query here
00:14:28.920
this portion of the explained statement
00:14:30.959
is telling us that postgres visited a
00:14:33.060
little over a million rows but Ford
00:14:35.399
found a record meeting our condition
00:14:38.519
now something to note here
00:14:40.320
there are over 3 million rows in our
00:14:42.540
Network blocks table and we know that
00:14:45.240
Network block records represent a unique
00:14:47.760
range of IP addresses and that there
00:14:49.680
should only be one record that can bound
00:14:52.320
an IP address
00:14:53.820
but how was postgres able to figure that
00:14:55.740
out without having to visit all three
00:14:57.600
million rows why was it able to stop
00:15:00.600
filtering about a third of the way
00:15:02.279
through after only visiting a million
00:15:04.380
rows out of the 3 million
00:15:07.320
well the answer is postgres knows that
00:15:09.540
this sequential scan is serving a higher
00:15:11.579
purpose going into the execution of our
00:15:14.040
query it already has a plan and knows
00:15:17.040
that part of that plan is limiting the
00:15:18.899
return to one record
00:15:20.699
so once we've found a match we can stop
00:15:23.100
our scan and move on to the next step of
00:15:25.860
the plan
00:15:27.300
now speaking of a higher purpose
00:15:29.399
postgres has decided that the best way
00:15:31.500
to perform our query is to get our
00:15:33.360
Network block data as a sub step of our
00:15:35.699
join statement
00:15:36.959
it's a good time to note that query
00:15:39.180
plans are executed inside out meaning
00:15:42.480
that the far most nested nodes are going
00:15:44.820
to evaluate before their ancestors
00:15:47.639
this step being interior to the nested
00:15:50.699
Loop node is acting as a dependency
00:15:53.699
postgres has determined that to
00:15:55.320
effectively perform our join we're going
00:15:57.779
to need the return from this step of the
00:15:59.459
plan first
00:16:00.779
but with that in hand
00:16:02.579
let's now take a look at nested loops
00:16:05.519
now there are actually a bunch of
00:16:06.959
algorithms that postgres can choose from
00:16:09.000
for joining two tables together and
00:16:11.339
nested Loop is the most basic operation
00:16:14.279
what's pretty cool is that a nested Loop
00:16:16.500
join can actually be achieved by a
00:16:18.600
number of different time complexities
00:16:21.000
the slowest version of this is going to
00:16:23.760
be performed by sequentially scanning
00:16:25.440
all of the items in your first table and
00:16:27.899
for each item in your first table
00:16:29.540
sequentially scanning each item in your
00:16:32.100
second table
00:16:33.300
at each visit we're evaluating the join
00:16:35.880
condition for True to see if we found a
00:16:38.100
match
00:16:41.639
may have got a little carried away with
00:16:42.899
the slides here
00:16:45.899
we'll call it right about there
00:16:49.440
so in other terms it's a nested for Loop
00:16:52.019
which can be classified as multi-linear
00:16:54.959
time and can be represented by o of n
00:16:58.500
times M where n is the number of items
00:17:01.980
in your first table and M is the number
00:17:04.620
of items in your second table
00:17:06.839
now there's a pretty slow time
00:17:08.819
complexity
00:17:09.959
what's cool about understanding this
00:17:11.760
though is that it highlights a reason
00:17:13.380
why databases like postgres always index
00:17:16.319
primary keys
00:17:17.760
usually a join between two tables is
00:17:20.760
going to be done by matching a foreign
00:17:22.380
key that lines up with a primary key on
00:17:24.839
another table
00:17:26.040
coming off of our Network blocks table
00:17:27.720
we already have a reference to the key
00:17:30.000
we're looking for in our Network
00:17:31.559
locations table and as we've seen by
00:17:34.320
running a sequential scan we can step
00:17:36.480
through each row to try and find a match
00:17:39.780
but an index scan offers us a way of
00:17:43.080
getting there more quickly
00:17:44.580
so let's take a look at that
00:17:47.400
now postgres supports a bunch of options
00:17:50.100
for creating an index and by default
00:17:52.500
it's going to use a balanced tree
00:17:54.539
structure
00:17:55.799
so if we were to create a b tree index
00:17:58.020
on say a table's primary key postgres is
00:18:01.440
going to gather up all of the values in
00:18:03.240
that field and find a relative midpoint
00:18:06.059
it's then going to distribute our data
00:18:08.160
downwards following a simple set of
00:18:10.200
rules and these rules are going to be
00:18:12.600
incredibly helpful if we want to read
00:18:14.700
back our data
00:18:16.919
they are
00:18:18.299
a higher value will be placed to the
00:18:20.640
right of the current node
00:18:22.620
and a lower value will be placed to the
00:18:24.960
left of the current node
00:18:27.299
following these rules postgres will step
00:18:29.460
through our data set to build out our
00:18:31.500
trait
00:18:36.900
now I do want to note what we've built
00:18:39.539
here is a bit of a simplification of a
00:18:42.299
balanced tree structure and all that it
00:18:44.160
can do which is actually a lot more like
00:18:46.500
having more than two children off of a
00:18:48.360
node and even clustering data together
00:18:50.400
on a single node but we had so much
00:18:54.059
slide space here and remember we're
00:18:56.220
already pushing like 200 slides let's
00:18:58.799
let's see how we're doing by the way
00:19:00.960
okay okay
00:19:03.419
back to indexing what we're going to
00:19:06.120
step through does highlight the
00:19:07.559
fundamentals it's really the rules for
00:19:09.780
organization that we care about and for
00:19:11.940
the most part those are the same
00:19:14.220
following these rules we're going to be
00:19:16.320
able to Traverse our tree and
00:19:18.120
demonstrate that the time complexity of
00:19:19.919
a b tree index scan is logarithmic which
00:19:22.980
can be represented as o of log of n
00:19:26.100
where n is the number of items in our
00:19:28.740
data set
00:19:30.000
so check this out
00:19:31.559
take a value of 53. in order to find the
00:19:34.799
node for this value in our tree we can
00:19:36.960
leverage the rules of the structure to
00:19:38.940
answer some basic questions at each stop
00:19:41.280
we make
00:19:42.419
is the value I'm looking for greater
00:19:44.880
than the current value
00:19:46.679
on our root node here 53 is greater than
00:19:49.620
50 so we move to the right
00:19:52.440
53 is less than 64. so we move to the
00:19:56.100
left and boom we've found the node for
00:19:59.280
our data point
00:20:00.539
by following this progression we're able
00:20:02.580
to throw away half of our remaining tree
00:20:04.980
at each stop and that's what makes this
00:20:07.440
logarithmic
00:20:09.000
following a sequential scan we would
00:20:11.220
have had to have visited each row on our
00:20:13.260
table until we found a match and the
00:20:15.600
cost of that operation would scale
00:20:17.460
linearly with the size of our data set
00:20:20.460
but with an index scan we don't have to
00:20:23.280
visit every record for each move we make
00:20:26.340
we effectively filter out half of the
00:20:28.740
remaining data and that's how
00:20:30.720
logarithmic operations really show their
00:20:32.940
speed
00:20:34.500
today we start kicking up the size of
00:20:36.059
our data set
00:20:37.200
to find a record using a sequential scan
00:20:39.600
which is a linear operation the time it
00:20:42.539
would take to find an item is going to
00:20:44.160
be directly proportional to the number
00:20:46.140
of items in our set
00:20:47.880
by using an index scan running in
00:20:50.100
logarithmic Time Each doubling of our
00:20:53.039
data here only introduces one added unit
00:20:56.520
of time onto our search
00:20:59.039
we can find our way through a million
00:21:01.140
items in at most 20 progressions through
00:21:03.720
our index
00:21:04.799
with logarithmic time our data can grow
00:21:07.260
exponentially and the cost of finding a
00:21:09.660
node won't change all that much it's
00:21:12.480
super powerful
00:21:14.520
now one thing to keep in mind
00:21:16.620
indexes exist separate from your data
00:21:19.679
what this means is that once we've found
00:21:21.900
a node matching our condition postgres
00:21:24.600
is going to follow some metadata stored
00:21:26.940
with that node to access our table row
00:21:29.280
which lives in What's called the Heap
00:21:31.740
not a lot we're going to dig into here
00:21:33.780
but it is a good thing to have an
00:21:35.520
awareness for
00:21:36.659
consider that for every table change
00:21:39.299
your database may have to update any
00:21:41.940
indexes that might need to understand
00:21:43.559
that change
00:21:45.419
the concern here could be that if you
00:21:48.059
have a heavily indexed table all of your
00:21:50.280
right and delete operations might get
00:21:52.919
expensive as you spend time keeping your
00:21:54.900
indexes in check
00:21:56.460
not something you run into often
00:21:58.500
but some good knowledge to keep in your
00:22:00.659
back pocket for your next database
00:22:02.340
trivia night
00:22:03.600
okay
00:22:05.700
so now we have a handle on indexing
00:22:07.380
let's look back at our query plan
00:22:09.600
we know that in order to fulfill our
00:22:11.640
join statement we're going to have to
00:22:13.500
visit both tables there's no getting
00:22:15.299
around that and luckily we Dodge the
00:22:18.059
double sequential scan our Network
00:22:20.280
locations Geo Name ID index is allowing
00:22:22.919
us to more quickly scan that table for
00:22:25.500
any rows that match the foreign key on
00:22:27.900
our Network block rows
00:22:30.120
now before we set up a nested Loop join
00:22:32.880
as multi-linear because we had to
00:22:34.860
sequentially scan both tables but
00:22:37.740
knowing that we're able to leverage the
00:22:39.059
index on network locations the cost of
00:22:41.940
our join is now going to come in at
00:22:44.460
o of n Times log of M time one
00:22:48.840
sequential scan to find network block
00:22:50.400
records and then for each record that we
00:22:52.860
find which from our limit will just be
00:22:55.320
one an index lookup to find our Network
00:22:58.500
location
00:23:00.780
so if we were looking to optimize this
00:23:02.760
query it's that sequential scan that
00:23:05.220
grabs our attention this thing is really
00:23:07.380
slowing us down
00:23:08.820
taking a look at the actual runtime
00:23:10.860
that's everything right there the
00:23:12.960
visiting of over a million rows it's
00:23:15.179
killing us
00:23:17.039
so what if we could evaluate that table
00:23:19.080
with an index scan that would take us to
00:23:22.260
the most performant option for our query
00:23:24.299
here
00:23:25.559
oh a vlog of n plus log of M time one
00:23:29.760
index lookup by IP address to find our
00:23:31.799
Network block record and then using the
00:23:34.020
Geo Name ID value off of that record
00:23:36.120
what index lookup to find its Network
00:23:39.360
location
00:23:40.860
by understanding the time complexity is
00:23:42.840
associated with each operation we know
00:23:44.880
where to look and can leverage this
00:23:46.679
stuff to better design our query or our
00:23:48.960
schema to take advantage
00:23:51.720
now remember back to our explain
00:23:53.220
statement
00:23:54.419
postgres is going to provide you with
00:23:56.520
what it thinks is the fastest query plan
00:23:59.039
and it's usually really good about
00:24:01.380
making that decision
00:24:02.940
but postgres only has available what we
00:24:05.340
give it and so in using explain
00:24:07.080
statements we might be able to identify
00:24:09.480
an opportunity for adding a new index or
00:24:12.179
maybe we could change our query
00:24:14.159
altogether to coerce postgres towards
00:24:16.919
putting together a better plan
00:24:18.900
and that's what we were able to do here
00:24:20.940
an index on the starting IP address and
00:24:23.760
ending IP address was already available
00:24:26.700
but postgres still decided to go for the
00:24:29.220
sequential scan it turns out that by
00:24:32.340
throwing a limit statement onto our
00:24:34.500
query of a low enough value in our case
00:24:37.080
one which came from active records fine
00:24:39.179
by postgres thinks that a sequential
00:24:41.820
scan is going to find our record faster
00:24:44.100
than going through the index remember
00:24:46.320
the query plan is built on projection
00:24:49.500
in actuality though we saw from the
00:24:52.260
numbers that this assumption postgres is
00:24:54.419
making turns out to be the bottleneck of
00:24:56.520
our query
00:24:57.720
so how do we influence postgres towards
00:25:00.419
using that index
00:25:01.860
well
00:25:03.059
turns out to be pretty simple clever but
00:25:06.120
simple
00:25:07.500
if we had an order statement to our
00:25:09.960
query postgres knows that it's going to
00:25:12.360
have to sort our rows at some point in
00:25:14.159
time and with a sequential scan this has
00:25:17.220
to be done as an added step only after
00:25:20.039
we've collected all the rows can
00:25:21.900
postgres figure out how they should be
00:25:23.820
sorted
00:25:25.080
but if you remember back to the rules of
00:25:27.659
how an index is designed they're built
00:25:30.120
around an order postgres is able to
00:25:32.460
leverage this and as it progresses
00:25:34.260
through an index scan it can keep track
00:25:36.539
of how the rows it picked up were
00:25:38.460
ordered
00:25:39.360
this allows us to sort along the way and
00:25:42.299
eliminates the need of that extra step
00:25:44.279
that came with the sequential scan
00:25:47.039
by adding an order statement to our
00:25:48.960
query postgres is going to change its
00:25:51.600
plan to favor our index so that it can
00:25:54.000
reduce what would have been two steps a
00:25:56.460
sequential scan followed by a sort down
00:25:59.700
to one step
00:26:01.740
an index scan was sorting along the way
00:26:05.640
by engaging our index we lose the
00:26:08.279
sequential scan and pick up a huge
00:26:10.440
performance win
00:26:12.059
with the initial version of our query
00:26:13.980
using the sequential scan we had to
00:26:16.620
filter through about a third of the
00:26:18.120
network blocks table and that led to a
00:26:20.400
run time of about 200 milliseconds
00:26:23.460
now with the index scan the runtime on
00:26:26.520
this thing comes down to
00:26:30.200
.05 milliseconds
00:26:32.820
look at that change
00:26:35.159
rolling the updated query out to our app
00:26:37.320
check out this reduction of runtime
00:26:39.659
before a geolocation request took about
00:26:42.779
200 milliseconds and was one of the top
00:26:45.480
offending transactions in our app
00:26:47.760
now
00:26:49.140
doesn't even register
00:26:51.559
1.67 milliseconds that's over 100 times
00:26:55.080
faster right there and that is going to
00:26:57.900
look very nice on my performance review
00:27:01.080
and I hope it made for a nice talk too
00:27:02.940
thank you
00:27:05.960
yeah thank you all for attending
00:27:10.200
back to our organizers
00:27:12.299
and thank you to our sponsors
00:27:14.460
the slides for this talk can be found at
00:27:16.919
Kevin lash.com railsconf2022
00:27:21.299
if you have any questions or compliments
00:27:24.360
for me you can find me on Twitter at
00:27:26.760
Kevin Lesh
00:27:28.320
and if you'd like to check out my other
00:27:29.940
projects you can keep up with those at
00:27:32.400
Kevin lesh.com
00:27:34.200
thank you
00:27:36.120
foreign