00:00:10.240
hello rubyconf i'm peter and i'm a ruby
00:00:12.799
core committer and senior developer at
00:00:14.480
shopify in the ruby infrastructure team
00:00:16.960
and i'm matt i'm also at shopify and i'm
00:00:19.199
also a senior developer in the ruby
00:00:20.720
infrastructure team
00:00:22.400
for the past year we've been working
00:00:24.000
together to make some improvements to
00:00:25.519
how memory is laid out in ruby
00:00:28.080
in this talk we're going to summarize
00:00:29.920
the work that we've done so far the
00:00:31.439
challenges that we've faced and where
00:00:32.719
we're heading in the future
00:00:34.880
but before we can go into detail we're
00:00:36.640
going to provide some background on how
00:00:38.160
ruby's garbage collector structures and
00:00:40.079
manages memory
00:00:44.000
we create an object in ruby the
00:00:45.760
interpreter stores a data structure
00:00:47.200
called an r value in memory to hold it
00:00:50.000
and r value is always 40 bytes wide and
00:00:52.079
it contains a union of different data
00:00:53.600
structures that represent all the
00:00:55.039
possible ruby core types
00:00:57.199
so we could have a structure
00:00:58.320
representing a string
00:01:00.079
or an array or a class and so on
00:01:04.080
and each one of these types contain
00:01:05.439
their own fields and data structures
00:01:06.960
specific to that type
00:01:08.960
the r value itself is just a container
00:01:10.799
for our ruby objects
00:01:13.280
let's look more closely at one of ruby's
00:01:14.960
object types our class in this case
00:01:18.799
the first 16 bytes of every type are
00:01:20.799
always the same an 8 byte field called
00:01:22.960
flags followed by an 8 by pointer to the
00:01:25.439
class of the object
00:01:28.080
the remaining 24 bytes are different for
00:01:30.079
every type
00:01:31.119
our class objects store a pointer to an
00:01:32.880
extension object here
00:01:34.640
strings can store the actual string
00:01:36.159
content and so on
00:01:38.960
these r values are organized into
00:01:40.560
regions called pages
00:01:42.159
a heap page is a 16 kilobyte memory
00:01:44.479
region
00:01:45.520
each page has a maximum of 409 slots
00:01:48.000
where each slot can hold one r value
00:01:52.079
a heap page's memory region is always
00:01:53.840
contiguous the slots have memory
00:01:55.600
addresses that are 40 bytes apart with
00:01:57.360
no gaps between them
00:01:59.840
initially when the heap page is created
00:02:02.000
all the slots will be filled with an
00:02:03.360
internal type called t none
00:02:05.840
this type represents an empty slot it
00:02:08.000
contains only flags
00:02:09.679
and a pointer value called next that can
00:02:11.680
point to an r value
00:02:14.000
this allows these empty slots to be
00:02:16.480
linked together to form a data structure
00:02:18.239
called the free list
00:02:20.400
when a heap page is initialized ruby
00:02:22.239
walks up each page visiting each slot in
00:02:24.480
turn
00:02:26.080
as we arrive at each slot we set a
00:02:27.920
pointer on the heap page called the free
00:02:29.440
list pointer to the address of that slot
00:02:32.000
and then set the slot's next pointer to
00:02:34.160
the address of the previous slot we
00:02:35.599
visited
00:02:37.120
this creates a singly linked list of
00:02:38.959
empty slots
00:02:41.680
when we need to allocate an object and
00:02:43.599
we need an empty slot
00:02:45.360
ruby asks the heap page for the first
00:02:46.959
slot on the free list
00:02:48.800
the heap page returns the slot and then
00:02:50.640
the free list pointer is updated to
00:02:52.160
point at the next slot in the list
00:02:55.599
this unlinks our slot from the free list
00:02:57.519
allowing us to put data in it
00:03:00.640
the use of a free list means that this
00:03:02.480
part of object allocation is both fast
00:03:04.879
and a constant time operation even when
00:03:07.040
a page is sparsely populated
00:03:10.000
because if the page has at least one
00:03:11.760
empty slot the free list is guaranteed
00:03:13.680
to always point to it
00:03:15.840
but eventually our heat pages are going
00:03:18.319
to run out of slot and this is where the
00:03:20.159
garbage collector is required
00:03:22.720
if the free list is empty when ruby
00:03:24.319
tries to allocate into the page it must
00:03:26.000
mean that there are no free slots left
00:03:27.680
on that page
00:03:29.680
so now peter's going to talk about how
00:03:31.599
and when the garbage collector runs to
00:03:33.120
release objects that are no longer
00:03:34.560
necessary
00:03:35.599
over to you peter
00:03:38.159
now that we've looked at how memory is
00:03:39.599
laid out inside ruby let's look at how
00:03:41.840
ruby's garbage collector works
00:03:44.239
ruby's garbage collection phase consists
00:03:46.159
of three distinct phases marking
00:03:48.799
sweeping and compaction
00:03:51.200
the last phase compaction is optional
00:03:53.760
and does not run unless manually enabled
00:03:56.560
ruby's garbage collection has stopped
00:03:58.080
the world meaning that ruby code is not
00:03:59.920
executed when the garbage collector is
00:04:01.760
active
00:04:02.959
the garbage collecting ruby is fairly
00:04:04.640
complex so we can't possibly cover all
00:04:06.720
the technical details in just a few
00:04:08.480
minutes but we'll provide a very high
00:04:10.640
level overview that contains all the
00:04:12.400
information you will need to know for
00:04:14.239
this talk
00:04:15.840
let's first look at marking marking is
00:04:18.160
the phase where we determine which ruby
00:04:19.919
objects are alive and which ones can be
00:04:22.079
freed
00:04:22.960
we first mark the routes which include
00:04:24.880
things like global variables
00:04:26.639
classes modules and constants
00:04:29.840
we then perform breadth-first search to
00:04:31.600
traverse the network of live objects
00:04:34.639
let's see an example of this here's a
00:04:36.639
toy example of a heap with two pages
00:04:38.639
with seven slots on each page
00:04:40.800
we have a total of 10 objects labeled a
00:04:42.800
through j
00:04:43.919
slots with no label means that the slot
00:04:45.600
is empty the arrows indicate references
00:04:48.240
for example the arrow from object j to
00:04:50.320
object d means that object j holds a
00:04:52.240
reference to object d
00:04:54.080
examples of references could be the
00:04:55.600
contents of arrays or instance variables
00:04:57.759
of objects
00:04:59.120
the first step of marking requires us to
00:05:00.800
mark the brood objects the root objects
00:05:02.720
here are objects a and b
00:05:04.560
let's now mark these two objects
00:05:07.440
from a and b objects c and g are
00:05:09.680
reachable and not marked so we mark them
00:05:13.360
then from object c and g objects f and h
00:05:16.080
are reachable and not marked so we mark
00:05:18.240
them
00:05:19.440
objects f and h do not refer to any
00:05:21.680
unmarked objects since we did not mark
00:05:24.160
any objects in this step it means that
00:05:26.320
marking is complete
00:05:28.880
now that we finish marking all marked
00:05:31.039
objects are live objects and all
00:05:32.800
unmarked objects are dead and can be
00:05:34.720
reclaimed by the garbage collector
00:05:37.440
we can simply scan the slots on the
00:05:39.039
pages and free the ones that are not
00:05:40.880
marked
00:05:42.479
let's continue to use the example we
00:05:44.080
used for marking
00:05:45.520
here we move our cursor to the first
00:05:47.199
object in the page that is not marked
00:05:49.440
object d in this case
00:05:51.199
we can free the object and reclaim the
00:05:53.199
memory
00:05:54.560
we then continue to move the cursor
00:05:56.400
until we reach the next unmarked object
00:05:58.400
which is object e
00:06:00.400
we've reached the end of page 1 so let's
00:06:02.560
move to sweep page 2.
00:06:04.720
again we move the cursor until the first
00:06:06.560
unmarked object is reached and will free
00:06:08.880
object i
00:06:10.479
we can continue to move the cursor until
00:06:12.400
we reach the next unmarked object which
00:06:14.800
is object j
00:06:16.400
now that we've swept all the pages
00:06:18.319
sweeping is complete
00:06:20.479
manual compaction was a new feature
00:06:22.240
introduced in ruby 2.7 and further
00:06:24.560
improved with automatic compaction in
00:06:26.479
ruby 3.0
00:06:28.240
however this feature remains optional
00:06:30.240
and is turned off by default
00:06:32.479
compaction moves objects within the heap
00:06:34.400
to the start of the heat this has
00:06:36.240
several benefits including reduced
00:06:37.840
memory usage
00:06:39.199
faster garbage collection and better
00:06:41.039
copy on right performance
00:06:43.199
ruby uses the two-finger algorithm for
00:06:45.039
compaction which i will explain in more
00:06:47.360
detail
00:06:48.800
the compaction algorithm involves two
00:06:50.720
steps the compact step and the reference
00:06:52.960
updating step
00:06:54.400
the compact step moves objects from the
00:06:56.080
end of the heap to the start of the heap
00:06:58.319
when the step completes we do the
00:07:00.319
reference updating step in this step we
00:07:03.120
scan all the objects and update the
00:07:04.800
pointers that have moved
00:07:07.199
let's see an example of compaction in
00:07:09.199
action the last live object of the heap
00:07:11.759
is object h and the first empty slot in
00:07:14.240
the heap is not one so we moved object h
00:07:17.280
to slot one and leave the forwarding
00:07:18.960
address
00:07:20.160
this forwarding address will tell us
00:07:21.599
where the object was moved to we will
00:07:23.599
use this forwarding address in the
00:07:25.039
reference updating step
00:07:27.440
we moved object g to slot 4 and leave a
00:07:29.919
forwarding address
00:07:31.759
we move object f to slot 5 and leave a
00:07:34.479
forwarding address
00:07:36.319
we are now done the compact step because
00:07:38.240
there are no more objects to move
00:07:41.280
but we're not done yet we must do the
00:07:43.440
reference updating step
00:07:45.360
let's say our object references look
00:07:47.039
like this where the arrows indicate that
00:07:49.120
a particular object refers to another
00:07:51.039
object
00:07:52.400
we need to check the references of each
00:07:54.160
object and update the reference if it
00:07:56.240
points to a forwarding address
00:07:59.120
here object b has one reference to
00:08:01.440
forwarding addresses
00:08:02.960
by reading the forwarding address we can
00:08:04.879
update this reference to the correct one
00:08:06.879
which is object h
00:08:09.440
next we see that object c holds two
00:08:12.080
references to moved objects we can
00:08:14.400
update the references of object c to
00:08:16.400
point to the correct ones
00:08:18.879
now that we have updated all the
00:08:20.240
references we can claim these
00:08:22.720
forwarding addresses since we don't need
00:08:24.479
them anymore we'll reclaim slots 7 8 and
00:08:27.840
11.
00:08:29.360
so that's how the garbage collector
00:08:30.960
works
00:08:31.840
but throughout this talk we've been
00:08:33.599
referring to slots as always being a
00:08:35.279
fixed size and matt's going to talk
00:08:37.680
about how we handle the case where the
00:08:39.839
object needs to store additional data
00:08:43.680
so as we heard ruby objects aren't
00:08:45.519
always a fixed size
00:08:47.519
there are two scenarios to consider
00:08:50.560
when the object fits in the 24 bytes at
00:08:52.480
the end of an r value and when they
00:08:54.000
don't
00:08:55.279
we'll look at two strings hello world
00:08:57.600
which is 12 bytes and hello rubyconf
00:08:59.920
thanks for having us which is 36 bytes
00:09:03.440
when we allocate the string ruby checks
00:09:05.360
the requested string length against an
00:09:06.880
internal constant called our string
00:09:08.399
embed len max
00:09:10.000
and if the requested length is shorter
00:09:11.680
than our string in bed len max which it
00:09:13.920
is in this case
00:09:15.279
it just pushes the string data straight
00:09:16.880
into the remainder of the slot after the
00:09:18.560
flags
00:09:20.800
when the string length is longer than r
00:09:22.959
string embed len max ruby sets a bit on
00:09:25.360
the r values flags called the no embed
00:09:27.200
bit
00:09:28.720
this signifies that the data inside the
00:09:30.399
r value is not the actual string but a
00:09:32.720
pointer to a separately allocated memory
00:09:34.560
region
00:09:36.080
it then uses malloc to reserve a new
00:09:38.000
memory region outside of the existing
00:09:39.839
heap it stores the address to that
00:09:41.760
memory in the 24 bytes of the r value
00:09:43.839
after the flags and class
00:09:45.600
and copies the string data into that
00:09:47.279
newly mallocked memory
00:09:49.760
and with that we've looked at the main
00:09:51.360
ways ruby stores objects in memory
00:09:53.760
so now pete is going to talk about some
00:09:55.680
of the challenges with a memory layout
00:09:57.120
like this and introduce our project
00:09:59.200
variable with allocation which is
00:10:00.800
attempting to fix them
00:10:03.120
let's discuss some bottlenecks that
00:10:04.800
exist within the ruby heap that we aim
00:10:06.880
to tackle with this project
00:10:09.600
the first issue is that a ruby object
00:10:11.839
often requires multiple pieces of memory
00:10:13.680
to store data
00:10:15.040
you've just seen an example of this with
00:10:16.640
the string example where the contents of
00:10:18.720
the string exist in a separate location
00:10:20.800
than the string object itself
00:10:22.880
this results in poor cache locality
00:10:25.360
additionally this memory is often
00:10:27.120
acquired through the system using malloc
00:10:28.959
which has a performance impact
00:10:31.440
we often think that the cpu only has a
00:10:33.360
single level of memory but that's not
00:10:35.360
true
00:10:36.240
in order to improve performance there's
00:10:38.160
faster memory built onto the cpu itself
00:10:40.320
called caches
00:10:42.000
these caches are further divided into
00:10:43.760
multiple levels usually three levels
00:10:46.800
the lower level the cache the closer the
00:10:48.720
cache is to the cpu core by being closer
00:10:51.680
the electric signal travels a shorter
00:10:53.519
distance so it has better performance
00:10:56.800
let's talk a little bit about how cpu
00:10:58.640
caches work whenever a piece of memory
00:11:01.360
is fetched from main memory it is also
00:11:03.040
stored in the cpu caches it's important
00:11:05.440
to note that data is fetched from main
00:11:07.279
memory in units of cache lines which on
00:11:09.920
the x86 architecture found in intel and
00:11:12.320
amd cpus is 64 bytes
00:11:15.600
what this means is that even if you need
00:11:17.279
just a single byte of data all 64 bytes
00:11:19.680
is fetch for you when the cpu cache is
00:11:22.000
full it will overwrite the least
00:11:23.839
recently used data to make space for the
00:11:26.000
most recently access data
00:11:28.560
two terms that is used as cache hit for
00:11:30.399
when a piece of data we access existing
00:11:32.640
in the cpu cache so we don't need to
00:11:34.640
access the main memory and conversely a
00:11:37.040
cache miss is when the data does not
00:11:38.880
exist in any caches so we need to do a
00:11:41.279
very slow read from the main memory
00:11:43.760
as you might expect a very important
00:11:45.600
metric for optimizing performance is to
00:11:48.000
increase the cache hit rate so we need
00:11:50.000
to read from the slow main memory less
00:11:52.160
frequently
00:11:54.720
let's see the cache performance of the
00:11:56.320
current ruby implementation let's use
00:11:58.480
the example of an array
00:12:00.240
in the current implementation the array
00:12:02.399
points to a region of memory in the
00:12:03.839
system heap acquired through malloc to
00:12:06.000
store it the contents
00:12:08.320
our array has four elements so we need
00:12:10.480
four pointers to ruby objects since each
00:12:13.120
pointer is eight bytes on the 64-bit
00:12:14.880
system we need 32 bytes of memory
00:12:18.000
the object occupies one r value which is
00:12:20.399
40 bytes and the content occupies 32
00:12:22.639
bytes
00:12:23.839
in order to read the elements of this
00:12:25.440
array we must do two memory fetches to
00:12:27.600
fetch two cache lines
00:12:30.079
acquiring memory from the system heap
00:12:31.760
using malek is not free it comes with
00:12:34.160
performance overhead so we want to
00:12:36.240
minimize the number of times we call
00:12:37.839
malloc
00:12:38.959
malik also requires space for headers
00:12:40.800
that store metadata when allocating
00:12:42.560
memory
00:12:43.519
which may add up and result in increased
00:12:45.519
memory usage
00:12:46.959
we're not the first ones trying to
00:12:48.240
tackle this particular issue ruby 2.6
00:12:50.959
introduced a second heap into ruby
00:12:53.120
called the transient heap that aims to
00:12:55.040
speed up ruby by reducing the number of
00:12:56.800
malloc calls
00:12:58.560
it was successful in improving
00:12:59.920
performance and some benchmarks saw
00:13:02.000
performance improvement by up to 50
00:13:05.279
however the transient heap has technical
00:13:07.360
limitations and cannot be a dramatic
00:13:09.600
replacement
00:13:12.399
one of the major goals of this project
00:13:14.480
is to improve the overall performance of
00:13:16.480
ruby by letting ruby control its memory
00:13:18.560
layout rather than calling the malik
00:13:20.399
system library and relying on the system
00:13:22.480
to manage memory
00:13:24.639
since ruby's garbage collector is aware
00:13:26.480
of the memory structure of each type of
00:13:28.160
object it can optimize for faster
00:13:30.240
allocation and for better efficiency
00:13:33.279
by placing the contents of the object
00:13:35.120
right after the object itself we can
00:13:37.440
improve cache locality
00:13:39.440
by allocating dynamic size memory inside
00:13:42.000
the ruby garbage collector we can avoid
00:13:44.240
malik system calls and thus be able to
00:13:46.320
improve performance
00:13:49.440
now let's look at how the memory layout
00:13:51.600
changes when we use variable width
00:13:53.440
allocation
00:13:54.880
here we see the array example again what
00:13:57.199
this project aims to do is to move the
00:13:59.199
array's data from the system heap to the
00:14:01.279
ruby heap right after the object itself
00:14:04.399
so what does this accomplish for us
00:14:06.560
well we removed the need to do a malik
00:14:08.480
system call when creating the array
00:14:10.079
object let's see if we improve cache
00:14:12.240
performance
00:14:13.360
again recall that every ruby object
00:14:15.600
requires allocating a r value which is
00:14:17.680
40 bytes in size
00:14:19.519
our rate data itself is 32 bytes by
00:14:22.399
placing the contents of the array right
00:14:24.079
next to the array object itself we can
00:14:26.079
now access the first three elements of
00:14:27.839
the array while remaining in the same
00:14:29.600
cache line
00:14:32.480
so far we've merged two major iterations
00:14:34.800
into ruby and our third is open for
00:14:36.800
feedback now
00:14:38.560
in our first two pull requests we built
00:14:40.399
the infrastructure required to move data
00:14:42.240
that would previously have been
00:14:43.199
allocated on the system heap into a heap
00:14:45.199
page alongside the r value it's attached
00:14:47.279
to and we provided a reference
00:14:49.279
implementation in our class
00:14:51.440
the changes up until now have been
00:14:52.959
minimally intrusive as this feature was
00:14:55.040
turned off by default and required
00:14:57.040
recompiling with the compile-time flag
00:14:58.800
to enable it
00:15:00.240
in our latest pull request we've
00:15:02.000
implemented a variable with allocation
00:15:03.600
for strings and additionally because
00:15:05.920
we're confident about the stability and
00:15:07.519
the performance of variable width
00:15:08.800
allocation we've also proposed to turn
00:15:11.040
it on by default
00:15:12.800
we'll talk more about the performance
00:15:14.160
results later in this talk
00:15:17.519
now let's walk through an example of
00:15:18.959
what this work does so far we're going
00:15:21.279
to talk about class allocation
00:15:23.279
these two scenarios allocate r values
00:15:25.279
containing r class objects a class has
00:15:27.839
data that's attached to it its instance
00:15:30.000
methods instance variables and so on so
00:15:32.800
ruby creates a separate data structure
00:15:34.480
for this called rb class ext it's a
00:15:37.519
fixed size of 104 bytes allocated on the
00:15:40.000
system heap
00:15:41.360
this is what the current allocation code
00:15:43.120
looks like that's not using variable
00:15:44.720
with allocation we first acquire a r
00:15:47.279
value from the garbage collector using
00:15:48.880
new object then we allocate a region for
00:15:51.680
the rb class ext struct using z alloc
00:15:54.880
which will call malloc to acquire memory
00:15:58.000
this is what the class allocation code
00:15:59.680
looks like when using variable width
00:16:01.360
allocation
00:16:02.639
first thing that stands out is that
00:16:04.240
we're calling a different new object
00:16:06.000
entry point and this one takes an
00:16:07.600
allocation size
00:16:09.199
this is the size and bytes of the slot
00:16:10.880
that we wish to allocate from the
00:16:12.240
garbage collector so 40 bytes for the r
00:16:14.480
value and 104 bytes for the rba class
00:16:17.199
ext struct which is 144 bytes in total
00:16:21.440
the garbage collector is guaranteed to
00:16:23.040
allocate a slot that is at least the
00:16:24.720
requested size but may allocate a slot
00:16:27.360
that is larger than the requested size
00:16:29.759
we will see the reason for this soon
00:16:32.720
you've just seen the api to allocate
00:16:34.560
classes to variable width allocation but
00:16:36.880
how does it all work
00:16:38.800
let's take a closer look inside the
00:16:40.240
garbage collector remember that ruby's
00:16:42.560
memory heap is divided into pages and
00:16:45.040
each page is made up of 40 byte slots
00:16:48.320
we've changed the heap to support pages
00:16:50.320
that contain different size dots to do
00:16:52.560
this we've introduced a new structure
00:16:54.399
called a size pool
00:16:55.839
pages within the size pool have a
00:16:57.440
particular slot size
00:16:59.360
the slot size is a power of 2 multiple
00:17:02.240
of our value size so
00:17:04.400
40 bytes 80 bytes 160 bytes 320 bytes
00:17:08.720
and so on
00:17:09.919
we chose powers of two multiples because
00:17:12.079
they're easiest to work with and
00:17:13.439
efficient since we can use bit shift
00:17:15.360
operations to calculate them we use
00:17:18.720
multiples of r value size so we don't
00:17:20.720
waste space when allocating a single r
00:17:22.640
value of space
00:17:24.400
since all objects that don't use
00:17:25.760
variable with allocation will only
00:17:27.600
allocate a single r value worth of space
00:17:32.000
here we see a diagram with four size
00:17:34.240
pools and their page slot sizes are
00:17:36.720
labeled
00:17:38.559
now we want to allocate a class object
00:17:41.520
we first need to calculate the size pool
00:17:43.679
which we're allocating into a class
00:17:46.160
object will need 144 bytes space 40
00:17:49.200
bytes for the object and 104 bytes for
00:17:51.360
the rb class ext struct this is 3.6 r
00:17:54.720
values wide which is rounded up to to
00:17:57.280
the next power of 2 so it will require
00:17:59.840
four r values worth of space
00:18:01.919
or on the third size pool
00:18:04.559
let's zoom into the size pool each size
00:18:07.039
pool has its own pool of pages and holds
00:18:09.039
a free list when we want to allocate we
00:18:11.360
just allocate into the head of the free
00:18:12.880
list and then move the free list pointer
00:18:14.640
to the next element in the linked list
00:18:17.360
now let's allocate our class object into
00:18:19.280
the free slot we allocate both the class
00:18:21.840
r value and the rb class ext struct into
00:18:24.960
the same slot
00:18:26.160
of course we waste about 16 bytes at the
00:18:28.480
end because we only used 144 bytes and
00:18:30.960
the slot is 160 bytes wide
00:18:34.320
so that's how an object of dynamic size
00:18:36.559
is allocated using variable width
00:18:38.400
allocation
00:18:40.400
in our latest pull request we
00:18:42.080
implemented variable width allocation
00:18:43.600
for strings
00:18:44.880
however unlike classes which only
00:18:46.799
allocate a fixed size data structure
00:18:49.039
strings are malleable and can change
00:18:50.480
sizes and this makes them more
00:18:51.919
challenging to deal with
00:18:53.919
earlier in this talk we spoke about the
00:18:55.520
two types of ruby strings
00:18:57.840
shorter embedded strings where the
00:19:00.000
string content is stored directly in the
00:19:01.760
r value itself and longer strings that
00:19:04.240
have their content separately allocated
00:19:06.000
using malloc and have just a reference
00:19:07.840
stored in the r value
00:19:10.080
prior to variable width allocation only
00:19:12.160
strings shorter than 24 bytes could be
00:19:14.080
embedded
00:19:15.360
as part of this project we change string
00:19:17.280
allocation to enable longer length
00:19:18.880
strings to be embedded
00:19:21.120
so let's look at allocating our example
00:19:22.799
strings again with variable width
00:19:24.080
allocation and see what's changed
00:19:26.559
our first string hello world was
00:19:29.039
embeddable before variable with
00:19:30.320
allocation and this is still the case
00:19:33.039
we allocate the headers the flags and
00:19:34.799
the class as normal and then push the
00:19:36.720
string content in directly after them
00:19:39.840
those with a keen eye may notice that
00:19:41.600
the string headers are now 18 bytes
00:19:43.280
instead of 16.
00:19:45.679
this is because strings allocated using
00:19:47.520
variable width allocation have to
00:19:48.960
support a longer embed length and so we
00:19:51.120
need a couple of extra bites to store
00:19:52.480
this larger number
00:19:54.960
so the headers at now occupy 18 bytes
00:19:57.280
and the contents occupy 12 bytes which
00:19:59.200
means we need 30 bytes in total
00:20:02.640
this fits inside the slots of our first
00:20:04.480
size pool which are 40 bytes so we
00:20:06.559
request a free slot and allocate into it
00:20:08.720
with 10 bytes at the end of the slot
00:20:10.240
going unused
00:20:12.559
now let's look at our allocating our
00:20:14.000
longer example string
00:20:16.480
remember that before variable width
00:20:18.080
allocation this string was too long to
00:20:19.679
be embedded so the contents were
00:20:21.360
allocated externally using malloc now
00:20:23.919
this string can also be embedded
00:20:26.320
the string contents are 36 bytes and the
00:20:28.480
headers are 18 bytes so this string
00:20:30.320
requires a total of 54 bytes
00:20:33.520
the second size pool is the smallest
00:20:35.120
pool into which this string fits as it
00:20:36.799
has a slot size of 80 bytes so we
00:20:39.039
allocate into a free slot from this pool
00:20:40.960
and 26 bytes go on used at the end of
00:20:42.880
the slot
00:20:45.200
but what happens if we try to resize a
00:20:47.039
string for example what if we tried to
00:20:49.520
double this string by appending it to
00:20:51.039
itself
00:20:52.240
the string is going to become too long
00:20:53.600
to fit in the slot
00:20:56.320
currently we don't have a great way to
00:20:57.919
handle this so we fall back to
00:20:59.440
allocating the contents outside of the
00:21:01.039
ruby heap just as we would have done
00:21:02.480
before vwa was implemented
00:21:05.440
to convert this embedded string to a no
00:21:07.200
embed string we first allocate memory
00:21:09.280
for the new string contents using malloc
00:21:12.320
then we convert the r string headers to
00:21:13.919
use the original 40 byte r string
00:21:15.600
structure and we set the no embed bit in
00:21:17.679
the flags
00:21:19.679
finally we set the string's data pointer
00:21:21.760
to point to the region we just allocated
00:21:25.280
now our string object is exactly the
00:21:27.440
same as if it had been allocated prior
00:21:29.120
to variable width allocation being
00:21:30.640
implemented
00:21:32.400
however this fallback solution does mean
00:21:34.159
that any space past 40 bytes is wasted
00:21:36.960
so as this is allocated in an 80 byte
00:21:38.960
slot resizing this string wastes 40
00:21:41.360
pipes
00:21:42.559
we don't have a way to deal with this
00:21:43.840
yet but we do have some ideas and we're
00:21:45.760
going to discuss those later on in this
00:21:47.200
talk
00:21:48.480
benchmarks of memory performance have
00:21:50.080
shown that this is not a very
00:21:51.600
significant issue for the workloads we
00:21:53.360
tested
00:21:55.120
now peter's going to tell you about our
00:21:56.799
benchmarks and talk about how we're
00:21:58.480
measuring runtime and memory performance
00:22:00.080
throughout this project
00:22:02.799
we deployed variable width allocation
00:22:04.799
and the corresponding commit on the ruby
00:22:06.640
master branch on a small portion of
00:22:08.559
production traffic of a shopify service
00:22:11.360
we collected data over one day where
00:22:13.360
they each served around 84 million
00:22:15.200
requests
00:22:16.320
we did not see a change in response
00:22:18.080
times beyond the margin of error we
00:22:20.159
looked at average 50th percentile 90th
00:22:22.960
percentile and 99th percentile response
00:22:25.280
times however variable width allocation
00:22:27.919
did decrease memory usage by about four
00:22:30.159
percent
00:22:31.760
synthetic benchmarks were ran on a bare
00:22:33.840
metal ubuntu machine on aws rail's bench
00:22:36.960
and rdoc generation was benchmarked
00:22:38.640
using the gwc and je malec allocators
00:22:41.840
gdp c and je malek are implementations
00:22:44.159
of the c library function called malloc
00:22:46.080
used to allocate memory
00:22:48.000
different allocators are implemented
00:22:49.600
differently and so they have different
00:22:51.120
properties
00:22:52.320
glibc is the default allocator in ubuntu
00:22:55.280
if you're interested in looking at more
00:22:56.720
in-depth results and analysis check out
00:22:59.120
the ticket linked in the slides
00:23:02.080
in rails bench we see a slight increase
00:23:04.080
in memory usage when using gwc and no
00:23:06.559
increase when using je malloc
00:23:08.720
however we also see a small performance
00:23:10.720
increase for for both gwc and je malek
00:23:13.919
this graph graphs the rss or resonant
00:23:16.320
set size which is the amount of memory
00:23:18.159
held in ram
00:23:19.679
we ran the two branches 25 times each
00:23:22.080
and each run consists of 20 000 requests
00:23:25.280
the graph graphs the memory usage over
00:23:27.280
time where the width allocation is
00:23:29.360
graphed in green and the master branch
00:23:31.280
is graphed in blue
00:23:33.760
rdoc generation generates the ruby
00:23:35.679
documentation unlike rails bench where
00:23:38.000
available with allocation use more
00:23:39.600
memory here we see significantly lower
00:23:42.400
memory usage for slight reduction in
00:23:44.480
performance
00:23:45.840
you can see in this graph that the
00:23:47.279
branch uses significantly lower memory
00:23:51.200
we wrote some micro benchmarks to
00:23:52.799
highlight the performance increase of
00:23:54.320
variable width allocation
00:23:56.159
for these benchmarks we use 71 byte
00:23:58.799
strings which is embedded when using
00:24:00.640
variable width allocation and not
00:24:02.480
embedded when not using fail with
00:24:04.159
allocation
00:24:05.279
our benchmark results show that
00:24:06.880
comparing two strings is 35 percent
00:24:09.120
faster
00:24:10.159
the times method also benchmarks
00:24:11.840
allocation speed and is 15 faster
00:24:15.360
similarly setting a character in the
00:24:17.200
string is also 15 faster this clearly
00:24:20.400
shows the performance benefits of using
00:24:22.480
variable width allocation
00:24:25.120
we're confident about the stability of
00:24:27.039
variable width allocation
00:24:28.880
variable with allocation passes all
00:24:30.799
tests on ci on shopify's braille's
00:24:33.039
monolith and we ran it for a small
00:24:35.200
portion of production traffic of the
00:24:36.720
shopify service for a week where it
00:24:38.799
served over 500 million requests
00:24:41.679
we also believe that the benchmark
00:24:43.200
results are satisfactory so in the
00:24:45.279
latest patch we propose that variable
00:24:47.039
width allocation is enabled by default
00:24:49.760
at the time of recording this talk this
00:24:51.520
proposal is open for feedback
00:24:53.919
however it's likely that it has
00:24:55.520
undergone changes and perhaps even
00:24:57.600
merged so please check the ticket linked
00:24:59.760
in the slides for the most up-to-date
00:25:01.760
information
00:25:04.080
now that you've seen our current
00:25:05.440
implementation and some performance
00:25:07.279
results matt will now talk about some
00:25:09.520
limitations of this implementation and
00:25:11.679
how we're planning on tackling them
00:25:14.480
the first issue to talk about is
00:25:16.159
increasing coverage
00:25:18.000
currently only classes and strings are
00:25:19.760
taking advantage of the benefits of vwa
00:25:22.480
in order to see an even larger
00:25:23.840
performance improvements we need to have
00:25:25.679
as many object types as possible using
00:25:27.679
it
00:25:29.360
next up is resizing
00:25:32.400
as you saw in our string implementation
00:25:34.320
we don't currently have an elegant way
00:25:36.000
to resize objects
00:25:37.520
this is a tough problem and solving is
00:25:39.440
one of our next major
00:25:40.840
milestones an idea we're currently
00:25:42.799
investigating is to leverage gc
00:25:44.640
compaction to move object data for us
00:25:48.159
so when we need to resize up and the
00:25:50.400
current slot is not large enough we
00:25:52.400
could allocate the new data into a slot
00:25:54.640
in a larger size pool and then use
00:25:56.559
compaction to move the original object
00:25:58.640
structure next to its data again and
00:26:00.799
free up the old slot
00:26:03.919
one other thing that we'd like to do
00:26:05.120
eventually is to shrink the size of the
00:26:06.880
r value struct from 40 bytes down to 32
00:26:09.279
bytes
00:26:10.720
we want to do this because it means that
00:26:12.480
slots in the heap page will then be
00:26:13.840
aligned on cpu cache lines
00:26:16.400
for example in the diagram below we can
00:26:19.200
see that many of the 40 byte slots
00:26:20.880
straddle two cache lines and so reading
00:26:23.039
a single slot requires two potentially
00:26:24.799
expensive cash lookups
00:26:27.200
if the slot size was shrunk to 32 bytes
00:26:29.679
all the slots become aligned on cache
00:26:31.279
lines and reading a slot never requires
00:26:33.520
more than one cache lookup
00:26:35.679
this isn't something we can do right now
00:26:37.279
as too many objects are still using all
00:26:39.039
40 bytes of the r value so they'd end up
00:26:41.200
being allocated into a 64 byte slot or
00:26:43.520
larger and then wasting 24 bytes or more
00:26:48.000
overall we're proud of the progress that
00:26:49.760
we've made but we still have a long way
00:26:51.600
to go and many things to explore
00:26:54.240
we plan on share to share progress
00:26:55.919
updates with the community frequently
00:26:57.840
via conference talks and blog posts
00:27:00.320
but for now the best way to keep an eye
00:27:02.080
on what we're up to is the shopify ruby
00:27:04.000
fork on github and the ruby core bug
00:27:06.320
tracker
00:27:09.120
and that's it folks in this talk we
00:27:10.960
looked at how ruby's memory allocator
00:27:12.640
and garbage collector works some of the
00:27:14.720
issues in ruby's memory layout how we're
00:27:16.880
trying to solve them with variable width
00:27:18.320
allocation as well as our current
00:27:20.080
progress in the project we're continuing
00:27:22.320
to improve the implementation and we
00:27:24.159
still have a long way to go thanks for
00:27:26.159
listening and thanks to rubyconf for
00:27:28.159
having us