List

Optimizing Ruby's Memory Layout

Optimizing Ruby's Memory Layout

by Peter Zhu and Matt Valentine-House

The video titled "Optimizing Ruby's Memory Layout," presented by Peter Zhu and Matt Valentine-House at RubyConf 2021, discusses the current memory model of Ruby, particularly its fixed-size slots for object storage and the inefficiencies arising from this structure. The presentation highlights their Variable Width Allocation project, aimed at improving Ruby's memory management to enhance performance. Key points discussed in the video include:

  • Ruby's Memory Management: Ruby uses a garbage collector to manage memory with an object representation known as 'r value,' which is fixed at 40 bytes, including flags and pointers.
  • Garbage Collection Phases: The garbage collection process consists of marking live objects and sweeping the heap to reclaim memory. Optional compaction helps reduce memory fragmentation.
  • Issues with Current Layout: Problems arise from objects needing multiple memory allocations, causing poor cache locality. Ruby's current model may lead to increased fetching times due to frequent malloc calls, negatively impacting performance.
  • Variable Width Allocation: This project aims to integrate both object data and additional memory directly within Ruby's heap, significantly reducing malloc calls. This can improve data locality and reduce memory overhead.
  • Performance Benchmarks: Early benchmarks showed a decrease in memory usage by 4% and performance improvements in scenarios using the new allocation methods, demonstrating a 35% increase in string comparison speed.
  • Future Directions: The speakers discuss plans for expanding coverage to more object types, enhancing string resizing capabilities, and reducing the size of the r value structure to better align memory allocation with CPU cache lines.

The overall conclusion emphasizes the potential for improved performance and greater control over memory management in Ruby, making it an ongoing area of development and optimization for the community.

Ruby’s current memory model assumes all objects live in fixed size slots. This means that object data is often stored outside of the Ruby heap, which has implications: an object's data can be scattered across many locations, increasing complexity and causing poor locality resulting in reduced efficiency of CPU caches.

Join us as we explore how our Variable Width Allocation project will move system heap memory into Ruby heap memory, reducing system heap allocations and giving us finer control of the memory layout to optimize for performance.

RubyConf 2021

00:00:10.240 hello rubyconf i'm peter and i'm a ruby
00:00:12.799 core committer and senior developer at
00:00:14.480 shopify in the ruby infrastructure team
00:00:16.960 and i'm matt i'm also at shopify and i'm
00:00:19.199 also a senior developer in the ruby
00:00:20.720 infrastructure team
00:00:22.400 for the past year we've been working
00:00:24.000 together to make some improvements to
00:00:25.519 how memory is laid out in ruby
00:00:28.080 in this talk we're going to summarize
00:00:29.920 the work that we've done so far the
00:00:31.439 challenges that we've faced and where
00:00:32.719 we're heading in the future
00:00:34.880 but before we can go into detail we're
00:00:36.640 going to provide some background on how
00:00:38.160 ruby's garbage collector structures and
00:00:40.079 manages memory
00:00:44.000 we create an object in ruby the
00:00:45.760 interpreter stores a data structure
00:00:47.200 called an r value in memory to hold it
00:00:50.000 and r value is always 40 bytes wide and
00:00:52.079 it contains a union of different data
00:00:53.600 structures that represent all the
00:00:55.039 possible ruby core types
00:00:57.199 so we could have a structure
00:00:58.320 representing a string
00:01:00.079 or an array or a class and so on
00:01:04.080 and each one of these types contain
00:01:05.439 their own fields and data structures
00:01:06.960 specific to that type
00:01:08.960 the r value itself is just a container
00:01:10.799 for our ruby objects
00:01:13.280 let's look more closely at one of ruby's
00:01:14.960 object types our class in this case
00:01:18.799 the first 16 bytes of every type are
00:01:20.799 always the same an 8 byte field called
00:01:22.960 flags followed by an 8 by pointer to the
00:01:25.439 class of the object
00:01:28.080 the remaining 24 bytes are different for
00:01:30.079 every type
00:01:31.119 our class objects store a pointer to an
00:01:32.880 extension object here
00:01:34.640 strings can store the actual string
00:01:36.159 content and so on
00:01:38.960 these r values are organized into
00:01:40.560 regions called pages
00:01:42.159 a heap page is a 16 kilobyte memory
00:01:44.479 region
00:01:45.520 each page has a maximum of 409 slots
00:01:48.000 where each slot can hold one r value
00:01:52.079 a heap page's memory region is always
00:01:53.840 contiguous the slots have memory
00:01:55.600 addresses that are 40 bytes apart with
00:01:57.360 no gaps between them
00:01:59.840 initially when the heap page is created
00:02:02.000 all the slots will be filled with an
00:02:03.360 internal type called t none
00:02:05.840 this type represents an empty slot it
00:02:08.000 contains only flags
00:02:09.679 and a pointer value called next that can
00:02:11.680 point to an r value
00:02:14.000 this allows these empty slots to be
00:02:16.480 linked together to form a data structure
00:02:18.239 called the free list
00:02:20.400 when a heap page is initialized ruby
00:02:22.239 walks up each page visiting each slot in
00:02:24.480 turn
00:02:26.080 as we arrive at each slot we set a
00:02:27.920 pointer on the heap page called the free
00:02:29.440 list pointer to the address of that slot
00:02:32.000 and then set the slot's next pointer to
00:02:34.160 the address of the previous slot we
00:02:35.599 visited
00:02:37.120 this creates a singly linked list of
00:02:38.959 empty slots
00:02:41.680 when we need to allocate an object and
00:02:43.599 we need an empty slot
00:02:45.360 ruby asks the heap page for the first
00:02:46.959 slot on the free list
00:02:48.800 the heap page returns the slot and then
00:02:50.640 the free list pointer is updated to
00:02:52.160 point at the next slot in the list
00:02:55.599 this unlinks our slot from the free list
00:02:57.519 allowing us to put data in it
00:03:00.640 the use of a free list means that this
00:03:02.480 part of object allocation is both fast
00:03:04.879 and a constant time operation even when
00:03:07.040 a page is sparsely populated
00:03:10.000 because if the page has at least one
00:03:11.760 empty slot the free list is guaranteed
00:03:13.680 to always point to it
00:03:15.840 but eventually our heat pages are going
00:03:18.319 to run out of slot and this is where the
00:03:20.159 garbage collector is required
00:03:22.720 if the free list is empty when ruby
00:03:24.319 tries to allocate into the page it must
00:03:26.000 mean that there are no free slots left
00:03:27.680 on that page
00:03:29.680 so now peter's going to talk about how
00:03:31.599 and when the garbage collector runs to
00:03:33.120 release objects that are no longer
00:03:34.560 necessary
00:03:35.599 over to you peter
00:03:38.159 now that we've looked at how memory is
00:03:39.599 laid out inside ruby let's look at how
00:03:41.840 ruby's garbage collector works
00:03:44.239 ruby's garbage collection phase consists
00:03:46.159 of three distinct phases marking
00:03:48.799 sweeping and compaction
00:03:51.200 the last phase compaction is optional
00:03:53.760 and does not run unless manually enabled
00:03:56.560 ruby's garbage collection has stopped
00:03:58.080 the world meaning that ruby code is not
00:03:59.920 executed when the garbage collector is
00:04:01.760 active
00:04:02.959 the garbage collecting ruby is fairly
00:04:04.640 complex so we can't possibly cover all
00:04:06.720 the technical details in just a few
00:04:08.480 minutes but we'll provide a very high
00:04:10.640 level overview that contains all the
00:04:12.400 information you will need to know for
00:04:14.239 this talk
00:04:15.840 let's first look at marking marking is
00:04:18.160 the phase where we determine which ruby
00:04:19.919 objects are alive and which ones can be
00:04:22.079 freed
00:04:22.960 we first mark the routes which include
00:04:24.880 things like global variables
00:04:26.639 classes modules and constants
00:04:29.840 we then perform breadth-first search to
00:04:31.600 traverse the network of live objects
00:04:34.639 let's see an example of this here's a
00:04:36.639 toy example of a heap with two pages
00:04:38.639 with seven slots on each page
00:04:40.800 we have a total of 10 objects labeled a
00:04:42.800 through j
00:04:43.919 slots with no label means that the slot
00:04:45.600 is empty the arrows indicate references
00:04:48.240 for example the arrow from object j to
00:04:50.320 object d means that object j holds a
00:04:52.240 reference to object d
00:04:54.080 examples of references could be the
00:04:55.600 contents of arrays or instance variables
00:04:57.759 of objects
00:04:59.120 the first step of marking requires us to
00:05:00.800 mark the brood objects the root objects
00:05:02.720 here are objects a and b
00:05:04.560 let's now mark these two objects
00:05:07.440 from a and b objects c and g are
00:05:09.680 reachable and not marked so we mark them
00:05:13.360 then from object c and g objects f and h
00:05:16.080 are reachable and not marked so we mark
00:05:18.240 them
00:05:19.440 objects f and h do not refer to any
00:05:21.680 unmarked objects since we did not mark
00:05:24.160 any objects in this step it means that
00:05:26.320 marking is complete
00:05:28.880 now that we finish marking all marked
00:05:31.039 objects are live objects and all
00:05:32.800 unmarked objects are dead and can be
00:05:34.720 reclaimed by the garbage collector
00:05:37.440 we can simply scan the slots on the
00:05:39.039 pages and free the ones that are not
00:05:40.880 marked
00:05:42.479 let's continue to use the example we
00:05:44.080 used for marking
00:05:45.520 here we move our cursor to the first
00:05:47.199 object in the page that is not marked
00:05:49.440 object d in this case
00:05:51.199 we can free the object and reclaim the
00:05:53.199 memory
00:05:54.560 we then continue to move the cursor
00:05:56.400 until we reach the next unmarked object
00:05:58.400 which is object e
00:06:00.400 we've reached the end of page 1 so let's
00:06:02.560 move to sweep page 2.
00:06:04.720 again we move the cursor until the first
00:06:06.560 unmarked object is reached and will free
00:06:08.880 object i
00:06:10.479 we can continue to move the cursor until
00:06:12.400 we reach the next unmarked object which
00:06:14.800 is object j
00:06:16.400 now that we've swept all the pages
00:06:18.319 sweeping is complete
00:06:20.479 manual compaction was a new feature
00:06:22.240 introduced in ruby 2.7 and further
00:06:24.560 improved with automatic compaction in
00:06:26.479 ruby 3.0
00:06:28.240 however this feature remains optional
00:06:30.240 and is turned off by default
00:06:32.479 compaction moves objects within the heap
00:06:34.400 to the start of the heat this has
00:06:36.240 several benefits including reduced
00:06:37.840 memory usage
00:06:39.199 faster garbage collection and better
00:06:41.039 copy on right performance
00:06:43.199 ruby uses the two-finger algorithm for
00:06:45.039 compaction which i will explain in more
00:06:47.360 detail
00:06:48.800 the compaction algorithm involves two
00:06:50.720 steps the compact step and the reference
00:06:52.960 updating step
00:06:54.400 the compact step moves objects from the
00:06:56.080 end of the heap to the start of the heap
00:06:58.319 when the step completes we do the
00:07:00.319 reference updating step in this step we
00:07:03.120 scan all the objects and update the
00:07:04.800 pointers that have moved
00:07:07.199 let's see an example of compaction in
00:07:09.199 action the last live object of the heap
00:07:11.759 is object h and the first empty slot in
00:07:14.240 the heap is not one so we moved object h
00:07:17.280 to slot one and leave the forwarding
00:07:18.960 address
00:07:20.160 this forwarding address will tell us
00:07:21.599 where the object was moved to we will
00:07:23.599 use this forwarding address in the
00:07:25.039 reference updating step
00:07:27.440 we moved object g to slot 4 and leave a
00:07:29.919 forwarding address
00:07:31.759 we move object f to slot 5 and leave a
00:07:34.479 forwarding address
00:07:36.319 we are now done the compact step because
00:07:38.240 there are no more objects to move
00:07:41.280 but we're not done yet we must do the
00:07:43.440 reference updating step
00:07:45.360 let's say our object references look
00:07:47.039 like this where the arrows indicate that
00:07:49.120 a particular object refers to another
00:07:51.039 object
00:07:52.400 we need to check the references of each
00:07:54.160 object and update the reference if it
00:07:56.240 points to a forwarding address
00:07:59.120 here object b has one reference to
00:08:01.440 forwarding addresses
00:08:02.960 by reading the forwarding address we can
00:08:04.879 update this reference to the correct one
00:08:06.879 which is object h
00:08:09.440 next we see that object c holds two
00:08:12.080 references to moved objects we can
00:08:14.400 update the references of object c to
00:08:16.400 point to the correct ones
00:08:18.879 now that we have updated all the
00:08:20.240 references we can claim these
00:08:22.720 forwarding addresses since we don't need
00:08:24.479 them anymore we'll reclaim slots 7 8 and
00:08:27.840 11.
00:08:29.360 so that's how the garbage collector
00:08:30.960 works
00:08:31.840 but throughout this talk we've been
00:08:33.599 referring to slots as always being a
00:08:35.279 fixed size and matt's going to talk
00:08:37.680 about how we handle the case where the
00:08:39.839 object needs to store additional data
00:08:43.680 so as we heard ruby objects aren't
00:08:45.519 always a fixed size
00:08:47.519 there are two scenarios to consider
00:08:50.560 when the object fits in the 24 bytes at
00:08:52.480 the end of an r value and when they
00:08:54.000 don't
00:08:55.279 we'll look at two strings hello world
00:08:57.600 which is 12 bytes and hello rubyconf
00:08:59.920 thanks for having us which is 36 bytes
00:09:03.440 when we allocate the string ruby checks
00:09:05.360 the requested string length against an
00:09:06.880 internal constant called our string
00:09:08.399 embed len max
00:09:10.000 and if the requested length is shorter
00:09:11.680 than our string in bed len max which it
00:09:13.920 is in this case
00:09:15.279 it just pushes the string data straight
00:09:16.880 into the remainder of the slot after the
00:09:18.560 flags
00:09:20.800 when the string length is longer than r
00:09:22.959 string embed len max ruby sets a bit on
00:09:25.360 the r values flags called the no embed
00:09:27.200 bit
00:09:28.720 this signifies that the data inside the
00:09:30.399 r value is not the actual string but a
00:09:32.720 pointer to a separately allocated memory
00:09:34.560 region
00:09:36.080 it then uses malloc to reserve a new
00:09:38.000 memory region outside of the existing
00:09:39.839 heap it stores the address to that
00:09:41.760 memory in the 24 bytes of the r value
00:09:43.839 after the flags and class
00:09:45.600 and copies the string data into that
00:09:47.279 newly mallocked memory
00:09:49.760 and with that we've looked at the main
00:09:51.360 ways ruby stores objects in memory
00:09:53.760 so now pete is going to talk about some
00:09:55.680 of the challenges with a memory layout
00:09:57.120 like this and introduce our project
00:09:59.200 variable with allocation which is
00:10:00.800 attempting to fix them
00:10:03.120 let's discuss some bottlenecks that
00:10:04.800 exist within the ruby heap that we aim
00:10:06.880 to tackle with this project
00:10:09.600 the first issue is that a ruby object
00:10:11.839 often requires multiple pieces of memory
00:10:13.680 to store data
00:10:15.040 you've just seen an example of this with
00:10:16.640 the string example where the contents of
00:10:18.720 the string exist in a separate location
00:10:20.800 than the string object itself
00:10:22.880 this results in poor cache locality
00:10:25.360 additionally this memory is often
00:10:27.120 acquired through the system using malloc
00:10:28.959 which has a performance impact
00:10:31.440 we often think that the cpu only has a
00:10:33.360 single level of memory but that's not
00:10:35.360 true
00:10:36.240 in order to improve performance there's
00:10:38.160 faster memory built onto the cpu itself
00:10:40.320 called caches
00:10:42.000 these caches are further divided into
00:10:43.760 multiple levels usually three levels
00:10:46.800 the lower level the cache the closer the
00:10:48.720 cache is to the cpu core by being closer
00:10:51.680 the electric signal travels a shorter
00:10:53.519 distance so it has better performance
00:10:56.800 let's talk a little bit about how cpu
00:10:58.640 caches work whenever a piece of memory
00:11:01.360 is fetched from main memory it is also
00:11:03.040 stored in the cpu caches it's important
00:11:05.440 to note that data is fetched from main
00:11:07.279 memory in units of cache lines which on
00:11:09.920 the x86 architecture found in intel and
00:11:12.320 amd cpus is 64 bytes
00:11:15.600 what this means is that even if you need
00:11:17.279 just a single byte of data all 64 bytes
00:11:19.680 is fetch for you when the cpu cache is
00:11:22.000 full it will overwrite the least
00:11:23.839 recently used data to make space for the
00:11:26.000 most recently access data
00:11:28.560 two terms that is used as cache hit for
00:11:30.399 when a piece of data we access existing
00:11:32.640 in the cpu cache so we don't need to
00:11:34.640 access the main memory and conversely a
00:11:37.040 cache miss is when the data does not
00:11:38.880 exist in any caches so we need to do a
00:11:41.279 very slow read from the main memory
00:11:43.760 as you might expect a very important
00:11:45.600 metric for optimizing performance is to
00:11:48.000 increase the cache hit rate so we need
00:11:50.000 to read from the slow main memory less
00:11:52.160 frequently
00:11:54.720 let's see the cache performance of the
00:11:56.320 current ruby implementation let's use
00:11:58.480 the example of an array
00:12:00.240 in the current implementation the array
00:12:02.399 points to a region of memory in the
00:12:03.839 system heap acquired through malloc to
00:12:06.000 store it the contents
00:12:08.320 our array has four elements so we need
00:12:10.480 four pointers to ruby objects since each
00:12:13.120 pointer is eight bytes on the 64-bit
00:12:14.880 system we need 32 bytes of memory
00:12:18.000 the object occupies one r value which is
00:12:20.399 40 bytes and the content occupies 32
00:12:22.639 bytes
00:12:23.839 in order to read the elements of this
00:12:25.440 array we must do two memory fetches to
00:12:27.600 fetch two cache lines
00:12:30.079 acquiring memory from the system heap
00:12:31.760 using malek is not free it comes with
00:12:34.160 performance overhead so we want to
00:12:36.240 minimize the number of times we call
00:12:37.839 malloc
00:12:38.959 malik also requires space for headers
00:12:40.800 that store metadata when allocating
00:12:42.560 memory
00:12:43.519 which may add up and result in increased
00:12:45.519 memory usage
00:12:46.959 we're not the first ones trying to
00:12:48.240 tackle this particular issue ruby 2.6
00:12:50.959 introduced a second heap into ruby
00:12:53.120 called the transient heap that aims to
00:12:55.040 speed up ruby by reducing the number of
00:12:56.800 malloc calls
00:12:58.560 it was successful in improving
00:12:59.920 performance and some benchmarks saw
00:13:02.000 performance improvement by up to 50
00:13:05.279 however the transient heap has technical
00:13:07.360 limitations and cannot be a dramatic
00:13:09.600 replacement
00:13:12.399 one of the major goals of this project
00:13:14.480 is to improve the overall performance of
00:13:16.480 ruby by letting ruby control its memory
00:13:18.560 layout rather than calling the malik
00:13:20.399 system library and relying on the system
00:13:22.480 to manage memory
00:13:24.639 since ruby's garbage collector is aware
00:13:26.480 of the memory structure of each type of
00:13:28.160 object it can optimize for faster
00:13:30.240 allocation and for better efficiency
00:13:33.279 by placing the contents of the object
00:13:35.120 right after the object itself we can
00:13:37.440 improve cache locality
00:13:39.440 by allocating dynamic size memory inside
00:13:42.000 the ruby garbage collector we can avoid
00:13:44.240 malik system calls and thus be able to
00:13:46.320 improve performance
00:13:49.440 now let's look at how the memory layout
00:13:51.600 changes when we use variable width
00:13:53.440 allocation
00:13:54.880 here we see the array example again what
00:13:57.199 this project aims to do is to move the
00:13:59.199 array's data from the system heap to the
00:14:01.279 ruby heap right after the object itself
00:14:04.399 so what does this accomplish for us
00:14:06.560 well we removed the need to do a malik
00:14:08.480 system call when creating the array
00:14:10.079 object let's see if we improve cache
00:14:12.240 performance
00:14:13.360 again recall that every ruby object
00:14:15.600 requires allocating a r value which is
00:14:17.680 40 bytes in size
00:14:19.519 our rate data itself is 32 bytes by
00:14:22.399 placing the contents of the array right
00:14:24.079 next to the array object itself we can
00:14:26.079 now access the first three elements of
00:14:27.839 the array while remaining in the same
00:14:29.600 cache line
00:14:32.480 so far we've merged two major iterations
00:14:34.800 into ruby and our third is open for
00:14:36.800 feedback now
00:14:38.560 in our first two pull requests we built
00:14:40.399 the infrastructure required to move data
00:14:42.240 that would previously have been
00:14:43.199 allocated on the system heap into a heap
00:14:45.199 page alongside the r value it's attached
00:14:47.279 to and we provided a reference
00:14:49.279 implementation in our class
00:14:51.440 the changes up until now have been
00:14:52.959 minimally intrusive as this feature was
00:14:55.040 turned off by default and required
00:14:57.040 recompiling with the compile-time flag
00:14:58.800 to enable it
00:15:00.240 in our latest pull request we've
00:15:02.000 implemented a variable with allocation
00:15:03.600 for strings and additionally because
00:15:05.920 we're confident about the stability and
00:15:07.519 the performance of variable width
00:15:08.800 allocation we've also proposed to turn
00:15:11.040 it on by default
00:15:12.800 we'll talk more about the performance
00:15:14.160 results later in this talk
00:15:17.519 now let's walk through an example of
00:15:18.959 what this work does so far we're going
00:15:21.279 to talk about class allocation
00:15:23.279 these two scenarios allocate r values
00:15:25.279 containing r class objects a class has
00:15:27.839 data that's attached to it its instance
00:15:30.000 methods instance variables and so on so
00:15:32.800 ruby creates a separate data structure
00:15:34.480 for this called rb class ext it's a
00:15:37.519 fixed size of 104 bytes allocated on the
00:15:40.000 system heap
00:15:41.360 this is what the current allocation code
00:15:43.120 looks like that's not using variable
00:15:44.720 with allocation we first acquire a r
00:15:47.279 value from the garbage collector using
00:15:48.880 new object then we allocate a region for
00:15:51.680 the rb class ext struct using z alloc
00:15:54.880 which will call malloc to acquire memory
00:15:58.000 this is what the class allocation code
00:15:59.680 looks like when using variable width
00:16:01.360 allocation
00:16:02.639 first thing that stands out is that
00:16:04.240 we're calling a different new object
00:16:06.000 entry point and this one takes an
00:16:07.600 allocation size
00:16:09.199 this is the size and bytes of the slot
00:16:10.880 that we wish to allocate from the
00:16:12.240 garbage collector so 40 bytes for the r
00:16:14.480 value and 104 bytes for the rba class
00:16:17.199 ext struct which is 144 bytes in total
00:16:21.440 the garbage collector is guaranteed to
00:16:23.040 allocate a slot that is at least the
00:16:24.720 requested size but may allocate a slot
00:16:27.360 that is larger than the requested size
00:16:29.759 we will see the reason for this soon
00:16:32.720 you've just seen the api to allocate
00:16:34.560 classes to variable width allocation but
00:16:36.880 how does it all work
00:16:38.800 let's take a closer look inside the
00:16:40.240 garbage collector remember that ruby's
00:16:42.560 memory heap is divided into pages and
00:16:45.040 each page is made up of 40 byte slots
00:16:48.320 we've changed the heap to support pages
00:16:50.320 that contain different size dots to do
00:16:52.560 this we've introduced a new structure
00:16:54.399 called a size pool
00:16:55.839 pages within the size pool have a
00:16:57.440 particular slot size
00:16:59.360 the slot size is a power of 2 multiple
00:17:02.240 of our value size so
00:17:04.400 40 bytes 80 bytes 160 bytes 320 bytes
00:17:08.720 and so on
00:17:09.919 we chose powers of two multiples because
00:17:12.079 they're easiest to work with and
00:17:13.439 efficient since we can use bit shift
00:17:15.360 operations to calculate them we use
00:17:18.720 multiples of r value size so we don't
00:17:20.720 waste space when allocating a single r
00:17:22.640 value of space
00:17:24.400 since all objects that don't use
00:17:25.760 variable with allocation will only
00:17:27.600 allocate a single r value worth of space
00:17:32.000 here we see a diagram with four size
00:17:34.240 pools and their page slot sizes are
00:17:36.720 labeled
00:17:38.559 now we want to allocate a class object
00:17:41.520 we first need to calculate the size pool
00:17:43.679 which we're allocating into a class
00:17:46.160 object will need 144 bytes space 40
00:17:49.200 bytes for the object and 104 bytes for
00:17:51.360 the rb class ext struct this is 3.6 r
00:17:54.720 values wide which is rounded up to to
00:17:57.280 the next power of 2 so it will require
00:17:59.840 four r values worth of space
00:18:01.919 or on the third size pool
00:18:04.559 let's zoom into the size pool each size
00:18:07.039 pool has its own pool of pages and holds
00:18:09.039 a free list when we want to allocate we
00:18:11.360 just allocate into the head of the free
00:18:12.880 list and then move the free list pointer
00:18:14.640 to the next element in the linked list
00:18:17.360 now let's allocate our class object into
00:18:19.280 the free slot we allocate both the class
00:18:21.840 r value and the rb class ext struct into
00:18:24.960 the same slot
00:18:26.160 of course we waste about 16 bytes at the
00:18:28.480 end because we only used 144 bytes and
00:18:30.960 the slot is 160 bytes wide
00:18:34.320 so that's how an object of dynamic size
00:18:36.559 is allocated using variable width
00:18:38.400 allocation
00:18:40.400 in our latest pull request we
00:18:42.080 implemented variable width allocation
00:18:43.600 for strings
00:18:44.880 however unlike classes which only
00:18:46.799 allocate a fixed size data structure
00:18:49.039 strings are malleable and can change
00:18:50.480 sizes and this makes them more
00:18:51.919 challenging to deal with
00:18:53.919 earlier in this talk we spoke about the
00:18:55.520 two types of ruby strings
00:18:57.840 shorter embedded strings where the
00:19:00.000 string content is stored directly in the
00:19:01.760 r value itself and longer strings that
00:19:04.240 have their content separately allocated
00:19:06.000 using malloc and have just a reference
00:19:07.840 stored in the r value
00:19:10.080 prior to variable width allocation only
00:19:12.160 strings shorter than 24 bytes could be
00:19:14.080 embedded
00:19:15.360 as part of this project we change string
00:19:17.280 allocation to enable longer length
00:19:18.880 strings to be embedded
00:19:21.120 so let's look at allocating our example
00:19:22.799 strings again with variable width
00:19:24.080 allocation and see what's changed
00:19:26.559 our first string hello world was
00:19:29.039 embeddable before variable with
00:19:30.320 allocation and this is still the case
00:19:33.039 we allocate the headers the flags and
00:19:34.799 the class as normal and then push the
00:19:36.720 string content in directly after them
00:19:39.840 those with a keen eye may notice that
00:19:41.600 the string headers are now 18 bytes
00:19:43.280 instead of 16.
00:19:45.679 this is because strings allocated using
00:19:47.520 variable width allocation have to
00:19:48.960 support a longer embed length and so we
00:19:51.120 need a couple of extra bites to store
00:19:52.480 this larger number
00:19:54.960 so the headers at now occupy 18 bytes
00:19:57.280 and the contents occupy 12 bytes which
00:19:59.200 means we need 30 bytes in total
00:20:02.640 this fits inside the slots of our first
00:20:04.480 size pool which are 40 bytes so we
00:20:06.559 request a free slot and allocate into it
00:20:08.720 with 10 bytes at the end of the slot
00:20:10.240 going unused
00:20:12.559 now let's look at our allocating our
00:20:14.000 longer example string
00:20:16.480 remember that before variable width
00:20:18.080 allocation this string was too long to
00:20:19.679 be embedded so the contents were
00:20:21.360 allocated externally using malloc now
00:20:23.919 this string can also be embedded
00:20:26.320 the string contents are 36 bytes and the
00:20:28.480 headers are 18 bytes so this string
00:20:30.320 requires a total of 54 bytes
00:20:33.520 the second size pool is the smallest
00:20:35.120 pool into which this string fits as it
00:20:36.799 has a slot size of 80 bytes so we
00:20:39.039 allocate into a free slot from this pool
00:20:40.960 and 26 bytes go on used at the end of
00:20:42.880 the slot
00:20:45.200 but what happens if we try to resize a
00:20:47.039 string for example what if we tried to
00:20:49.520 double this string by appending it to
00:20:51.039 itself
00:20:52.240 the string is going to become too long
00:20:53.600 to fit in the slot
00:20:56.320 currently we don't have a great way to
00:20:57.919 handle this so we fall back to
00:20:59.440 allocating the contents outside of the
00:21:01.039 ruby heap just as we would have done
00:21:02.480 before vwa was implemented
00:21:05.440 to convert this embedded string to a no
00:21:07.200 embed string we first allocate memory
00:21:09.280 for the new string contents using malloc
00:21:12.320 then we convert the r string headers to
00:21:13.919 use the original 40 byte r string
00:21:15.600 structure and we set the no embed bit in
00:21:17.679 the flags
00:21:19.679 finally we set the string's data pointer
00:21:21.760 to point to the region we just allocated
00:21:25.280 now our string object is exactly the
00:21:27.440 same as if it had been allocated prior
00:21:29.120 to variable width allocation being
00:21:30.640 implemented
00:21:32.400 however this fallback solution does mean
00:21:34.159 that any space past 40 bytes is wasted
00:21:36.960 so as this is allocated in an 80 byte
00:21:38.960 slot resizing this string wastes 40
00:21:41.360 pipes
00:21:42.559 we don't have a way to deal with this
00:21:43.840 yet but we do have some ideas and we're
00:21:45.760 going to discuss those later on in this
00:21:47.200 talk
00:21:48.480 benchmarks of memory performance have
00:21:50.080 shown that this is not a very
00:21:51.600 significant issue for the workloads we
00:21:53.360 tested
00:21:55.120 now peter's going to tell you about our
00:21:56.799 benchmarks and talk about how we're
00:21:58.480 measuring runtime and memory performance
00:22:00.080 throughout this project
00:22:02.799 we deployed variable width allocation
00:22:04.799 and the corresponding commit on the ruby
00:22:06.640 master branch on a small portion of
00:22:08.559 production traffic of a shopify service
00:22:11.360 we collected data over one day where
00:22:13.360 they each served around 84 million
00:22:15.200 requests
00:22:16.320 we did not see a change in response
00:22:18.080 times beyond the margin of error we
00:22:20.159 looked at average 50th percentile 90th
00:22:22.960 percentile and 99th percentile response
00:22:25.280 times however variable width allocation
00:22:27.919 did decrease memory usage by about four
00:22:30.159 percent
00:22:31.760 synthetic benchmarks were ran on a bare
00:22:33.840 metal ubuntu machine on aws rail's bench
00:22:36.960 and rdoc generation was benchmarked
00:22:38.640 using the gwc and je malec allocators
00:22:41.840 gdp c and je malek are implementations
00:22:44.159 of the c library function called malloc
00:22:46.080 used to allocate memory
00:22:48.000 different allocators are implemented
00:22:49.600 differently and so they have different
00:22:51.120 properties
00:22:52.320 glibc is the default allocator in ubuntu
00:22:55.280 if you're interested in looking at more
00:22:56.720 in-depth results and analysis check out
00:22:59.120 the ticket linked in the slides
00:23:02.080 in rails bench we see a slight increase
00:23:04.080 in memory usage when using gwc and no
00:23:06.559 increase when using je malloc
00:23:08.720 however we also see a small performance
00:23:10.720 increase for for both gwc and je malek
00:23:13.919 this graph graphs the rss or resonant
00:23:16.320 set size which is the amount of memory
00:23:18.159 held in ram
00:23:19.679 we ran the two branches 25 times each
00:23:22.080 and each run consists of 20 000 requests
00:23:25.280 the graph graphs the memory usage over
00:23:27.280 time where the width allocation is
00:23:29.360 graphed in green and the master branch
00:23:31.280 is graphed in blue
00:23:33.760 rdoc generation generates the ruby
00:23:35.679 documentation unlike rails bench where
00:23:38.000 available with allocation use more
00:23:39.600 memory here we see significantly lower
00:23:42.400 memory usage for slight reduction in
00:23:44.480 performance
00:23:45.840 you can see in this graph that the
00:23:47.279 branch uses significantly lower memory
00:23:51.200 we wrote some micro benchmarks to
00:23:52.799 highlight the performance increase of
00:23:54.320 variable width allocation
00:23:56.159 for these benchmarks we use 71 byte
00:23:58.799 strings which is embedded when using
00:24:00.640 variable width allocation and not
00:24:02.480 embedded when not using fail with
00:24:04.159 allocation
00:24:05.279 our benchmark results show that
00:24:06.880 comparing two strings is 35 percent
00:24:09.120 faster
00:24:10.159 the times method also benchmarks
00:24:11.840 allocation speed and is 15 faster
00:24:15.360 similarly setting a character in the
00:24:17.200 string is also 15 faster this clearly
00:24:20.400 shows the performance benefits of using
00:24:22.480 variable width allocation
00:24:25.120 we're confident about the stability of
00:24:27.039 variable width allocation
00:24:28.880 variable with allocation passes all
00:24:30.799 tests on ci on shopify's braille's
00:24:33.039 monolith and we ran it for a small
00:24:35.200 portion of production traffic of the
00:24:36.720 shopify service for a week where it
00:24:38.799 served over 500 million requests
00:24:41.679 we also believe that the benchmark
00:24:43.200 results are satisfactory so in the
00:24:45.279 latest patch we propose that variable
00:24:47.039 width allocation is enabled by default
00:24:49.760 at the time of recording this talk this
00:24:51.520 proposal is open for feedback
00:24:53.919 however it's likely that it has
00:24:55.520 undergone changes and perhaps even
00:24:57.600 merged so please check the ticket linked
00:24:59.760 in the slides for the most up-to-date
00:25:01.760 information
00:25:04.080 now that you've seen our current
00:25:05.440 implementation and some performance
00:25:07.279 results matt will now talk about some
00:25:09.520 limitations of this implementation and
00:25:11.679 how we're planning on tackling them
00:25:14.480 the first issue to talk about is
00:25:16.159 increasing coverage
00:25:18.000 currently only classes and strings are
00:25:19.760 taking advantage of the benefits of vwa
00:25:22.480 in order to see an even larger
00:25:23.840 performance improvements we need to have
00:25:25.679 as many object types as possible using
00:25:27.679 it
00:25:29.360 next up is resizing
00:25:32.400 as you saw in our string implementation
00:25:34.320 we don't currently have an elegant way
00:25:36.000 to resize objects
00:25:37.520 this is a tough problem and solving is
00:25:39.440 one of our next major
00:25:40.840 milestones an idea we're currently
00:25:42.799 investigating is to leverage gc
00:25:44.640 compaction to move object data for us
00:25:48.159 so when we need to resize up and the
00:25:50.400 current slot is not large enough we
00:25:52.400 could allocate the new data into a slot
00:25:54.640 in a larger size pool and then use
00:25:56.559 compaction to move the original object
00:25:58.640 structure next to its data again and
00:26:00.799 free up the old slot
00:26:03.919 one other thing that we'd like to do
00:26:05.120 eventually is to shrink the size of the
00:26:06.880 r value struct from 40 bytes down to 32
00:26:09.279 bytes
00:26:10.720 we want to do this because it means that
00:26:12.480 slots in the heap page will then be
00:26:13.840 aligned on cpu cache lines
00:26:16.400 for example in the diagram below we can
00:26:19.200 see that many of the 40 byte slots
00:26:20.880 straddle two cache lines and so reading
00:26:23.039 a single slot requires two potentially
00:26:24.799 expensive cash lookups
00:26:27.200 if the slot size was shrunk to 32 bytes
00:26:29.679 all the slots become aligned on cache
00:26:31.279 lines and reading a slot never requires
00:26:33.520 more than one cache lookup
00:26:35.679 this isn't something we can do right now
00:26:37.279 as too many objects are still using all
00:26:39.039 40 bytes of the r value so they'd end up
00:26:41.200 being allocated into a 64 byte slot or
00:26:43.520 larger and then wasting 24 bytes or more
00:26:48.000 overall we're proud of the progress that
00:26:49.760 we've made but we still have a long way
00:26:51.600 to go and many things to explore
00:26:54.240 we plan on share to share progress
00:26:55.919 updates with the community frequently
00:26:57.840 via conference talks and blog posts
00:27:00.320 but for now the best way to keep an eye
00:27:02.080 on what we're up to is the shopify ruby
00:27:04.000 fork on github and the ruby core bug
00:27:06.320 tracker
00:27:09.120 and that's it folks in this talk we
00:27:10.960 looked at how ruby's memory allocator
00:27:12.640 and garbage collector works some of the
00:27:14.720 issues in ruby's memory layout how we're
00:27:16.880 trying to solve them with variable width
00:27:18.320 allocation as well as our current
00:27:20.080 progress in the project we're continuing
00:27:22.320 to improve the implementation and we
00:27:24.159 still have a long way to go thanks for
00:27:26.159 listening and thanks to rubyconf for
00:27:28.159 having us