Optimizing Ruby's Memory Layout

by Peter Zhu and Matt Valentine-House

The video titled "Optimizing Ruby's Memory Layout," presented by Peter Zhu and Matt Valentine-House at RubyConf 2021, discusses the current memory model of Ruby, particularly its fixed-size slots for object storage and the inefficiencies arising from this structure. The presentation highlights their Variable Width Allocation project, aimed at improving Ruby's memory management to enhance performance. Key points discussed in the video include:

Ruby's Memory Management: Ruby uses a garbage collector to manage memory with an object representation known as 'r value,' which is fixed at 40 bytes, including flags and pointers.
Garbage Collection Phases: The garbage collection process consists of marking live objects and sweeping the heap to reclaim memory. Optional compaction helps reduce memory fragmentation.
Issues with Current Layout: Problems arise from objects needing multiple memory allocations, causing poor cache locality. Ruby's current model may lead to increased fetching times due to frequent malloc calls, negatively impacting performance.
Variable Width Allocation: This project aims to integrate both object data and additional memory directly within Ruby's heap, significantly reducing malloc calls. This can improve data locality and reduce memory overhead.
Performance Benchmarks: Early benchmarks showed a decrease in memory usage by 4% and performance improvements in scenarios using the new allocation methods, demonstrating a 35% increase in string comparison speed.
Future Directions: The speakers discuss plans for expanding coverage to more object types, enhancing string resizing capabilities, and reducing the size of the r value structure to better align memory allocation with CPU cache lines.

The overall conclusion emphasizes the potential for improved performance and greater control over memory management in Ruby, making it an ongoing area of development and optimization for the community.

Ruby’s current memory model assumes all objects live in fixed size slots. This means that object data is often stored outside of the Ruby heap, which has implications: an object's data can be scattered across many locations, increasing complexity and causing poor locality resulting in reduced efficiency of CPU caches.

Join us as we explore how our Variable Width Allocation project will move system heap memory into Ruby heap memory, reducing system heap allocations and giving us finer control of the memory layout to optimize for performance.

RubyConf 2021

00:00:10.240 hello rubyconf i'm peter and i'm a ruby

00:00:12.799 core committer and senior developer at

00:00:14.480 shopify in the ruby infrastructure team

00:00:16.960 and i'm matt i'm also at shopify and i'm

00:00:19.199 also a senior developer in the ruby

00:00:20.720 infrastructure team

00:00:22.400 for the past year we've been working

00:00:24.000 together to make some improvements to

00:00:25.519 how memory is laid out in ruby

00:00:28.080 in this talk we're going to summarize

00:00:29.920 the work that we've done so far the

00:00:31.439 challenges that we've faced and where

00:00:32.719 we're heading in the future

00:00:34.880 but before we can go into detail we're

00:00:36.640 going to provide some background on how

00:00:38.160 ruby's garbage collector structures and

00:00:40.079 manages memory

00:00:44.000 we create an object in ruby the

00:00:45.760 interpreter stores a data structure

00:00:47.200 called an r value in memory to hold it

00:00:50.000 and r value is always 40 bytes wide and

00:00:52.079 it contains a union of different data

00:00:53.600 structures that represent all the

00:00:55.039 possible ruby core types

00:00:57.199 so we could have a structure

00:00:58.320 representing a string

00:01:00.079 or an array or a class and so on

00:01:04.080 and each one of these types contain

00:01:05.439 their own fields and data structures

00:01:06.960 specific to that type

00:01:08.960 the r value itself is just a container

00:01:10.799 for our ruby objects

00:01:13.280 let's look more closely at one of ruby's

00:01:14.960 object types our class in this case

00:01:18.799 the first 16 bytes of every type are

00:01:20.799 always the same an 8 byte field called

00:01:22.960 flags followed by an 8 by pointer to the

00:01:25.439 class of the object

00:01:28.080 the remaining 24 bytes are different for

00:01:30.079 every type

00:01:31.119 our class objects store a pointer to an

00:01:32.880 extension object here

00:01:34.640 strings can store the actual string

00:01:36.159 content and so on

00:01:38.960 these r values are organized into

00:01:40.560 regions called pages

00:01:42.159 a heap page is a 16 kilobyte memory

00:01:44.479 region

00:01:45.520 each page has a maximum of 409 slots

00:01:48.000 where each slot can hold one r value

00:01:52.079 a heap page's memory region is always

00:01:53.840 contiguous the slots have memory

00:01:55.600 addresses that are 40 bytes apart with

00:01:57.360 no gaps between them

00:01:59.840 initially when the heap page is created

00:02:02.000 all the slots will be filled with an

00:02:03.360 internal type called t none

00:02:05.840 this type represents an empty slot it

00:02:08.000 contains only flags

00:02:09.679 and a pointer value called next that can

00:02:11.680 point to an r value

00:02:14.000 this allows these empty slots to be

00:02:16.480 linked together to form a data structure

00:02:18.239 called the free list

00:02:20.400 when a heap page is initialized ruby

00:02:22.239 walks up each page visiting each slot in

00:02:24.480 turn

00:02:26.080 as we arrive at each slot we set a

00:02:27.920 pointer on the heap page called the free

00:02:29.440 list pointer to the address of that slot

00:02:32.000 and then set the slot's next pointer to

00:02:34.160 the address of the previous slot we

00:02:35.599 visited

00:02:37.120 this creates a singly linked list of

00:02:38.959 empty slots

00:02:41.680 when we need to allocate an object and

00:02:43.599 we need an empty slot

00:02:45.360 ruby asks the heap page for the first

00:02:46.959 slot on the free list

00:02:48.800 the heap page returns the slot and then

00:02:50.640 the free list pointer is updated to

00:02:52.160 point at the next slot in the list

00:02:55.599 this unlinks our slot from the free list

00:02:57.519 allowing us to put data in it

00:03:00.640 the use of a free list means that this

00:03:02.480 part of object allocation is both fast

00:03:04.879 and a constant time operation even when

00:03:07.040 a page is sparsely populated

00:03:10.000 because if the page has at least one

00:03:11.760 empty slot the free list is guaranteed

00:03:13.680 to always point to it

00:03:15.840 but eventually our heat pages are going

00:03:18.319 to run out of slot and this is where the

00:03:20.159 garbage collector is required

00:03:22.720 if the free list is empty when ruby

00:03:24.319 tries to allocate into the page it must

00:03:26.000 mean that there are no free slots left

00:03:27.680 on that page

00:03:29.680 so now peter's going to talk about how

00:03:31.599 and when the garbage collector runs to

00:03:33.120 release objects that are no longer

00:03:34.560 necessary

00:03:35.599 over to you peter

00:03:38.159 now that we've looked at how memory is

00:03:39.599 laid out inside ruby let's look at how

00:03:41.840 ruby's garbage collector works

00:03:44.239 ruby's garbage collection phase consists

00:03:46.159 of three distinct phases marking

00:03:48.799 sweeping and compaction

00:03:51.200 the last phase compaction is optional

00:03:53.760 and does not run unless manually enabled

00:03:56.560 ruby's garbage collection has stopped

00:03:58.080 the world meaning that ruby code is not

00:03:59.920 executed when the garbage collector is

00:04:01.760 active

00:04:02.959 the garbage collecting ruby is fairly

00:04:04.640 complex so we can't possibly cover all

00:04:06.720 the technical details in just a few

00:04:08.480 minutes but we'll provide a very high

00:04:10.640 level overview that contains all the

00:04:12.400 information you will need to know for

00:04:14.239 this talk

00:04:15.840 let's first look at marking marking is

00:04:18.160 the phase where we determine which ruby

00:04:19.919 objects are alive and which ones can be

00:04:22.079 freed

00:04:22.960 we first mark the routes which include

00:04:24.880 things like global variables

00:04:26.639 classes modules and constants

00:04:29.840 we then perform breadth-first search to

00:04:31.600 traverse the network of live objects

00:04:34.639 let's see an example of this here's a

00:04:36.639 toy example of a heap with two pages

00:04:38.639 with seven slots on each page

00:04:40.800 we have a total of 10 objects labeled a

00:04:42.800 through j

00:04:43.919 slots with no label means that the slot

00:04:45.600 is empty the arrows indicate references

00:04:48.240 for example the arrow from object j to

00:04:50.320 object d means that object j holds a

00:04:52.240 reference to object d

00:04:54.080 examples of references could be the

00:04:55.600 contents of arrays or instance variables

00:04:57.759 of objects

00:04:59.120 the first step of marking requires us to

00:05:00.800 mark the brood objects the root objects

00:05:02.720 here are objects a and b

00:05:04.560 let's now mark these two objects

00:05:07.440 from a and b objects c and g are

00:05:09.680 reachable and not marked so we mark them

00:05:13.360 then from object c and g objects f and h

00:05:16.080 are reachable and not marked so we mark

00:05:18.240 them

00:05:19.440 objects f and h do not refer to any

00:05:21.680 unmarked objects since we did not mark

00:05:24.160 any objects in this step it means that

00:05:26.320 marking is complete

00:05:28.880 now that we finish marking all marked

00:05:31.039 objects are live objects and all

00:05:32.800 unmarked objects are dead and can be

00:05:34.720 reclaimed by the garbage collector

00:05:37.440 we can simply scan the slots on the

00:05:39.039 pages and free the ones that are not

00:05:40.880 marked

00:05:42.479 let's continue to use the example we

00:05:44.080 used for marking

00:05:45.520 here we move our cursor to the first

00:05:47.199 object in the page that is not marked

00:05:49.440 object d in this case

00:05:51.199 we can free the object and reclaim the

00:05:53.199 memory

00:05:54.560 we then continue to move the cursor

00:05:56.400 until we reach the next unmarked object

00:05:58.400 which is object e

00:06:00.400 we've reached the end of page 1 so let's

00:06:02.560 move to sweep page 2.

00:06:04.720 again we move the cursor until the first

00:06:06.560 unmarked object is reached and will free

00:06:08.880 object i

00:06:10.479 we can continue to move the cursor until

00:06:12.400 we reach the next unmarked object which

00:06:14.800 is object j

00:06:16.400 now that we've swept all the pages

00:06:18.319 sweeping is complete

00:06:20.479 manual compaction was a new feature

00:06:22.240 introduced in ruby 2.7 and further

00:06:24.560 improved with automatic compaction in

00:06:26.479 ruby 3.0

00:06:28.240 however this feature remains optional

00:06:30.240 and is turned off by default

00:06:32.479 compaction moves objects within the heap

00:06:34.400 to the start of the heat this has

00:06:36.240 several benefits including reduced

00:06:37.840 memory usage

00:06:39.199 faster garbage collection and better

00:06:41.039 copy on right performance

00:06:43.199 ruby uses the two-finger algorithm for

00:06:45.039 compaction which i will explain in more

00:06:47.360 detail

00:06:48.800 the compaction algorithm involves two

00:06:50.720 steps the compact step and the reference

00:06:52.960 updating step

00:06:54.400 the compact step moves objects from the

00:06:56.080 end of the heap to the start of the heap

00:06:58.319 when the step completes we do the

00:07:00.319 reference updating step in this step we

00:07:03.120 scan all the objects and update the

00:07:04.800 pointers that have moved

00:07:07.199 let's see an example of compaction in

00:07:09.199 action the last live object of the heap

00:07:11.759 is object h and the first empty slot in

00:07:14.240 the heap is not one so we moved object h

00:07:17.280 to slot one and leave the forwarding

00:07:18.960 address

00:07:20.160 this forwarding address will tell us

00:07:21.599 where the object was moved to we will

00:07:23.599 use this forwarding address in the

00:07:25.039 reference updating step

00:07:27.440 we moved object g to slot 4 and leave a

00:07:29.919 forwarding address

00:07:31.759 we move object f to slot 5 and leave a

00:07:34.479 forwarding address

00:07:36.319 we are now done the compact step because

00:07:38.240 there are no more objects to move

00:07:41.280 but we're not done yet we must do the

00:07:43.440 reference updating step

00:07:45.360 let's say our object references look

00:07:47.039 like this where the arrows indicate that

00:07:49.120 a particular object refers to another

00:07:51.039 object

00:07:52.400 we need to check the references of each

00:07:54.160 object and update the reference if it

00:07:56.240 points to a forwarding address

00:07:59.120 here object b has one reference to

00:08:01.440 forwarding addresses

00:08:02.960 by reading the forwarding address we can

00:08:04.879 update this reference to the correct one

00:08:06.879 which is object h

00:08:09.440 next we see that object c holds two

00:08:12.080 references to moved objects we can

00:08:14.400 update the references of object c to

00:08:16.400 point to the correct ones

00:08:18.879 now that we have updated all the

00:08:20.240 references we can claim these

00:08:22.720 forwarding addresses since we don't need

00:08:24.479 them anymore we'll reclaim slots 7 8 and

00:08:27.840 11.

00:08:29.360 so that's how the garbage collector

00:08:30.960 works

00:08:31.840 but throughout this talk we've been

00:08:33.599 referring to slots as always being a

00:08:35.279 fixed size and matt's going to talk

00:08:37.680 about how we handle the case where the

00:08:39.839 object needs to store additional data

00:08:43.680 so as we heard ruby objects aren't

00:08:45.519 always a fixed size

00:08:47.519 there are two scenarios to consider

00:08:50.560 when the object fits in the 24 bytes at

00:08:52.480 the end of an r value and when they

00:08:54.000 don't

00:08:55.279 we'll look at two strings hello world

00:08:57.600 which is 12 bytes and hello rubyconf

00:08:59.920 thanks for having us which is 36 bytes

00:09:03.440 when we allocate the string ruby checks

00:09:05.360 the requested string length against an

00:09:06.880 internal constant called our string

00:09:08.399 embed len max

00:09:10.000 and if the requested length is shorter

00:09:11.680 than our string in bed len max which it

00:09:13.920 is in this case

00:09:15.279 it just pushes the string data straight

00:09:16.880 into the remainder of the slot after the

00:09:18.560 flags

00:09:20.800 when the string length is longer than r

00:09:22.959 string embed len max ruby sets a bit on

00:09:25.360 the r values flags called the no embed

00:09:27.200 bit

00:09:28.720 this signifies that the data inside the

00:09:30.399 r value is not the actual string but a

00:09:32.720 pointer to a separately allocated memory

00:09:34.560 region

00:09:36.080 it then uses malloc to reserve a new

00:09:38.000 memory region outside of the existing

00:09:39.839 heap it stores the address to that

00:09:41.760 memory in the 24 bytes of the r value

00:09:43.839 after the flags and class

00:09:45.600 and copies the string data into that

00:09:47.279 newly mallocked memory

00:09:49.760 and with that we've looked at the main

00:09:51.360 ways ruby stores objects in memory

00:09:53.760 so now pete is going to talk about some

00:09:55.680 of the challenges with a memory layout

00:09:57.120 like this and introduce our project

00:09:59.200 variable with allocation which is

00:10:00.800 attempting to fix them

00:10:03.120 let's discuss some bottlenecks that

00:10:04.800 exist within the ruby heap that we aim

00:10:06.880 to tackle with this project

00:10:09.600 the first issue is that a ruby object

00:10:11.839 often requires multiple pieces of memory

00:10:13.680 to store data

00:10:15.040 you've just seen an example of this with

00:10:16.640 the string example where the contents of

00:10:18.720 the string exist in a separate location

00:10:20.800 than the string object itself

00:10:22.880 this results in poor cache locality

00:10:25.360 additionally this memory is often

00:10:27.120 acquired through the system using malloc

00:10:28.959 which has a performance impact

00:10:31.440 we often think that the cpu only has a

00:10:33.360 single level of memory but that's not

00:10:35.360 true

00:10:36.240 in order to improve performance there's

00:10:38.160 faster memory built onto the cpu itself

00:10:40.320 called caches

00:10:42.000 these caches are further divided into

00:10:43.760 multiple levels usually three levels

00:10:46.800 the lower level the cache the closer the

00:10:48.720 cache is to the cpu core by being closer

00:10:51.680 the electric signal travels a shorter

00:10:53.519 distance so it has better performance

00:10:56.800 let's talk a little bit about how cpu

00:10:58.640 caches work whenever a piece of memory

00:11:01.360 is fetched from main memory it is also

00:11:03.040 stored in the cpu caches it's important

00:11:05.440 to note that data is fetched from main

00:11:07.279 memory in units of cache lines which on

00:11:09.920 the x86 architecture found in intel and

00:11:12.320 amd cpus is 64 bytes

00:11:15.600 what this means is that even if you need

00:11:17.279 just a single byte of data all 64 bytes

00:11:19.680 is fetch for you when the cpu cache is

00:11:22.000 full it will overwrite the least

00:11:23.839 recently used data to make space for the

00:11:26.000 most recently access data

00:11:28.560 two terms that is used as cache hit for

00:11:30.399 when a piece of data we access existing

00:11:32.640 in the cpu cache so we don't need to

00:11:34.640 access the main memory and conversely a

00:11:37.040 cache miss is when the data does not

00:11:38.880 exist in any caches so we need to do a

00:11:41.279 very slow read from the main memory

00:11:43.760 as you might expect a very important

00:11:45.600 metric for optimizing performance is to

00:11:48.000 increase the cache hit rate so we need

00:11:50.000 to read from the slow main memory less

00:11:52.160 frequently

00:11:54.720 let's see the cache performance of the

00:11:56.320 current ruby implementation let's use

00:11:58.480 the example of an array

00:12:00.240 in the current implementation the array

00:12:02.399 points to a region of memory in the

00:12:03.839 system heap acquired through malloc to

00:12:06.000 store it the contents

00:12:08.320 our array has four elements so we need

00:12:10.480 four pointers to ruby objects since each

00:12:13.120 pointer is eight bytes on the 64-bit

00:12:14.880 system we need 32 bytes of memory

00:12:18.000 the object occupies one r value which is

00:12:20.399 40 bytes and the content occupies 32

00:12:22.639 bytes

00:12:23.839 in order to read the elements of this

00:12:25.440 array we must do two memory fetches to

00:12:27.600 fetch two cache lines

00:12:30.079 acquiring memory from the system heap

00:12:31.760 using malek is not free it comes with

00:12:34.160 performance overhead so we want to

00:12:36.240 minimize the number of times we call

00:12:37.839 malloc

00:12:38.959 malik also requires space for headers

00:12:40.800 that store metadata when allocating

00:12:42.560 memory

00:12:43.519 which may add up and result in increased

00:12:45.519 memory usage

00:12:46.959 we're not the first ones trying to

00:12:48.240 tackle this particular issue ruby 2.6

00:12:50.959 introduced a second heap into ruby

00:12:53.120 called the transient heap that aims to

00:12:55.040 speed up ruby by reducing the number of

00:12:56.800 malloc calls

00:12:58.560 it was successful in improving

00:12:59.920 performance and some benchmarks saw

00:13:02.000 performance improvement by up to 50

00:13:05.279 however the transient heap has technical

00:13:07.360 limitations and cannot be a dramatic

00:13:09.600 replacement

00:13:12.399 one of the major goals of this project

00:13:14.480 is to improve the overall performance of

00:13:16.480 ruby by letting ruby control its memory

00:13:18.560 layout rather than calling the malik

00:13:20.399 system library and relying on the system

00:13:22.480 to manage memory

00:13:24.639 since ruby's garbage collector is aware

00:13:26.480 of the memory structure of each type of

00:13:28.160 object it can optimize for faster

00:13:30.240 allocation and for better efficiency

00:13:33.279 by placing the contents of the object

00:13:35.120 right after the object itself we can

00:13:37.440 improve cache locality

00:13:39.440 by allocating dynamic size memory inside

00:13:42.000 the ruby garbage collector we can avoid

00:13:44.240 malik system calls and thus be able to

00:13:46.320 improve performance

00:13:49.440 now let's look at how the memory layout

00:13:51.600 changes when we use variable width

00:13:53.440 allocation

00:13:54.880 here we see the array example again what

00:13:57.199 this project aims to do is to move the

00:13:59.199 array's data from the system heap to the

00:14:01.279 ruby heap right after the object itself

00:14:04.399 so what does this accomplish for us

00:14:06.560 well we removed the need to do a malik

00:14:08.480 system call when creating the array

00:14:10.079 object let's see if we improve cache

00:14:12.240 performance

00:14:13.360 again recall that every ruby object

00:14:15.600 requires allocating a r value which is

00:14:17.680 40 bytes in size

00:14:19.519 our rate data itself is 32 bytes by

00:14:22.399 placing the contents of the array right

00:14:24.079 next to the array object itself we can

00:14:26.079 now access the first three elements of

00:14:27.839 the array while remaining in the same

00:14:29.600 cache line

00:14:32.480 so far we've merged two major iterations

00:14:34.800 into ruby and our third is open for

00:14:36.800 feedback now

00:14:38.560 in our first two pull requests we built

00:14:40.399 the infrastructure required to move data

00:14:42.240 that would previously have been

00:14:43.199 allocated on the system heap into a heap

00:14:45.199 page alongside the r value it's attached

00:14:47.279 to and we provided a reference

00:14:49.279 implementation in our class

00:14:51.440 the changes up until now have been

00:14:52.959 minimally intrusive as this feature was

00:14:55.040 turned off by default and required

00:14:57.040 recompiling with the compile-time flag

00:14:58.800 to enable it

00:15:00.240 in our latest pull request we've

00:15:02.000 implemented a variable with allocation

00:15:03.600 for strings and additionally because

00:15:05.920 we're confident about the stability and

00:15:07.519 the performance of variable width

00:15:08.800 allocation we've also proposed to turn

00:15:11.040 it on by default

00:15:12.800 we'll talk more about the performance

00:15:14.160 results later in this talk

00:15:17.519 now let's walk through an example of

00:15:18.959 what this work does so far we're going

00:15:21.279 to talk about class allocation

00:15:23.279 these two scenarios allocate r values

00:15:25.279 containing r class objects a class has

00:15:27.839 data that's attached to it its instance

00:15:30.000 methods instance variables and so on so

00:15:32.800 ruby creates a separate data structure

00:15:34.480 for this called rb class ext it's a

00:15:37.519 fixed size of 104 bytes allocated on the

00:15:40.000 system heap

00:15:41.360 this is what the current allocation code

00:15:43.120 looks like that's not using variable

00:15:44.720 with allocation we first acquire a r

00:15:47.279 value from the garbage collector using

00:15:48.880 new object then we allocate a region for

00:15:51.680 the rb class ext struct using z alloc

00:15:54.880 which will call malloc to acquire memory

00:15:58.000 this is what the class allocation code

00:15:59.680 looks like when using variable width

00:16:01.360 allocation

00:16:02.639 first thing that stands out is that

00:16:04.240 we're calling a different new object

00:16:06.000 entry point and this one takes an

00:16:07.600 allocation size

00:16:09.199 this is the size and bytes of the slot

00:16:10.880 that we wish to allocate from the

00:16:12.240 garbage collector so 40 bytes for the r

00:16:14.480 value and 104 bytes for the rba class

00:16:17.199 ext struct which is 144 bytes in total

00:16:21.440 the garbage collector is guaranteed to

00:16:23.040 allocate a slot that is at least the

00:16:24.720 requested size but may allocate a slot

00:16:27.360 that is larger than the requested size

00:16:29.759 we will see the reason for this soon

00:16:32.720 you've just seen the api to allocate

00:16:34.560 classes to variable width allocation but

00:16:36.880 how does it all work

00:16:38.800 let's take a closer look inside the

00:16:40.240 garbage collector remember that ruby's

00:16:42.560 memory heap is divided into pages and

00:16:45.040 each page is made up of 40 byte slots

00:16:48.320 we've changed the heap to support pages

00:16:50.320 that contain different size dots to do

00:16:52.560 this we've introduced a new structure

00:16:54.399 called a size pool

00:16:55.839 pages within the size pool have a

00:16:57.440 particular slot size

00:16:59.360 the slot size is a power of 2 multiple

00:17:02.240 of our value size so

00:17:04.400 40 bytes 80 bytes 160 bytes 320 bytes

00:17:08.720 and so on

00:17:09.919 we chose powers of two multiples because

00:17:12.079 they're easiest to work with and

00:17:13.439 efficient since we can use bit shift

00:17:15.360 operations to calculate them we use

00:17:18.720 multiples of r value size so we don't

00:17:20.720 waste space when allocating a single r

00:17:22.640 value of space

00:17:24.400 since all objects that don't use

00:17:25.760 variable with allocation will only

00:17:27.600 allocate a single r value worth of space

00:17:32.000 here we see a diagram with four size

00:17:34.240 pools and their page slot sizes are

00:17:36.720 labeled

00:17:38.559 now we want to allocate a class object

00:17:41.520 we first need to calculate the size pool

00:17:43.679 which we're allocating into a class

00:17:46.160 object will need 144 bytes space 40

00:17:49.200 bytes for the object and 104 bytes for

00:17:51.360 the rb class ext struct this is 3.6 r

00:17:54.720 values wide which is rounded up to to

00:17:57.280 the next power of 2 so it will require

00:17:59.840 four r values worth of space

00:18:01.919 or on the third size pool

00:18:04.559 let's zoom into the size pool each size

00:18:07.039 pool has its own pool of pages and holds

00:18:09.039 a free list when we want to allocate we

00:18:11.360 just allocate into the head of the free

00:18:12.880 list and then move the free list pointer

00:18:14.640 to the next element in the linked list

00:18:17.360 now let's allocate our class object into

00:18:19.280 the free slot we allocate both the class

00:18:21.840 r value and the rb class ext struct into

00:18:24.960 the same slot

00:18:26.160 of course we waste about 16 bytes at the

00:18:28.480 end because we only used 144 bytes and

00:18:30.960 the slot is 160 bytes wide

00:18:34.320 so that's how an object of dynamic size

00:18:36.559 is allocated using variable width

00:18:38.400 allocation

00:18:40.400 in our latest pull request we

00:18:42.080 implemented variable width allocation

00:18:43.600 for strings

00:18:44.880 however unlike classes which only

00:18:46.799 allocate a fixed size data structure

00:18:49.039 strings are malleable and can change

00:18:50.480 sizes and this makes them more

00:18:51.919 challenging to deal with

00:18:53.919 earlier in this talk we spoke about the

00:18:55.520 two types of ruby strings

00:18:57.840 shorter embedded strings where the

00:19:00.000 string content is stored directly in the

00:19:01.760 r value itself and longer strings that

00:19:04.240 have their content separately allocated

00:19:06.000 using malloc and have just a reference

00:19:07.840 stored in the r value

00:19:10.080 prior to variable width allocation only

00:19:12.160 strings shorter than 24 bytes could be

00:19:14.080 embedded

00:19:15.360 as part of this project we change string

00:19:17.280 allocation to enable longer length

00:19:18.880 strings to be embedded

00:19:21.120 so let's look at allocating our example

00:19:22.799 strings again with variable width

00:19:24.080 allocation and see what's changed

00:19:26.559 our first string hello world was

00:19:29.039 embeddable before variable with

00:19:30.320 allocation and this is still the case

00:19:33.039 we allocate the headers the flags and

00:19:34.799 the class as normal and then push the

00:19:36.720 string content in directly after them

00:19:39.840 those with a keen eye may notice that

00:19:41.600 the string headers are now 18 bytes

00:19:43.280 instead of 16.

00:19:45.679 this is because strings allocated using

00:19:47.520 variable width allocation have to

00:19:48.960 support a longer embed length and so we

00:19:51.120 need a couple of extra bites to store

00:19:52.480 this larger number

00:19:54.960 so the headers at now occupy 18 bytes

00:19:57.280 and the contents occupy 12 bytes which

00:19:59.200 means we need 30 bytes in total

00:20:02.640 this fits inside the slots of our first

00:20:04.480 size pool which are 40 bytes so we

00:20:06.559 request a free slot and allocate into it

00:20:08.720 with 10 bytes at the end of the slot

00:20:10.240 going unused

00:20:12.559 now let's look at our allocating our

00:20:14.000 longer example string

00:20:16.480 remember that before variable width

00:20:18.080 allocation this string was too long to

00:20:19.679 be embedded so the contents were

00:20:21.360 allocated externally using malloc now

00:20:23.919 this string can also be embedded

00:20:26.320 the string contents are 36 bytes and the

00:20:28.480 headers are 18 bytes so this string

00:20:30.320 requires a total of 54 bytes

00:20:33.520 the second size pool is the smallest

00:20:35.120 pool into which this string fits as it

00:20:36.799 has a slot size of 80 bytes so we

00:20:39.039 allocate into a free slot from this pool

00:20:40.960 and 26 bytes go on used at the end of

00:20:42.880 the slot

00:20:45.200 but what happens if we try to resize a

00:20:47.039 string for example what if we tried to

00:20:49.520 double this string by appending it to

00:20:51.039 itself

00:20:52.240 the string is going to become too long

00:20:53.600 to fit in the slot

00:20:56.320 currently we don't have a great way to

00:20:57.919 handle this so we fall back to

00:20:59.440 allocating the contents outside of the

00:21:01.039 ruby heap just as we would have done

00:21:02.480 before vwa was implemented

00:21:05.440 to convert this embedded string to a no

00:21:07.200 embed string we first allocate memory

00:21:09.280 for the new string contents using malloc

00:21:12.320 then we convert the r string headers to

00:21:13.919 use the original 40 byte r string

00:21:15.600 structure and we set the no embed bit in

00:21:17.679 the flags

00:21:19.679 finally we set the string's data pointer

00:21:21.760 to point to the region we just allocated

00:21:25.280 now our string object is exactly the

00:21:27.440 same as if it had been allocated prior

00:21:29.120 to variable width allocation being

00:21:30.640 implemented

00:21:32.400 however this fallback solution does mean

00:21:34.159 that any space past 40 bytes is wasted

00:21:36.960 so as this is allocated in an 80 byte

00:21:38.960 slot resizing this string wastes 40

00:21:41.360 pipes

00:21:42.559 we don't have a way to deal with this

00:21:43.840 yet but we do have some ideas and we're

00:21:45.760 going to discuss those later on in this

00:21:47.200 talk

00:21:48.480 benchmarks of memory performance have

00:21:50.080 shown that this is not a very

00:21:51.600 significant issue for the workloads we

00:21:53.360 tested

00:21:55.120 now peter's going to tell you about our

00:21:56.799 benchmarks and talk about how we're

00:21:58.480 measuring runtime and memory performance

00:22:00.080 throughout this project

00:22:02.799 we deployed variable width allocation

00:22:04.799 and the corresponding commit on the ruby

00:22:06.640 master branch on a small portion of

00:22:08.559 production traffic of a shopify service

00:22:11.360 we collected data over one day where

00:22:13.360 they each served around 84 million

00:22:15.200 requests

00:22:16.320 we did not see a change in response

00:22:18.080 times beyond the margin of error we

00:22:20.159 looked at average 50th percentile 90th

00:22:22.960 percentile and 99th percentile response

00:22:25.280 times however variable width allocation

00:22:27.919 did decrease memory usage by about four

00:22:30.159 percent

00:22:31.760 synthetic benchmarks were ran on a bare

00:22:33.840 metal ubuntu machine on aws rail's bench

00:22:36.960 and rdoc generation was benchmarked

00:22:38.640 using the gwc and je malec allocators

00:22:41.840 gdp c and je malek are implementations

00:22:44.159 of the c library function called malloc

00:22:46.080 used to allocate memory

00:22:48.000 different allocators are implemented

00:22:49.600 differently and so they have different

00:22:51.120 properties

00:22:52.320 glibc is the default allocator in ubuntu

00:22:55.280 if you're interested in looking at more

00:22:56.720 in-depth results and analysis check out

00:22:59.120 the ticket linked in the slides

00:23:02.080 in rails bench we see a slight increase

00:23:04.080 in memory usage when using gwc and no

00:23:06.559 increase when using je malloc

00:23:08.720 however we also see a small performance

00:23:10.720 increase for for both gwc and je malek

00:23:13.919 this graph graphs the rss or resonant

00:23:16.320 set size which is the amount of memory

00:23:18.159 held in ram

00:23:19.679 we ran the two branches 25 times each

00:23:22.080 and each run consists of 20 000 requests

00:23:25.280 the graph graphs the memory usage over

00:23:27.280 time where the width allocation is

00:23:29.360 graphed in green and the master branch

00:23:31.280 is graphed in blue

00:23:33.760 rdoc generation generates the ruby

00:23:35.679 documentation unlike rails bench where

00:23:38.000 available with allocation use more

00:23:39.600 memory here we see significantly lower

00:23:42.400 memory usage for slight reduction in

00:23:44.480 performance

00:23:45.840 you can see in this graph that the

00:23:47.279 branch uses significantly lower memory

00:23:51.200 we wrote some micro benchmarks to

00:23:52.799 highlight the performance increase of

00:23:54.320 variable width allocation

00:23:56.159 for these benchmarks we use 71 byte

00:23:58.799 strings which is embedded when using

00:24:00.640 variable width allocation and not

00:24:02.480 embedded when not using fail with

00:24:04.159 allocation

00:24:05.279 our benchmark results show that

00:24:06.880 comparing two strings is 35 percent

00:24:09.120 faster

00:24:10.159 the times method also benchmarks

00:24:11.840 allocation speed and is 15 faster

00:24:15.360 similarly setting a character in the

00:24:17.200 string is also 15 faster this clearly

00:24:20.400 shows the performance benefits of using

00:24:22.480 variable width allocation

00:24:25.120 we're confident about the stability of

00:24:27.039 variable width allocation

00:24:28.880 variable with allocation passes all

00:24:30.799 tests on ci on shopify's braille's

00:24:33.039 monolith and we ran it for a small

00:24:35.200 portion of production traffic of the

00:24:36.720 shopify service for a week where it

00:24:38.799 served over 500 million requests

00:24:41.679 we also believe that the benchmark

00:24:43.200 results are satisfactory so in the

00:24:45.279 latest patch we propose that variable

00:24:47.039 width allocation is enabled by default

00:24:49.760 at the time of recording this talk this

00:24:51.520 proposal is open for feedback

00:24:53.919 however it's likely that it has

00:24:55.520 undergone changes and perhaps even

00:24:57.600 merged so please check the ticket linked

00:24:59.760 in the slides for the most up-to-date

00:25:01.760 information

00:25:04.080 now that you've seen our current

00:25:05.440 implementation and some performance

00:25:07.279 results matt will now talk about some

00:25:09.520 limitations of this implementation and

00:25:11.679 how we're planning on tackling them

00:25:14.480 the first issue to talk about is

00:25:16.159 increasing coverage

00:25:18.000 currently only classes and strings are

00:25:19.760 taking advantage of the benefits of vwa

00:25:22.480 in order to see an even larger

00:25:23.840 performance improvements we need to have

00:25:25.679 as many object types as possible using

00:25:27.679 it

00:25:29.360 next up is resizing

00:25:32.400 as you saw in our string implementation

00:25:34.320 we don't currently have an elegant way

00:25:36.000 to resize objects

00:25:37.520 this is a tough problem and solving is

00:25:39.440 one of our next major

00:25:40.840 milestones an idea we're currently

00:25:42.799 investigating is to leverage gc

00:25:44.640 compaction to move object data for us

00:25:48.159 so when we need to resize up and the

00:25:50.400 current slot is not large enough we

00:25:52.400 could allocate the new data into a slot

00:25:54.640 in a larger size pool and then use

00:25:56.559 compaction to move the original object

00:25:58.640 structure next to its data again and

00:26:00.799 free up the old slot

00:26:03.919 one other thing that we'd like to do

00:26:05.120 eventually is to shrink the size of the

00:26:06.880 r value struct from 40 bytes down to 32

00:26:09.279 bytes

00:26:10.720 we want to do this because it means that

00:26:12.480 slots in the heap page will then be

00:26:13.840 aligned on cpu cache lines

00:26:16.400 for example in the diagram below we can

00:26:19.200 see that many of the 40 byte slots

00:26:20.880 straddle two cache lines and so reading

00:26:23.039 a single slot requires two potentially

00:26:24.799 expensive cash lookups

00:26:27.200 if the slot size was shrunk to 32 bytes

00:26:29.679 all the slots become aligned on cache

00:26:31.279 lines and reading a slot never requires

00:26:33.520 more than one cache lookup

00:26:35.679 this isn't something we can do right now

00:26:37.279 as too many objects are still using all

00:26:39.039 40 bytes of the r value so they'd end up

00:26:41.200 being allocated into a 64 byte slot or

00:26:43.520 larger and then wasting 24 bytes or more

00:26:48.000 overall we're proud of the progress that

00:26:49.760 we've made but we still have a long way

00:26:51.600 to go and many things to explore

00:26:54.240 we plan on share to share progress

00:26:55.919 updates with the community frequently

00:26:57.840 via conference talks and blog posts

00:27:00.320 but for now the best way to keep an eye

00:27:02.080 on what we're up to is the shopify ruby

00:27:04.000 fork on github and the ruby core bug

00:27:06.320 tracker

00:27:09.120 and that's it folks in this talk we

00:27:10.960 looked at how ruby's memory allocator

00:27:12.640 and garbage collector works some of the

00:27:14.720 issues in ruby's memory layout how we're

00:27:16.880 trying to solve them with variable width

00:27:18.320 allocation as well as our current

00:27:20.080 progress in the project we're continuing

00:27:22.320 to improve the implementation and we

00:27:24.159 still have a long way to go thanks for

00:27:26.159 listening and thanks to rubyconf for

00:27:28.159 having us