Solid Cache: A disk backed Rails cache

by Donal McBreen

In this presentation, Donal McBreen, a programmer at 37signals, discusses Solid Cache, a new disk-backed caching system designed for Ruby on Rails applications. Solid Cache aims to address the inefficiencies associated with traditional memory-based caching systems like Redis or Memcached, which, while fast, have limitations in scalability and cost. The project was implemented on the hey.com and Basecamp platforms, transitioning from a Redis cache to a system that can handle months of data storage instead of just days.

Key points covered in the video include:

- Project Motivation: The central question is whether disk caching can match the performance of memory caching while offering increased storage and reduced costs.

- Caching Strategy: Solid Cache is classified as a remote disk cache, filling a gap in Rails' caching options. The approach began with the choice to utilize a database, facilitating implementation through existing libraries and SQL capabilities.
- Design Criteria: Goals included making the cache database-agnostic, simple to install, avoiding scheduled tasks for data management, and optimizing performance.

- Handling Expiry: McBreen explains the challenge of managing cache expiry. Traditional memory caches automatically delete old data, but disk caches require a more manual method. Strategies were proposed for expiry based on the age of items and the overall size of the cache.
- Expiry Algorithms: The presentation discusses various strategies for expiry, focusing on FIFO (First-In-First-Out) versus LRU (Least Recently Used) methodologies. FIFO was chosen for its cost efficiency and lower overhead, allowing for effective management of cache size over time.

- Performance Results: Following implementation, Solid Cache yielded significant improvements:

- Reads averaged around 1 millisecond and writes about 1.4 milliseconds.

- The setup reduced the required RAM from 1.1 terabytes with Redis to just 80 gigabytes, leading to substantial cost savings.

- The miss rate improved from 10% with Redis to about 7.5% with Solid Cache, enhancing efficiency.

- Conclusions and Future Scope: McBreen concludes that Solid Cache has proven effective in enhancing application speeds and reducing costs. While its benefits may vary for applications not optimized for caching, the overall efficiency gains and operational improvements make it a valuable tool for Rails developers.

Through Solid Cache, 37signals demonstrates that a thoughtful approach to caching can yield significant performance enhancements and cost reductions in software applications.

A disk cache can hold much more data than a memory one. But is it fast enough?

At @37signals they built Solid Cache, a database backed ActiveSupport cache, to test this out. Running in production on hey.com, they now cache months' rather than days' of data.

The result: emails are 40% faster to display and it costs less to run.

Links:
https://rubyonrails.org/
https://github.com/rails/solid_cache
https://dev.37signals.com/solid-cache/

#RailsWorld #RubyonRails #SolidCache #database #ActiveSupport

Rails World 2023

00:00:15.240 Hi everyone. My name is Donal McBreen, and I am a programmer at 37signals. Today, I'm going to talk about our new disk-backed Rails cache that we're calling Solid Cache.

00:00:22.119 We have been using this cache on hey.com since about February, and since the start of September on Basecamp.

00:00:34.360 The tagline of our project revolves around a critical question: if we start caching data on disk instead of memory, will the performance be sufficient?

00:00:41.480 The traditional approach involves using memory caches like Redis or Memcached. However, our aim was to explore disk caching, which offers the potential for significantly larger storage at a lower cost.

00:00:52.800 While we do not expect the individual operations in our cache to be faster, we hope that the overall caching mechanism will improve the application’s response time.

00:01:03.320 This is our Rails cache store quadrant. We distinguish between local and remote caches, and between memory and disk caches.

00:01:11.159 Rails already includes a file store as a built-in disk cache, but there's nothing currently available in the remote disk cache quadrant. Our goal is to fill this gap.

00:01:41.480 Before we started the project, we established some design criteria. Our first design goal was to utilize a database.

00:01:46.600 Starting with a database simplifies many aspects of implementation. It gives us access to connection libraries, SQL for operations like inserting, selecting, and deleting cache records, as well as indexing capabilities to quickly retrieve records. Additionally, it eliminates the need to interact directly with the file system, as the database handles that for us. Since we already have a database, we thought it would be sensible to utilize it for the cache, although we could also consider sharding it across multiple databases.

00:02:48.520 A major advantage of using a database is that they typically come with built-in memory caches. For instance, in MySQL, there's a buffer pool that can provide high hit rates, often over 99%, so you would only need to access the disk once in every 500 operations.

00:03:30.480 The second design criterion was to make it database-agnostic. We aimed to ensure that the cache could work with various databases, such as SQL Lite, MySQL, or PostgreSQL.

00:03:49.680 The third aspect was to make the cache plug-and-play. We wanted a simple installation process consisting of three main steps: install the gem, run the migrations to create the cache table, and configure the cache store in your settings.

00:03:57.680 Furthermore, we didn't want to rely on scheduled tasks or cron jobs to manage the cache, such as expiring old data.

00:04:12.200 Finally, performance was a key criterion. We wanted the cache operations to be as fast as possible, but we recognized that we might not achieve the speed of Redis or Memcached, where operations occur in microseconds, while database operations typically take around 100 microseconds.

00:04:33.680 However, all these times are relatively small, so our goal was to see how close we could get to that performance.

00:04:50.960 Let me show you a schema that illustrates a fundamental starting point for building the cache. All that's necessary is a simple key-value store with a key, a value, and an index on the key. With that, we can insert and query records effectively.

00:05:16.160 However, we faced the challenge of cache expiry. This issue is significant, especially as Redis and Memcached manage this automatically: you can simply set a memory limit, and they’ll delete older data when it exceeds that limit.

00:05:43.560 On the other hand, databases act differently; they prefer to retain data and need commands to delete it. To manage expiry in our solution, we consider two important factors: the age of the items and the overall size of the cache.

00:06:10.200 First, why should we expire items based on age? Because we want to ensure our customers are not left with outdated data in the cache. For instance, Redis uses a probabilistic least recently used algorithm to randomly remove older items.

00:06:24.880 This means that when using such a cache, some old records may remain. Therefore, we wanted to build a mechanism that allows you to specify a maximum age for the data, ensuring it gets deleted after that time.

00:06:46.480 We can achieve this by indexing the age of the cache items, allowing for the quick identification of the oldest records.

00:07:13.160 The second challenge is figuring out how to expire items based on the size of our cache. Even while we expiring by age, we need to account for varying growth rates within the cache.

00:07:47.640 Caches can grow unexpectedly, especially with fragment caching: if fragments are used across many pages, and a content change alters its digest, it results in a significant influx of data into the cache.

00:08:25.000 We explored several strategies, including file size checks and database statistics, but neither yielded the real-time information we required.

00:08:41.000 One alternative we considered was using row counts as a proxy. However, with hundreds of millions of rows in the cache, a full count would be too slow.

00:09:32.960 Next, we examined various expiry algorithms to determine how to identify the oldest items for removal. The simplest algorithm is FIFO, or First-In-First-Out.

00:09:53.680 To implement FIFO, we would need to modify our schema to include a 'created_at' timestamp field and index it, enabling us to identify the oldest records.

00:10:29.480 However, Redis and Memcached use an alternative approach based on 'Least Recently Used.' Instead of tracking creation times, they track access times—this requires renaming the timestamp column to reflect access times.

00:11:06.440 With LRU, every time a cache item is accessed, we update its access timestamp. This means the newest accessed item is at the front of the index. Unfortunately, while this method maintains freshness, it comes at a cost; every read operation also results in an update operation to the database.

00:12:14.760 Conversely, using FIFO allows faster reads since we can simply select from the table without additional updates.

00:12:50.320 In fact, with FIFO, the oldest entries align with the lowest IDs, allowing us to scan through IDs to find the oldest records without requiring a separate index.

00:13:21.200 However, a notable downside to FIFO is that it results in a lower hit rate compared to LRU, as we may evict items that are later requested.

00:13:56.720 Yet, due to cost efficiency and longer retention times—up to two months in our use case—FIFO proves to be a viable option.

00:14:34.880 Ultimately, we determined that FIFO reduces fragmentation, easily manages cache size estimates, and avoids the performance penalty associated with frequent updates.

00:14:57.520 We can use row ID ranges as proxies for cache size without fragmentation, allowing us to accurately estimate cache usage without costly overhead.

00:15:34.120 Now, let’s take a look at some initial results of our implementation. After introducing our expiry process, we observed a significant stabilization in our database size.

00:15:54.880 It's important to note that while databases seldom release memory, our goal was to prevent growth—this was achieved.

00:16:19.800 This analysis relied on the assumption that uniformity in data maintains consistent expiry behavior.

00:16:41.720 In terms of execution, we established a background task tied to cache writes. Every 80 writes, the cache would begin the expiry procedure.

00:17:19.680 This initiative is designed to process the oldest 100 records, examining both their ages and the overall cache size. Records older than a specified duration will then be deleted.

00:17:41.920 The ratio of records purged to records written ensures the cache remains in balance, preventing unbounded growth.

00:18:02.280 In terms of resilience, our system utilizes several MySQL databases, each holding dedicated memory and storage resources. The design focuses on rapid retrieval while ensuring security and encryption.

00:18:51.960 The caches are configured for durability without relying on replication, allowing us to sustain a straightforward infrastructure.

00:19:32.560 Transitioning over to performance, although we anticipated the operations to be slower due to encryption, our results revealed that reads averaged around 1 millisecond, and writes around 1.4 milliseconds—values that remain efficient in our context.

00:20:15.640 To compare storage costs, previously, we relied on 1.1 terabytes of RAM for our Redis cache but now utilize only 80 gigabytes of RAM with Solid Cache, resulting in substantial cost savings.

00:20:53.760 Estimations reveal that scaling with Solid Cache could be around 20 times cheaper and enable extended entries compared to using Redis.

00:21:30.960 Finally, looking at cache efficiency, the miss rate reduced from 10% with Redis to approximately 7.5% with Solid Cache, suggesting a significant improvement.

00:22:17.160 The key takeaway from this improvement is that the cache's efficacy hinges on its size and structure, and with careful management, it can optimize performance.

00:23:00.160 We also examined our capacity to endure system fluctuations while maintaining a reliance on disk-based storage, which has proven advantageous.

00:23:48.440 In conclusion, not only have we confirmed that Solid Cache operates at a larger scale and is more cost-effective, but we also validated that it has indeed sped up our application.

00:24:24.560 For applications not optimized for caching, the benefits might vary. Nevertheless, shifting to a primary database presents a significant gain in operational efficiency.

00:24:55.440 As I conclude, here's the repository link for anyone interested in exploring more about our Solid Cache project. Thank you for your attention!