How to Design a Cache

Introduction

As part of my responsabilities as a cloud engineer or SRE, I had to work on a microservice handling statistics from the API gateway we were managing.

This microservice was just a backend to historical data about the requests and responses that went through the gateway stored on Timestream, which is then processed and delivered uppon request by an angularJS front-end implementing a dashboard.

The problem is that each of these requests involved a query to timestream to fetch the data needed to build it, with a great impact on both cloud costs and latency.

This is a vital service from both the view of the team and stakeholders of our SaaS, due to the fact that it provides deep insights in a cloud environment in which we cannot just use something like Graphana due to the security restrictions of the client.

This is why I was assigned to figure out how to build a cache for it.

The mathematical model of a cache

Using set theory, we can model a cache by defining it by using two domains linked by a series of functions.

The data we need is in the larger domain, which in this case is AWS Timestream, but in classical computing it usually is the RAM or persistent memory of the computer.

The cache usually contains the data our application will immediately need to do its operations due to a property of the data called reference locality, from which there are two kinds:

The relationship by which an item or several items of the larger domain is mapped to the smaller domain or cache is called the matching function.

If this cache is written to, then we need to define write function that invalidates the contents of the data in the cache and commits the implications of this change to the larger memory to keep consistency.

According to Phil Karlton, an engineer from Netscape in the 90s, There are only two hard things in Computer Science: cache invalidation and naming things. Luckily for us, this is a read-only cache, and we do not need to define this write function in our dashboard because we are handling historical data for monitoring purposes.

We will also need a search function that defines the algorithm by which we will move data from one domain to another according to our matching function.

Finally we will need a replacement function for our cache due to the fact that it is smaller than the larger memory and we will need to decide which data gets overwritten once it becomes full.

Other recommendations for cache design come from the fact that it is convenient to separate between data and instructions or specifications in your architecture because they usually have strong differences in their locality. Another recommendation is to have the largest cache your application can afford.

The data

The most important step when attempting to design a cache is to analyze on the underlying properties of your data, the schema and how it is used and the matching function between the data on timestream and the data your application will actually need.

In this case, what we were dealing with were time series, accessed via a time window, which has strong space locality due to the fact that it is grouped in between ranges of time in well defined time partitions such as hours, days, weeks, months or years.

This meant that when we were analyzing the data for the day we had to obtain groups of metrics per hour. If we were dealing with weeks, then the metrics should have been grouped per day and so on.

This grouping greatly reduced the size of the data we needed to fetch from Timestream, given that we did not need to make a one-to-one relationship between the records on the larger domain to the cache, but a many-to-one relationship via a Timestream query that recovered this data already grouped.

The fact that this is a time series also implies that we cannot change the data on the past nor ask for data on the future, which simplified the semantics for accessing this data.

Later on this process I had the idea of instrumenting the application for logging the data in a machine-readable format such as CSV in order to analyze it using something like Excel or Pandas, which would have given me insights on its structure or patterns of access, which would have been of help to understand how the application is behaving in order to make a data access profile for it.

The application

We could not simply use Redis due to the large amount of data that this service needs to cache, so I knew from the start that we had to use a single file database of some sort or use plain files.

Given that we already had a mock using SQLite, I decided to reuse that code in order to go with the single file database, another reason was that it was easy to define the schema of the cache itself as well as the queries or indexes needed to make the search and replace function for this cache.

If I had gone with a CSV file for the data, I would most likely had to spend time in a reader/writer interface of sorts, (although Golang provides really easy to use tools to achieve this)[https://www.bytesizego.com/blog/golang-csv-reader].

A big part of the work was to analyze how the application was built and which data to cache.

One way was to cache the requests and the responses to the backend itself on the side of the controller of the application, meaning that if you have previously received the same request then you should return the same data. But this would have meant binding the schema of the cache to the specific use cases of the application, which was a no-go because this is an app still in development and those use cases would change, invalidating the cache schema. So we had to go to the basic schema of the data we were pulling from Timestream.

This meant that the cache itself needed to be on what Java developers know as the repository pattern, from which the rest of the code of the application recovers the data.

The challenges

The hardest part of this task was essentially to figure out the substitution function and the search function to optimize the hit rate of the application.

For the search function I had issues with the fact that the data received was not exactly the data asked by the application, thus generating a miss and a call to Timestream.

I solved this with an heuristic to detect when this data was indeed on the cache by looking forward and backward by an acceptance time range, determined by the grouping of the data requested.

If no data was found within that time range, such as one hour over the time requested time or one hour before in the case of by-hour groupings, then it was assumed that we missed data in the cache for this time frame that that we should ask Timestream for it. This was not the case in most of the calls to the cache if the data had already been previously fetched and if the acceptance time range was wide enough, thus reducing the reported misses from the cache.

Regarding the substitution function, if the cache is full then it is hard to decide which data needs to go out when there is no criteria on how it will be accessed.

For example, if the application is used for auditing performance over time or searching for trends over months, then old time data is important. But if what you are trying to achieve is a real-time monitoring system, then you will most likely need to make new data a priority.

It is also possible that the data access profile for the application is different for each kind of user, thus requiring multiple caches with different cache substitution function.

However, these are explorations that go too far into the future and over our budget, so we will consider a single cache and KISS the hell out of it.

For now, until we have more data on how this application is used we will just cache the newest data, which is what will be requested by most users when accessing the site. Thus, making a substitution function that fills older records as soon as new data is available.

Conclusions

The perception I have on the industry is that most software engineering has piggy backed upon hardware and networking improvements over time without being smart about how their systems are implemented or what they are doing.

In order to answer those questions, we need to generate lots of data and learn how to analyze it and how to move it over the network efficiently.

Learning how to design and implement a cache is a hard skill for anyone attempting to build performant distributed systems, and there is still a lot of questions to solve, such as how to architect systems with cache invalidation protocols.

whoami

Jaime Romero is a software engineer and cybersecurity expert operating in Western Europe.