Existing by coincidence, programming deliberately
Production outages are great at teaching you how not to cause production outages. I've caused plenty and hope that by sharing them publicly, it might help some people bypass part one of the production outage learning syllabus. Previously I discussed ways I've broken prod with PostgreSQL and with healthchecks. Now I'll show you how I've done it with Redis too.
For the record, I absolutely love Redis. It works brilliantly if you use it correctly. The gotchas that follow were all occasions when I didn't use it correctly.
Redis executes commands on a single thread, which means concurrency in your application layer creates contention as commands are queued on the server. In the normal course of things, this probably won't cause problems because Redis commands are typically very fast to execute. But at times of very high load or if commands are slow to finish, you will either see timeouts or latency spikes, depending how your connection pools are configured.
If you're particularly naive,
like I was on one occasion,
you'll exacerbate these failures
with some poorly-implemented application logic.
I wrote a basic session cache using GET
,
which fell back to a database query and SET
to populate the cache in the event of a miss.
Crucially,
it held onto the Redis connection
for the duration of that fallback condition
and allowed errors from SET
to fail the entire operation.
Increased traffic,
combined with a slow query in Postgres,
caused this arrangement to effectively DOS our Redis connection pool
for minutes at a time.
During these periods,
connections timed out across the board
and users were left staring at a generic fail page
instead of a working application.
The easiest way to handle concurrency in Redis is by sharding your data across multiple instances. There are various ways to do this.
If your application contains a few functionally-separate Redis abstractions, you might want to manually shard data from each of those functional areas to its own instance. This approach allows you to vary configuration options like eviction policy by functional area too. The downside is that if any one area gets too heavy, you're back to where you started in terms of needing to shard again.
Alternatively, to shard your data more generally across multiple instances, you can use Redis Cluster. For the most part this lets you forget about how sharding is implemented, unless you're using multi-key commands, transactions or lua scripts. If you do have any of those, you must ensure that all keys per command/transaction/script resolve to the same shard by using hash tags. A hash tag is just a substring of the key, delineated by opening and closing curly braces.
Redis Cluster may not be available in your deployment environment, for instance if you're using GCP Memorystore. In that case, you could shard your keyspace manually of course. But there are a couple of automated options still available too. Twemproxy and Codis are 3rd-party, open source proxies that you can stand up in front of your Redis instances to handle sharding for you.
Redis supports Lua scripts (before version 7) and functions (version 7 onwards) for logic that needs to run atomically. They're especially useful when you need to combine commands conditionally or in a loop. But because of Redis' single-threaded nature, you should pay attention to how long these scripts take to execute. Loops in particular can get out of hand if you're not careful.
I made this mistake when implementing a cache for a permissions graph. In our model permissions cascaded down the graph, so I incorporated a secondary store for each node as a sorted set, populated with the ids of its ancestors. That allowed us to remove entire subgraphs in one operation, because modifying permissions on any node meant modifying permissions on all its ancestors too. This worked well for a long time, but as more features were gradually added to the product the size of the subgraphs increased. And each of those increases had a compound effect because it also increased the number of events invalidating the cache. Eventually we reached a point where individual loops in our Lua script were running thousands of iterations and we began to notice latency spikes in monitoring. At times of particularly heavy traffic it caused timeouts on our Redis connection pool as commands got stuck waiting to be scheduled.
So keep your scripts and functions simple and if they can't be simple, consider whether Redis is the right tool for whatever you're trying to do. In my case, it wasn't.
The maxmemory-policy
setting
determines how Redis behaves when available memory is exhausted.
Broadly speaking,
it can either fail writes
or evict some other data to allow writes to succeed.
If you're implementing a cache
or any kind of ephemeral store where it's okay to lose data,
you can probably pick one of the allkeys-*
options
and not worry too much about memory usage in production.
Otherwise you must choose between noeviction
and volatile-*
,
and design your application to handle failed writes gracefully.
When those failed writes happen, you don't want it to be a surprise of course. Configure monitoring to alert when memory usage is at 80%, 90% and 99%. I like having multiple layers of alert because sometimes everyone is under pressure to ship features and the early alerts may get deprioritised or forgotten. That's not saying they're okay to ignore, but acknowledging the reality of working at a startup. Hopefully you never get to see that 99% alert fire because you had a chance either to increase memory or reduce usage. But it's nice to know it's there, just in case.
I once wrote a debounce abstraction
for a system that generated lots of update events,
to reduce reindexing activity in Elasticsearch.
To save a database query when handling debounced events,
I stashed the aggregated event bodies in Redis
along with the debounce timestamp.
Everything was fine until we added wiki pages
as a new feature in the application.
Pages were allowed to include base64-encoded image data,
so those events turned out to be
much larger than any we'd emitted previously.
And they were more frequent too,
because users tended to make lots of small edits
to their pages.
This was a noeviction
Redis instance and embarrassingly,
I hadn't set up alerts on memory usage.
It wasn't until I saw the error spike
that I realised something was wrong.
The Redis API is so much richer
than just GET
, SET
and DEL
.
There's too much to cover in detail,
but make sure you understand the tradeoffs between
hashes,
lists,
sets and
sorted sets.
Familiarise yourself with
bitmaps and
bitfields.
The docs do a good job of discussing big-O performance for each abstraction.
If you understand your data and the tradeoffs in advance,
it can save a lot of time and pain later from using the wrong one.
One common mistake is serialising objects to JSON strings before storing them in Redis. This works for reading and writing objects as atomic units but is inefficient for reading or updating individual properties within an object, because you pay to parse or serialise the whole thing on every command. Instead, decomposing your objects to hashes enables you to access individual properties directly. For large objects, this can be a significant performance improvement.
Another mistake can be using lists for large collections.
If you find yourself using
LINDEX
,
LINSERT
or
LSET
on a large list,
be careful.
These commands are O(n) and
you might be better off with a sorted set instead.