Matt Doherty
kdb+ is famous for its dislike of loops (no stinking loops). As kdb+ developers we have a lot of flexibility in the architectures we can use, and today I’m going to try to convince you there’s something else we should remove from our systems: rdbs. No stinking rdbs!
In kdb+ land there are essentially two ways to store data: in-memory, and on-disk. The data itself is identical. This was a key insight that’s now quite common in the columnar data space (see Apache Arrow in particular), as it removes the need for serialization when moving data from one place to another, and enables memory mapping and very fast reads of data from disk. kdb+ supports a range of different architectures, but the standard or default model is still to use two main types of process together: real-time databases (rdbs) with the data fully in the process’s own memory space, and historical databases (hdbs) with the data serialized on disk and memory mapped into the processes memory space.
My opening is intentionally provocative and a little tongue in cheek. I don’t really believe we shouldn’t use rdbs at all. A major strength of kdb+ as a technology is its flexibility, and kdb-x promises to add even more. The argument I’ll make here is that rdbs should not be considered the default option in all kdb systems. They have always been one of the core process types in kdb, but I believe for many problems they may not be required, and you should at least consider not using them. The approach isn’t new, and is deployed in a number of installations and in some of KX’s own components in their Insights product, but the details around when and where it might be applicable isn’t always clear.
Queries against data in memory are of course much faster, aren’t they? I put together a simple benchmark on fairly standard hardware with fairly standard data:
So the key piece of context before we talk more about these results, is that hdb processes often won’t actually to go all the way to disk to get their data. They memory map files on disk, and when accessed the OS will first check the page cache, and if a particular file exists in that cache it will not actually go to disk to get it. So in all of the hdb results above the query is hitting the page cache instead of disk i.e. the cache is warm. So when the cache is warm the hdb isn’t so different from the rdb: it’s data is in memory, but in the OS level page cache instead of the processes own memory space. I’ve also added a 3rd set of benchmarks when the cache is “cold”, so we can see the difference when the OS has to actually go to disk.
So what are the key takeaways from these benchmarks:
As well as reading data, we also have to be able to add new data. New trades flow in during the day and we need to be able to add them. Here is another benchmark comparing the two setups:
The in-memory process has a more clear-cut advantage here. But it’s perhaps not as large as you might expect. It’s certainly much smaller than when your storage was slow spinning disks. And another crucial point here: inserts into memory are a serial process. When an rdb is inserting new data it cannot serve queries simultaneously. This is not true in the later case, we can insert to a splayed table on disk and at the same time processes can serve queries on the data. Compute and memory are separated.
I know above I showed that the rdb is (ever so slightly) faster, but that’s for one query. What if you have 2, or 10? On an rdb you’re stuck running queries serially. It cannot run queries concurrently. However if your data is shared on disk (and in memory via the OS page cache) you can go parallel. Each individual query might take slightly longer, but you can run all 10 at once on separate processes. A query that is 10% slower on an hdb, becomes 45% faster if you need to run it twice as you can go parallel e.g. two queries at 110ms each running in parallel taking 110ms total vs two queries at 100ms serially taking 200ms. This is also a big architectural advantage if you want to support different types of use-case on the same system. For example research queries, and more latency sensitive trading queries, without them getting in each others way.
We can shard and scale basically for free. rdbs have a fixed memory cost, if we want more processes we need more memory, and new processes are slow to startup as they need to read all current data into memory. When your data is only stored once, on disk and in the OS page cache, you can have 10 processes on top of it, or why not 100? These processes can startup almost instantly. This fundamentally separates memory and compute
You can of course still have gateway processes if you want, but you don’t need them to join rdb and hdb data together. It’s possible to have all your data accessible in one process. Users can spin up their own single process with access to all data, both live and historical.
There are some details to consider on reloading and mapping data on disk when it’s being appended to “live”, but on the whole you have fewer processes, and fewer process types. There’s no coordination problem between rdb and hdbs and gateways, so your system may well be simpler.
So let’s talk a little more about this…
Standard kdb+ architecture is the way it is for a good reason. If you’re in a world where memory is much more expensive, and all of your non-volatile storage is spinning disks, it makes a lot more sense to split your database in two. Inserts against splayed tables on disk will be much slower than memory, likely several orders of magnitude, and any queries that do not hit the OS page cache (cold cache) will be drastically slower, again probably several orders of magnitude. But this is no longer the world we live in.
I’m proposing an alternate default architecture that looks more like this:
The tickerplant process is exactly the same as before, but all data goes straight to disk via “writer” processes (essentially TorQ wdb processes), and all data is read via database processes which mount the data on disk (TorQ hdbs).
Let me be clear here that this isn’t a revolutionary idea by any means. I’m sure there are kdb+ systems in operation today that have a similar architecture to this. My argument here is less that this is new, but rather that we should consider this alternative much more seriously and broadly.
Here’s a clearer picture showing how we have on-disk data separated into a “live partition” and all your standard historical data:
I believe this approach gives us a few significant advantages:
Of course everything in software engineering is trade-offs, and here the trade-offs we’re making are:
There are also a few key technical considerations for “live append” databases (dbs, no need to call them hdb and rdb anymore):
Based on the requirements of your specific system the architecture above can then be expanded with realtime engines and/or cache processes to calculate analytics (for example VWAPs, bars etc.). These values can then be published back into the tickerplant and read from the database processes, or accessed directly
I don’t want to dwell too much on these technical details, so I’ll aim to follow up with details on how to setup this kind of architecture with TorQ:
I’m not one for long conclusions, but hopefully I’ve made you at least consider an alternate default architecture for kdb+. “no rdb” architectures can be simpler, more scalable and nearly as fast. I’m certainly not saying this is the “correct” way, simply that it’s something we should consider a little more strongly. If anyone else has thoughts or wants to discuss these ideas further we’re always happy to hear from you!
Share this: