Introducing the TORQ IDB | KDB+ Data Ingestion

Read time:
10 minutes

Gabriel Roels

The Intraday Database (IDB) component has recently made its debut in TorQ, bringing with it new features and opportunities to optimise business solutions. In this blog, we will focus on this new TorQ component to highlight how the IDB could aid in tackling massive datasets.

What is an IDB?

An IDB is an on-disk database that sits between the WDB and HDB and deals with data that is downstream of the Tickerplant and WDB but has not yet been archived in the HDB.

Within TorQ, the IDB process gets all of its configuration from an existing WDB, so once you have a running WDB you’re ready to go!

Why Would I Use an IDB?

All data solutions must contend with a simple principle: RAM is fast but expensive, and disk space tends to be cheaper.

Following this principle, it is normally good practice to reserve the valuable RAM space only for data that is needed by extremely latency sensitive queries. This is an issue in a simple RDB-HDB architecture because it forces users to either keep all their intraday data in RAM (and bloating it) or jerry-rigging a solution where portions of today’s data are archived to the HDB early; which undermines the definition of the HDB being a complete and historical archive.

This is the exact problem that the IDB is designed to solve. It allows users to store intraday data on-disk while keeping it logically separated until it is time to archive to the HDB; reserving the RDB for data that really needs it. Maybe you have ten tables in your RDB but nine of them aren’t latency sensitive, send ’em to the IDB!

IDBs are also much more scalable than RDBs for handling intraday data, especially when redundancy or load-balancing is a requirement. It’s easy to spin up multiple IDB processes to load-balance incoming queries.

On top of all this, advances in modern disk storage mean that the speed gap between querying data in memory and on disk has reduced substantially. Fast disk storage options are perfect for running an IDB while keeping the HDB on a larger, cheaper-per-byte filesystem more suited to archiving.

Finally, WDBs in TorQ systems are already writing intraday data to disk throughout the day to ease the end-of-day write procedure, the IDB now allows you to leverage it. In many ways, the question should be “Why wouldn’t I use one?” when you have your data being written to disk anyway.

Integrating an IDB into Existing TorQ Set-Ups

IDBs are receive-only components that are broadcast asynchronously by WDBs. This one-way communication allows many IDBs to be spun up and connected to a live WDB without affecting any other services (this barely affects the WDB itself because it uses async-broadcast).

As such, these are the only steps that need to be taken to introduce an IDB to a running TorQ stack (without taking anything offline):

Ensure the WDB is using the default or partbyenum writedown method.
Add an IDB entry to your process.csv.
Start the IDB process and watch it automatically connect.
Done! The IDB is ready to use directly or via the gateway.

I think it’s safe to say that you couldn’t get much simpler!

Querying the IDB

Querying the IDB is as easy as querying the RDB or HDB. If using the partbyenum writedown method, the maptoint function is provided to handle symbol encoding.

neg[gwHandle](`.gw.asyncexec;"select from trade where int=maptoint[`GOOG]";`idb);gwHandle[]

Overlapping Data: Considerations

IDB data necessarily overlaps with data within the RDB, which means gateways need to bear in mind exactly how they load-balance incoming queries to avoid duplicate data being returned.

Here is a simple rule-of-thumb that avoids this issue entirely: unless otherwise proven not to clash, avoid directing the same query to an IDB and RDB.

On-Disk Details

On startup, the IDB automatically connects to a WDB which supplies the required configuration such as IDB database location, HDB database location and writedown mode. Once this is received, the IDB will load the sym-file from the HDB and then load all intraday data written by the WDB on disk. The fact that the IDB and HDB share a sym-file is a key design decision, one which will be expanded upon later on.

It is worth noting that data is not removed from the RDB when it is written to the IDB, users will have to set up their own functionality to do that.

Write Modes

The IDB currently works with WDBs that have one of two write-modes: default or partbyenum.

default

When using the default write-mode, the IDB behaves just like an intraday HDB with the same expected structure (except for the lack of its own sym-file).

partbyenum

When using the partbyenum write-mode, the IDB will be partitioned by symbol (as ints on-disk). This comes with the following benefits:

It allows for fast and memory efficient lookups for queries that filter by ints (that map to symbols) first. They will have to use the maptoint function to locate the correct partition on-disk.
It allows the IDB to have pre-processed the end-of-day symbol partitioning job that is common when saving to the HDB.

Write Frequency

The WDB writes data from the RDB to IDB at regular time intervals, the length of which can be configured on the WDB itself. As with everything in computing, there are trade-offs between having longer or shorter write intervals:

Shorter intervals between writes will keep the IDB more up-to-date with the latest data, causes each write to consume less memory and generally take less time.
Longer intervals between writes will reduce the potential load on the processor, filesystem (and potentially the network) by reducing the number of write jobs.

Shared Sym File

The IDB data on disk is written by the WDB which uses the same sym-file as the HDB to encode symbols which ensures both datasets are always trivially compatible.

The IDB and HDb share a Sym File

Reloading the DB and Sym File

The IDB automatically gets triggered to reload the on-disk data by the WDB after every writedown event. To prevent frivolous reloads, the IDB has the following rules:

The sym file will only be reloaded if it has changed size on disk.
The data will only be reloaded if if the number of partitions (partitioncount) has changed.

If neither of these conditions are met but a reload still needs to be forced, there is a separate function which accomplishes this.

Conclusions

Expanding the tools available for users to tackle data-related problems is always a good thing and our bread-and-butter at Data Intellect! The IDB is a fantastic option for users who need intraday data to massively scale, users who need to limit the RAM usage of their intraday datasets and even those who feel safer with an extra layer of redundancy.

As always, if you have any questions or need help with implementing an IDB feel free to get in touch!