At Data Intellect we have been collaborating with AWS and KX, integrating our TorQ framework to work seamlessly with Amazon FinSpace with Managed kdb Insights. In this blog we will explain how we migrated TorQ, and give some pointers to anyone else who might be considering a kdb codebase or framework migration.
Capturing, persisting and analyzing real time data is a challenging problem. Daily data volumes can be fixed (such as a smart meter sending a reading on a pre-set schedule) or fluctuate (such as website traffic, or price quotes in foreign exchange markets). Data may be discarded, but given that future analytic use cases are usually unknown it is common to maintain data indefinitely at full granularity. Well built systems tend to attract additional users and business use cases.
If we look at some specific financial market examples, an intra-day profile of AAPL quote volume is shown below – note the high data rates at market open.
A daily view of AAPL quote volume is below. It can be seen that on most days, having sufficient capture capacity for 400k updates would suffice, but there are many days with higher volumes including a peak of 750k updates.
An ideal data capture system needs to be able to cope with:
The ideal solution will likely require flexibility and scalability at both the hardware and software level.
AWS and KX launched Managed kdb Insights in June 2023. kdb Insights is the timeseries database and analytics engine from KX, used by leading organisations to capture, analyze and store huge volumes (think multi-Terabytes) of data in real-time. Managed kdb Insights provides a managed environment for deploying new or existing kdb applications on AWS infrastructure. The driver behind Managed kdb Insights is to ease the development and operational burden on kdb applications by making it easy to configure, manage and run at scale kdb workloads on AWS.
At Data Intellect we like the approach of lowering the development burden and to have a standardised approach to standard problems. That was one of the original drivers behind our kdb framework, TorQ, which is the backbone of many kdb data capture installations. Over the last number of months we have been collaborating with AWS and KX to ensure TorQ is fully interoperable with Managed kdb Insights. The target outcomes are to:
TorQ is our open source kdb+ framework, released in 2015. TorQ provides a default data capture architecture and solves a number of common problems encountered in a kdb+ system build out.
A basic example TorQ data capture architecture running on two Amazon Elastic Compute Cloud (Amazon EC2) instances is outlined below. A lot of TorQ deployments would be similar to this, and the architecture should be familiar to anyone who has used a kdb+ tick style data capture system (if it’s not familiar and you would like to learn more, you can read up on KX or get in touch). TorQ consists of a number of separate data capture and access components and in this example requires both local storage and shared storage.
Describing the detailed function of each component is outside of the scope of this article, but in summary:
Managed kdb Insights approaches kdb+ from a lowest-common-denominator view point, and doesn’t try to mandate a specific way of doing things, with the exception of using managed storage for data.
Managed kdb Insights provides several types of cluster to deploy kdb+ applications. A cluster is defined by its capabilities – primarily the resources it has available to it (for example managed storage, or locally attached disk). Each cluster can contain one or more nodes which map to kdb processes. A cluster can be configured to scale up or scale down the number of available nodes based on load.
Managed storage is the Managed kdb Insights approach to long term data storage. Managed storage is updated by pushing changesets, which allows a version history of the database to be built. The changeset mechanism also allows different versions of the data to be made available to different nodes. Managed Storage is built on Amazon S3 and leverages the associated resilience and scalability. An Amazon FSx caching layer can be deployed in conjunction with managed storage to provide a query performance improvement.
At General Availability, Managed kdb Insights provides four types of cluster:
Managed kdb Insights simplifies integrating your kdb applications into AWS tooling by centralizing the capture of logs and performance metrics to Amazon CloudWatch, utilizing AWS Identity and Access Management (IAM) for access control to your databases and clusters, and aligns with the infrastructure-as-code paradigm using the AWS Command Line Interface (AWS CLI) or Terraform.
A key point with Managed kdb Insights is the automatic adherence to Information Security Policy best practice. In a non-managed environment these requirements will fall into the remit of the kdb application development teams- they require management and are a diversion from delivering business value. These include:
A standard TorQ installation usually consists of two parts:
An example of (2) is the TorQ Finance Starter Pack (TorQ FSP) which is a Financial Services oriented data capture system which generates and stores mock market data. A common approach that several TorQ users have taken is to use this as the base for a production application, usually only modifying the schema and data feeds.
To migrate to Managed kdb Insights we needed to do two things:
The resulting MVP architecture is detailed below. It is in essence a stripped back version of the TorQ Finance Starter Pack which focusses on the RDB, HDB and Gateway query components. Some of the standard TorQ components, such as the Housekeeping process and the Reporter, are no longer required in Managed kdb Insights as the functionality they provide is already implemented in Managed kdb Insights or would be better implemented using already available AWS services such as AWS Lambda or AWS Glue.
The changes required for Managed kdb Insights integration were:
Managed kdb Insights provides a managed, secure environment for deploying kdb+ code. It removes the need for a kdb application development team to take on infrastructure-oriented tasks, and aligns by default with infosec best practice. It also opens up the world of AWS services to kdb+ applications. Specifically:
The nodes within a cluster require the memory size to be pre-defined. There are a number of components in the architecture with a memory footprint which is potentially large and unknown in advance. Cloud deployments such as Managed kdb Insights remove the need to estimate how big the problem might get over the next N years when procuring hardware, instead requiring an estimate for the coming day or week.
A memory scaling approach could be built by the developer within Managed kdb Insights. The metrics published to CloudWatch can be analysed and compared to the historic profile, and if thresholds are predicted to be breached a larger replacement cluster can be brought up in the background and workloads migrated without downtime. However, it would be better if this was done automatically and the Managed kdb Insights cluster size grew with application demands.
The DI Team have enjoyed getting to grips with Managed kdb Insights and working with the AWS team. The AWS team have educated us on their product and approach, and our feature requests and modified approaches have been received well.
Throughout 2024 we plan to incorporate Managed kdb Insights roadmap items as they become General Availability. We are also going to experiment with the TorQ architecture to investigate how we fully utilize the flexibility that Managed kdb Insights provides. In the pipeline we have new approach for managing intraday data which will be specific to Managed kdb Insights.
The easiest way to get started is to install TorQ and download the TorQ Amazon FinSpace Starter Pack. Full startup instructions can be found in the TorQ Amazon FinSpace Starter Pack documentation. The full set of changes required to make TorQ with Managed kdb Insights can be found on this Pull Request.