We are delighted to announce the release of the Data Quality System to TorQ, our kdb+ framework. The system allows developers to easily set up customizable checks on the quality of data within their database system.
TorQ already has monitoring tools provided, but previously they were mainly used to monitor the health of individual processes. Checks would be performed to test whether a process was still running, but insight of the quality of the data captured by the process was not provided. The monitoring tools were capable of performing data quality checks, but they were not well supported. This release will build on top of those capabilities, by providing developers with a formalized framework, including pre-defined checks and user-defined breach conditions. Developers also have the ability to write custom tests.
Examples of pre-defined checks include checking table schemas meet expectation and detecting anomalies such as bid price being higher than ask price, both of which can be parameterized by the developer.
Data Quality System
The Data Quality System consists of four processes that work in tandem alongside a typical TorQ setup. These four processes are called:
However, the Data Quality System can generally be thought of as just the DQC and the DQE (with their associated databases).
Within the Data Quality System, each process is responsible for different tasks, and the purpose of each will be explained in the upcoming sections. With the four processes, the user can check the quality of data captured by other processes in TorQ, and external databases.
DATA QUALITY CHECKER (DQC)
The DQC is responsible for running periodic checks on other TorQ or kdb+ processes. At the heart of the DQC is a results table, which contains the checks that have been performed, the process that they ran on, the result of the check (true or false) and a short description describing the test result. Checks also have a specified start and end time, as well as an interval with which they run. For example, you could set a check to run every five minutes starting at 9AM until the end of the day. The configuration data and results table are persisted to disk and made available in the HDB process of the Checker (DQCDB). Once persisted to the DQCDB, this data is then available for the user to see whether there are issues regarding data quality.
DATA QUALITY ENGINE (DQE)
The DQE is responsible for running queries on other TorQ processes for daily statistics. The query results are stored in resultstab, which contains the query that was ran, what process/table was the query performed on, and the statistics gathered from the query. The resultstab is then persisted to disk and made available in the HDB process of the Engine (DQEDB). The data from DQEDB is then used by the Checker to perform advanced checks on other TorQ processes.
Below is an overview of the architecture of the Data Quality System:
The documentation of the system can be found here:
An example DQE database could be found here:
If you would like to find out more information regarding the system, feel free to contact us at : firstname.lastname@example.org. An associated press release is here.