Jonny Press
In April, James Corcoran invited me to be a guest on a panel at the London STAC Summit to discuss the question: “Are open data formats the key to a successful data strategy?”
In this blog I’ll outline some of the talking points and share some of Data Intellect’s thoughts and experiences working with them, with a focus on our traditional playground- storing and analysing large volumes of timeseries data within a capital markets trading domain.
TL;DR: open data formats enable interoperability, avoid vendor lock-in, and allow analytics teams across an organisation to use a wider range of tools on the same data. However, where performance is paramount, and particularly in a real time workflow, proprietary formats still have a role to play. AI use cases will drive open data adoption.
An open file format is a structured way of representing data where the file format specification is published and freely available. Any piece of software can use them without restriction. They have been around for a long time and there are many examples such as CSV, XML and JSON. We will focus on two column oriented formats which have risen to prominence in the data analytics space- Apache Parquet and Apache Arrow.
Parquet is designed for efficient data storage (i.e. excellent compression) and retrieval. Arrow is a complementary technology designed for in-memory compute, but can also be persisted to file. Generally speaking, Arrow offers improved analytical performance at the expense of storage size.
Open table formats provide an extension on the file formats, providing support for table level operations including ACID transactions, schema evolution and time travel (data version history). There are multiple open table formats, the two which we come across most frequently are Apache Iceberg and Delta Lake, both of which are usually underpinned by Parquet as the storage format. There has been an extended debate in the community around preferences for Delta Lake vs Iceberg, with each having their pros and cons. The majority of analytical tools support both, and there are efforts underway to ensure interoperability between the two formats.
Open table formats introduce pros and cons- the improved capability comes at a complexity and flexibility cost. In our home ground of managing large volumes of timeseries trading data, some of the features have more value than others- having a good solution for schema evolution is important, but time travel is more of a nice-to-have. Generally we want to always maintain the data exactly as received and track changes within the structure of the data itself rather than relying on querying data against a specific version number.
But what are the reasons for choosing to adopt open data standards?
Of course, there are cons to removing a database or data management layer – there’s a reason that databases have existed since the dawn of time. Open data formats provide a well understood mechanism for representing and interacting with data values, but they don’t provide a mechanism to understand the meaning of the data, the relationships between datasets (e.g. foreign keys), or help with any aspect of data governance. We’ve seen systems built on an open data foundation with an initially limited functional scope grow to become a mess, with datasets of unknown heritage scattered across storage mediums.
Openness and tooling availability will not solve all problems, and data management with good governance layers still have a fundamental role to play. There isn’t a standard approach, and usually requires a combination of tooling and bespoke development.
Vlad Ilyushchenko from QuestDB and Justin Morris from Databricks where also on the panel.
Databricks are key players in the open data arena. Databricks’ underlying storage format is Delta tables, and ultimately Parquet. Data is stored in your own storage account, so is separable from the Databricks software if you wished to migrate away, or perform analytics on the dataset directly outside of a Databricks environment. Databricks’ governance solution, Unity Catalog, was open sourced in 2024.
QuestDB persist historic data to Iceberg tables with Parquet as the underlying format. An intermediate native format is used to manage the most recently collected datasets. Data is written to the native store initially and queries served from there, then periodically flushed to Iceberg and Parquet for longer term access, either from QuestDB or any other supporting tool.
DuckDB is becoming a de-facto standard for querying data stored in open data formats. It provides query access to a broad range, including Parquet, Arrow, Delta and Iceberg. It also supports a native format which provides a performance enhancement in some of the testing we’ve done. They’ve recently announced their own open table format called DuckLake.
Deephaven supports interoperability with Arrow, Parquet, Iceberg and Delta. It uses a proprietary format to support real time inserts to its live dataframes. After a configurable period of time (sub second to a day is usual) the data is moved to Parquet under Iceberg for long term access by both Deephaven and other tools.
KX’s kdb+ relies on a proprietary storage format. Their format provides high performance via memory mapping by default, with options to encrypt and/or compress data. KX have announced that in their next version of kdb+ they will support query engine native access to Parquet, opening the possibility for longer term history to be stored in an open format.
The future of trading data systems will belong to open data formats. The performance edge provided by native and proprietary formats dissolves over time, although the immediate accessibility of data arriving in realtime is still a non-trivial problem and does not have a standard solution within the open data ecosystem.
Share this: