Open Data Formats in Capital Markets

Jonny Press

In April, James Corcoran invited me to be a guest on a panel at the London STAC Summit to discuss the question: “Are open data formats the key to a successful data strategy?”

In this blog I’ll outline some of the talking points and share some of Data Intellect’s thoughts and experiences working with them, with a focus on our traditional playground- storing and analysing large volumes of timeseries data within a capital markets trading domain.

TL;DR: open data formats enable interoperability, avoid vendor lock-in, and allow analytics teams across an organisation to use a wider range of tools on the same data. However, where performance is paramount, and particularly in a real time workflow, proprietary formats still have a role to play. AI use cases will drive open data adoption.

What Are Open Data Formats?

An open file format is a structured way of representing data where the file format specification is published and freely available. Any piece of software can use them without restriction. They have been around for a long time and there are many examples such as CSV, XML and JSON. We will focus on two column oriented formats which have risen to prominence in the data analytics space- Apache Parquet and Apache Arrow.

Parquet is designed for efficient data storage (i.e. excellent compression) and retrieval. Arrow is a complementary technology designed for in-memory compute, but can also be persisted to file. Generally speaking, Arrow offers improved analytical performance at the expense of storage size.

Open table formats provide an extension on the file formats, providing support for table level operations including ACID transactions, schema evolution and time travel (data version history). There are multiple open table formats, the two which we come across most frequently are Apache Iceberg and Delta Lake, both of which are usually underpinned by Parquet as the storage format. There has been an extended debate in the community around preferences for Delta Lake vs Iceberg, with each having their pros and cons. The majority of analytical tools support both, and there are efforts underway to ensure interoperability between the two formats.

Open table formats introduce pros and cons- the improved capability comes at a complexity and flexibility cost. In our home ground of managing large volumes of timeseries trading data, some of the features have more value than others- having a good solution for schema evolution is important, but time travel is more of a nice-to-have. Generally we want to always maintain the data exactly as received and track changes within the structure of the data itself rather than relying on querying data against a specific version number.

But what are the reasons for choosing to adopt open data standards?

Interoperability

Choosing to build an analytics system underpinned by open data allows interoperability with a wide range of tools. All the major programming languages provide support for open file formats. The Python ecosystem has a wide range of popular extensions, including Pandas, Polars and DuckDB. From the end-user perspective this creates an advantage, as they can use whichever tool they like rather than the one tool chosen by the team who are in charge of collecting and managing the dataset.

Eliminate Database Contention

Leaning on open formats can also remove a point of resource contention. Data access no longer needs to go through a database or query engine layer, meaning that database contention problems become storage access scaling problems. Storage access scaling problems are generally more easily solved, especially in cloud environments.

Transparency

Open data provides data transparency within the trading data ecosystem and across an organisation. Multiple teams can use the same data sets for their own purposes- for example a trade surveillance team can directly use the same datasets that the quant research team are using without having to disrupt the quant workflows. If a regulator starts asking questions, they could be given direct access to the data files without having to organise and arrange specific extracts, so we don’t get questions like “can you export this to Excel for me please?”

AI

As AI use cases expand across organisations, data sets that are presented in well understood, open formats will be more accessible to AI tooling. AI adoption will drive further adoption of open standards.

What About Performance?

Arrow is memory mappable, so facilitates improved analytics performance over Parquet but at a storage cost. Parquet compressibility makes it best for long term storage. We’ve seen some very good analytical performance against Arrow based datasets.

Where native data formats will still play a role is in performance oriented environments. As the designer of a system or query engine you will likely know something specific about the structure of the data (e.g. relationships between types, cardinality etc.) or access patterns that are difficult to support in an open format and means a custom format provides a performance boost. Parquet provides some shortcuts, such as storing min and max values at the column and row group levels to assist with predicate pushdown in query engines, but greater flexibility is required for high performance.

Native formats are particularly prevalent when trying to ensure data is available in a consistent format to downstream consumers as close to real time as possible. Arrow and Parquet don’t natively support append operations meaning systems and procedures have to be built around them when data is added. Both Iceberg and Delta support streaming insert operations, but not with the latency characteristics generally expected of a datastore within a trading environment.

If you opt for a native format, you should consider how long the advantage of leaning on a native format persists. As data ages, the criticality of low latency access usually diminishes, so are the performance gains still warranted by the interoperability woes of the native format? If we consider the use case of a quant building and backtesting a complex trading model against a long history of data, the quality of available tools is likely more important than the data access performance, and the run time of execution is likely dominated by compute time rather than data access time.

Arrow Performance Comparisons

Sometimes It Gets Messy...

Of course, there are cons to removing a database or data management layer – there’s a reason that databases have existed since the dawn of time. Open data formats provide a well understood mechanism for representing and interacting with data values, but they don’t provide a mechanism to understand the meaning of the data, the relationships between datasets (e.g. foreign keys), or help with any aspect of data governance. We’ve seen systems built on an open data foundation with an initially limited functional scope grow to become a mess, with datasets of unknown heritage scattered across storage mediums.

Openness and tooling availability will not solve all problems, and data management with good governance layers still have a fundamental role to play. There isn’t a standard approach, and usually requires a combination of tooling and bespoke development.

Different Technology Approaches

Vlad Ilyushchenko from QuestDB and Justin Morris from Databricks where also on the panel.

Databricks are key players in the open data arena. Databricks’ underlying storage format is Delta tables, and ultimately Parquet. Data is stored in your own storage account, so is separable from the Databricks software if you wished to migrate away, or perform analytics on the dataset directly outside of a Databricks environment. Databricks’ governance solution, Unity Catalog, was open sourced in 2024.

QuestDB persist historic data to Iceberg tables with Parquet as the underlying format. An intermediate native format is used to manage the most recently collected datasets. Data is written to the native store initially and queries served from there, then periodically flushed to Iceberg and Parquet for longer term access, either from QuestDB or any other supporting tool.

DuckDB is becoming a de-facto standard for querying data stored in open data formats. It provides query access to a broad range, including Parquet, Arrow, Delta and Iceberg. It also supports a native format which provides a performance enhancement in some of the testing we’ve done. They’ve recently announced their own open table format called DuckLake.

Deephaven supports interoperability with Arrow, Parquet, Iceberg and Delta. It uses a proprietary format to support real time inserts to its live dataframes. After a configurable period of time (sub second to a day is usual) the data is moved to Parquet under Iceberg for long term access by both Deephaven and other tools.

KX’s kdb+ relies on a proprietary storage format. Their format provides high performance via memory mapping by default, with options to encrypt and/or compress data. KX have announced that in their next version of kdb+ they will support query engine native access to Parquet, opening the possibility for longer term history to be stored in an open format.

Conclusion

The future of trading data systems will belong to open data formats. The performance edge provided by native and proprietary formats dissolves over time, although the immediate accessibility of data arriving in realtime is still a non-trivial problem and does not have a standard solution within the open data ecosystem.