Introduction to Starburst

Carla McLaverty

IN THE BEGINNING...THERE WAS AN INTRODUCTION

Starburst Data is a rapidly rising star within the data world, and here at Data Intellect we decided to have a look for ourselves and see what all the fuss was about!

In principle Starburst Data provides a distributed SQL query engine, primarily for in-memory processing (nothing too exciting yet…). However, when we look at it as a data lake analysis tool, it gets a lot more interesting.

UNVEILING COSMIC BENEFITS

So, why is it useful?

Starburst identified a problem in the data market, specifically the impossibility of true data centralisation. With ever increasing new sources and stores of data, Starburst wagers that attempting to centralise all of these into one on-prem or cloud database is highly inefficient, costly, slow and carries the risk of vendor lock-in. There are also legal and latency considerations for data being held in specific locations which is unavoidable in some cases. This was a problem at the time of their inception (2017), which has only grown with the ever-increasing scale of data required to be stored and queried.

Starburst decided to work with, rather than against, the distributed nature of the data landscape and created Starburst Trino off the back of Presto (a project initially developed at Facebook). Trino allows users to query data wherever it lives: on-prem and from several cloud sources. It then creates a single virtual, federated data layer that can be queried across.

Starburst also released a corporate version of Trino: Starburst Enterprise, with additional features and optimisations, and more recently Starburst Galaxy. Starburst Galaxy is their SaaS offering, querying only across cloud sources.

I’ve mentioned data federation, another buzzword at the moment. This refers to the fact that the data is first analysed locally by the respective organisation; these insights are then aggregated, or statistical summaries/gradients encrypted before being shared (in this case with Starburst). This preserves the privacy and decentralisation of the data, allowing data to be shared and analysed whilst respecting legal, privacy and security requirements.

Within the financial industry federated analysis carries many benefits including:

Fraud detection: Multiple financial institutions can collaborate to identify patterns and detect fraudulent activities across their customer base without sharing personally identifiable information or transaction details.
Risk assessment: Financial institutions can pool anonymised customer data to analyse and assess risks associated with lending or credit.
Market analysis: Financial institutions can collaborate to analyse market trends, customer behaviour, and investment patterns without sharing confidential trading or client information.
Anti-money laundering (AML): Multiple banks can work together to analyse transactions and identify potential money laundering activities while maintaining data privacy.

Starburst outlined a Global Investment Bank case study which claimed to be using Starburst for many of the reasons stated above, adding that it enabled them to rapidly test, update and deploy new AML models.

There are many similarly varied benefits across a range of industries, which are detailed here: Starburst case studies.

Another question surrounding centralisation vs decentralisation regards who has access to, and control of, the data. Should this be reserved to one group or distributed amongst different units working with the same data? Starburst landed on the data mesh as a solution.

Data mesh endorses a decentralised approach, where individual data streams are encouraged to branch out from the traditional data lake monolith and be treated as their own products (usually data sets, although occasionally APIs or applications). Such cross denominational responsibility promotes data sharing, and increased trust in said data, due to curators having a contextual knowledge of it.

[ If you’re interested in learning more about data mesh, please read our blog: Is your kdb+ data a data mesh? ].

Within Starburst Galaxy, they have implemented ‘catalog’ features which help to easily build these data products. Catalogs have data/scheme discovery capabilities (which are also available within Starburst Enterprise) and can be placed within clusters enabling role-based access control.

INTERGALACTIC INVESTIGATION

Following this initial background investigation, we decided to go ahead and have a play around with Starburst Galaxy.

The set up was very simple following their ‘Get Started’ guide. You first set up an account and domain, then connect to your chosen cloud sources. Once you’ve connected a source, you’ll be asked to create a catalog and place your source within this, followed by placing one or more catalogs within a cluster of your creation. It is recommended that your data sources and cluster originate from the same cloud provider and region to enable optimal performance and avoid unnecessary data transfer costs. Once these are configured, the data can be queried in SQL in the usual manner, by enabling the cluster and then accessing the catalog and its nested schemas and tables. RBAC can be controlled throughout this process, at domain or cluster level.

Initially, we followed the examples laid out by Starburst in their tutorials section to get a gist of Galaxy’s quirks and capabilities. We ran simple queries on one of their sample databases, followed by some federated queries across two different sample data sources. We found both these processes ran smoothly, aided by the clear documentation.

We then connected our own cloud source (Big Query) and experimented with some simple queries on our selected financial database. Again, we found Galaxy easy to navigate with the latency varying between 2-3 seconds per simple query (including the unavoidable 1 second delay imparted by Big Query itself). We are aware that this is not the most representative use case, as Starburst is optimised for querying across multiple sources rather than just one… but we wanted to give it a go anyways. In Starburst’s own collated source (federated query) example, processing speeds of up to 174k rows/s were witnessed (for our simple queries this figure was closer to ~4.5k rows/s), and we assume that this is much more representative of the software’s capabilities.

THE FINAL FRONTIER

To conclude:

It appears that multi-cloud data storage is here to stay with as many as 87% of companies already implementing a multi-cloud approach. Multi-cloud systems enable companies to make use of the advantages of each respective cloud provider for specific needs/purposes, and as such it is likely that this percentage will continue to grow in the future.

With this being the case, there is a definite need for a platform to collate and query across this data securely, and judging by their existing customer base it appears that Starburst are more than qualified to fulfill this role.

Here at Data Intellect we hope to one day further our technical analysis, but in the mean time it’s fair to say… we can understand the hype.