In the world of Time Series Data we focus on the live capture of the system and then saving that data to be used historically. Once the data is persisted to disk, especially with kdb+, it just becomes part of the platform we maintain, occasionally backfilled, occasionally modified, very rarely removed.
The piece that we rarely discuss, unless budgets require, is the cost of that data. When we do, in my experience, it’s usually around the actual storage itself and how we can potentially move to “lesser grade” storage that costs a little less.
What we tend to avoid is addressing two bigger questions, that really we should:
The first question as a simple answer and then a more complicated one; the simple, none of it, well not for N years because of audit requirements, the complicated, it depends on the clients of the data. Funnily enough, this should be the same answer as the second question.
Often, especially with the rise of cloud storage, we term data hot, warm and cold depending in how often it is used. This qualitative categorization is good but it doesn’t include the most critical data type – dead data. Dead data is data that is not used and no longer relevant. If we think on the high level purpose of Time Series Data it’s about capturing data that changes with time allowing using the raw and derived data to make analyze or make decisions. Those decisions either directly or indirectly make money for our companies. With that in mind, dead data doesn’t make us any money, yet it costs us to keep it – it has become a loss making dataset.
With that in mind we should define our data-on-disk strategies considering the likelihood of how profitable our data can be:
There is no mention of dead data in the above as it should be deleted and we should have processes in place to do that. If we find there is too much warm/cold data we don’t need to just demote them to dead, we could actually downsample the data reducing it in terms of rows/columns/precision in a way that it does not impact the overall decision making.
In order to do all of this we need 2 things:
From this we can then draft up a data strategy and implement a solution that takes data through the qualitative characteristics and most importantly does the rm -rf on the end of it.
The key takeaway from this blog is for everyone to look around for data that might be dead, check it’s pulse (by looking at query logs, speaking to folks) having the doctor sign the death cert (audit/compliance/quants sign off), send that data to the big unix box in the sky (NOT the cloud) and to answer the blog title, yes lets write an obituary for our dead time series data.
If we all did this, how much money could we save?