Data Creep: The Uninvited Guest in Your Database

Read time:
3 minutes

Gary Davies

What is Data Creep?

Previously, we highlighted the cost and impact of dead data in a time series capture. This time we want to focus on the data equivalent of scope creep.

Imagine your database as a tidy room with neatly organized shelves. Initially, you carefully place only the essential items on those shelves; your core data columns. But over time, as more requirements emerge, you start adding extra shelves and piling up more items. These new columns and data points accumulate, often without a second thought. This phenomenon is what we call data creep.

In technical terms, data creep refers to the gradual expansion of data capture beyond the original project scope or initial requirements. It’s like that cluttered closet at home; you keep adding stuff, and suddenly you’re drowning in a sea of old shoes, forgotten sweaters, and mismatched socks.

From a time series perspective, this gradual expansion also refers to the historical data partitions that are naturally growing (usually daily). I’m sure many of us have worked on or owned a system that you ponder if anyone ever looks at the data from 10+ years ago.

The Problem with Data Creep

Complexity Overload: As you add more columns and data points, your once-sleek database becomes a labyrinth. Queries become convoluted, and maintaining the system becomes a headache. You’re no longer sure which data is critical and which is just digital dust.
Storage Costs: This can be a reference to disk space but also to RAM depending on your solution. Every additional column consumes storage space. While storage costs have decreased over time, they still add up. Unused or redundant data is like paying rent for an empty room.
Performance Drag: More data means slower queries. Retrieving information becomes sluggish, impacting user experience and system efficiency.
Data Quality: Data creep often leads to inconsistencies, inaccuracies, and outdated information. Cleaning up the mess becomes a Herculean task. Think about your oldest historical database, was the data you captured back then the same as it is now? Is it still relevant?

Solving the Data Creep Dilemma

Regular Audits: Just as you declutter your home periodically, audit your database. Review existing columns and ask: Do we really need this? Is it still relevant? If not, consider archiving or deleting it.
Purpose-Driven Capture: Before adding a new column, ask why. What purpose does it serve? Will it enhance decision-making or analytics? If not, think twice.
Calculated Fields: Instead of storing redundant data, calculate it on the fly. For example, if you have price and size, don’t store trade value, calculate when needed.
Metadata Management: Maintain detailed metadata. Document the purpose, source, and usage of each column. This helps prevent accidental data creep.
Data Governance: Establish clear guidelines for data capture. Define ownership, access controls, and retention policies. Educate your team about responsible data management.
Understanding Your Data: Having a clear picture of what data is used, accessed and how is key. Purpose-Driven Capture is all well and good but that assumes from the requirements phase. In systems that have suffered years or even decades of data creep we may not even know what is being used or how. This is where capturing and understanding becomes key.

Conclusion

Data creep is like a silent intruder, it sneaks in unnoticed and wreaks havoc, often going unnoticed until the complexity of the system increases the resourcing and cost needed to support it.

In this blog we’ve supplied approaches that can be used to reduce and mitigate against the creep to help you figure out which data doesn’t belong.

At Data Intellect, we are always keen to help improve the efficiencies and cost effectiveness of data systems, please reach out to us if you are interested in hearing more, or would like us to help reduce your creep.

If we all did this, how much money could we save?