Kent Lee
Have you ever gotten into a situation where a perfectly well-built app still gets into trouble with the business having to raise incidents and complaints about something that doesn’t work? You ended up spending half of your day debugging your code/logic and found nothing that is wrong with your system, but it was those nasty little data points that mess everything up. We can point fingers at the upstream systems that their data is the problem, but the business is just not having it, saying that our app is rubbish.
Well, to be honest, there is just no such thing as a perfect set of data. Data comes from many different sources, and they are all persisted in different ways without a standard template that fits all. They are noisy, messy, and dirty, and without the hygiene care, your application can be easily exposed to the risk of bad quality data, which can cause further harm to your business in different ways:
inaccurate financial report that can influence business to make the wrong decision
a surveillance system that consistently outputting false alerts, reducing the confidence to spot genuine market abuse activities
the list goes on…
Ever heard of RIRO? “Rubbish in rubbish out” which I find quite true. Would it be fair to expect bad inputs into a healthy system to output something that is accurate and meets expectations? For the next part of our ARK series, following on from our previous post on Testing, I’ll bring you through the journey to explore some of the major pain points we faced when dealing with data on a daily basis.
As mentioned before, data comes in many different shapes and forms. For a financial institution, it is quite common to store all the trading activities done by the bank + other useful reference data such as market data, details of instruments, trader information, etc. When it comes to data capture, there will be many different upstreams, and they all have their own method to deliver the data.
One of the discrepancies would be the field name. There is just no standard naming convention. The challenge is to figure out whether the content of the fields is actually what they mean and whether they are relatable to data provided by other upstreams.
Another discrepancy can be the data type. The amount of decimal points, precision of the timestamps, product code vs full product name, etc. These are not easy to overcome and quite time-consuming to come up with rules to fit all the data points.
How do we know if we have captured all the data we need? When data is delivered in a flat file format, it is quite common that no further meta data is available in the extract other than just the data itself. The extract can be half written when we start loading it, and the unfortunate thing is that without the meta data, it is hard to detect.
Should metadata be embedded in flat files like CSV, DAT, etc.? It would be helpful to describe the data content to achieve data completeness with information such as start of data, end of data, row count, etc. However, without standardisation on the meta data, each file needs to be handled in a custom manner, which ties back to the first point on “Structure”.
Even with a pubsub mechanism, there is still potential for a message to drop out during the transfer. Did we recover properly when our process was restarted? Are the number of records consumed today matching the number of records published by upstream? These are all important questions to look into when building out a data-capturing service.
Documentation is quite crucial for understanding the context of the data we have. This has some relation to the “Structure” point mentioned above also since all the collected data need to be sensible when consolidated together. The documentation needs to be kept up-to-date and accurate to the best ability in order to be fully effective.
There are many benefits to well-documented data content:
Keeping in mind the challenges with data quality, we, Data Intellect, under the leadership of our head Data Science stream, developed the data validation engine called DIVE. It’s a fully kdb-native validation framework, executing validations 100x faster than Python equivalents. With built-in checks for data availability, missing segments, and null values, DIVE ensures comprehensive and efficient data validation.
For any interest in understanding more about DIVE, please do reach out to us!
Here is a reference link to our LinkedIn post: LinkedIn – DIVE
The list is not exhaustive here, and certainly there are more to debate about. In summary, data quality is essential to ensuring the accuracy of the product you are developing. It helps to provide a clear picture of the data you are visualizing. Without it, no matter how impressive the functionalities are in your application, the result is an inaccurate output, which will not be helpful to what you are trying to achieve.
Stay tuned for our next ARK article 🙂
Share this: