Stephen Barr
On the 7th of March 2022, Zerodha, a major stockbroker in India, reported an issue with the National Stock Exchange of India (NSE). Namely, that equity prices were not being updated on the exchange. While this glitch occurred, orders were still being executed in the market, resulting in chaos for India’s largest trading market. During this time, the Sensex index dropped 3%, and the Nifty index dropped 2.58%. Let’s be clear: we cannot attribute these drops entirely to the glitch, but it certainly had an impact. This error occurred just over a year after the same market had to be shut down because of glitches in their servers.
Due to its scale, a major broker picked up on this market-wide issue and allowed them to effectively broadcast it to traders. To prevent this issue, we must examine how to also discover it on the smaller scale. How does a trader identify when their data has gone stale, especially when they are monitoring multiple data feeds in real-time? This business question has led to us developing a program that monitors data feeds to detect, flag, and alert when the incoming data has gone stale, in real-time. We covered some of the fundamental modelling around purely time considerations in our last stale data post – here, we’ll dive into implementation of a similar approach, based on the quality of the data.
With the above feed, we can easily implement a staleness checker for live data directly in the data consumer itself. Note this is relatively sweeping, as we expect this consumer to stay up all day, and we’ll need a one-size-fits-all approach to time thresholding, but using the analysis from our previous post we can set this at an appropriate value. For NYSE TAQ, a few hundred milliseconds would be entirely reasonable.
Below we roll this threshold directly into the Kafka consumer itself, automatically raising an exception if the feed breaks.
from kafka import KafkaConsumer
def setup_kafka_consumer(timeout_time):
"""
inputs:
timeout_tick (int) - timeout wait in ms before flagging a break
output:
consumer (object) - Kafka consumer providing messages
"""
# Set up the consumer for the topic and bootstrap server
# Timeout gives the consumer wait time
consumer = KafkaConsumer(
"tick_gen",
group_id="my-group",
bootstrap_servers=["localhost:9092"], #redpanda location
consumer_timeout_ms=timeout_time
)
return consumer
We can define a single column or multiple columns in which to check for duplicate values from one entry to the next. For example, in a highly liquid market, we wouldn’t expect repeats of the same stock price over multiple ticks, as the stock price is constantly changing. If two consecutive ticks were observed with the same value, we would flag them as a cause for concern. This is typical of a system generating data in calculation cycles – when the input stalls, the output repeats. Detecting stationary values can help identify an issue further up the chain.
def detect_duplicate_fuzzy(df_latest_tick, df_prior_tick, columns_check, threshold):
"""
detect:
inputs:
df_latest_tick (polars df) - a polars dataframe containing the latest tick dataframe a single row.
df_prior_tick (polars df) - a polars dataframe containing the prior tick (2nd latest) dataframe a single row
columns_check, (list) - a list of column names to check for fuzzy duplicates.
Only these column names will be checked for duplicate entries between the latest and prior ticks
threshold (int) - an int variable between 0 and 100 that if exceeded or matched will flag the values as duplicate
output:
duplicate_fuzz_count (int) - this int will denote how many fuzzy duplicate columns are detected
this cannot be larger than the length of the columns check list.
"""
# This checks to see if this is the first tick read into the monitor
# treats it as such if the type is the initial value of df_prior_tick which is an int before becoming a dataframe
if type(df_prior_tick) == int:
df_prior_tick = df_latest_tick
duplicate_fuzz_count = 0
return duplicate_fuzz_count
else:
duplicate_fuzz_count = 0
for check in columns_check:
if (fuzz.ratio(df_latest_tick[check].item(), df_prior_tick[check].item()) >= threshold):
duplicate_fuzz_count += 1
else:
duplicate_fuzz_count += 0
return duplicate_fuzz_count
While the scope of this proof of concept is to examine live stock market ticks for data quality and timeliness issues, we kept the code general to allow for its adaptation to many different use cases. We’re also taking steps to include features such as a time-weighted variance, as well as notifications via Microsoft Teams and E-mail. Beyond this, we’re researching new metrics and methods of detecting data quality issues to incorporate into the program.
The NSE issue with prices not being updated on the exchange was quickly spotted by a major stockbroker due to the macro nature of problem. This may not always be the case as if it is an issue with a specific stock or an inhouse issue for a firm there will not be the same degree of oversight. With tools like those described above, checking the timeliness of tick data makes it possible to automatically flag any potential issues for the user without having to rely on an outside source such as Zerodha.
This reduction in latency between the error occurring and the end user finding could result in a large decrease in losses due to trading on stale information. With the blue chip index of the NSE, the NIFTY 50 dropping 2-3%, being able to detect this market-wide data issue ahead of press releases from outside agents could have mitigated loses suffered as a result. The NSE suffered similar technical issues in 2017 and 2021 requiring trading to be halted, so having internal tools to detect these issues and take action could significantly reduce costs which would result from the delay in action if relying on third party information.
Share this: