ARK 7: Supportability

Data Intellect

"Little Grey Cells"

Previously on ARK: Jemma talked us through Security and Access Controls, before leaving our readers with this cliffhanger – how do we support all this? The answer and more in this months exciting episode of ARK!

I love TV detective shows. A good whodunnit with lots of twists and turns. It’s such a diverse genre – silly ones like Brooklyn 99 or Murder in Paradise, classics like Poirot and Colombo, procedurals like CSI and The Wire, scary ones like Hannibal and Evil, ones for kids like Scooby Doo, historical ones like Ripper Street, supernatural ones like The X Files or thrillers like 24. I love them all.

When asked to do an ARK blog post on supportability, I started to dissect the subject matter and noticed an interesting metaphor. A kdb+ framework is much like a well functioning society and like any society there will be problems. The bugs, failures and edge cases that can cause so much mischief in our highly tuned systems aren’t quite as nefarious as the criminals being chased by my silver screen heroes. Though by choosing to emulate the procedure and philosophy of these gumshoes, we are able to police our infrastructure, catch the defects and solve any mysteries that we encounter while developing our supportable application. Over the next few paragraphs, I will walk you through some of the key topics to consider for system supportability and I’ll compare each to the attitude or skill set of the characters from a specific detective show.

Monitoring

“We’re building something here, detective. We’re building it from scratch. All the pieces matter”
Lestor Freeman – The Wire S1 Ep6

We start our journey of supportability in Baltimore, with one of the greatest TV shows ever made. Following those in the orbit of detective Jimmy McNulty (Dominic West) and the criminals he aims to incarcerate, The Wire focuses on surveillance and wire-taps to try and infiltrate and disrupt organised criminals and corrupt bureaucrats. Being able to listen to and interpret the dialogue from the inside of a criminal enterprise is a perfect mirror of what we hope to achieve with our monitoring suite. Ensuring that you are capturing metrics from your most important processes and that you can decipher the information being returned is a requirement to efficiently catch any problems. Like the detectives in the show, we need our system to be able to filter out the noise so we can focus on impactful messages.

An additional comparison can be made between successful application monitoring and the philosophy of McNulty’s season 1 boss, Lieutenant Daniels (Lance Reddick). After some initial doubts, Daniel’s focus is on identifying the root cause of what he’s investigating rather than treating the symptoms. In the show, this amounts to aiming for drug producers rather than the replaceable ‘foot soldiers’ dealing on street corners, but we can still use this mindset within our supportability framework.

Working with limited resources is a difficulty a lot of IT systems share with the fictional interpretation of Baltimore PD. Therefore our focus should be efficiency, with monitoring giving you important information at the right time and not overusing infrastructure resources or impacting your application’s latency. Human resources should also be considered, as was highlighted in the series – it’s important to make sure that someone is listening at the end of the wire-tap into your system!

Logging

“Where did you get that?” – “From the case file.”
Evelyn Shelby and Lilly Rush, Cold Case S1 Ep1

We leave Baltimore via the I-95 and drive for 100 minutes before reaching Philadelphia, the next stop of our TV sleuth showcase. This time we focus on Lilly Rush, the protagonist of Cold Case, who revisits unsolved cases with fresh eyes, modern techniques and a ‘won’t take no for an answer’ attitude. As you can imagine, looking at cases that are sometimes decades old, Lilly and team heavily rely on the files that previous investigators have left for them. The importance of these files is paramount – tracking down individuals, understanding the circumstances of a crime and piecing together the solutions relies on insightful and accurate records of what took place both at the crime scene and during the investigation.

When looking at the supportability of our kdb+ system, we are often investigating “cold” issues. Normally, a user has reported a problem or our monitoring has sent out an alert and one of our first steps is to look back in time to see exactly what happened. The clarity of this retrospect is highly dependent on the relevancy and level of detail in our log files. When developing our system, implementing good logging should be an early priority that will pay dividends as time goes on. As with all detective cases we want to solve it as quickly as possible and avoid any repeat occurrences. Getting our logging right from day one will allow us to be all over these problems like a SWAT team.

Documentation

“Irregulars. I’m your consultant, Captain, they are mine. They’re experts I turn to when I encounter problems which are beyond my knowledge.”
Sherlock Holmes, Elementary S3 E3

Continuing down the I-95 for another 2 hours, we reach New York City and the brownstone that is the home of Sherlock Holmes (Jonny Lee Miller) and Joan Watson (Lucy Liu). Elementary sees Sir Arthur Conan Doyle’s famous detective plying his trade in the United States while consulting with the NYPD. As with most iterations of the Holmes’ character, Sherlock is a font of obscure knowledge which helps him make incredible connections in order to solve crimes. Paying tribute to the original novels and putting a (very slightly) more realistic spin on the great detective’s deductive skills, Elementary introduces ‘Irregulars’ – a group of individuals with exceptional knowledge on a specific subject. Insights into such subjects as mathematics, knife throwing and smell detection, allow Sherlock and Joan to gain the upper hand in their investigations.

Employing the same logic to the system we are supporting, it is easy to see the importance of documentation. Well written schematics of a process or job can tell you what it is, how it works, why it was created, what are its limitations, who is impacted if something goes wrong and a host of other details that can aid an investigation. Contributing to documentation across development and support teams helps build a knowledge base that combines many insights, skillsets and observations. It’s also the first resource anyone will look at if a problem arises that is new to them and the quality of the information available can make the difference between a quick solve and a serious incident.

Architecture

“Get out. I need to go to my mind palace”
Sherlock Holmes, Sherlock S2 E2

Our next destination in the world of television investigators jumps across the Atlantic to London, for another modern incarnation of everyones favourite deerstalker wearing detective. This time, the protagonist is the rude and abrasive, but morally unshakeable Sherlock, played by Benedict Cumberbatch. In this iteration we see the character as a solitary figure, before building his friendship with John Watson (Martin Freeman) and others around him and this network becomes increasingly important to his deductive reasoning throughout the series.

Sherlock’s relationship with John works very much as a sounding board to test his theories against and his contacts within the police and the lab at St. Bart’s hospital provide opportunity for him to forensically analyse his case findings. He also utilises his “homeless network” to search large areas more efficiently and gather information. Another plot device that Sherlock uses to solve his cases is his mind palace: a data warehouse within his own head that he can imagine himself walking through and interacting with. Sherlock uses this tool to store and retrieve memories and facts, while also using it to debug his hypotheses and make inferences on his cases.

Sherlock has created an architecture around himself with the singular goal of being a better detective and catching criminals. We can learn from this approach when designing our own application by ensuring our system architecture allows and aids us in any investigations.

Are there non-client facing processes to debug from? Can we access production data from safe environments? Are we able to utilise UAT testing effectively? Are we able to mimic how a query executes for a specific user? Does our hardware have enough memory to run this? Can you release a patch without having to restart processes? Is there an advantage in capturing some of our own performance metrics or user data? There are hundreds of questions like this that we can consider when deciding on the architecture of our application and the answers are very individual to each team. By evaluating these types of questions and proactively building a support architecture that provides integrated tools and spaces to investigate issues, we set ourselves up to succeed when problems in our code, network or hardware arise.

Code

“Whenever two objects are broken there occurs what we call striae — two unique connecting points. If I can match the nail in the sneaker to the suspect’s clippings … Alcatraz!”
Gil Grissom, CSI: Crime Scene Investigation S1 E1

Our whistle-stop tour around the world of TV detective shows now takes us to Las Vegas, Nevada to the boffins at CSI: Crime Scene Investigation. CSI focuses on the forensic analysts who investigate crime scenes with a fine tooth comb to look for microscopic clues left behind by criminals. Once they’ve collected the evidence, they use their team of lab technicians, the evidence they’ve collected and attend police interviews to try and recreate the sequence of events that occurred during the crime. Using a range of technological and chemical tools, such as ‘luminol’ for finding blood or various fingerprint lifting gadgets, the team are able to identify unique attributes in human genetic code to identify criminals and solve cases.

The CSIs are able to extract massive amounts of information from tiny amounts of material that have been left behind and we want to give whoever is supporting our system this ability. By considering how we write code for our application and our coding standards, we can make sure that we leave behind clues that can be easily turned into solutions by investigators. Writing error traps, manual debug overrides and enabling backtrace logic in our code helps us to extract the ‘DNA’ of functions for analysis, while ensuring that user and query mapping is retained through IPC queries helps us track the path of our user queries.

Writing code that is easily readable and digestible with plenty of comments ensures that we can understand the fundamentals of what we’re looking at when something has gone wrong. Likewise, making sure we write code that is performant and efficient will lower the stress on our hardware and network speeds and creating helper functions or breaking larger code blocks into digestable chunks, can help avoid uninterpretable black boxes.

Testing

“Imitation allows us to better understand the behaviour of others”
Will Graham, Hannibal S3 E4

The final destination on our guided excursion through the most deductive minds to grace the small screen takes a dark turn in Minnesota. Hannibal follows forensic psychologist and secret cannibal, Hannibal Lecter (Mads Mickelson) and FBI agent Will Graham as they work together to profile and hunt down serial killers. The dual skillset our ttwo investigators display makes this series particularly interesting. Doctor Lecter provides insights into the mind and motivation of the mentally-ill assailants, allowing the FBI to predict their next moves and tighten the net. Will Graham on the other-hand has the ability to empathise with a killer, giving him scarcely-credible insights into how they committed their atttack just by looking at a crime scene. While the pair admittedly struggle with their own issues throughout the series, they are extremely effective at getting into the mind of the killer, ultimately allowing them to catch their prey.

The philosophy of the two ‘protagonists’ of the show again provides a great foundation for building testing methodology into a kdb+ system. Hannibal looks at the weaknesses of a character and uses it to predict where they might come unstuck and we can employ this same logic by creating robust unit and regression tests. Meanwhile Will’s empathetic understanding alludes brilliantly to our use of UAT or Pre-Prod testing – putting ourselves in as close a position as possible to our production environment and gaining insights from what we can see. One of DI’s future ARK blogs will cover testing in more detail, but from a supportability point of view, a robust testing framework allows us to catch bugs before they become incidents. As with the investigators in Hannibal, we want to avoid ‘serial’ issues and catch and prevent problems before they reach our production environment.

"Just One More Thing ..."

So the cuffs are on, the evidence has been collected and we’re sending those pesky errors to virtual jail! Our system is running smoothly and your team of kdb+ detectives are being given the keys of the city. Ok, maybe I’ve stretched the metaphor as far as it can go, but my final takeaway is that there is no one-size-fits-all solution for supportability. There are many different TV detectives out there and although they all work slightly differently, they nearly always get their man/woman/super natural being. It’s up to you what type of sleuth you want to be, but it’s highly recommended that you have a cool gimmick to end every incident with.

Horatio Caine
Stay tuned for the next months exciting episode of ARK!