ARK 5: Disaster Recovery / Continuity of Business

Ross Edens

Introduction

Following on from our recent blog discussing Environments, we take a closer look at Disaster Recovery and how you can protect your application/business from disaster. When I think of Disaster Recovery (DR) the first thing that comes to mind is a quote from Franz Kafka, “Better to have, and not need, than to need, and not have”. DR is certainly not the most luxurious aspect of software development but I have found that an application (or team) faced with a disastrous event can actually provide an opportunity to shine.

What is a Disaster?

So what actually constitutes a disaster? Ultimately, it is any event or action that could disrupt or halt business operations. This could range from running out of coffee in the office to a nuclear apocalypse! The list of potential disasters impacting an application could be endless but some examples include:

Power/Network outages
Post Release outage (bugs)
Natural disasters (Earthquakes, floods etc)
Pandemics
Hardware failure
Cyber/Terrorist attacks

Disasters are inevitable, so how can we protect our applications and minimise impact to the business?

What is Disaster Recovery (DR)?

Disaster recovery implementations can take many forms, but simply put, it is our backup plan in the event of a disaster. DR is a general term that can be applied in various scenarios, it could refer to application specific DR or even a DR plan put in place throughout an entire organisation. You may even have been given a run-through of a DR plan as part of orientation/on-boarding for your current role!

For the purpose of this blog, we will focus on DR methods applied at the application level. As an example, let’s assume we have a standard kdb+ tick application (TP/RDB/HDB/GW) that both captures and serves data to a number of business users. To implement a successful DR strategy for this scenario, there are various strategies/configurations we can consider.

DR Environment (Site Replication)

This strategy embodies a classic Blue Peter saying “Here is one I made earlier” (non UK readers may need to google this one). As the heading would suggest it consists of replicating a production ready environment (ideally in a different location) to act as a backup in the event of a disaster. This can be set up in various ways to suit different use cases:

Hot-cold

This DR environment contains no live processes, it holds a backup of the primary system at a given point in time. As an example, this could be a copy of our kdb+ application as of the last stable code release. In the event of a disaster, failover to the secondary environment would require starting all necessary processes in the secondary environment and routing users accordingly once initialised. Although day-to-day resource usage is low for this method, failover is generally more complex. Depending on your application, initialising the backup environment could take significant time and any data loss between the primary and secondary sites could require further management. With this configuration, regularly testing your failover procedure is critical. There is nothing worse than trying to recover to the DR site only to find that critical dependencies are missing and recovery is no longer possible.

Hot-warm

This scenario is similar to hot-cold, with the key difference being that processes are live and mirror the primary environment. This environment is independent from the primary and responsible for its own data capture, resulting in a more straight forward failover procedure. As the secondary environment is already primed, failover is achieved by simply routing users to the secondary environment via the applications entry-point (i.e. GW). Although failover is less complex, this method requires careful management of the DR environment to ensure it is always functional and viable for user traffic (i.e. prod-ready).

Hot-hot

A hot-hot configuration goes a step further than hot-warm. Instead of a dormant DR environment used only in the event of recovery, we utilise the secondary environment to serve a portion of the user load on our application. In the event of a disaster impacting one site, we can simply route user traffic to the other functional site. This is often seen as a more efficient use of resources and by balancing the user traffic between sites, it can result in less strain on the application overall. Provided you have the infrastructure, this concept can also be scaled across several environments. This results in a large-scale flexible system on which to support your application.

So which environment is right for you? Generally speaking, If you have relatively low demand for data availability a hot-cold setup could be sufficient. On the other hand, applications requiring high data availability will likely require more sophisticated hot-hot setups. In fact, there can even be cases where you could require a combination of the various setups (hot-hot-hot-warm-cold…). It is advised to tailor the landscape of your DR environment to suit the needs of your application.

Critical Data Backup

A simple yet often overlooked strategy for effective DR is the creation of data backups. Where a DR environment may consist of a fully replicated system (including replicated data) creating backups of environment specific data can be an extremely efficient source of recovery in the event of a disaster. Consider a KDB+ historical database (HDB). Creating backups of the critical sym file is a trivial task but has the potential to significantly improve recovery time in the event the live sym file is lost or corrupted. It is also common practice to store backups in a separate secure location to the live environment to ensure backups are not collateral damage in the event of a disaster (yes, I found this out the hard way).

Sticking with the theme of kdb, another classic example of data backup/recovery is present within the widely used kdb+ tick framework. Applications utilising a standard kdb Tickerplant (TP) will typically involve the creation of a tickerplant log. This TP log is created daily by default and effectively contains a record of the raw data processed by the TP for a given day. The primary use of this TP log offers built in recovery for real-time subscribers by replaying the contents upon restart. However, by archiving and storing these logs day-to-day also provides the means to recover historical data should the need arise.

In addition to backing up data critical to the application itself, another backup strategy to consider is the archival of critical data for the business. We can utilise the same storage infrastructure to house alternative resources that allow the continuity of business in situations where the primary application is not functional to end users.

Failover

However you choose to implement DR, a critical component will be how you initiate failover. The complexity of failover will largely depend on your DR strategy and setup. Failover procedures can range from manual set up of a DR site (hot-cold) to simply switching a DNS to re-route user traffic to a backup site (hot-hot/hot-warm). Systems where high data-availability is critical may even possess automated failover, providing a seamless transition to the DR site, removing the need for manual intervention completely.

An often-overlooked aspect of failover (and DR in general) is testing. Regular testing should take place to ensure your failover mechanism is working as expected. It is also good practice to document this procedure and understand the time and effort involved in the process.

Conclusion

It is worth repeating myself, disasters can and will happen. Implementing a basic DR plan could greatly reduce the impact to your application/business in the event of a disaster. We have discussed common practices for effective DR but it is important to tailor these solutions to your application. You may require extensive DR involving multiple environments and complex failover options or you may simply require a basic backup of critical data. Ultimately, the first step in implementing DR is to ask yourself the question, ”How will your application/business respond today in the event of a disaster?”

Look out for the next post in our ARK series, where we will be taking a deep dive into Performance and Scalability!