July 20, 2023
Involving the Business in Developing your BIA
As we dig deeper into the topic of Business Impact Analyses, a picture is worth a thousand words, so let us start there. According to guidance as defined in the FFIEC Business Continuity Management Booklet, the following diagram depicts the three components of the BIA – Recovery Point Objective, Recovery Time Objective, and Maximum Tolerable Downtime:
Recovery Point Objective (RPO): This metric measures the amount of acceptable data loss. How much data loss that can be endured is directly tied to your business processes and what it would take to recover normal functions if data were to be lost or corrupted. Within your business continuity plans, you set your target RPO and ensure/test that your technologies (e.g., data backups/data restores) can meet this objective. If they do not, then you should reevaluate the frequency of your backups. Consider these examples:
Business Process A processes a large amount of critical customer data throughout the day using Server1. The organization has determined that they can absorb only one hour’s worth of data re-entry related to Business Process A. So, backups of Server1 need to occur every hour, so they would only lose a maximum of one hour’s worth of data if a disruption were to occur.
Business Process B processes low priority data using Server2. The organization has determined that they can absorb one day’s worth of data re-entry for Business Process B and have set an objective of 24 hours of data loss. The business processes have been established to perform backups of Server2 nightly. If a disruption were to occur, then no more than one day’s worth of data would be lost, which should meet the target objective.
These backup interval times reflect the RPOs identified for each process. RPOs can vary greatly between processes. Keep in mind that the shorter the RPO, the more expensive the backup processes and technology becomes - so it is important to decide what that optimal point is between taking and storing backups and data loss. There may very well be points where it is more expensive to maintain and manage high-frequency backups than it is to re-input data in case of failure. This dialogue with the relevant business units is part of the value of understanding, evaluating, and establishing RPOs, so you can set realistic expectations for everybody, and each department understands the effort that must go into the recovery process.
Recovery Time Objective (RTO): This is typically the most understood metric in a BIA, aimed at answering the question “How long until it is back up?” RTOs establish the objective for recovery of a system, process or other operations component if impacted by a business disruption. When performing business continuity testing, you can evaluate the ability to meet this objective by measuring the time to recover. RTOs can be applied to almost anything: a particular piece of equipment (a server, a firewall), a service provided by a third party (e.g., an Internet connection), or even a business process or business unit (e.g., processing a wire transfer).
Different business units may have different expectations for target objective related to their processes. IT (Information Technology) and management may look at the defined expectations and determine whether the cost is too high or the technology is not capable of delivering on these expectations. In such a case, business units working together with IT and management must decide the appropriate course of action.
For example, a server processing customer information may need to be back online within a matter of hours, while a server that holds only archive information may be able to be down for multiple days without serious impact. Internet service may need to be back online within minutes or hours, while a secondary connection could be down for a day or more. The same may apply with various business units within an organization. Front-line customer service staff may need to be back in service ASAP, while less time-sensitive groups can wait longer with less impact. To expect everything to be “back to normal” within the same short time frame is not only unrealistic, but it may also be a recipe for disaster if IT resources (both people and technology) are limited. Realistic RTOs are a way to declare recovery prioritization before everyone is running around like headless chickens and pressuring IT to get to them next!
There are a couple of common challenges to RPO/RTO development:
Capabilities do not match expectations. Frequently, we see RTO and/or RPO values in a BIA that are overly optimistic based on existing backup systems used by an organization. A classic example is “aggressive” RTOs for servers that have never gone through detailed, timed recovery testing. This type of testing helps determine if the systems can meet expectations. Simply setting a one-hour RTO for a server does not make it achievable. IT staff, business unit leaders, and senior management need to collaborate. Business unit leaders can express their case for short RTOs and RPOs based on the criticality of their operation. IT can then give the results of restore testing using existing equipment and processes, illustrate any gaps between expectations, and propose new solutions. Senior management can then make a budgetary decision – whether it makes sense to pay for a new solution, more staff, or improved resiliency for infrastructure, or if a reset of expectations is in order.
Understanding interdependencies. Most business processes have multiple technical components supporting their activities. A strong BIA should catalog interdependencies, and RTO/RPO values should align accordingly. For example, Accounting uses a program that leverages a database on Server A, and that program also integrates with a cloud-based service over the Internet. Accounting needs this program to be fully functional with 4 hours of failure thus setting a recovery objective of four hours. Both Server A and Internet access (providing access to the cloud service) should have the capability to recover in four hours or less to meet the target RTO. If either is larger, then you have a misalignment and a gap between expectations and recovery realities.
Maximum Tolerable Downtime (MTD): This metric is sometimes referred to as Maximum Allowable Downtime (MAD) and is more strategic in nature. NIST defines MTD as “the amount of time mission/business process can be disrupted without causing significant harm to the organization’s mission.” Leaders in an organization should consider strategic and reputational risks when developing MTDs for various factors. How long can a critical service be down (RTO) before customers start considering switching to another company? How much critical data could be lost (RPO) without irreparable reputational or strategic damage? And again, MTDs vary with criticality – some MTDs may be measured in hours/days, while low risk items may have MTDs measured in weeks.
With this improved understanding of the BIA components, what does it take to get started? All too often, creation and management of an organization’s BIA starts and stops with the IT management. Effective BIA development is a team effort. BIA development should involve multiple groups within an organization, including business unit managers, information security staff, and senior leadership. The BIA is a business analysis, and each business unit needs to understand how disruption of processes will impact their part of the overall operation and determine how they will manage through a business impacting event.
How to Leverage Your BIA for Business Continuity Management
Now that you have established these three metrics for your business processes, what is next? With the understanding of these building blocks, organizations can create business continuity and incident response plans that meet the expectations of your business if a disruption were to occur and properly plan what they will do when those circumstances arise. In other words, it can help you create plans that actually work and be a tool that your employees will want to use and understand.
When thinking about your RPOs, start with an evaluation of your data backup processes. The backup interval must align with the amount of data loss the business function or system can endure. In some cases, the backup interval can be shorter than what your business may require, but if it is longer than desired, then you are not properly preparing your business for data loss.
Again, let us point to the examples above. Recovering from data loss that has impacted a critical process can be painful. How do you recover the data? What if you must manually recreate the data? A shorter interval between backups will minimize that loss and provide for a smoother recovery. If the business process can endure as much as a twenty-four hour period between backups, then it may be more cost effective to perform backups daily and still meet the expectations of the business. As you plan for the various possibilities that could result in data loss/data corruption, ensure that you define the steps for handling the data loss and what manual intervention may be needed to support business operations during the duration of impact. And remember, as part of your annual BCP testing, you should evaluate the ability to recover from backups to ensure your business targets can be achieved.
Your operational processes are also critical for ensuring that you can recover from data loss or corruption. Monitoring of your backup processes is crucial so that valid backups are available and reliable if a disruption were to occur. Encrypting your backups, storing them at a secondary location, and ensuring that they are immutable (e.g., unable to be edited, off-line or air-gapped) will provide protection against possible malicious or accidental corruption.
Backup retention is another factor to consider when managing backups. Maintaining at least 90 days of backups ensures you have data for recovery if a special case might warrant going back further than a few days. It is worth noting that the 90-day figure mentioned does not necessarily mean that you have 90 days of granular, full backups stored at any point in time. Considering storage constraints, many organizations opt for generational backup storage where something like 7 backups are kept from each day of the past week, 4 backups are kept from the previous weekends, and 3 backups are kept from previous months. Those numbers are not rigid, and variations are generally acceptable if the restoration capabilities are communicated to stakeholders in the BIA development process with acceptance by them of any recovery constraints. Many backup solutions available today have simple methods to graphically communicate recovery capabilities with "calendar views" of available restore points that may be easily understood by those not typically involved in backup or restoration processes.
When performing testing, measuring/evaluating how quickly you can recover from a business disruption lays the foundation for your BCP. As you look at the recovery time for independent components, it may influence how you architect your network and systems environment. For example, if having a single Internet connection down for as much as a day would be too impactful, then you may decide that it is worth the investment to have a secondary Internet connection. This would lessen the likelihood of a day-long Internet outage. Other things to consider as you define and manage your target RTOs is to understand how one impact, like an Internet outage, can have residual impact on system and data recovery efforts. As you plan for and execute your BCP testing, look for test scenarios that may demonstrate a domino effect so you can ensure your recovery processes are set up for success. Testing also allows you to ensure that your target objectives are realistic. Testing is an opportunity to evaluate recovery against your objectives within your BIA and prepare each business organization for managing through the business disruption if it were to occur. If recovery efforts failed to meet the RTO, then supporting technology, resources or other adjustments may be needed to meet the target objective.
The last component is your Maximum Tolerable Downtime. If you thought the last two metrics were hard to wrap your head around then this one may really have you stumped. Oftentimes, an organization looks past this one; however, a proper BIA includes thinking through scenarios that could put your business under undue stress. This discussion needs to include senior leadership as it is more of a strategic thinking exercise. Consider what can cause the organization undue or extreme stress (e.g., extended power outages, loss of Internet connectivity, data center facility damage.) What triggers should be in place to prevent adverse impacts (e.g., when to failover to a backup data center, when to move wire operations to an alternate location)? Where is more investment needed to prevent or lessen impacts (e.g., purchase of a backup generator, fully operational backup data center)?
Many potential impacts can be brought into the discussion. For example, an ice storm could result in an extended power outage. How many days could you handle without electricity? Are there recovery plans that you might kick into motion on day two or three, such as executing a rental agreement with a generator vendor or purchasing a generator, so you are better prepared? Including the discussion on MTD as part of your BIA development will support a more thorough planned-for response.
Developing your BIA is necessary for a solid business continuity plan. And remember that the BIA does not begin and end at the planning stage. Testing real-world scenarios with engagement by management from all aspects of your business should be baked into your annual review process to continue to move business continuity planning to the next-level and ensure that the next disruption will be handled gracefully.
Authored By: Renee Keffer