The purpose of Business Continuity Planning (BCP) is to ensure the uninterrupted delivery of product and services to your customers. In essence, its goal is to help you perform your daily operations in order to stay in business by preventing:
- Loss of business to competitors
- Supply chain interruptions
- Injury to customers or employees
- Loss of reputation
Another way to look at BCP is as a path to business continuity assurance, assurance that unplanned service interruptions caused by probable events are identified and planned for. A key phrase in this definition is “probable events.” As we’ll see as we move through the BCP steps, it isn’t necessary to plan for every event your team can imagine. Probability of occurrence must be considered. For example, a business in Wisconsin would not plan for a hurricane while an organization in Florida might place hurricane planning near the top of its list.
Many people think of BCP as synonymous with Disaster Recovery Planning (DRP).Although DRP is important, it’s only one piece of effective
BCP.The probability that your business will suffer a catastrophic event is much less than the probability of experiencing a failed server or router.BCP should be integrated into all business processes, a standard part of any technology project or implementation plan.So how does an organization achieve a reasonable and appropriate level of business continuity assurance?The rest of this series is focused on answering that question.
There are five steps to achieving business continuity assurance.
Step 1: Business Analysis
The purpose of the first step is business analysis. This goes far beyond a simple analysis of your network infrastructure. It also includes the following:
1. An understanding of all processes that make your business function, including how those processes work together to produce business outcomes. The identification of vendors and other business partners whose contributions to your operation are critical for product and service delivery. Include why and in it what manner you interact with each entity.It’s also important to record contact information as well as the existence of agreements that contain clauses dealing with interruption of deliveries, service, support, payments, etc.
2. A thorough understanding of your information processing infrastructure.It isn’t enough to understand your internal network. You must also understand how your network interfaces with those of your customers, banks, and suppliers.Your infrastructure assessment must include all required workstations, servers, storage devices, backup/restore systems, and communication services.
3. An understanding of which people are critical to your business. These individuals are often not on your management team. Rather, they are the people who work in the trenches every day. Their understanding of how to get work done is a key element in maintaining business continuity. Additional information about them, and the tasks they perform, includes:
- The existence of cross-training to ensure more than one person can adequately perform business critical tasks.
- An assessment of how to maintain business continuity if key people are unable or unwilling to participate in recovery operations.
4. The identification of vendors who will assist with your recovery. They might include:
- Computer hardware and software vendors
- Recovery site vendors
- Communication vendors
5. The creation of a contact list including all key employees. Contact information should include:
- Home address
- Home phone
- Cell phone
6. An assessment of all key support services, including:
- Voice communication
- Fax services
- Shipping and receiving
Upon completion of the analysis step, you should have a clear view into the people, processes, and technologies necessary to continue delivering product and services. As we move to the next BCP step, we begin assessing the risk associated with the full or partial loss of one or more of them.
Step 2: Assess Risks
The first step in a BCP risk assessment is identifying internal and external threats. Next, look at the critical components of your organization to determine vulnerabilities of each to the identified threats. Finally, determine the business impact of the partial or complete loss of each critical operational component. Some of the areas to address include:
- Loss of short term revenue
- Loss of long term revenue
- Loss of investor confidence
- Loss of key employees
- Loss of facilities or other key fixed assets
As you work through these and other possible business issues, try to change a limited set of variables through the use of scenario planning. Scenario planning enables management and key employees to work through several kinds of business continuity interruptions, and helps determine if the team has considered all critical recovery requirements. You list of scenarios might include:
- One or more facilities are untenable, but the information processing infrastructure is still operational. This could be caused by
- Chemical spills
- One or more facilities no longer exist due to fire, hurricane, explosion, etc.
- All facilities are operational, but a supplier is temporarily shut down because of a catastrophic event
- The central data center is no longer operational, but all other facility functions are capable of normal operation
- In the case of a catastrophic event, many key employees or their families are affected. One or more of these employees might be unable to help with your recovery efforts. So the following questions should be answered as part of BCP (from John Burtle’s Beware the Complex Plan…):
- Who is prepared to do what? What activities and conditions will they tolerate?
- Who is not prepared to do certain things?
- What are the general reservations, or things the entire team is reluctant to do?
- List what people can do above and beyond their normal duties. For example, who may have a 4-wheel drive vehicle or unique skills other than those used daily at the office?
Additional scenarios are found in John Burtle’s Some thoughts on exercise scenarios and plot lines.
Using the results of scenario planning activities, build a quantitative or qualitative risk assessment chart. The resulting risk scores help with prioritization of process recovery.
Finally, list all key processes in a matrix that includes, at a minimum, the following information:
- The process owner
- Key individuals required to produce the desired outcome(s)
- The technology required to execute each process and any manual tasks used as workarounds
- The maximum number of hours or days the organization can survive without the output of the process
- Any special considerations resulting from scenario planning
- Dependencies (what processes must be operational to support one or more other processes)
After compiling all assessment information, you’re ready to begin developing a recovery strategy and plan.
Step 3: Strategy & Plan Development
Before developing your recovery plan, review the risk assessment matrix. Select the appropriate business continuity strategy for each risk. The strategies you develop for each system directly impact recovery. Possible strategies fall into one of three categories:
- Accept the risk
- Transfer the risk
- Reduce the risk to an acceptable level
Accepting the risk means taking no steps to prevent or mitigate the impact of a continuity event. However, planning should include clear recovery steps, steps that minimize business interruption via quick, efficient recovery activities.
Transferring risk includes purchasing business interruption insurance. It’s important not be be too short-sighted. Your insurance carrier might pay for short term losses, but you may never recover from the long term effects of the loss of customer or investor confidence.
Reducing risk is typically accomplished by reducing or eliminating vulnerabilities, including,
- A single point of failure, such as a server, router, switch, or firewall
- Lack of proper documentation to rebuild one or more components of a system
- Insufficient skills within the technical teams to quickly recover from system failures
- Lack of agreements with vendors that obligate them to respond within a defined time-frame
- Lack of an overall technology recovery plan, or the presence of an untested plan
- Lack of documented manual processes that can be initiated if automated systems fail
- Lack of cross-training programs that ensure more than one person possesses a critical skill set
- Non-IS personnel are not involved in recovery testing
Strategies for dealing with these and other potential business continuity weaknesses can take many forms. For example, single points of failure can be mitigated by maintaining one or more duplicate components “on the shelf,” helping reduce downtime by eliminating equipment acquisition cycles. Another method is implementing redundant components. This provides for minimal downtime through automatic fail-over, from a broken device to one that is either on standby or in a load balancing relationship. Another way to mitigate risk is including the proper maintenance of system build documentation in all project plans. Regardless of the vulnerabilities you identify, ensure you mitigate them so you can recover each system before maximum tolerable downtime is reached.
Maximum Tolerable Downtime (MTD)- MTD is the period during which a specific business process can be down without significant, or irrecoverable, business impact. Every effort should be made to ensure a process is recovered prior to exceeding its MTD.
Now that the risks are identified, and you’ve documented strategies for dealing with them, you’re ready to build your recovery plans. The following are some recommended steps for creating a successful plan:
- **Create a clear communication plan.**When a business continuity event occurs, communication is probably the most important recovery activity. All stakeholders must be kept informed of the type of event, the impact on the business overall, and the impact on their teams or departments. Understanding the scope of an event helps managers determine the best course of action to maintain the critical processes for which they’re responsible. Other points of contact for inclusion in the communication plan include:
- Fire services
- Law enforcement
- Insurance carriers
- Create recovery teams. Looking at your recovery requirements, create a team for each specific recovery area. For example, select a team of individuals who will travel to your hot site to rebuild your data center. Another consideration is a team assigned to set up a temporary office environment with phones, workstations, fax machines, and other office equipment necessary to perform day-to-day activities.
- Create easy to follow checklists. When first responding to a business continuity incident, your response teams shouldn’t be encumbered with lengthy, verbose technical or process documentation. Rather, they should follow checklists, which quickly guide them through the initial stages of recovery. Reacting quickly during the first few hours is critical to positioning your organization for a successful recovery. Completion of checklists should result in:
- Notification of critical personnel
- Identification of incident type
- Identification of incident scope
- Business impact mitigation
- Initiation of process and technology recovery efforts, if necessary
- Create system/process recovery documentation. In addition to lists of forms and other items necessary to implement temporary manual processes, this step requires the creation of detailed documentation that results in the recovery of all delivery systems. Examples include,
- Server and workstation build documents
- Application and data recovery documents
- Manual process instructions
- Plan for worst-case scenarios. Creating documentation for each possible scenario might not be practical. Your business continuity teams are usually engaged in day-to-day operational activities when they’re not working on BCP. In such cases, develop all recovery documentation with the intent to recover from catastrophic events. If your teams are properly trained, they will be able to adapt the plans on-the-fly to lesser incidents. Regular testing will help develop necessary awareness and flexibility.
Step 4: Test the Plan
Testing the plan is probably the most important part of BCP, and often the most neglected. Organizations that fail to conduct regular tests can’t reasonably expect their recovery teams to react quickly enough to an actual incident.
“The true measure of success from a business perspective is the pace of recovery. All of our business continuity plans and preparations are aimed at improving response and recovery times in order to reduce the impact on the business” (John Burtles, 2004).
To reach the measure of success defined by Burtles, your testing must target two primary objectives. First, your teams must be so familiar with the recovery process that management intervention is unnecessary except to help teams overcome external obstacles. Second, documentation should be free from inaccuracies, including those caused by mistakes or by changes in the business environment. It’s also a good idea to question all documented activities. Do they represent the most efficient path to recovery?
Testing should be a multi-step process. Without proper preparation, BCP tests will fall far short of your objectives. The elements of effective testing include educate, maintain, and test.
Educate each team on the contents of the documents related to their areas of responsibility. Team education should result in:
- An understanding of team roles.
- Solid knowledge of the processes and steps contained in the documentation.
- Elimination of team member resistance and apathy. In many organizations, the initial reaction to business continuity activities is that they are a waste of time; things that pull them away from “real work.” The education process must address this issue by helping team members understand the importance of BCP.
- Recommendations from the teams on how to improve the recovery processes.
- Team leaders having a high level view about how their teams’ activities fit into the overall recovery plan.
Between tests, the documentation must be properly maintained. The best way to accomplish this is through an effective change management process. Some of the deliverables of change management are updated configuration and build documentation for infrastructure components, updated process diagrams, and changes to forms required for manual processes. The responsibility for ensuring documentation changes are made and included in the BCP must be clearly defined.
The purpose of testing is to ensure that the documentation is accurate and to increase the awareness of recovery teams. Other reasons to test include:
- Recording system recovery times
- Identifying and documenting system recovery dependencies – in some cases, systems must be recovered in a specific order to fully recover delivery systems
Once you’re confident your documentation is reasonably accurate, plan the test. You don’t have to successfully restore a system to have a successful test. Remember, testing is designed to raise the awareness of your teams and to identify inaccuracies and inefficiencies in your documentation. However, you should establish certain guidelines and test objectives for each test. The following steps will help with the test planning process
- Establish a test strategy. This should include the type of test and the systems and processes you want to recover. There are three basic types of tests – checklist, walk-through, and hot site. A checklist test is performed by the individuals in your organization most familiar with the process or system being tested. The purpose of this test is to ensure the accuracy of the documentation. A walk-through is typically performed by one or more recovery teams sitting at a conference table. Walking through the recovery documents as though they were actually recovering from an incident increases awareness and helps identify roadblocks to recovery. A hot site test is an actual physical build of infrastructure and business processes.
- Clearly define the objectives of the test. Often, the objective of a test may be to simply test how long it takes to recover one or more systems. Other times, you may need to demonstrate you can recover and run a specific task. For example, your objective may be to recover your payroll system and actually print checks. Whatever your objectives, everyone participating in the test should understand what it is they’re trying to accomplish.
- Define how the test is to be conducted. Test planning should include criteria governing how the test will be performed. This includes how the documentation should be used, the types of logs and reports each team must complete, and the sequence of events. One challenge you should address is the tendency for teams to recover processes and systems based on memory rather than using recovery documentation. This is usually a bad idea. Planning for worst case scenarios includes planning for recovery situations in which your internal staff might not be available. If the documentation is not tested through strict adherence to it during recovery tests, you probably won’t be able to rely on it in an actual declared disaster.
- Select test team. Selecting the team for the test is relatively easy. The team responsible for the process or technology being tested should conduct the test. Ensure each member of the team understands the test strategy, objectives, and how the test is to be conducted.
- **Test.**Step through each phase of recovery, including initial notification of team members, immediate response through checklist implementation, and full system/process recovery. During a test, the following items should be documented in detail:
- Test start time
- Time each task in the plan is completed
- Actual time to complete each task
- Inaccuracies encountered in the documentation
- Recommendations for improving
Step 5: Manage Test Results
Using the documentation generated during the test, conduct an After Action Review (AAR). The fundamental purpose of the AAR is to identify and address people, process, and technology issues related to efficient and effective recovery. The output of the AAR is an action plan that, at a minimum, should include the following activities:
- Documentation updates
- Modifications to agreements with recovery vendors
- Changes to processes
- Team restructuring
The results of the test, including the AAR action plan, should be communicated to management as soon as possible after the test.
The BCP produced by Steps 1 through 3 is not just a book for the auditors that sits unused on someone’s bookshelf. Regular testing followed by a remediation action plan, Steps 4 and 5, is the cornerstone of an effective business continuity program. This is an incremental, evolving process. Each time you execute the test-manage cycle, your team becomes a little more capable of responding to business continuity events in a way that prevents significant business impact.
Business Continuity Planning is not a one time project. It is a continuous process, which results in incremental improvements in your organization’s ability to effectively recover from unplanned business interruptions.