Disaster recovery (DR) is all too often an afterthought in business continuity strategies. Even enterprises with complex systems and terabyte upon terabyte of sensitive data can be guilty of having outdated and untested DR plans, or no DR plan at all. An effective DR plan focuses on the technology systems supporting critical business functions; it involves a set of policies and procedures for recovering and/or continuing vital technology infrastructure and systems following any kind of disaster. Essentially, in an effective DR plan, technology systems will transition from the primary site to the DR site.
One of the biggest challenges companies face when creating DR plans is deciding between self-managed, on-prem hardware or cloud solutions. For enterprises and organizations with complex monolithic applications, the relative ease of expanding their existing on-prem solutions for disaster recovery is tempting; after all, using a cloud DR solution would require refactoring and modernization. But there are some hefty risks associated with on-premises hardware—labor-intensive maintenance, infrastructural rigidity, potential outages, networking limitations, high latencies, and data storage and retrieval issues. SAP customers teetering between the two strategies should consider a number of important factors.
What matters to your disaster recovery strategy
There is no one-size-fits-all approach to disaster recovery. Strategies differ from application to application according to structure, function, and objective. The most successful DR plans consider the entire technology network and the company’s end-goals.
Identifying the best strategy, architecture, and toolset for your business begins with defining your Recovery Time Objective (RTO), which is how long you can afford to have your business offline, and your Recovery Point Objective (RPO), which is how much data loss you can sustain before you run into compliance issues due to financial losses. The smaller your RTO and RPO goals are, the more costly the application will be.
Every organization, regardless of its situation and goals, also needs to determine and factor in the costs to the business while the system is offline, and the costs for data loss and re-creation.
3 types of applications and 3 paths to DR
Depending on the application and databases involved, there are several ways of replicating data and the corresponding application configuration from the primary site to the DR site.
Path 1: RTO within days/RPO depends on function
This scenario is meant for non-critical business applications and non-production environments; it has a recovery time objective in the range of a few hours to a few days, with a recovery point objective of less than a day. In the event of a disaster, SAP systems running in Google Cloud are recovered from persistent disk snapshots, backups stored in Cloud Storage buckets, or both. New VMs for database and application servers can also be created from Compute Engine machine images (beta). In addition, SAP HANA databases can be recovered directly from Cloud Storage buckets, when the SAP HANA Backint agent for Google Cloud (beta) is used for database backup. The frequency of backups for SAP system database and application servers determines the RPO. One of the key advantages with this path is that there are no costs incurred for having systems in standby mode (hot or cold) during normal operations until the time point of a disaster, as new VMs are created after a disaster. Additionally, managed backup solutions from third parties such as Actifio, Commvault and Dell EMC can also be used.
Path 2: RTO in less than one day/RPO within minutes
This path is meant for applications that a business can function without temporarily, provided there’s a reasonable recovery plan. In the event of a disaster, the recovery approach for SAP application servers is from persistent disk snapshots or Google machine images (which is the same as that of the previous path). For database server recovery, the approach will differ based on the type of database that’s underlying the SAP system (SAP HANA or other databases). The SAP HANA database has an asynchronous replication feature that ensures near real-time replication. For other databases, the recovery approach is based on the specific features for replication or restore from backup, and replay of the most recent logs that are replicated. Because you can recover the database to any point in time until the time of the last replicated log, you help protect the system from potential user error.
In Google Cloud, persistent disk snapshots and Compute Engine machine images can have multiregional storage locations for geo-redundancy of data. Cloud Storage buckets also offer the additional option of dual-region storage locations that combine the performance of a single region with geo-redundancy. The key consideration in this approach is the benefit of shorter RTO/less RPO, which comes with the cost that’s incurred for running a database server in a DR site (for data or log replication). An additional risk could be the potential capacity crunch in the DR region to stand up application servers within the targeted RTO. This can be mitigated by either making reservations for capacity (at an additional cost) or by running a non-productive system, like a quality assurance or test system, in the DR region whose capacity can be repurposed for the recovery of a production system in the event of a disaster.
Path 3: RTO in minutes/RPO as close to zero as possible
This final strategy is best suited for business-critical applications. With this path, the full reservation of resources is guaranteed at the disaster recovery site. The SAP systems in the DR region are always on and configured to the same size as the source systems, which ensures that your applications will recover quickly. While the benefit of the lowest RTO/RPO numbers comes at the cost of constantly running servers in the DR region, Google Cloud’s innovative pricing, with options like Sustained use discounts, allows you to architect a cost-effective DR strategy.
In any of the paths that you choose for DR, Google Cloud’s premium networking brings industry-leading network performance, software-defined networking, global virtual private networks, and best-in-class security, all of which enable a simplified, yet robust and reliable DR architecture.
More considerations for planning your DR strategy
After you’ve defined the RPOs and RTOs that will guide your DR design, consider capacity planning and automation as part of your larger business continuity plan. Begin by making sure there’s enough capacity available to stand up a copy of a development system, so that you can control how to develop and transport any emergency SAP changes to the production system.
Although initiating a DR plan is usually a manual task, recovery and startup should be automated to ensure fast and error-free recovery. With Google Cloud, infrastructure is considered as code—we believe that repeatable tasks like provisioning, configuration, and deployment should be automated. All Google Cloud customers have access to infrastructure as code (IaC) capabilities where you can repeatedly build, start, and stop landscapes (these are the three steps needed to bring systems back into operation).
For SAP installations, Google Cloud also offers specific deployment manager/terraform scripts that not only reduce infrastructure creation times but also automate typical SAP system configurations, such as an SAP HANA Cluster setup with HSR and Pacemaker (full list of configurations here). These scripts can be enhanced or customized for specific deployment use cases, including standing up systems in the first recovery path mentioned above. Google Cloud also has additional automation tools like Cloud Scheduler which, in combination with Cloud Functions and Cloud Pub/Sub, can be used to automate your backups as well as testing of your DR strategies.
Don’t wait to develop your DR plan
Most businesses have learned firsthand that planning for the unexpected requires urgent attention. It begins with developing your DR plan—but that’s not enough. Your plan needs to address the full recovery process, from fail-over to fail-back, which includes planning, architecting, testing, and iterating or updating. Keep your business objectives top of mind so that your solution provides the right service, at the right cost. Once your plan is in place, remember that frequent testing and updating is critical to business continuity and DR strategies. And finally, automate whenever and wherever possible. Automation can be daunting without features like Cloud Scheduler, Cloud Functions, and Deployment Manager. With these features ready and at hand in Google Cloud, with minimum effort, your DR plan will be always ready to go and error-free.
To learn more about creating the optimal disaster recovery strategies for your SAP systems and applications on Google Cloud, download our whitepaper SAP on Google Cloud: Disaster Recovery Strategies and view this video on Google Cloud Disaster Recovery Strategies and Solutions for SAP Customers.