What is a Disaster?
Hazard which has come to realization Perceived tragedy
± Natural calamity ± Man-made catastrophe
Disasters are the consequence of inappropriately managed risks
DR for a BSS Data Centre
4
Risks to be Addressed
DR for a BSS Data Centre
5
What is Disaster Recovery in IT Perspective?
Timely and effective restoration of IT services in a major incident Any plan or set of procedures implemented by a business to maintain uptime and/or prevent data loss in the event of a system failure
DR for a BSS Data Centre
6
Disaster Recovery
People
± Staff, Outsourced
Process
± Crisis Management
Technology
± Hardware, Software
IT
DR for a BSS Data Centre
7
Metrics for Disaster Recovery (1/2)
Driven by two metrics
± Recovery Time Objective (RTO) Interrupted for how long? ± Recovery Point Objective (RPO) How much data loss?
DR for a BSS Data Centre
8
Metrics for Disaster Recovery (2/2)
Recovery Point Objectives (RPO) Recovery Time Objectives (RTO)
5
a.m.
6
a.m.
7
a.m.
8
a.m.
9
a.m.
10
a.m.
11
a.m.
12
a.m.
1
p.m.
2
p.m.
3
p.m.
4
p.m.
5
p.m.
6
p.m.
7
p.m.
DECLARE DISASTER 10 a.m.
RPO: Amount of data lost from failure, measured as the amount of time from a disaster event
RTO: Targeted amount of time to restart a business service after a disaster event
DR for a BSS Data Centre
9
Understanding RPO and RTO
Cost of downtime per hour
± Employee cost per hour + Cost of problem repair + Cost of employee overtime ± Loss of customer ± Reputation of Company
Recovery Point Objective (RPO)
± A point in time to which the data must be recovered ± An acceptable loss of data during disaster situation
Recovery Time Objective (RTO)
± The duration of time within which a business process must be restored after a disaster (underlying infrastructure and application components are restored first)
DR for a BSS Data Centre 10
Investment Scenario
DR for a BSS Data Centre
11
High Availability v/s Disaster Tolerance
High Availability
± Providing redundancy within a data center to maintain the service (with or without a short outage)
Hardware failures Software failures Human error
Disaster Tolerance
± Providing redundancy between data centers to restore the service quickly (tens of minutes) after certain disasters (dedicated equipments)
Power loss Fire, flood, earthquake Sabotage, terrorism
Network and power related changes Hardware repair Hardware and/or software upgrades Software maintenance
OS Database Applications
± Data backup and storage management
As data grows in size, tape backup is less effective What data must be archived How is the data archived?
DR for a BSS Data Centre 13
± Software failure
Crashes, errors, hangs, etc. OS and applications
± Human error
Hardware, software, data
± Disasters (Man made and otherwise)
DR for a BSS Data Centre 14
What causes the most Downtime?
Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008
DR for a BSS Data Centre
15
Measure of Availability
Hours of downtime per year per IT service
Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008
DR for a BSS Data Centre
16
Section 2
Architecture & Sizing for Disaster Recovery
DR for a BSS Data Centre
17
2-Site Architecture
100% Primary Site + 100% DR Site Database changes are more frequent hence log based replication of database between Primary and DR site. Sync replication is not possible because of WAN bandwidth A-synch Replication is possible RPO -> Depends on how much data to be replicated, RTO -> Depends upon People + Processes
DR for a BSS Data Centre 18
2-Site Architecture: Working
DR for a BSS Data Centre
19
SAN
Storage Volume Group Application files VG Archive logs VG
Asynchronous Replication
SAN
Application files VG Archive logs VG
Storage VG
Storage Tier
Dark Fiber
DB Tier
DBCI servers in Cluster DBCI servers in Cluster
2-Site Architecture
Advantages
± Simple to manage ± Less expensive than other solutions ± Only one link needs to be procured
Disadvantages
± RPO of 15 minutes is not quantifiable (Impact could be high or low) ± Cannot estimate what kind of data loss will happen ± RTO for DR site cannot be quantified to business because of lost transactions.
DR for a BSS Data Centre 21
3 Site Architecture (for RPO=0)
DR for a BSS Data Centre
22
For RPO=0
± Must have synchronous replication of database ± Synchronous replication has limitations on distance (40 to 60 km) ± Hence cannot replicate synchronously for long distances ± But can replicate short distances ± So a 3 Site ( primary, Near, DR)solution might achieve RPO=0 (Almost)
DR for a BSS Data Centre
23
What case will RPO be zero
± Regional disasters which don t destroy primary and Near site at the same time. ± For all kind of DC failures RPO=0 can be achieved ± In case of regional disaster which wipes out both Primary and Near site, RPO will depend upon the link between Primary and DR( could be 15 minutes depending upon the size of the link)
DR for a BSS Data Centre
24
WAN link
SAN
Storage Volume Group Application files VG Archive logs VG
Synchronous Replication
SAN
Application files VG Archive logs VG
Storage VG
Asynchronous Replication
SAN
Application files VG Archive logs VG Storage VG
DBCI servers in Cluster
DBCI servers
DBCI servers in Cluster
Application Servers
Application Servers
Application Servers
Primary Site
Near Site
DR for a BSS Data Centre
DR Site
25
3 SITE ARCHITECTURE: Working
Distance < 25 kms Dark Fibre
Site A
PROD
Site B
Near/ Bunker
Site C
DR
DR for a BSS Data Centre
26
3 Site DR considerations
What should a Near site must have
± Different & multiple power source/ power grid ± Network Termination exactly same as Primary DC (if Near site has to be used for Primary site operations) ± Replication links from multiple vendors (No SPOF) ± Link to DR site
DR for a BSS Data Centre
27
What should be in the Near Site??
± Option1 : Full 100 % Replica of the Primary Site
High cost (Infrastructure + People0
± Servers, storage, firewalls, switches, backup, power sources ± Applications, Databases, etc ± Security, Personnel, Processes ± Network Connectivity
Would protect against any local problems at Primary DC
DR for a BSS Data Centre
28
What should be in the Near Site??
Option 2: Split Configuration between primary and Near Site
± Database servers split between primary and Near Site (extended cluster) ± When Primary DC fails operations move to Near Site ± Maintenance and continuous upkeep of the of the Near Site essential ± Redundancy required in case of Application Servers, Firewall, routers, Servers, Backup etc
DR for a BSS Data Centre 29
What should be in the Near Site??
Option 3: Minimalist
± Treat Near site only for RPO=0 purpose and not for operations ± Replicate storage continuously for RPO=0 ± Keep only that hardware which can push data from Near sit to DR in case of primary DC failure. ± Keeps the simplicity of 2 Site DR which RPO=0 for 3 Site ± RPO=0 not achieved if Primary and Near Site go down together
DR for a BSS Data Centre 30
Section 3
Connectivity to DR Site
DR for a BSS Data Centre
31
Connectivity
The majority of businesses deploy wide area networks (WANs) to connect the remote parts of the business back to centralized resources Bandwidth is always an issue in disaster recovery. If you're replicating data for potential failover both locally and remotely then your bandwidth issues become more complicated. We want to establish a DR site that's far enough away that it won't be affected by the same disaster, but not so far away that WAN bandwidth costs will be prohibitive.
DR for a BSS Data Centre
32
The physical distance involved will often dictate the type of replication used to move data between sites. They are two types of replication: 1) Synchronous replication 2) Asynchronous replication
Synchronous replication moves data in real time so that the data center and DR site contain the same data moment to moment, but synchronous data transfers often need high-bandwidth Asynchronous replication moves data on a bandwidth-available basis. This allows data movement using cheaper, lower-bandwidth connections, but presents a possibility of data loss because the data center and DR site may be out of sync by up to several hours
DR for a BSS Data Centre
33
With the popularity of IP connectivity there are lots of connectivity options available.Connectivity on SAN can be done by many options like: Ethernet FC (Fibre Channel) iSCSI (Internet Small Computer System Interface) FCIP FCoE (Fibre Channel over Ethernet) The sites can be connected by a VPN, which provides cost benefits
1) Ethernet Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet (10GigE) is widely available for data centers 2) Fibre Channel Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned until recently. Today, 4 Gbps FC is readily available and 10 Gbps implementations are appearing on some high-end systems and director-class switches.
DR for a BSS Data Centre 34
3) iSCI (Internet Small Computer System Interface) iSCSI to transfer data over LANs, WANs or the Internet and supports storage management over long distances. The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP packets for transmission over an Ethernet connection, rather than a Fibre Channel connection. iSCSI still has two disadvantages for storage: At 1 GigE, it does not perform as fast as Fibre Channel. And Ethernet will drop packets during network congestion. These problems may be alleviated soon, thanks to the emergence of 10 GigE and Data Center Ethernet 4) FCIP . FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged between distant Fibre Channel SANs. It's important to note that FCIP only works to connect Fibre Channel SANs, but iSCSI can run on any Ethernet network. 5) FCoE Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable SAN and LAN convergence
DR for a BSS Data Centre 35
Requirements
To establish WAN connectivity between the Central Location to 2 remote locations for Data Transfer Application. The leased line based network design primarily to be used for implementing the Online Data Transfer Application with the auto ISDN backup connectivity. The connectivity from the Central Location to the remote locations at 64Kbps to 2 Mbps speed. The connectivity to be always on. The Network Devices to be SNMP managed. Provision for future scalability.
DR for a BSS Data Centre
36
DR for a BSS Data Centre
37
DAX Network
Central Location:
At the Central location, Dax recommended the customer to opt for 1 no. of DX2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module Support. The router was populated as follows: Slot 1 2-ports Sync/Async Serial Module (speed up to : 2Mbps) Slot 2 4-port ISDN U module. Remaining 2 slots were left free for future scalability.
Remote Location:
At the Remote location, Dax recommended each remote branch to use DX-1721 Modular Router with 1# 10/100 port and 4 WAN Slot for WAN/VOIP modules. Each DX1721 was loaded with the following modules: Slot 1 - ISDN S/T module for providing automatic back-up connectivity. Slot 2 - 1-Port High speed Serial Sync / Async WAN Interface module for connecting leased line link @ 64 Kbps up to 2 Mbps Speed. The remaining 2 slots were left free for future scalability.
DR for a BSS Data Centre 38
Section 4
Backup Solution
DR for a BSS Data Centre
39
Possible Options
Backup and recovery from tape Host-based replication Storage-based replication Data replication infrastructure Replicating databases A comparison of the various disaster recovery solutions Metro clusters
DR for a BSS Data Centre 40
Backup And Recovery From Tape
RAID technology used to provide high levels of data availability Cannot protect against data loss if the data is deleted (accidental or otherwise) or corrupted The tapes can be cloned, i.e., copied to new media to allow them to be stored off-site in a disaster recovery location Least expense of all the options it is only really applicable as the primary disaster recovery mechanism for non-critical services, i.e. services with RPOs where data loss and longer RTOs are acceptable
DR for a BSS Data Centre
41
Host-based replication
The remote mirror software works at the OS kernel level to intercept writes to underlying logical devices as well as to physical devices, such as disk slices and hardware RAID protected LUNs It then forwards these writes on to one or more remote Solaris OSbased nodes connected through an IP-based network
2 modes of data transfer: Synchronous mode replication, Asynchronous mode replication
DR for a BSS Data Centre
42
Storage-Based Data Replication
Perform data replication on the CPUs or controllers resident in the storage systems.
2 ways- Synchronous and Asynchronous modes, but the software operates at a much lower level.
Consequently, storage-based replication software can replicate data held by applications such as Oracle OPS and Oracle RAC even though the I/Os to a single LUN might be issued by several nodes concurrently.
The software provides remote replication through disk based journaling.
Journaling techniques can improve levels of reliability and robustness in remote copying operations, thereby also providing better data recovery capabilities.
DR for a BSS Data Centre
43
Data replication infrastructure
DR for a BSS Data Centre
44
Replicating databases
(RDBMS) portfolios from IBM and Oracle include wide range of tools to manage and administer data held in their respective databases: DB2 and Oracle The RDBMS software is designed to handle logical changes to the underlying data So, it offers considerably greater flexibility and lower network traffic than a corresponding block-based replication solution.
DR for a BSS Data Centre 45
DR for a BSS Data Centre
46
DR for a BSS Data Centre
47
Metro Clusters
The ability to cluster systems across hundreds of kilometers using Dense Wave Division Multiplexors (DWDM) and SAN connected Fibre Channel storage devices Cluster deployments that try to combine availability and disaster recovery by separating the two halves of the cluster and storage between two widely separated data centers The physically separated cluster nodes work identically but offer the added benefits of protecting against local disasters and eliminating the requirement for a dedicated disaster recovery environment
DR for a BSS Data Centre
48
Section 5
Costing
DR for a BSS Data Centre
49
The investments on DR don t increase top-line revenue, though they will likely let you retain more of your profits through cost avoidance and corporate viability. Building the business case requires a different approach that calculates the cost of downtime, defines specific requirements, identifies realistic risks, selects cost-effective technologies and services, and shows a commitment to disaster recovery planning and preparedness as an ongoing program.
DR for a BSS Data Centre 50
SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING
Implement a continuity management process. Conduct a business impact analysis (BIA) and risk assessment. Calculate the cost of downtime. Develop impact scenarios that address all risks, not just disasters. Position DR as a competitive necessity. Develop a DR services catalog. Align DR technology investments with other IT initiatives.
DR for a BSS Data Centre
51
Assumption
Qty
Unit Price (INR)
Cost (INR crores)
Capex
DC site Servers Storage Network Software Implementation- Consulting 33% of space in sqft 33% of CPUs 33% of storage in TB 10% of server cost 15% of storage cost 10% of Capex 20,000 2,000 2,000 25,000 500,000 400,000 50 100 80 10 12 20
Total
272
Opex
Bandwidth Power Rs. 50,000 per kw annum, 6 kw per rack 6 NOC seats, 20 on-site per 600 100,000 300,000 50 18