Disaster Recovery BSS Data Center

Published on May 2016 | Categories: Types, Business/Law | Downloads: 86 | Comments: 0 | Views: 680
of 53
Download PDF   Embed   Report

Comments

Content

Disaster Recovery for a BSS Data Center

DR for a BSS Data Centre

1

Disaster Recovery: The Lighter Side

DR for a BSS Data Centre

2

Section 1

Disaster Recovery Overview

DR for a BSS Data Centre

3

What is a Disaster?
‡ Hazard which has come to realization ‡ Perceived tragedy
± Natural calamity ± Man-made catastrophe

‡ Disasters are the consequence of inappropriately managed risks

DR for a BSS Data Centre

4

Risks to be Addressed

DR for a BSS Data Centre

5

What is Disaster Recovery in IT Perspective?
‡ Timely and effective restoration of IT services in a major incident ‡ Any plan or set of procedures implemented by a business to maintain uptime and/or prevent data loss in the event of a system failure

DR for a BSS Data Centre

6

Disaster Recovery
‡ People
± Staff, Outsourced

‡ Process
± Crisis Management

‡ Technology
± Hardware, Software

IT

DR for a BSS Data Centre

7

Metrics for Disaster Recovery (1/2)
‡ Driven by two metrics
± Recovery Time Objective (RTO) Interrupted for how long? ± Recovery Point Objective (RPO) How much data loss?

DR for a BSS Data Centre

8

Metrics for Disaster Recovery (2/2)
Recovery Point Objectives (RPO) Recovery Time Objectives (RTO)

5
a.m.

6
a.m.

7
a.m.

8
a.m.

9
a.m.

10
a.m.

11
a.m.

12
a.m.

1
p.m.

2
p.m.

3
p.m.

4
p.m.

5
p.m.

6
p.m.

7
p.m.

DECLARE DISASTER 10 a.m.

RPO: Amount of data lost from failure, measured as the amount of time from a disaster event

RTO: Targeted amount of time to restart a business service after a disaster event

DR for a BSS Data Centre

9

Understanding RPO and RTO
‡ Cost of downtime per hour
± Employee cost per hour + Cost of problem repair + Cost of employee overtime ± Loss of customer ± Reputation of Company

‡ Recovery Point Objective (RPO)
± A point in time to which the data must be recovered ± An acceptable loss of data during disaster situation

‡ Recovery Time Objective (RTO)
± The duration of time within which a business process must be restored after a disaster (underlying infrastructure and application components are restored first)
DR for a BSS Data Centre 10

Investment Scenario

DR for a BSS Data Centre

11

High Availability v/s Disaster Tolerance
‡ High Availability
± Providing redundancy within a data center to maintain the service (with or without a short outage)
‡ Hardware failures ‡ Software failures ‡ Human error

‡ Disaster Tolerance
± Providing redundancy between data centers to restore the service quickly (tens of minutes) after certain disasters (dedicated equipments)
‡ Power loss ‡ Fire, flood, earthquake ‡ Sabotage, terrorism

DR for a BSS Data Centre

12

Availability Events (1/2)
‡ Planned Outages
± ± ± ±

Network and power related changes Hardware repair Hardware and/or software upgrades Software maintenance
‡ OS ‡ Database ‡ Applications

± Data backup and storage management
‡ As data grows in size, tape backup is less effective ‡ What data must be archived ‡ How is the data archived?
DR for a BSS Data Centre 13

Availability Events (2/2)
‡ Unplanned Outages
± Hardware failure
‡ Server, storage, network, power

± Software failure
‡ Crashes, errors, hangs, etc. ‡ OS and applications

± Human error
‡ Hardware, software, data

± Disasters (Man made and otherwise)
DR for a BSS Data Centre 14

What causes the most Downtime?

Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

DR for a BSS Data Centre

15

Measure of Availability

Hours of downtime per year per IT service

Source: Best practices for Continuous Application Availability, Gartner Data Center Conference, 2008

DR for a BSS Data Centre

16

Section 2

Architecture & Sizing for Disaster Recovery

DR for a BSS Data Centre

17

2-Site Architecture
‡ 100% Primary Site + 100% DR Site ‡ Database changes are more frequent hence log based replication of database between Primary and DR site. ‡ Sync replication is not possible because of WAN bandwidth ‡ A-synch Replication is possible ‡ RPO -> Depends on how much data to be replicated, ‡ RTO -> Depends upon People + Processes
DR for a BSS Data Centre 18

2-Site Architecture: Working

DR for a BSS Data Centre

19

SAN
Storage Volume Group Application files VG Archive logs VG

Asynchronous Replication

SAN
Application files VG Archive logs VG

Storage VG

Storage Tier

Dark Fiber

DB Tier
DBCI servers in Cluster DBCI servers in Cluster

Application Tier
Application Servers Application Servers

Primary Site (ACTIVE)

DR Site

DR for a BSS Data Centre

20

2-Site Architecture
‡ Advantages
± Simple to manage ± Less expensive than other solutions ± Only one link needs to be procured

‡ Disadvantages
± RPO of 15 minutes is not quantifiable (Impact could be high or low) ± Cannot estimate what kind of data loss will happen ± RTO for DR site cannot be quantified to business because of lost transactions.
DR for a BSS Data Centre 21

3 Site Architecture (for RPO=0)

DR for a BSS Data Centre

22

‡ For RPO=0
± Must have synchronous replication of database ± Synchronous replication has limitations on distance (40 to 60 km) ± Hence cannot replicate synchronously for long distances ± But can replicate short distances ± So a 3 Site ( primary, Near, DR)solution might achieve RPO=0 (Almost)

DR for a BSS Data Centre

23

‡ What case will RPO be zero
± Regional disasters which don t destroy primary and Near site at the same time. ± For all kind of DC failures RPO=0 can be achieved ± In case of regional disaster which wipes out both Primary and Near site, RPO will depend upon the link between Primary and DR( could be 15 minutes depending upon the size of the link)

DR for a BSS Data Centre

24

WAN link

SAN
Storage Volume Group Application files VG Archive logs VG

Synchronous Replication

SAN
Application files VG Archive logs VG

Storage VG

Asynchronous Replication

SAN
Application files VG Archive logs VG Storage VG

DBCI servers in Cluster

DBCI servers

DBCI servers in Cluster

Application Servers

Application Servers

Application Servers

Primary Site

Near Site
DR for a BSS Data Centre

DR Site
25

3 SITE ARCHITECTURE: Working
Distance < 25 kms Dark Fibre

Site A

PROD

Site B

Near/ Bunker

Site C

DR

DR for a BSS Data Centre

26

3 Site DR considerations
‡ What should a Near site must have
± Different & multiple power source/ power grid ± Network Termination exactly same as Primary DC (if Near site has to be used for Primary site operations) ± Replication links from multiple vendors (No SPOF) ± Link to DR site

DR for a BSS Data Centre

27

What should be in the Near Site??
± Option1 : Full 100 % Replica of the Primary Site
‡ High cost (Infrastructure + People0
± Servers, storage, firewalls, switches, backup, power sources ± Applications, Databases, etc ± Security, Personnel, Processes ± Network Connectivity

‡ Would protect against any local problems at Primary DC

DR for a BSS Data Centre

28

What should be in the Near Site??
‡ Option 2: Split Configuration between primary and Near Site
± Database servers split between primary and Near Site (extended cluster) ± When Primary DC fails operations move to Near Site ± Maintenance and continuous upkeep of the of the Near Site essential ± Redundancy required in case of Application Servers, Firewall, routers, Servers, Backup etc
DR for a BSS Data Centre 29

What should be in the Near Site??
‡ Option 3: Minimalist
± Treat Near site only for RPO=0 purpose and not for operations ± Replicate storage continuously for RPO=0 ± Keep only that hardware which can push data from Near sit to DR in case of primary DC failure. ± Keeps the simplicity of 2 Site DR which RPO=0 for 3 Site ± RPO=0 not achieved if Primary and Near Site go down together
DR for a BSS Data Centre 30

Section 3

Connectivity to DR Site

DR for a BSS Data Centre

31

Connectivity
The majority of businesses deploy wide area networks (WANs) to connect the remote parts of the business back to centralized resources Bandwidth is always an issue in disaster recovery. If you're replicating data for potential failover both locally and remotely then your bandwidth issues become more complicated. We want to establish a DR site that's far enough away that it won't be affected by the same disaster, but not so far away that WAN bandwidth costs will be prohibitive.

DR for a BSS Data Centre

32

The physical distance involved will often dictate the type of replication used to move data between sites. They are two types of replication: 1) Synchronous replication 2) Asynchronous replication

Synchronous replication moves data in real time so that the data center and DR site contain the same data moment to moment, but synchronous data transfers often need high-bandwidth Asynchronous replication moves data on a bandwidth-available basis. This allows data movement using cheaper, lower-bandwidth connections, but presents a possibility of data loss because the data center and DR site may be out of sync by up to several hours

DR for a BSS Data Centre

33

With the popularity of IP connectivity there are lots of connectivity options available.Connectivity on SAN can be done by many options like:      Ethernet FC (Fibre Channel) iSCSI (Internet Small Computer System Interface) FCIP FCoE (Fibre Channel over Ethernet) The sites can be connected by a VPN, which provides cost benefits

1) Ethernet Traditional Ethernet ports support 10/100 Mbps -- far slower than Fibre Channel. Ethernet bandwidth is increasing today and 10 Gigabit Ethernet (10GigE) is widely available for data centers 2) Fibre Channel Early FC implementations ran at 1 Gbps per port, and 2 Gbps reigned until recently. Today, 4 Gbps FC is readily available and 10 Gbps implementations are appearing on some high-end systems and director-class switches.
DR for a BSS Data Centre 34

3) iSCI (Internet Small Computer System Interface) iSCSI to transfer data over LANs, WANs or the Internet and supports storage management over long distances. The emergence of iSCSI eases these challenges by encapsulating SCSI commands into IP packets for transmission over an Ethernet connection, rather than a Fibre Channel connection. iSCSI still has two disadvantages for storage:‡ At 1 GigE, it does not perform as fast as Fibre Channel. ‡ And Ethernet will drop packets during network congestion. These problems may be alleviated soon, thanks to the emergence of 10 GigE and Data Center Ethernet 4) FCIP . FCIP translates Fibre Channel commands and data into IP packets, which can be exchanged between distant Fibre Channel SANs. It's important to note that FCIP only works to connect Fibre Channel SANs, but iSCSI can run on any Ethernet network. 5) FCoE Storage vendors are working on a Fibre Channel over Ethernet (FCoE) standard to enable SAN and LAN convergence
DR for a BSS Data Centre 35

Requirements
 To establish WAN connectivity between the Central Location to 2 remote locations for Data Transfer Application.  The leased line based network design primarily to be used for implementing the Online Data Transfer Application with the auto ISDN backup connectivity.  The connectivity from the Central Location to the remote locations at 64Kbps to 2 Mbps speed.  The connectivity to be always on.  The Network Devices to be SNMP managed.  Provision for future scalability.

DR for a BSS Data Centre

36

DR for a BSS Data Centre

37

DAX Network
Central Location:
At the Central location, Dax recommended the customer to opt for 1 no. of DX2650 Modular Access Router with 1# 10/100 ports, 4NM Slots and VoIP Module Support. The router was populated as follows: Slot 1 2-ports Sync/Async Serial Module (speed up to : 2Mbps) Slot 2 4-port ISDN U module. Remaining 2 slots were left free for future scalability.

Remote Location:
At the Remote location, Dax recommended each remote branch to use DX-1721 Modular Router with 1# 10/100 port and 4 WAN Slot for WAN/VOIP modules. Each DX1721 was loaded with the following modules: ‡ Slot 1 - ISDN S/T module for providing automatic back-up connectivity. ‡ Slot 2 - 1-Port High speed Serial Sync / Async WAN Interface module for connecting leased line link @ 64 Kbps up to 2 Mbps Speed. ‡ The remaining 2 slots were left free for future scalability.
DR for a BSS Data Centre 38

Section 4

Backup Solution

DR for a BSS Data Centre

39

Possible Options
Backup and recovery from tape Host-based replication Storage-based replication Data replication infrastructure Replicating databases A comparison of the various disaster recovery solutions ‡ Metro clusters ‡ ‡ ‡ ‡ ‡ ‡
DR for a BSS Data Centre 40

Backup And Recovery From Tape
RAID technology used to provide high levels of data availability Cannot protect against data loss if the data is deleted (accidental or otherwise) or corrupted The tapes can be cloned, i.e., copied to new media to allow them to be stored off-site in a disaster recovery location Least expense of all the options it is only really applicable as the primary disaster recovery mechanism for non-critical services, i.e. services with RPOs where data loss and longer RTOs are acceptable

DR for a BSS Data Centre

41

Host-based replication
The remote mirror software works at the OS kernel level to intercept writes to underlying logical devices as well as to physical devices, such as disk slices and hardware RAID protected LUNs It then forwards these writes on to one or more remote Solaris OSbased nodes connected through an IP-based network

2 modes of data transfer: Synchronous mode replication, Asynchronous mode replication

DR for a BSS Data Centre

42

Storage-Based Data Replication
Perform data replication on the CPUs or controllers resident in the storage systems.

2 ways- Synchronous and Asynchronous modes, but the software operates at a much lower level.

Consequently, storage-based replication software can replicate data held by applications such as Oracle OPS and Oracle RAC even though the I/Os to a single LUN might be issued by several nodes concurrently.

The software provides remote replication through disk based journaling.

Journaling techniques can improve levels of reliability and robustness in remote copying operations, thereby also providing better data recovery capabilities.

DR for a BSS Data Centre

43

Data replication infrastructure

DR for a BSS Data Centre

44

Replicating databases
(RDBMS) portfolios from IBM and Oracle include wide range of tools to manage and administer data held in their respective databases: DB2 and Oracle The RDBMS software is designed to handle logical changes to the underlying data So, it offers considerably greater flexibility and lower network traffic than a corresponding block-based replication solution.
DR for a BSS Data Centre 45

DR for a BSS Data Centre

46

DR for a BSS Data Centre

47

Metro Clusters
The ability to cluster systems across hundreds of kilometers using Dense Wave Division Multiplexors (DWDM) and SAN connected Fibre Channel storage devices Cluster deployments that try to combine availability and disaster recovery by separating the two halves of the cluster and storage between two widely separated data centers The physically separated cluster nodes work identically but offer the added benefits of protecting against local disasters and eliminating the requirement for a dedicated disaster recovery environment

DR for a BSS Data Centre

48

Section 5

Costing

DR for a BSS Data Centre

49

‡ The investments on DR don t increase top-line revenue, though they will likely let you retain more of your profits through cost avoidance and corporate viability. ‡ Building the business case requires a different approach that calculates the cost of downtime, defines specific requirements, identifies realistic risks, selects cost-effective technologies and services, and shows a commitment to disaster recovery planning and preparedness as an ongoing program.
DR for a BSS Data Centre 50

SEVEN KEY STEPS FOR DISASTER RECOVERY SPENDING
‡ Implement a continuity management process. ‡ Conduct a business impact analysis (BIA) and risk assessment. ‡ Calculate the cost of downtime. ‡ Develop impact scenarios that address all risks, not just disasters. ‡ Position DR as a competitive necessity. ‡ Develop a DR services catalog. ‡ Align DR technology investments with other IT initiatives.

DR for a BSS Data Centre

51

Assumption

Qty

Unit Price (INR)

Cost (INR crores)

Capex
DC site Servers Storage Network Software Implementation- Consulting 33% of space in sqft 33% of CPUs 33% of storage in TB 10% of server cost 15% of storage cost 10% of Capex 20,000 2,000 2,000 25,000 500,000 400,000 50 100 80 10 12 20

Total

272

Opex
Bandwidth Power Rs. 50,000 per kw annum, 6 kw per rack 6 NOC seats, 20 on-site per 600 100,000 300,000 50 18

Manpower

10

AMC

6% of Capex

15

Total
DR for a BSS Data Centre

93
52

Thank You

DR for a BSS Data Centre

53

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close