Big Data Compression Management

Published on May 2016 | Categories: Documents | Downloads: 63 | Comments: 0 | Views: 267
of 11
Download PDF   Embed   Report

Comments

Content

















Three Must-Have Storage Tools for
Managing Big Data



February 2012
Dick Csaplar



This document is the result of primary research performed by Aberdeen Group. Aberdeen Group's methodologies provide for objective fact-based research and
represent the best analysis available at the time of publication. Unless otherwise noted, the entire contents of this publication are copyrighted by Aberdeen Group, Inc.
and may not be reproduced, distributed, archived, or transmitted in any form or by any means without prior written consent by Aberdeen Group, Inc.




February 2012
Three Must-Have Storage Tools for
Managing Big Data
One of the key IT challenges organizations are facing is how to store
greater amounts of data without dramatically increasing their IT spending.
Aberdeen surveyed the IT community to investigate how fast their data is
growing and how they are dealing with the increasing demand for storage
space. From the 106 respondents, Aberdeen identified three important
storage tools for managing data growth: storage virtualization, data
compression or deduplication, and data tiering. Each tool addresses a
different challenge for managing the storage of Big Data.
Big Data Trends
Aberdeen asked survey respondents to report on their most pressing IT
issues. Fifty eight percent (58%) of respondents reported that the growing
demand for storage capacity is one of their top three IT job pressures.
Figure 1: Storage Capacity – A Top IT Pressure
58%
39%
25%
0%
10%
20%
30%
40%
50%
60%
70%
Meeting the increasing
demands for storage
Reduced budgets Managing outages
%

o
f

C
o
m
p
a
n
i
e
s

S
u
r
v
e
y
e
d
n = 106
Source: Aberdeen Group January 2012

The pressure of increasing storage demands was cited 38% more often than
the pressure of reducing IT budgets.
Aberdeen also investigated how quickly the demand for storage is rising. On
average, the storage capacity required by organizations grew 35% from 2010
to 2011. In addition, 8% of respondents reported that their data grew
Analyst Insight
Aberdeen’s Insights provide the
analyst perspective of the
research as drawn from an
aggregated view of the research
surveys, interviews, and
data analysis
“We are constantly struggling
to get additional storage in
place as well as upgrade out-
dated hardware.”
~ IT Manager, Mid-sized
Healthcare Provider, USA
Three Must-Have Storage Tools for Managing Big Data
Page 2


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
between 90% and140% per year, and another 6% claimed their data grew
over 150% per year.
Table 1: Data Growth Rates

All
Respondents
Small Mid-Sized Large
2010 to 2011
Growth Rate
35% 29% 35% 44%
Source: Aberdeen Group, January 2012
The growth of storage requirements is especially troubling for large
enterprises, which reported an average growth rate of 44%. The 35%
average growth rate across all company sizes means that data storage is
doubling every 2.5 years. Unless organizations get smarter and more
efficient, they will soon need twice the number of storage devices, twice the
space in the data center, and twice the time and effort to manage their
business data.
Big Data is a term used to describe the vast pool of data organizations are
required to manage, mine, and store. Three aspects need to be considered
when managing Big Data:
• First there is the total size of the data stored. Today it is not
unusual to hear of companies managing petabytes (PB) or exabytes
(EB) of storage (a petabyte is 1000 terabytes, and an exabyte is 1000
petabytes).
• The second factor to consider is how fast the data changes. One PB
of static, slowly changing data can be managed differently than a
similar volume of data that changes every month, every week, or
every day.
• Finally organizations must understand the different data types they
handle. Video, spreadsheets, formatted databases, and fully
unstructured data require different tools to optimize storage.
For more on Big Data and how to manage it, see the Aberdeen report “Big
Data, Big Moves” August, 2011.
Storage Capabilities to Manage Big Data
Data is growing too rapidly for organizations to keep pace by just getting
bigger; they have to get smarter. Aberdeen has found that three important
storage features help manage the growing impact of Big Data. Each of the
following storage tools makes data more manageable in different ways:
• Storage virtualization: Managing the variety of data types
• Data deduplication and compression: Managing the size of the
data
Three Must-Have Storage Tools for Managing Big Data
Page 3


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
• Storage tiering: Managing the speed at which the data changes
Managing the Variety of Storage: Storage Virtualization
Today there is a great buzz around the benefits of virtualization. The term
“virtualization” has great marketing value; however, it is defined differently
by different companies. Most storage solutions support virtualized storage
over and above basic Logical Unit Numbers (LUNs – the grouping of
multiple hard drives into a single manageable unit). No two storage
virtualization solutions are identical, and some are widely different from the
average. All, however, share a common element: storage virtualization adds
an abstraction layer of software that hides physical devices from the user,
and allows all devices to be managed as a single pool.
This means data is not necessarily represented by its physical location, but
rather is managed as a logical unit. On a practical level, storage devices can
all be managed by a single application.
While storage virtualization has been around for many years, it has not been
as widely adopted as server virtualization. Table 2 shows the current
adoption of storage virtualization in the enterprise.
Table 2: Virtualization Statistics

All
Respondents
Small Mid-Sized Large
Average number of
storage devices
2.7 2.3 2.6 3.2
Average number of
storage management
applications
2.4 2.4 2.0 3.0
Percentage of a companies
with a single storage
management app
21% 38% 32% 20%
Source: Aberdeen Group, January 2012
The ratio of storage devices to management applications shows that
virtualization has not penetrated as deeply as it should. Only about one fifth
of organizations have attained the goal of managing all their storage devices
with a single application. This shows that a major segment of the potential
market has not fully adopted storage virtualization.
Aberdeen investigated what business benefits were experienced by
organizations that virtualized their storage devices. The research shows that
storage virtualization offers both operational and financial benefits (Figure
2).
“Organizations need to
understand how their data is
used and look for tiering
alternatives that can provide
fast access for current data and
slower access to less used
data.”
~ IT Manager, Large IT
Consulting Organization, US

Three Must-Have Storage Tools for Managing Big Data
Page 4


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
Figure 2: Storage Virtualization Benefits
54%
41%
38%
31%
0%
10%
20%
30%
40%
50%
60%
Reduced effort to
manage SANs
Improved time to
deploy
servers/apps
Reduced IT
expense
Reduced number
of SANs
%

o
f

S
u
r
v
e
y
e
d

R
e
s
p
o
n
d
e
n
t
s
n = 106
Source: Aberdeen Group, January 2012
Table 5 shows further performance gains reported by organizations that
deployed this technology (Table 5). While these are small reductions, they
need to be considered in the light of the very high growth of the amount of
data being handled.
Table 3: Further Benefits of Storage Virtualization

Benefits since deploying storage
virtualization
Change in application downtime - 12%
Change in IT spending on storage - 6%
Change in IT storage headcount - 3.5%
Source: Aberdeen Group, January 2012
The improvement in application downtime is linked to the storage
virtualization feature of consistent file management. Key business
applications are always directed to the right data, even though it may have
moved as part of data tiering or a file management process. The reductions
in spending for headcount and equipment are likewise linked to increased
ease of management and reduced islands of storage.
Aberdeen expects these benefits will be experienced more broadly and
deeply as companies achieve the goal of managing all their storage devices
under a single management application.
Manage the Size of the Data: Deduplication and
Compression
One key strategy to deal with Big Data is to reduce the size of the files
being stored. Aberdeen investigated the overall adoption rate of
Three Must-Have Storage Tools for Managing Big Data
Page 5


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
deduplication and compression technologies as part of organization-wide
storage programs. Forty-seven percent (47%) of respondents use data
deduplication and 56% use data compression.
Data deduplication is a specialized file compression technique for eliminating
redundant data. The technique is used to improve storage utilization and
can also be applied to network data transfers to reduce the number of bytes
sent across a link. In the deduplication process, unique bunches of data, or
byte patterns, are identified and stored during an analysis process. As the
analysis continues, other bunches are compared to the stored copy, and
whenever a match occurs, the redundant bunch is replaced with a small
reference that points to the stored original. Given that the same byte
pattern may occur many times, the amount of data that must be stored or
transferred can be greatly reduced. Data deduplicaiton is designed to handle
full files and larger amounts of data.
Data compression differs from deduplication in that these tools identify
short repeated identical strings in individual files, and store only a single
copy. For example, a typical email system might contain many copies of the
same Word document file attachment. With data deduplication, only one
instance of the Word attachment is actually stored, and subsequent
instances are referenced back to the saved copy. Data compression is
designed to handle single files and smaller amounts of data.
Table 4: Data Reduction Rates

Average Data Reduction
Rate
Data reduction rates 31%
Source: Aberdeen Group, January 2012
Thirteen percent (13%) of respondents reported that they achieved over
50% reduction rates from their compression tools. Some vendors advertise
data deduplication and compression rates of up to 80%. While this may be
possible in the lab, it is important to note that in the real world such results
are almost impossible to attain. Very random data or encrypted files present
almost no opportunity for compression. An organization needs highly
repetitive files to gain very high rates. However, 35% – 50% file sized
reductions remain valuable: they translate directly into fewer required hard
drives, and eventually into fewer storage devices.
Manage the Change of Data: Data Tiering
Storage tiering advocates using the fastest, most reliable storage medium to
house the newest and most important enterprise data. As the data ages and
its usefulness diminishes, it can be transferred to slower, less reliable and
cheaper storage devices. The cost savings come from the effective use of
less expensive storage mediums, such as SAS or SATA hard disks, for lower
tiers of data storage.
Three Must-Have Storage Tools for Managing Big Data
Page 6


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
Table 5: How many storage tiers do you have?

How many tiers of storage do
you use?
2 Tiers 33%
3 Tiers 47%
4 or more tiers 13%
Source: Aberdeen Group, January 2012
Automated storage tiering is a fairly new technology, and not widely
deployed in most organizations. Only 15% of organizations reported
deploying automated storage tiering in their current storage strategy.
In September 2011, Aberdeen published a complete report on storage
tiering, called “Reduce the Cost of Storage with Tiering”. The next section
shares details from the analysis presented in that report.
Tiering Details
Managers must make three basic choices concerning the make-up of their
storage tiers: the storage technology, the speed of the device, and the form
of RAID chosen to protect the data. Each of these choices has a direct
impact on the speed of storage performance and the cost per TB of data.
Figure 3: Device Types by Storage Tier
8%
46%
8%
38%
0%
22%
39% 39%
0%
25%
50%
Solid State Disks -
SSDs
Fibre Channel Hard
Drive - FC HDD
SAS Hard Drive SATA Hard Drive
P
e
r
c
e
n
t
a
g
e

o
f

R
e
s
p
o
n
d
e
n
t
s

(
n
=
1
0
6
)
Tier 1
Tier 2
Source: Aberdeen Group, May 2011
Figures 3, 4 and 5 show the different percentage of companies that deploy
different technologies in Tier 1 vs. Tier 2 storage.
SSDs are the fastest, but by far the most expensive form of data storage.
They are reported to be 25 to 100 times faster than SAS hard drives, and
are ideal for storing the top percentage of the data that is constantly
accessed by users. Since SSDs have no spinning disks, there is no "seek
time"—the milliseconds required to get the disk head over the right portion
of the drive to perform a read or write command. This technology is fairly
new, and will make up a higher percentage of Tier 1 storage as prices
decline.
RAID Levels
The following terms are used
when discussing storage data
protection:
√ RAID - Redundant Array of
Independent Disks (RAID)
describes how data is
distributed across hard disks
to provide protection in case
of the loss of a drive.
√ RAID 10 - Creates a
mirrored set of the data.
Fastest read/write
performance but high cost in
terms of number of disks
√ RAID 5 - Mirrored data with
parity distributed across
drives. Can survive the loss
of a single drive
√ RAID 6 - Mirrored data with
parity distributed across
drives. Can survive the loss
of two disks in the set.
Three Must-Have Storage Tools for Managing Big Data
Page 7


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
Fibre Channel drives are the most reliable, and the primary choice for disk
drives used in Tier 1 storage. They are also the most costly hard disk
option. Likely due to their expense, only half the respondents who used
these drives for Tier 1 storage also used then for Tier 2 storage, preferring
instead to use lower cost SAS and SATA drives for that task.
Generally, the faster the hard drive spins, the faster data is read or written,
since the head reaches targeted blocks of the disk more rapidly. Disk drives
with 15K RPM account for almost half of the HDDs used in Tier 1. This
percentage decreases to just 25% in Tier 2. Drives with 10K RPM become
the speed of choice in Tier 2, with some even slower 7.2K RPM drives
beginning to appear.
Figure 4: Device Speed by Storage Tier
15%
46%
38%
8%
33%
25%
33%
0%
0%
25%
50%
7.2K RPM 10K RPM 15K RPM No Factor
P
e
r
c
e
n
t
a
g
e

o
f

R
e
s
p
o
n
d
e
n
t
s

(
n
=
1
0
6
)
Tier 1
Tier 2
Source: Aberdeen Group, May 2011

In the survey process, Aberdeen allowed end users to choose "no factor,"
meaning that they did not use device speed as a criterion for deciding how
to format their storage tiers. About one third of end users do not consider
the speed of the drive significant enough to influence their selection of
technology for their storage tiers.
Figure 5: RAID Level by Storage Tier
38%
8% 8%
17%
42%
0%
41%
46%
0%
25%
50%
RAID 10 RAID 5 RAID 6 No Factor
P
e
r
c
e
n
t
a
g
e

o
f

R
e
s
p
o
n
d
e
n
t
s

(
n
=
1
0
6
)
Tier 1
Tier 2
Source: Aberdeen Group, May 2011
“You can replace destroyed
hardware, but you cannot
replace destroyed critical
business data. Protect it
adequately. We use LTO Tape
as our final tier. That's how we
recovered after a flood
destroyed all of our hardware.
Our tapes were safely stored
off-site.”
~ IT Director, Large
Transportation Company, US

Three Must-Have Storage Tools for Managing Big Data
Page 8


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
RAID is a relationship set among a group of hard drives to protect the
recorded data. Different RAID levels support the loss of one, two or
several hard drives at a time without the loss of any of the data they
contain. They do this by linking drives into logical units and dividing the data
across multiple members.
The different RAID formats have tradeoffs between levels of data
protection and the speed of read / write commands. RAID 10 simply copies
data to two different drives with no calculated error protection (called
parity). This method is very fast, but requires 100% more storage capacity,
since all data is completely replicated on two different disks. At the other
end, RAID 6 requires heavy parity calculations, but requires less additional
disk capacity to provide data protection. RAID 6 requires fewer hard drives
to store the data, but takes longer to read and write from the drives, as it
has to calculate the parity bits. Another factor to consider is that if a drive
in a RAID 6 set fails, high processor overhead is required to recreate the
failed disks on a hot spare – a backup hard drive that fills in for any failing
disk.
Organizations prefer to use RAID 10 in Tier 1, as it has the fastest read /
write performance. For this level of important data, the extra cost of RAID
10 is often justified. In Tier 2 the dominant choice is RAID 5, which protects
the data just as well but performs read / write commands more slowly due
to parity calculations.
Forty-one percent (41%) of respondents reported having no preference as
to the type of RAID. For these respondents, the effect of RAID on Tier 2
may not be all that important, as the data is not accessed or updated very
often.
Tiering Cost Analysis
Aberdeen priced out equal amounts of storage for both a Tier 1 and Tier 2
configuration as defined by the responding organizations. Aberdeen used
component prices current to September 2011, with standard discounts, and
kept to generally accepted storage configuration rules, such as not
exceeding 80% of hard disk volumes and providing additional volumes for
different levels of RAID.
The Tier 2 configuration was 25% less costly than that of Tier 1.
This was predominantly caused by two factors:
• Tier 1 SSD storage - SSD storage is very expensive compared to the
cost of hard disk drives. In the Tier 1 configuration, SSD storage
provided only 8% of the capacity, but cost 40% of the budget.
• Tier 1 RAID choices - About 50% of the Tier 1 capacity was
configured as RAID 10. This requires about twice the amount of
raw disk capacity, as the data is written fully to two separate disks.
The choice of RAID increased the cost of the hard drives in tier 1
by 50%, compared to only by 25% in Tier 2
“Have 3 level on-line storage
and 1 level optical disk
archive.”
~ CIO, Mid-sized Financial
Services Organization, Middle
East

Three Must-Have Storage Tools for Managing Big Data
Page 9


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897
The pricing was based on storage options provided by one of the leading
storage vendors in the industry. Every effort was made to keep the cost of
enclosures, management software, and operational factors constant
between the two tiers, allowing Aberdeen to focus only on the storage
medium.
Other Savings from Storage Tiering
Storage tiering can also provide other benefits:
• Reduced power and cooling charges - By using higher density and
slower drives in Tier 2, fewer hard drives can store the same
amount of data. Fewer drives require less space in a computer rack,
consume less power, produce less heat, and require less cooling.
• Faster performance - By keeping the most accessed data on SSD
and FC drives, the performance of the storage arrays can be
improved. This benefits the entire IT architecture, as storage I/O is
traditionally a limit to overall datacenter performance.
Storage tiering is a fairly new concept, and there are a limited number of
organizations that have implemented it long enough to realize solid,
quantifiable returns. Aberdeen will revisit this topic next year and look to
publish metrics from all aspects of a storage tiering program.
Report Summary: Managing Big Data
Organizations are experiencing radical increases in the size and complexity
of the business data they are required to manage. A 35% growth rate means
that the size of data that must be stored doubles every 2.5 years. IT cannot
allow their spending for storing data to increase at that rate, so technologies
and strategies must be adopted to make storage more efficient. According
to Aberdeen’s research, these strategies include:
• Storage virtualization brings all the storage devices under a
single management umbrella. Islands of unused storage capacity
become available to everyone, and data can be stored where is
most logical. IT can manage larger data pools more effectively if
storage devices have been virtualized.
• Data compression and deduplication can, on average, reduce
the size of stored data files by 30%. This impacts an organization’s
budget by reducing the number of hard drives needed and,
ultimately, the number of SAN and NAS devices. This results in real
capital savings.
• Data tiering ensures that the most important data is stored on the
fastest and most reliable media. Older, less important data can be
moved automatically to less expensive storage devices. This ensures
the corporation is not paying premium dollars for housing non-
critical data.
Using these three storage features, organizations can reduce the financial
impact of Big Data management.
Three Must-Have Storage Tools for Managing Big Data
Page 10


© 2012 Aberdeen Group. Telephone: 617 854 5200
www.aberdeen.com Fax: 617 723 7897













For more information on this or other research topics, please visit
www.aberdeen.com.

Related Research
Measuring the Returns from a Desktop Virtualization Program; September 2011
Small vs. Large Enterprise Data Backup; Same Concept, Different Process; June 2011
High Availability for Virtualized Applications: Protecting Against Unplanned Downtime;
June 2011
Managing Virtualized Applications: Optimizing Dynamic Infrastructures; April 2011
Storage Virtualization: Experience Begets Benefits; February 2011
Small and Mid-Sized Organizations Gain Disaster Recovery Advantages Using Cloud
Storage; December 2010
Author: Dick Csaplar, Senior Research Analyst, Virtualization and Storage
Practice ([email protected])

For more than two decades, Aberdeen's research has been helping corporations worldwide become Best-in-Class.
Having benchmarked the performance of more than 644,000 companies, Aberdeen is uniquely positioned to provide
organizations with the facts that matter —the facts that enable companies to get ahead and drive results. That's why
our research is relied on by more than 2.5 million readers in over 40 countries, 90% of the Fortune 1,000, and 93% of
the Technology 500.
As a Harte-Hanks Company, Aberdeen’s research provides insight and analysis to the Harte-Hanks community of
local, regional, national and international marketing executives. Combined, we help our customers leverage the power
of insight to deliver innovative multichannel marketing programs that drive business-changing results. For additional
information, visit Aberdeen http://www.aberdeen.com or call (617) 854-5200, or to learn more about Harte-Hanks, call
(800) 456-9748 or go to http://www.harte-hanks.com.
This document is the result of primary research performed by Aberdeen Group. Aberdeen Group's methodologies
provide for objective fact-based research and represent the best analysis available at the time of publication. Unless
otherwise noted, the entire contents of this publication are copyrighted by Aberdeen Group, Inc. and may not be
reproduced, distributed, archived, or transmitted in any form or by any means without prior written consent by
Aberdeen Group, Inc. (2011a)

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close