data analytics using hadoop

Published on February 2017 | Categories: Documents | Downloads: 30 | Comments: 0 | Views: 321
of 3
Download PDF   Embed   Report

Comments

Content

November - 2014

Research Paper
Analytics of Data using Hadoop-A review
Vineet Sajwan
Student, Computer Science and Engineering

Vikash Yadav
Student, Computer Science and Engineering

MGMCoET, Noida Sec-62

MGMCoET, Noida Sec-62

Abstract— We live in data world. It’s not easy to calculate the data produced every day but according to IDC
estimation our planet contained 2.75 zettabytes (ZB) worth of data in 2013 and it will rise up to 8 zettabytes by
2015.This flood of data coming from various social websites (like Facebook, twitter etc.) and other source like GPS,
Google maps, heat/pressure sensors etc. This situation give rise to a new word known as “big data”. Our research
discuss about what really the big data is and what platform that can use to process such big data.
Keywords— Big data, Big data analytics,MapReduce,HDFS,HBase,Pig,Hive
I. INTRODUCTION
A. Definition
The volume of data that enterprises and our latest technologies acquire every day is increasing exponentially. Now,
many organization uses big data analytics which processes massive amount of data. Big data is extremely large data sets
that may be analysed computationally to reveal patterns, trends and big data analytics is the process of examining large
data sets which can be in the form of structured and unstructured to uncover hidden pattern, unknown correlation, market
trends and business information.
B. Big Data Parameters
To understand the phenomenon of big data it is described into 5 Vs. The 5 Vs of big data are:-Volume, Velocity,
Variety, Veracity, and Value.
a. Volume:-refers to vast amount of data generated every second like Facebook receives approx. 10
billion photos, taking up one petabyte of storage. In Facebook we send 10 billion messages per day
and upload 300 million new pictures every day. The data stored has been growing exponentially. By
big data analytics tools we can now store huge amount of data sets and use them by using
distributed system, where parts of data stored in different location and brought together by a
software.
b. Velocity:- In todays world which is surrounded by social network data is generated so quickly and
moves around very fast. In Social media (like Facebook) message going viral in seconds. Big data
technology allows us to analyse data while it is generated without putting it into the database.
c. Variety:-refers to structured ,unstructured and semi-structured data we can now use. In past we
focused on that fits into the table but now 80% of world’s data are unstructured therefore can’t
easily put into the table. With the help of big data technology we can now harness different types of
data (including social media messages, photos, video, sensor data etc.) and bring them together
more traditional, structured data.
d. Veracity:-refers to trustworthiness of data. In today’s data world as the forms of data increases,
quality and accuracy decreases. Twitter post contain such type of data(hash tags, abbreviation, typos
and colloquial speech ) .But big data analytics now allow us to work with such type of data.
e. Value:-It is the most important V in of big data. It is no use of big data analytics too if we don’t get
result in the forum of value.values in the form of statistics, hypothetical and other forms.
II. LITERATURE SURVEY
Shilpa, Manjt Kaur (10, October, 2013),”BIG Data and Methodology” describes big data, its parameters, its
methodology, issues regarding big data and its solution.
Pareedpa, A.;Dr.Antony Selvadoss Thanamani (8 August, 2013),”Significant Trends of Big Data” describes the
comparison between big data analytics and traditional warehouse appliances. It also describe the future of big data and
big data tools.
Gurpeet Singh Bedi, Ashima Singh (5 April, 2013),”Big Data Analysis with Dataset Scaling in Yet another
Resource Negotiator (YARN)”describes the Apache Hadoop application performance. It also describes about
performance of HDFS.
III.PROBLEM DEFINITION
As there exist large amount of data, the various challenges are faced about the management of such extensive data like
unstructured data, fault tolerance and issues regarding storage of big amount of data.
A. Problem Description

Page |
1

November - 2014
The problem is simple: the storage capacities of hard drivers increased massively over the years-the rate at which data
can read from drives have not kept up. In 1990, one typical drive can store 1,370 MB of data and had data transfer
speed of 4.4 MB/s, so in about 5 minutes you can read the full data. 20 years later, one typical hard drives can store
terabytes but speed is around 4.4MB/s so it need half hour to read full data of the disk.
B. Problem Solution
The obvious way to reduce time is multiple disks at once. Imagine if we have 100 drives working in parallel we can
read the data in 2 minutes. Only 100 disks is wasteful .But we can store datasets, each of which is one terabyte and
provide access to them.
C. Problem to Solve
1.Hardware failure:-The chance of failure increases as we connected the drivers. Another way of avoiding data loss is
through replicate the data sets. The Hadoop Distributed Filesystem (HDFS).
2.Analysis task need to combine in some way:-The data read to one disk need to be able to combine from any of 99
disks. Various distribution system allow this but doing this is very challenging. MapReduce provides a programming
model that abstracts the model from disk read and writes, transforming it into computation over sets of keys and
values.
IV. BIG DATA ANALYTICS USING HADOOP
Hadoop:-Apache Hadoop is an open source of Java framework for processing and querying vast amounts of data on
large clusters of commodity hardware .Hadoop is top level Apache project, initiated and led by Yahoo! And Doug
Cutting.
Apache Hadoop has two main feature: Hadoop Distributed File System(HDFS)
 MapReduce
The current Apache Hadoop ecosystem consists of HDFS, MapReduce, Pig, Hive, Sqoop, HBase, and
Zookeeper.
MapReduce:-A MapReduce is a programmable framework for pulling data in parallel out of cluster. It is low level
programming.
HDFS:-It is a java based file system that provide scalable and reliable data storage.
Pig:-MapReduce is low level language.So,it is is very difficult to implement MapReduce.Pig is data flow language
which is high level and procedural language.We can wrap up queries against thee cluster in minutes rather than
hours,days and weeks.
Hive:-it give us another way to work with data inside Hadoop cluster. It purpose is same as pig a quick and easy way to
store data in HDFS.Hive is sequel oriented query language.
HBase:-It works on top of HDFS. HBase doesn’t support structured query language. Base application are written in java.
It uses batch-style computation using MapReduce as well as point queries.
Zookeeper:-It is a distributed coordination service for distributed application. It take care of all difficult stuff inside of
distributed application so that we can focus on functionality rather than architecture.
Sqoop:-It is combination of sequel and Hadoop. It is Data transfer tool .that get data into Hadoop from relational system
and also take data in Hadoop usually our results and Map Reduce jobs.
V. HADOOP AND RDBMS
We use MapReduce over databases because MapReduce is a complement to RDBMS.
TABLE I
COMPARISON BETWEEN MAPREDUCE AND RDBMS

Traditional RDBMS
Tools
Data Size

In Gigabytes

Access

Batch and interactive

Updates

Read and Write many
Times

Structure

Fixed schema

MapReduce

In Petabytes
Batch
Write once and read many
times
Unstructured schema

Page |
2

November - 2014






Latency

Low

High

Integrity

High

Low

Language

SQL

Scaling

Nonlinear

Procedural(Java,C ++)

Linear

MapReduce is good fit for problems that need to analyze the whole dataset in a batch function whereas RDBMS
is good for point queries and updates.
MapReduce is good for applications where data is reads once write anywhere while database is good for
datasets that are continually updated.
MapReduce works on unstructured data or semi-structured data but RDBMS only works n structured data.
Relational data is normalized to regain its integrity and remove redundancy but normalizes poses problem for
MapReduce

VI. CONCLUSION
We live in the data world. There are various challenges in big data .there are various issue too. We came into
conclusion that big data is not just huge data it is an opportunity to find new insights and analysis of our future data. We
have discuss various parameters of big data. We also have discuss about comparison between RDBMS and
MapReduce.We have discussed that Hadoop is better option for data analytics .There must be support and research on big
data for getting new results.

ACKNOWLEDGMENT
This work is part of final year project Mahatma Gandhi College of Engineering and Technology, Noida and completed
under guidance of Dr. Mohammad Haider Asstt. Prof, Dept of CSE Mahatma Gandhi College of Engineering.

REFERENCES
[1] Shipa, Manjit kaur, “Big Data and Methodology”, 10 Oct, 2013
[2] Pareedpa, A.; Dr.Antony Selvadoss, “Significant Trends of Big Data”, 8 Aug, 2013
[3] Gurpeet Singh Bedi, Ashima, “Big Data Analysis with Dataset Scaling in Yet another Resource Negotiator
(YARN)”, 5April, 2013
[4] Hadoop-The Definitive Guide, Tom White, Edition-3, 27Jan, 2012

Page |
3

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close