Hadoop and Big Data

Published on June 2016 | Categories: Documents | Downloads: 60 | Comments: 0 | Views: 591

of 41

Introduction to Big Data

Content

Big Data and Hadoop Essentials

2
Hadoop Ecosystem
Agenda
Map Reduce Algorithm Exemplified
Hadoop Architecture
Brief History in time
Why Hadoop?
How Big is Big Data?
Demo
3
Brief History in time
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t
budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for
bigger computers, but more systems of computers.
—Grace Hopper, American Computer Scientist

4
How Big is Big Data?
5
How Big is Big Data?
6
How Big is Big Data?
7
Why Hadoop?
8
The Problem
9
BIG
DATA

Volume
Big Data comes in on large
scale. Its on TB and even PB
Records, Transaction,
Tables , Files

Veracity
Quality, consistency,
reliability and provenance of
data
Good, bad, undefined,
inconsistency, incomplete.
Variety
Big Data extends structured,
including semi- structured and
unstructured data of all variety
text, log, xml, audio, video, stream,
flat files
Velocity
Data flown continues, time
sensitive, streaming flow
Batch, Real time, Streams,
Historic
Challenges in managing Big Data
10
To overcome Big Data challenges Hadoop evolves
• Cost Effective – Commodity HW
• Big Cluster – (1000 Nodes) --- Provides Storage &
Processing
• Parallel Processing – Map reduce
• Big Storage – Memory per node * no of Nodes / RF
• Fail over mechanism – Automatic Failover
• Data Distribution
• Moving Code to data
• Heterogeneous Hardware System
(IBM,HP,AIX,Oracle Machine of any
memory and CPU configuration)
• Scalable

11
What Exactly is Hadoop?
12
What’s in a name?
13
Hadoop Vendors
14
Who uses Hadoop?
15
Why Hadoop is used for?
16
Stop and Ponder
• Is Hadoop an alternative for RDBMS?
• Hadoop is not replacing the traditional data systems used for building
analytic applications – the RDBMS, EDW and MPP systems – but rather is a
complement. & Works fine together with RDBMs.
• Hadoop is being used to distill large quantities of data into something more
manageable
17
Stop and Ponder
• But Don’t we know Coherence to be distributed too? Why Hadoop?
Coherence is the market leading In-Memory Data Grid. While Hadoop works fine
for large processing operations, i.e. requiring many TB of data, that can be
processed in a batch like way, there are use cases where the processing
requirements are more real-time and the data volumes are smaller, where
Coherence is a better choice than HDFS for storing the data
18
Hadoop vs. RDBMS
RDBMS MapReduce
Data size
Gigabytes Petabytes
Access
Interactive and batch Batch
Structure
Fixed schema Unstructured schema
Language
SQL Procedural (Java, C++, Ruby, etc)
Integrity
High Low
Scaling
Nonlinear Linear
Updates
Read and write Write once, read many times
Latency
Low High
19
Using Hadoop in Enterprise
20
Hadoop Architecture

• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop MapReduce: A software framework for distributed processing of
large data sets on compute clusters.
HDFS
Map
Reduce
Hadoop
21
Hadoop Distributed File System(HDFS)
22
HDFS Architecture(Master-Slave)
Secondary
Name Node
Master
Book Keeper
Slave(s)
Periodic checkpoint
Data Block
23
The CORE
CLIENT
Data Analytics Jobs
Map Reduce
Data Storage Jobs
HDFS
MASTER
SLAVE
= HDFS
24
Hadoop Ecosystem
25
MAP REDUCE Algorithm exemplified!
Calculate the yearly average per state.
26
Group the city average temperatures by state
1
27
We don’t really care about the city names, so we will
discard those and keep only the state names and
cities Temperatures.
2
28
3
We’re going to get a list of temperatures averages for
each state.
29
That was Map/Reduce!
4
All we have to do is to calculate the average
temperature for each state.
30
Let’s do it again…
• Map/Reduce has 3 stages : Map/Shuffle/Reduce
• The Shuffle part is done automatically by Hadoop, you just need to
implement the Map and Reduce parts.
• You get input data as <Key,Value> for the Map part.
• In this example, the Key is the City name, and the Value is the set
of attributes : State and City yearly average temperature.

31
• Since you want to regroup your temperatures by state, you’re going to get
rid of the city name, and the State will become the Key, while the
Temperature will become the Value.
32
Shuffle
• Now, the shuffle task will run on the output of the Map task. It is going to
group all the values by Key, and you’ll get a List<Value>
33
Reduce
• The Reduce task is the one that does the logic on the data, in our case this
is the calculation of the State yearly average temperature.
• And that’s what we will get as final output

34
Hadoop AppStore

35
Ecosystem Matrix
36
Pig and HIVE in the Hadoop Ecosystem
37
Hadoop Ecosystem Development
38
Demo
39
References
• http://hadoop.apache.org/
• http://hadoop.apache.org/hive/
• Hadoop in Action
(http://www.manning.com/lam/)
• Definitive Guide to Hadoop, 2nd ed.
(http://oreilly.com/catalog/0636920010388)
• Yahoo! Hadoop blog
(http://developer.yahoo.net/blogs/hadoop/)
• Cloudera
(http://www.cloudera.com/)
40
Q & A
41
Thank You

Hadoop and Big Data

Comments

Content

Sponsor Documents

Recommended