Big Data Hadoop Insight

Published on May 2017 | Categories: Documents | Downloads: 80 | Comments: 0 | Views: 701

of 46

Content

Data Science Analytics &
Research Centre

9/20/2014

Data Science Analytics & Research Centre

1

Big Data

HDFS

Hadoop

9/20/2014

• Big Data Overview
• Characteristics
• Applications & Use Case

•
•
•
•
•
•
•

Hadoop Distributed File System (HDFS) Overview
HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations

•
•
•
•
•
•
•
•
•

Hadoop Overview
Inputs & Outputs
Data Types
What is MapReduce (MR)
Example
Functionalities of MR
Speculative Execution
Hadoop Streaming
Hadoop Job Scheduling

Data Science Analytics & Research Centre

2

Big Data Overview
Characteristics
Applications & Use Case
Data Footprint & Time Horizon
Technology Adoption Lifecycle

9/20/2014

Data Science Analytics & Research Centre

3

9/20/2014

Data Science Analytics & Research Centre

4

9/20/2014

Data Science Analytics & Research Centre

5

9/20/2014

Data Science Analytics & Research Centre

6

Real
Time

Near
Real
Time

Hourly

Daily
Weekly

Monthly
Quarterly

Yearly

3 Years

5 Years

10 Years

Highly
Summarized
Visualization &
Dashboards

Aggregated

Analytic
Marts & Cubes

Detailed
Events / Facts

Predictive
Analytics

Core ERP
& Legacy Applications
& Data Warehouse

Unstructured
Web /
Telemetry

Big Data
Hadoop etc.

Consumption
Source

9/20/2014

Real Time
GB

Daily

Monthly
TB

Data Science Analytics & Research Centre

Yearly
PB

7

9/20/2014

Data Science Analytics & Research Centre

8

9/20/2014

Data Science Analytics & Research Centre

9

9/20/2014

Data Science Analytics & Research Centre

10

Financial Services

Healthcare

• Detect fraud

• Optimal treatment pathways

• Model and manage risk

• Remote patient monitoring

• Improve debt recovery rates

• Predictive modeling for new drugs

• Personalize banking/insurance
products

• Personalized medicine

Retail
• In-store behavior analysis

• Cross selling
• Optimize pricing, placement, design
• Optimize inventory and distribution
9/20/2014

Data Science Analytics & Research Centre

11

Web / Social / Mobile

Government

• Location-based marketing

• Reduce fraud

• Social segmentation

• Segment populations, customize
action

• Sentiment analysis

• Price comparison services

• Support open data initiatives
• Automate decision making

Manufacturing
• Design to value
• Crowd-sourcing
• “Digital factory” for lean manufacturing
• Improve service via product sensor data
9/20/2014

Data Science Analytics & Research Centre

12

9/20/2014

Data Science Analytics & Research Centre

13

Hadoop Distributed File System (HDFS)
Overview
HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations

9/20/2014

Data Science Analytics & Research Centre

14











Hadoop own implementation of distributed file system.
Is coherent and provides all facilities of a file system.
Implements ACLs and provides a subset of usual UNIX
commands for accessing or querying the filesystem.
It has large block size (default 64MB) 128MB
recommended for storage to compensate for seek time to
network bandwidth. So very large files for storage are
ideal.
Streaming data access. Write once and read many times
architecture. Since files are large time to read is significant
parameter than seek to first record.
Commodity hardware. It is designed to run on commodity
hardware which may fail. HDFS is capable of handling it.
E.g.: 420MB file is split as:
128 MB

9/20/2014

128 MB

128 MB

36 MB

Data Science Analytics & Research Centre

15

9/20/2014

Data Science Analytics & Research Centre

16

9/20/2014

Data Science Analytics & Research Centre

17

File 1

Create
Complete
B1

B2

n1

B1

n1

B2

n1

B1

n2

B1

n2

B2

n2

B3

n3

B2

n3

B3

n3

B3

n4

Namenode

Rack 1

9/20/2014

B3

n4

n4

Rack 2

Data Science Analytics & Research Centre

Rack 3

18

9/20/2014

Data Science Analytics & Research Centre

19

9/20/2014

Data Science Analytics & Research Centre

20

• HDFS Flow – Read

9/20/2014

• HDFS Flow – Write

Data Science Analytics & Research Centre

21

Command

Usage

Syntax

cat

Copies source paths to stdout

hadoop dfs -cat URI [URI …]

chgrp

Change group association of files. With -R, make the
change recursively through the directory structure. hadoop dfs -chgrp [-R] GROUP URI [URI …]

chmod

Change the permissions of files. With -R, make the hadoop dfs -chmod [-R] <MODE[,MODE]... |
change recursively through the directory structure OCTALMODE> URI [URI …]

chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls (or) lsr
9/20/2014

Change the owner of files. With -R, make the
hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI
change recursively through the directory structure [URI ]
Similar to put command, except that the source is
restricted to a local file reference.
hadoop dfs -copyFromLocal <localsrc> URI
Similar to get command, except that the destination hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
is restricted to a local file reference.
<localdst>
Copy files from source to destination
hadoop dfs -cp URI [URI …] <dest>
Displays aggregate length of files contained in the
directory or the length of a file in case its just a file.
Displays a summary of file lengths.
Empty the Trash
Copy files to the local file system
Concatenates files in source into the destination
local file
File - returns stat on the file
Directory - returns list of its direct children

hadoop dfs -du URI [URI …]
hadoop dfs -dus <args>
hadoop dfs -expunge
hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
hadoop dfs -getmerge <src> <localdst> [addnl]
hadoop dfs -ls <args>

Data Science Analytics & Research Centre

22

Command

Usage

Syntax

mkdir

Takes path uri's as argument and creates
directories

hadoop dfs -mkdir <paths>
dfs -moveFromLocal <src> <dst>

movefromLocal
mv

setrep

Moves files from source to destination
hadoop dfs -mv URI [URI …] <dest>
Copy single src, or multiple srcs from local file
system to the destination filesystem
hadoop dfs -put <localsrc> ... <dst>
Delete files specified as args. Only deletes non
empty directory and files
hadoop dfs -rm URI [URI …]
Changes the replication factor of a file. -R option
is for recursively increasing the replication factor
of files within a directory
hadoop dfs -setrep [-R] <path>

stat

Returns the stat information on the path

hadoop dfs -stat URI [URI …]

tail

hadoop dfs -tail [-f] URI

text

Displays last kilobyte of the file to stdout
e - if the file exists
z - if the file is zero length
d - if the path is directory
Takes a source file and outputs the file in text
format

touchz

Create a file of zero length

hadoop dfs -touchz URI [URI …]

put
rm (or) rmr

test

9/20/2014

hadoop dfs -test -[ezd] URI
hadoop dfs -text <src>

Data Science Analytics & Research Centre

23






Low latency data access. It is not optimized for low latency data access it
trades latency to increase the throughput of the data.
Lots of small files. Since block size is 64 MB and lots of small files(will
waste blocks) will increase the memory requirements of namenode.
Multiple writers and arbitrary modification. There is no support for
multiple writers in HDFS and files are written to by a single writer after
end of each file.

9/20/2014

Data Science Analytics & Research Centre

24

Hadoop Overview
Inputs & Outputs
Data Types
What is MR
Example
Functionalities of MR
Speculative Execution
How Hadoop runs MR
Hadoop Streaming
Hadoop Job Scheduling

9/20/2014

Data Science Analytics & Research Centre

25

Hadoop is a framework which provides open source libraries for distributed computing
using simple single map-reduce interface and its own distributed filesystem called HDFS. It
facilitates scalability and takes cares of detecting and handling failures.

9/20/2014

Data Science Analytics & Research Centre

26

• 1.0.X

- current stable version, 1.0 release

• 1.1.X

- current beta version, 1.1 release

• 2.X.X

- current alpha version

• 0.23.X - similar to 2.X.X but missing NN HA.
• 0.22.X - does not include security
• 0.20.203.X

- old legacy stable version

• 0.20.X - old legacy version

9/20/2014

Data Science Analytics & Research Centre

27

9/20/2014

Data Science Analytics & Research Centre

28



Risk Modeling:
 How business/industry can better understand

customers and market.

• Customer Churn Analysis:
– Why companies really loose customers.

• Recommendation Engine:
– How to predict customer preferences.

9/20/2014

Data Science Analytics & Research Centre

29



AD Targeting:


How to increase campaign efficiency.

• Point of Sale Transaction Analysis:
– Targeting promotions to make customers buy.

• Predicting network Failure:
– Using machine-generated data to identify trouble spots.

9/20/2014

Data Science Analytics & Research Centre

30



Threat Analysis:


Detecting threats and fraudulent analysis.

• Trade Surveillance:
– Help business spot the rogue trader.

• Search Quality:
– Delivering more relevant search results to customers.

9/20/2014

Data Science Analytics & Research Centre

31



Framework is introduced by google.



Process vast amounts of data (multi-terabyte data-sets) in-parallel.



Achieves high performance on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.



Splits the input data-set into independent chunks.



Sorts the outputs of the maps, which are then input to the reduce tasks.



Takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

9/20/2014

Data Science Analytics & Research Centre

32

The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of

<key, value> pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the

WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, List(v2)> -> reduce -> <k3, v3> (output)

9/20/2014

Data Science Analytics & Research Centre

33

9/20/2014

Data Science Analytics & Research Centre

34

9/20/2014

Data Science Analytics & Research Centre

35

9/20/2014

Data Science Analytics & Research Centre

36

• Serialization is the process of turning structured objects into a byte stream for transmission over
a network or for writing to persistent storage.

• Hadoop has writable interface supporting serialization
• There are following predefined implementations available for WritableComparable.
1. IntWritable
2. LongWritable
3. DoubleWritable
4. VLongWritable. Variable size, stores as much as needed. 1-9 bytes storage
5. VIntWritable. Less used ! as it is pretty much represented by Vlong.
6. BooleanWritable
7. FloatWritable

9/20/2014

Data Science Analytics & Research Centre

37

8.

BytesWritable.

9.

NullWritable.

10. MD5Hash
11. ObjectWritable
12. GenericWritable


Apart from the above there are four Writable Collection types
1.

ArrayWritable

2.

TwoDArrayWritable

3.

MapWritable

4.

SortedMapWritable

9/20/2014

Data Science Analytics & Research Centre

38

MapperClass

Input Data
Input Data
Format
<K1, V1>
Mapper

<K2, V2>

ReducerClass

<K2, List(V2)>
Reducer
<K3, V3>

9/20/2014

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}

public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Data Science Analytics & Research Centre

39

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output

Mapper implementation:

Combiner implementation:

Lines: 18 - 25
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

Line: 46
Output of first map emits:
< Bye, 1>
< Hello, 1>
< World, 2>

The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

Output of second map emits:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>

9/20/2014

Data Science Analytics & Research Centre

Reducer implementation:
Lines: 29 - 35
Output of job:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

40

• A way of coping with individual Machine performance
• The same input can be processed multiple times in parallel, to exploit differences in machine
capabilities
• Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do
not have other work to perform

Name

Value

Description

mapred.map.tasks.
speculative.execution

true

If true, then multiple instances of some map
tasks may be executed in parallel.

Mapred.reduce.tasks.
speculative.execution

true

If true, then multiple instances of some reduce
tasks may be executed in parallel.

9/20/2014

Data Science Analytics & Research Centre

41

9/20/2014

Data Science Analytics & Research Centre

42




Utility that comes with the Hadoop distribution
Allows you to create and run map/reduce jobs with any executable or script as the mapper
and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper\
-reducer /bin/wc \
-jobconf mapred.reduce.tasks=2

9/20/2014

Data Science Analytics & Research Centre

43

9/20/2014

Data Science Analytics & Research Centre

44

Default Scheduler


Single priority based queue of jobs.



Scheduling tries to balance map and reduce load on all tasktrackers in the cluster.

Capacity Scheduler


Within a queue, jobs with higher priority will have access to the queue's resources before jobs with
lower priority.



In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on
the percentage of resources allocated to a user at any given time, if there is competition for them.

Fair Scheduler


Multiple queues (pools) of jobs – sorted in FIFO or by fairness limits



Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm.



Scheduler tries to ensure that over time, all jobs receive the same number of resources.

9/20/2014

Data Science Analytics & Research Centre

45

Thank you !!

Data Science
Analytics &
Research Centre

9/20/2014

Data Science Analytics & Research Centre

46

Big Data Hadoop Insight

Comments

Content

Sponsor Documents

Recommended