• Big Data Overview
• Characteristics
• Applications & Use Case
•
•
•
•
•
•
•
Hadoop Distributed File System (HDFS) Overview
HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations
•
•
•
•
•
•
•
•
•
Hadoop Overview
Inputs & Outputs
Data Types
What is MapReduce (MR)
Example
Functionalities of MR
Speculative Execution
Hadoop Streaming
Hadoop Job Scheduling
Data Science Analytics & Research Centre
2
Big Data Overview
Characteristics
Applications & Use Case
Data Footprint & Time Horizon
Technology Adoption Lifecycle
9/20/2014
Data Science Analytics & Research Centre
3
9/20/2014
Data Science Analytics & Research Centre
4
9/20/2014
Data Science Analytics & Research Centre
5
9/20/2014
Data Science Analytics & Research Centre
6
Real
Time
Near
Real
Time
Hourly
Daily
Weekly
Monthly
Quarterly
Yearly
3 Years
5 Years
10 Years
Highly
Summarized
Visualization &
Dashboards
Aggregated
Analytic
Marts & Cubes
Detailed
Events / Facts
Predictive
Analytics
Core ERP
& Legacy Applications
& Data Warehouse
Unstructured
Web /
Telemetry
Big Data
Hadoop etc.
Consumption
Source
9/20/2014
Real Time
GB
Daily
Monthly
TB
Data Science Analytics & Research Centre
Yearly
PB
7
9/20/2014
Data Science Analytics & Research Centre
8
9/20/2014
Data Science Analytics & Research Centre
9
9/20/2014
Data Science Analytics & Research Centre
10
Financial Services
Healthcare
• Detect fraud
• Optimal treatment pathways
• Model and manage risk
• Remote patient monitoring
• Improve debt recovery rates
• Predictive modeling for new drugs
• Personalize banking/insurance
products
• Personalized medicine
Retail
• In-store behavior analysis
• Cross selling
• Optimize pricing, placement, design
• Optimize inventory and distribution
9/20/2014
Data Science Analytics & Research Centre
11
Web / Social / Mobile
Government
• Location-based marketing
• Reduce fraud
• Social segmentation
• Segment populations, customize
action
• Sentiment analysis
• Price comparison services
• Support open data initiatives
• Automate decision making
Manufacturing
• Design to value
• Crowd-sourcing
• “Digital factory” for lean manufacturing
• Improve service via product sensor data
9/20/2014
Data Science Analytics & Research Centre
12
9/20/2014
Data Science Analytics & Research Centre
13
Hadoop Distributed File System (HDFS)
Overview
HDFS Architecture
Data replication
Node types
Jobtracker / Tasktracker
HDFS Data Flows
HDFS Limitations
9/20/2014
Data Science Analytics & Research Centre
14
Hadoop own implementation of distributed file system.
Is coherent and provides all facilities of a file system.
Implements ACLs and provides a subset of usual UNIX
commands for accessing or querying the filesystem.
It has large block size (default 64MB) 128MB
recommended for storage to compensate for seek time to
network bandwidth. So very large files for storage are
ideal.
Streaming data access. Write once and read many times
architecture. Since files are large time to read is significant
parameter than seek to first record.
Commodity hardware. It is designed to run on commodity
hardware which may fail. HDFS is capable of handling it.
E.g.: 420MB file is split as:
128 MB
9/20/2014
128 MB
128 MB
36 MB
Data Science Analytics & Research Centre
15
9/20/2014
Data Science Analytics & Research Centre
16
9/20/2014
Data Science Analytics & Research Centre
17
File 1
Create
Complete
B1
B2
n1
B1
n1
B2
n1
B1
n2
B1
n2
B2
n2
B3
n3
B2
n3
B3
n3
B3
n4
Namenode
Rack 1
9/20/2014
B3
n4
n4
Rack 2
Data Science Analytics & Research Centre
Rack 3
18
9/20/2014
Data Science Analytics & Research Centre
19
9/20/2014
Data Science Analytics & Research Centre
20
• HDFS Flow – Read
9/20/2014
• HDFS Flow – Write
Data Science Analytics & Research Centre
21
Command
Usage
Syntax
cat
Copies source paths to stdout
hadoop dfs -cat URI [URI …]
chgrp
Change group association of files. With -R, make the
change recursively through the directory structure. hadoop dfs -chgrp [-R] GROUP URI [URI …]
chmod
Change the permissions of files. With -R, make the hadoop dfs -chmod [-R] <MODE[,MODE]... |
change recursively through the directory structure OCTALMODE> URI [URI …]
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls (or) lsr
9/20/2014
Change the owner of files. With -R, make the
hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI
change recursively through the directory structure [URI ]
Similar to put command, except that the source is
restricted to a local file reference.
hadoop dfs -copyFromLocal <localsrc> URI
Similar to get command, except that the destination hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
is restricted to a local file reference.
<localdst>
Copy files from source to destination
hadoop dfs -cp URI [URI …] <dest>
Displays aggregate length of files contained in the
directory or the length of a file in case its just a file.
Displays a summary of file lengths.
Empty the Trash
Copy files to the local file system
Concatenates files in source into the destination
local file
File - returns stat on the file
Directory - returns list of its direct children
Moves files from source to destination
hadoop dfs -mv URI [URI …] <dest>
Copy single src, or multiple srcs from local file
system to the destination filesystem
hadoop dfs -put <localsrc> ... <dst>
Delete files specified as args. Only deletes non
empty directory and files
hadoop dfs -rm URI [URI …]
Changes the replication factor of a file. -R option
is for recursively increasing the replication factor
of files within a directory
hadoop dfs -setrep [-R] <path>
stat
Returns the stat information on the path
hadoop dfs -stat URI [URI …]
tail
hadoop dfs -tail [-f] URI
text
Displays last kilobyte of the file to stdout
e - if the file exists
z - if the file is zero length
d - if the path is directory
Takes a source file and outputs the file in text
format
touchz
Create a file of zero length
hadoop dfs -touchz URI [URI …]
put
rm (or) rmr
test
9/20/2014
hadoop dfs -test -[ezd] URI
hadoop dfs -text <src>
Data Science Analytics & Research Centre
23
Low latency data access. It is not optimized for low latency data access it
trades latency to increase the throughput of the data.
Lots of small files. Since block size is 64 MB and lots of small files(will
waste blocks) will increase the memory requirements of namenode.
Multiple writers and arbitrary modification. There is no support for
multiple writers in HDFS and files are written to by a single writer after
end of each file.
9/20/2014
Data Science Analytics & Research Centre
24
Hadoop Overview
Inputs & Outputs
Data Types
What is MR
Example
Functionalities of MR
Speculative Execution
How Hadoop runs MR
Hadoop Streaming
Hadoop Job Scheduling
9/20/2014
Data Science Analytics & Research Centre
25
Hadoop is a framework which provides open source libraries for distributed computing
using simple single map-reduce interface and its own distributed filesystem called HDFS. It
facilitates scalability and takes cares of detecting and handling failures.
9/20/2014
Data Science Analytics & Research Centre
26
• 1.0.X
- current stable version, 1.0 release
• 1.1.X
- current beta version, 1.1 release
• 2.X.X
- current alpha version
• 0.23.X - similar to 2.X.X but missing NN HA.
• 0.22.X - does not include security
• 0.20.203.X
- old legacy stable version
• 0.20.X - old legacy version
9/20/2014
Data Science Analytics & Research Centre
27
9/20/2014
Data Science Analytics & Research Centre
28
Risk Modeling:
How business/industry can better understand
• Recommendation Engine:
– How to predict customer preferences.
9/20/2014
Data Science Analytics & Research Centre
29
AD Targeting:
How to increase campaign efficiency.
• Point of Sale Transaction Analysis:
– Targeting promotions to make customers buy.
• Predicting network Failure:
– Using machine-generated data to identify trouble spots.
9/20/2014
Data Science Analytics & Research Centre
30
Threat Analysis:
Detecting threats and fraudulent analysis.
• Trade Surveillance:
– Help business spot the rogue trader.
• Search Quality:
– Delivering more relevant search results to customers.
9/20/2014
Data Science Analytics & Research Centre
31
Framework is introduced by google.
Process vast amounts of data (multi-terabyte data-sets) in-parallel.
Achieves high performance on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.
Splits the input data-set into independent chunks.
Sorts the outputs of the maps, which are then input to the reduce tasks.
Takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
9/20/2014
Data Science Analytics & Research Centre
32
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, List(v2)> -> reduce -> <k3, v3> (output)
9/20/2014
Data Science Analytics & Research Centre
33
9/20/2014
Data Science Analytics & Research Centre
34
9/20/2014
Data Science Analytics & Research Centre
35
9/20/2014
Data Science Analytics & Research Centre
36
• Serialization is the process of turning structured objects into a byte stream for transmission over
a network or for writing to persistent storage.
• Hadoop has writable interface supporting serialization
• There are following predefined implementations available for WritableComparable.
1. IntWritable
2. LongWritable
3. DoubleWritable
4. VLongWritable. Variable size, stores as much as needed. 1-9 bytes storage
5. VIntWritable. Less used ! as it is pretty much represented by Vlong.
6. BooleanWritable
7. FloatWritable
Apart from the above there are four Writable Collection types
1.
ArrayWritable
2.
TwoDArrayWritable
3.
MapWritable
4.
SortedMapWritable
9/20/2014
Data Science Analytics & Research Centre
38
MapperClass
Input Data
Input Data
Format
<K1, V1>
Mapper
<K2, V2>
ReducerClass
<K2, List(V2)>
Reducer
<K3, V3>
9/20/2014
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Data Science Analytics & Research Centre
39
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01
Hello World Bye World
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
Run the application:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Mapper implementation:
Combiner implementation:
Lines: 18 - 25
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Line: 46
Output of first map emits:
< Bye, 1>
< Hello, 1>
< World, 2>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
Output of second map emits:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
• A way of coping with individual Machine performance
• The same input can be processed multiple times in parallel, to exploit differences in machine
capabilities
• Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do
not have other work to perform
Name
Value
Description
mapred.map.tasks.
speculative.execution
true
If true, then multiple instances of some map
tasks may be executed in parallel.
Mapred.reduce.tasks.
speculative.execution
true
If true, then multiple instances of some reduce
tasks may be executed in parallel.
9/20/2014
Data Science Analytics & Research Centre
41
9/20/2014
Data Science Analytics & Research Centre
42
Utility that comes with the Hadoop distribution
Allows you to create and run map/reduce jobs with any executable or script as the mapper
and/or the reducer
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper\
-reducer /bin/wc \
-jobconf mapred.reduce.tasks=2
9/20/2014
Data Science Analytics & Research Centre
43
9/20/2014
Data Science Analytics & Research Centre
44
Default Scheduler
Single priority based queue of jobs.
Scheduling tries to balance map and reduce load on all tasktrackers in the cluster.
Capacity Scheduler
Within a queue, jobs with higher priority will have access to the queue's resources before jobs with
lower priority.
In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on
the percentage of resources allocated to a user at any given time, if there is competition for them.
Fair Scheduler
Multiple queues (pools) of jobs – sorted in FIFO or by fairness limits
Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm.
Scheduler tries to ensure that over time, all jobs receive the same number of resources.