What is Big Data

Published on January 2017 | Categories: Documents | Downloads: 61 | Comments: 0 | Views: 458
of 80
Download PDF   Embed   Report

Comments

Content

What Is Big Data
In order to understand 'Big Data', we first need to know what 'data' is. Oxford dictionary defines 'data' as "The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the
form of electrical signals and recorded on magnetic, optical, or mechanical recording media. "
So, 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet
growing exponentially with time.In short, such a data is so large and complex that none of the traditional data management tools are able
to store it or process it efficiently.
Examples Of 'Big Data'
Following are some the examples of 'Big Data'The New York Stock Exchange generates about one terabyte of new
trade data per day.
• Social Media Impact
Statistic shows that 500+terabytes of new data gets ingested into the
databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges,
putting comments etc.
Single Jet engine can generate 10+terabytes of data in 30 minutes of a
flight time. With many thousand flights per day, generation of data
reaches up to many Petabytes

1.

Categories Of 'Big Data'
Big data' could be found in three forms:
Structured

2.

Un-structured

3.

Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data. Over the period of time,
talent in computer science have achieved greater success in developing techniques for working with such kind of data (where the format
is well known in advance) and also deriving value out of it. However, now days, we are foreseeing issues when size of such data grows to
a huge extent, typical sizes are being in the rage of multiple zettabyte.
Do you know? 1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.
Looking at these figures one can easily understand why the name 'Big Data' is given and imagine the challenges involved in its storage
and processing.
Do you know? Data stored in a relational database management system is one example of a 'structured' data.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
Employee_ID
Employee_Name
Gender
Department
Salary_In_lacs
2365

Rajesh Kulkarni

Male

Finance

650000

3398

Pratibha Joshi

Female

Admin

650000

7465

Shushil Roy

Male

Admin

500000

7500

Shubhojit Das

Male

Finance

500000

7699

Priya Sane

Female

Finance

550000

Un-structured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data
poses multiple challenges in terms of its processing for deriving value out of it. Typical example of unstructured data is, a heterogeneous
data source containing a combination of simple text files, images, videos etc. Now a day organizations have wealth of data available with
them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
Output returned by 'Google Search'

HADOOP

1|Page

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a strcutured in form but it is actually not
defined with e.g. a table definition in relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
Personal data stored in a XML file?
1 <rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
2 <rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
3 <rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
4 <rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
5 <rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Data Growth over years

HADOOP

2|Page

Please note that web application data, which is unstructured, consists of log files, transaction history files etc. OLTP systems are built to
work with structured data wherein data is stored in relations (tables).
Characteristics Of 'Big Data'
(i)Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data plays very crucial role in determining value out
of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with 'Big Data'.
(ii)Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and
databases were the only sources of data considered by most of the applications. Now days, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. is also being considered in the analysis applications. This variety of unstructured data poses certain
issues for storage, mining and analysing data.
(iii)Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the
demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks and social
media sites, sensors, mobile devices, etc. The flow of data is massive and continuous.
(iv)Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to
handle and manage the data effectively.
Advantages Of Big Data Processing
Ability to process 'Big Data' brings in multiple benefits, such as• Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.
• Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed with 'Big Data' technologies. In these new
systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.
• Early identification of risk to the product/services, if any
• Better operational efficiency
'Big Data' technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved
to the data warehouse. In addition, such integration of 'Big Data' technologies and data warehouse helps organization to offload
infrequently accessed data.
From <http://www.guru99.com/what-is-big-data.html>
Learn Hadoop In 10 Minutes
Apache HADOOP is a framework used to develop data processing applications which are executed in a distributed computing
environment.

HADOOP

3|Page

Similar to data residing in a local file system of personal computer system, in Hadoop, data resides in a distributed file system which is
called as a Hadoop Distributed File system.
Processing model is based on 'Data Locality' concept wherein computational logic is sent to cluster nodes(server) containing data.
This computational logic is nothing but a compiled version of a program written in a high level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
HADOOP is an open source software framework. Applications built using HADOOP are run on large data sets distributed across clusters
of commodity computers.
Commodity computers are cheap and widely available. These are mainly useful for achieving greater computational power at low cost.
Do you know? Computer cluster consists of a set of multiple processing units (storage disk + processor) which are connected to each
other and acts as a single system.
Components of Hadoop
Below diagram shows various components in Hadoop ecosystem-

1.
2.

Apache Hadoop consists of two sub-projects –
Hadoop MapReduce : MapReduce is a computational model and software framework for writing applications which are run
on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
HDFS (Hadoop Distributed File System): HDFS takes care of storage part of Hadoop applications. MapReduce applications
consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in cluster. This
distribution enables reliable and extremely rapid computations.
Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects
that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include
are Hive, HBase, Mahout, Sqoop , Flume and ZooKeeper.
Features Of 'Hadoop'
• Suitable for Big Data Analysis
As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for analysis of Big Data. Since, it is
processing logic (not the actual data) that flows to the computing nodes, less network bandwidth is consumed. This concept is called
as data locality concept which helps increase efficiency of Hadoop based applications.
• Scalability

HADOOP

4|Page

HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes, and thus allows for growth of Big Data. Also,
scaling does not require modifications to application logic.
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the event of a cluster node failure,
data processing can still proceed by using data stored on another cluster node.
Network Topology In Hadoop
Topology (Arrangment) of the network, affects performance of the Hadoop cluster when size of the hadoop cluster grows. In addition to
the performance, one also needs to care about the high availability and handling of failures. In order to achieve this Hadoop cluster
formation makes use of network topology.

Typically, network bandwidth is an important factor to consider while forming any network. However, as measuring bandwidth could be
difficult, in Hadoop, network is represented as a tree and distance between nodes of this tree (number of hops) is considered as important
factor in the formation of Hadoop cluster. Here, distance between two nodes is equal to sum of their distance to their closest common
ancestor.
Hadoop cluster consists of data center, the rack and the node which actually executes jobs. Here, data center consists of racks and rack
consists of nodes. Network bandwidth available to processes varies depending upon location of the processes. That is, bandwidth
available becomes lesser as we go away from

Processes on the same node



Different nodes on the same rack



Nodes on different racks of the same data center



Nodes in different data centers
From <http://www.guru99.com/learn-hadoop-in-10-minutes.html>
How To Install Hadoop
Prerequisites:
You must have Ubuntu installed and running
You must have Java Installed.
Step 1) Add a Hadoop system user using below command
sudo addgroup hadoop_

HADOOP

5|Page

sudo adduser --ingroup hadoop_ hduser_

Enter your password , name and other details.
NOTE:
There is a possibility of below mentioned error in this setup and installation process.
"hduser is not in the sudoers file. This incident will be reported."

This error can be resolved by
Login as a root user

Execute the command
sudo adduser hduser_ sudo

Re-login as hduser_

Step 2) . Configure SSH
In order to manage nodes in a cluster, Hadoop require SSH access
First, switch user, enter following command
su - hduser_
HADOOP

6|Page

This command will create a new key.
ssh-keygen -t rsa -P ""

Enable SSH access to local machine using this key.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Now test SSH setup by connecting to locahost as 'hduser' user.
ssh localhost

HADOOP

7|Page

Note:
Please note, if you see below error in response to 'ssh localhost', then there is a possibility that SSH is not available on this system-

To resolve this Purge SSH using,
sudo apt-get purge openssh-server
It is good practice to purge before start of installation

Install SSH using commandsudo apt-get install openssh-server

Step 3) Next step is to download Hadoop
Goto http://www.apache.org/dyn/closer.cgi/hadoop/core

HADOOP

8|Page

Select Stable

HADOOP

9|Page

Select the tar.gz file ( not the file with src)

Once download is complete, navigate to the directory containing the tar file
Enter , sudo tar xzf hadoop-2.2.0.tar.gz
Now, rename rename hadoop-2.2.0 as hadoop
sudo mv hadoop-2.2.0 hadoop
HADOOP

10 | P a g e

sudo chown -R hduser_:hadoop_ hadoop

Step 4) Modify ~/.bashrc file
Add following lines to end of file ~/.bashrc
?
1 #Set HADOOP_HOME
2 export HADOOP_HOME=<Installation Directory of Hadoop>
3 #Set JAVA_HOME
4 export JAVA_HOME=<Installation Directory of <a title="Java" href="/java-tutorial.html">Java</a>>
5 # Add bin/ directory of Hadoop to PATH
6 export PATH=$PATH:$HADOOP_HOME/bin

Now, source this environment configuration using below command
. ~/.bashrc

Step 5) Configurations related to HDFS
Set JAVA_HOME inside file $HADOOP_HOME/etc/hadoop/hadoop-env.sh

HADOOP

11 | P a g e

With

There are two parameters in $HADOOP_HOME/etc/hadoop/core-site.xml which need to be set1. 'hadoop.tmp.dir' - Used to specify directory which will be used by Hadoop to store its data files.
2. 'fs.default.name' - This specifies the default file system.
To set these parameters, open core-site.xml
sudo gedit $HADOOP_HOME/etc/hadoop/core-site.xml
Copy below line in between tags <configuration></configuration>
?
1 <property>
2 <name>hadoop.tmp.dir</name>
3 <value>/app/hadoop/tmp</value>
4 <description>Parent directory for other temporary directories.</description>
5 </property>
6 <property>
7 <name>fs.defaultFS </name>
8 <value>hdfs://localhost:54310</value>

HADOOP

12 | P a g e

9 <description>The name of the default file system. </description>
10 </property>

Navigate to the directory $HADOOP_HOME/etc/Hadoop

Now, create the directory mentioned in core-site.xml
sudo mkdir -p <Path of Directory used in above setting>

Grant permissions to the directory
sudo chown -R hduser_:Hadoop_ <Path of Directory created in above step>

sudo chmod 750 <Path of Directory created in above step>

Step 6) Map Reduce Configuration
Before you begin with these configurations, lets set HADOOP_HOME path
sudo gedit /etc/profile.d/hadoop.sh
And Enter
export HADOOP_HOME=/home/guru99/Downloads/Hadoop

HADOOP

13 | P a g e

Next enter
sudo chmod +x /etc/profile.d/hadoop.sh

Exit the Terminal and restart again
Type echo $HADOOP_HOME. To verify the path

Now copy files
sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml

Open the mapred-site.xml file
sudo gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add below lines of setting in between tags <configuration> and </configuration>
?
1 <property>
2 <name>mapreduce.jobtracker.address</name>
3 <value>localhost:54311</value>
4 <description>MapReduce job tracker runs at this host and port.
5 </description>
6 </property>

HADOOP

14 | P a g e

Open $HADOOP_HOME/etc/hadoop/hdfs-site.xml as below,
sudo gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add below lines of setting between tags <configuration> and </configuration>
?
1 <property>
2 <name>dfs.replication</name>
3 <value>1</value>
4 <description>Default block replication.</description>
5 </property>
6 <property>
7 <name>dfs.datanode.data.dir</name>
8 <value>/home/hduser_/hdfs</value>
9 </property>

HADOOP

15 | P a g e

Create directory specified in above settingsudo mkdir -p <Path of Directory used in above setting>
sudo mkdir -p /home/hduser_/hdfs

sudo chown -R hduser_:hadoop_ <Path of Directory created in above step>
sudo chown -R hduser_:hadoop_ /home/hduser_/hdfs
sudo chmod 750 <Path of Directory created in above step>
sudo chmod 750 /home/hduser_/hdfs
Step 7) Before we start Hadoop for the first time, format HDFS using below command
$HADOOP_HOME/bin/hdfs namenode -format

HADOOP

16 | P a g e

Step 8) Start Hadoop single node cluster using below command
$HADOOP_HOME/sbin/start-dfs.sh
Output of above command

$HADOOP_HOME/sbin/start-yarn.sh

Using 'jps' tool/command, verify whether all the Hadoop related processes are running or not.

HADOOP

17 | P a g e

If Hadoop has started successfully then output of jps should show NameNode, NodeManager, ResourceManager, SecondaryNameNode,
DataNode.
Step 9) Stopping Hadoop
$HADOOP_HOME/sbin/stop-dfs.sh

$HADOOP_HOME/sbin/stop-yarn.sh

From <http://www.guru99.com/how-to-install-hadoop.html>





Learn HDFS: A Beginners Guide
Hadoop comes with a distributed file system called HDFS (HADOOP Distributed File Systems) HADOOP based applications make use
of HDFS. HDFS is designed for storing very large data files, running on clusters of commodity hardware. It is fault tolerant, scalable, and
extremely simple to expand.
Do you know? When data exceeds the capacity of storage on a single physical machine, it becomes essential to divide it across number
of separate machines. File system that manages storage specific operations across a network of machines is called as distributed file
system.
HDFS cluster primarily consists of a NameNode that manages the file system Metadata and a DataNodes that stores the actual data.
NameNode: NameNode can be considered as a master of the system. It maintains the file system tree and the metadata for
all the files and directories present in the system. Two files 'Namespace image' and the 'edit log' are used to store metadata information.
Namenode has knowledge of all the datanodes containing data blocks for a given file, however, it does not store block locations
persistently. This information is reconstructed every time from datanodes when the system starts.
DataNode : DataNodes are slaves which reside on each machine in a cluster and provide the actual storage. It is responsible
for serving, read and write requests for the clients.
Read/write operations in HDFS operate at a block level. Data files in HDFS are broken into block-sized chunks, which are stored as
independent units. Default block-size is 64 MB.
HDFS operates on a concept of data replication wherein multiple replicas of data blocks are created and are distributed on nodes
throughout a cluster to enable high availability of data in the event of node failure.
Do you know? A file in HDFS, which is smaller than a single block, does not occupy a block's full storage.
Read Operation In HDFS
Data read request is served by HDFS, NameNode and DataNode. Let's call reader as a 'client'. Below diagram depicts file read operation
in Hadoop.

HADOOP

18 | P a g e



Client initiates read request by calling 'open()' method of FileSystem object; it is an object of type DistributedFileSystem.



This object connects to namenode using RPC and gets metadata information such as the locations of the blocks of the file.
Please note that these addresses are of first few block of file.






In response to this metadata request, addresses of the DataNodes having copy of that block, is returned back.
Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and NameNode. In step 4 shown
in above diagram, client invokes 'read()' method which causes DFSInputStream to establish a connection with the first DataNode with
the first block of file.
Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process of read() operation
continues till it reaches end of block.



Once end of block is reached, DFSInputStream closes the connection and moves on to locate the next DataNode for the next
block



Once client has done with the reading, it calls close() method.
Write Operation In HDFS
In this section, we will understand how data is written into HDFS through files.

HADOOP

19 | P a g e

1.

Client initiates write operation by calling 'create()' method of DistributedFileSystem object which creates a new file - Step no. 1
in above diagram.

2.

DistributedFileSystem object connects to the NameNode using RPC call and initiates new file creation. However, this file
create operation does not associate any blocks with the file. It is the responsibility of NameNode to verify that the file (which is being
created) does not exist already and client has correct permissions to create new file. If file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to client. Otherwise, operation succeeds and a new record for the file is
created by the NameNode.

3.

Once new record in NameNode is created, an object of type FSDataOutputStream is returned to the client. Client uses it to
write data into the HDFS. Data write method is invoked (step 3 in diagram).

4.

FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes and NameNode.
While client continues writing data, DFSOutputStream continues creating packets with this data. These packets are en-queued into a
queue which is called as DataQueue.

5.

There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks NameNode
for allocation of new blocks thereby picking desirable DataNodes to be used for replication.

6.

Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen replication level of
3 and hence there are 3 DataNodes in the pipeline.

7.

The DataStreamer pours packets into the first DataNode in the pipeline.

8.

Every DataNode in a pipeline stores packet received by it and forwards the same to the second DataNode in pipeline.

HADOOP

20 | P a g e

9.

Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are waiting for acknowledgement
from DataNodes.

10.

Once acknowledgement for a packet in queue is received from all DataNodes in the pipeline, it is removed from the 'Ack
Queue'. In the event of any DataNode failure, packets from this queue are used to reinitiate the operation.

11.

After client is done with the writing data, it calls close() method (Step 9 in the diagram) Call to close(), results into flushing
remaining data packets to the pipeline followed by waiting for acknowledgement.

12.

Once final acknowledgement is received, NameNode is contacted to tell it that the file write operation is complete.
Access HDFS using JAVA API
In this section, we try to understand Java interface used for accessing Hadoop's file system.
In order to interact with Hadoop's filesytem programmatically, Hadoop provides multiple JAVA classes. Package named
org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop's filesystem. These operations include, open, read,
write, and close. Actually, file API for Hadoop is generic and can be extended to interact with other filesystems other than HDFS.
Reading a file from HDFS, programmatically
Object java.net.URL is used for reading contents of a file. To begin with, we need to make Java recognize Hadoop's hdfs URL scheme.
This is done by calling setURLStreamHandlerFactory method on URL object and an instance of FsUrlStreamHandlerFactory is passed
to it. This method needs to be executed only once per JVM, hence it is enclosed in a static block.
An example code is?
1 public class URLCat {
2
static {
3
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
4
}
5
public static void main(String[] args) throws Exception {
6
InputStream in = null;
7
try {
8
in = new URL(args[0]).openStream();
9
IOUtils.copyBytes(in, System.out, 4096, false);
10
} finally {
11
IOUtils.closeStream(in);
12
}
13
}
14 }
This code opens and reads contents of a file. Path of this file on HDFS is passed to the program as a commandline argument.
Access HDFS Using COMMAND-LINE INTERFACE
This is one of the simplest way to interact with HDFS. Command-line interface has support for filesystem operations like read file, create
directories, moving files, deleting data, and listing directories.
We can run '$HADOOP_HOME/bin/hdfs dfs -help' to get detailed help on every command. Here, 'dfs' is a shell command of HDFS
which supports multiple subcommands.
Some of the widely used commands are listed below along with some details of each one.
1. Copy a file from local filesystem to HDFS
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /

This command copies file temp.txt from local filesystem to HDFS.
2. We can list files present in a directory using -ls
$HADOOP_HOME/bin/hdfs dfs -ls /

We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.
3. Command to copy a file to local filesystem from HDFS
$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt

HADOOP

21 | P a g e

We can see temp.txt copied to local filesystem.
4. Command to create new directory
$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory

Check whether directory is created or not. Now, you should know how to do it ;-)
From <http://www.guru99.com/learn-hdfs-a-beginners-guide.html>
Create Your First Hadoop Program
Problem Statement:
Find out Number of Products Sold in Each Country.
Input: Our input data set is a CSV file, SalesJan2009.csv
Prerequisites:


This tutorial is developed on Linux - Ubuntu operating System.



You should have Hadoop (version 2.2.0 used for this tutorial) already installed.



You should have Java (version 1.8.0 used for this tutorial) already installed on the system.
Before we start with the actual process, change user to 'hduser' (user used for Hadoop ).
su - hduser_

Steps:
Create a new directory with name MapReduceTutorial
sudo mkdir MapReduceTutorial
Give permissions
sudo chmod -R 777 MapReduceTutorial
Copy files SalesMapper.java, SalesCountryReducer.java and SalesCountryDriver.java in this directory.
Download Files Here
If you want to understand the code in these files refer this Guide

Check the file permissions of all these files

HADOOP

22 | P a g e

and if 'read' permissions are missing then grant the same2. Export classpath
export CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar:
$HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:~/MapReduceTutorial/SalesCountry/*:
$HADOOP_HOME/lib/*"

3. Compile java files (these files are present in directory Final-MapReduceHandsOn). Its class files will be put in the package directory
javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

This warning can be safely ignored.
This compilation will create a directory in a current directory named with package name specified in the java source file
(i.e. SalesCountry in our case) and put all compiled class files in it.

Step )
Create a new file Manifest.txt
sudo gedit Manifest.txt
add following lines to it,
Main-Class: SalesCountry.SalesCountryDriver

HADOOP

23 | P a g e

SalesCountry.SalesCountryDriver is name of main class. Please note that you have to hit enter key at end of this line.
Step Create a Jar file
jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class
Check that the jar file is created

6. Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
7. Copy the File SalesJan2009.csv into ~/inputMapReduce
Now Use below command to copy ~/inputMapReduce to HDFS.
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/inputMapReduce /

We can safely ignore this warning.
Verify whether file is actually copied or not.
$HADOOP_HOME/bin/hdfs dfs -ls /inputMapReduce

8. Run MapReduce job
$HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales

This will create an output directory named mapreduce_output_sales on HDFS. Contents of this directory will be a file containing product
sales per country.
9. Result can be seen through command interface as,
$HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_output_sales/part-00000

HADOOP

24 | P a g e

o/p of above
OR
Results can also be seen via web interface asResults through web interfaceOpen r in web browser.

Now select 'Browse the filesystem' and navigate upto /mapreduce_output_sales
o/p of above

HADOOP

25 | P a g e

Open part-r-00000

From <http://www.guru99.com/create-your-first-hadoop-program.html>
Understanding MapReducer Code
This tutorial is divided into 3 sections
Explanation of SalesMapper Class
Explanation of SalesCountryReducer Class
Explanation of SalesCountryDriver Class
Explanation of SalesMapper Class
In this section we will understand implementation of SalesMapper class.
1. We begin by specifying name of package for our class. SalesCountry is name of out package. Please note that output of
compilation, SalesMapper.class will go into directory named by this package name: SalesCountry.
HADOOP

26 | P a g e

Followed by this, we import library packages.
Below snapshot shows implementation of SalesMapper class-

Code Explanation:
1. SalesMapper Class Definitionpublic class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
Every mapper class must be extended from MapReduceBase class and it must implement Mapper interface.
2. Defining 'map' function?
1 public void map(LongWritable key,
2
Text value,
3 OutputCollector<Text, IntWritable> output,
4 Reporter reporter) throws IOException
Main part of Mapper class is a 'map()' method which accepts four arguments.
At every call to 'map()' method, a key-value pair ('key' and 'value' in this code) is passed.
'map()' method begins by splitting input text which is received as an argument. It uses tokenizer to split these lines into words.
?
1 String valueString = value.toString();
2 String[] SingleCountryData = valueString.split(",");
Here, ',' is used as a delimiter.
After this, a pair is formed using a record at 7th index of array 'SingleCountryData' and a value '1'.
output.collect(new Text(SingleCountryData[7]), one);
We are choosing record at 7th index because we need Country data and it is located at 7th index in array 'SingleCountryData'.
Please note that our input data is in the below format (where Country is at 7th index, with 0 as a starting index)Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
Output of mapper is again a key-value pair which is outputted using 'collect()' method of 'OutputCollector'.
Explanation of SalesCountryReducer Class
In this section we will understand implementation of SalesCountryReducer class.
1. We begin by specifying name of package for our class. SalesCountry is name of out package. Please note that output of
compilation, SalesCountryReducer.class will go into directory named by this package name: SalesCountry.
Followed by this, we import library packages.
Below snapshot shows implementation of SalesCountryReducer class-

HADOOP

27 | P a g e

Code Explanation:
1. SalesCountryReducer Class Definitionpublic class SalesCountryReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
Here, first two data types, 'Text' and 'IntWritable' are data type of input key-value to the reducer.
Output of mapper is in the form of <CountryName1, 1>, <CountryName2, 1>. This output of mapper becomes input to the reducer. So, to
align with its data type, Text and IntWritable are used as data type here.
The last two data types, 'Text' and 'IntWritable' are data type of output generated by reducer in the form of key-value pair.
Every reducer class must be extended from MapReduceBase class and it must implement Reducer interface.
2. Defining 'reduce' function?
1 public void reduce( Text t_key,
2
Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output,
3
Reporter reporter) throws IOException {
Input to the reduce() method is a key with list of multiple values.
For example, in our case it will be<United Arab Emirates, 1>, <United Arab Emirates, 1>, <United Arab Emirates, 1>,<United Arab Emirates, 1>, <United Arab Emirates, 1>,
<United Arab Emirates, 1>.
This is given to reducer as <United Arab Emirates, {1,1,1,1,1,1}>
So, to accept arguments of this form, first two data types are used, viz., Text and Iterator<IntWritable>. Text is a data type of key
and Iterator<IntWritable> is a data type for list of values for that key.
The next argument is of type OutputCollector<Text,IntWritable> which collects output of reducer phase.
reduce() method begins by copying key value and initializing frequency count to 0.
Text key = t_key;
int frequencyForCountry = 0;
Then, using 'while' loop, we iterate through the list of values associated with the key and calculate the final frequency by summing up all
the values.
?
1 while (values.hasNext()) {
2
// replace type of value with the actual type of our value
3
IntWritable value = (IntWritable) values.next();
4
frequencyForCountry += value.get();
5
HADOOP

28 | P a g e

6
}
Now, we push the result to the output collector in the form of key and obtained frequency count.
Below code does thisoutput.collect(key, new IntWritable(frequencyForCountry));
Explanation of SalesCountryDriver Class
In this section we will understand implementation of SalesCountryDriver class
1. We begin by specifying name of package for our class. SalesCountry is name of out package. Please note that output of
compilation, SalesCountryDriver.class will go into directory named by this package name: SalesCountry.
Here is a line specifying package name followed by code to import library packages.

2. Define a driver class which will create a new client job, configuration object and advertise Mapper and Reducer classes.
The driver class is responsible for setting our MapReduce job to run in Hadoop. In this class, we specify job name, data type of
input/output and names of mapper and reducer classes.

HADOOP

29 | P a g e

3. In below code snippet, we set input and output directories which are used to consume input dataset and produce output, respectively.
arg[0] and arg[1] are the command-line arguments passed with a command given in MapReduce hands-on, i.e.,
$HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales

4. Trigger our job
Below code start execution of MapReduce job?
1 try {
2
// Run the job
3
JobClient.runJob(job_conf);
4
} catch (Exception e) {
5
e.printStackTrace();
6
}
From <http://www.guru99.com/understanding-map-reducer-code.html>
Introduction to Counters & Joins In MapReduce
A counter in MapReduce is a mechanism used for collecting statistical information about the MapReduce job. This information could be
useful for diagnosis of a problem in MapReduce job processing. Counters are similar to putting log message in the code for map or
reduce.
Typically, these counters are defined in a program (map or reduce) and are incremented during execution when a particular event or
condition (specific to that counter) occurs. A very good application of counters is to track valid and invalid records from an input dataset.
There are two types of counters:
1. Hadoop Built-In counters: There are some built-in counters which exist per job. Below are built-in counter groups

MapReduce Task Counters - Collects task specific information (e.g., number of input records) during its execution time.



FileSystem Counters - Collects information like number of bytes read or written by a task



FileInputFormat Counters - Collects information of number of bytes read through FileInputFormat



FileOutputFormat Counters - Collects information of number of bytes written through FileOutputFormat



Job Counters - These counters are used by JobTracker. Statistics collected by them include e.g., number of task launched
for a job.
2. User Defined Counters
In addition to built-in counters, user can define his own counters using similar functionalities provided by programming languages. For
example, in Java 'enum' are used to define user defined counters.
An example MapClass with Counters to count the number of missing and invalid values:
?
1 public static class MapClass
2
extends MapReduceBase
3
implements Mapper<LongWritable, Text, Text, Text>
4 {
5
static enum SalesCounters { MISSING, INVALID };
6
public void map ( LongWritable key, Text value,
7
OutputCollector<Text, Text> output,
8
Reporter reporter) throws IOException
9
{
10

HADOOP

30 | P a g e

11
//Input string is split using ',' and stored in 'fields' array
12
String fields[] = value.toString().split(",", -20);
13
//Value at 4th index is country. It is stored in 'country' variable
14
String country = fields[4];
15
16
//Value at 8th index is sales data. It is stored in 'sales' variable
17
String sales = fields[8];
18
19
if (country.length() == 0) {
20
reporter.incrCounter(SalesCounters.MISSING, 1);
21
} else if (sales.startsWith("\"")) {
22
reporter.incrCounter(SalesCounters.INVALID, 1);
23
} else {
24
output.collect(new Text(country), new Text(sales + ",1"));
25
}
26
}
27 }
Above code snippet shows an example implementation of counters in Map Reduce.
Here, SalesCounters is a counter defined using 'enum'. It is used to count MISSING and INVALID input records.
In the code snippet, if 'country' field has zero length then its value is missing and hence corresponding
counter SalesCounters.MISSING is incremented.
Next, if 'sales' field starts with a " then the record is considered INVALID. This is indicated by incrementing
counter SalesCounters.INVALID.
MapReduce Join
Joining two large dataset can be achieved using MapReduce Join. However, this process involves writing lots of code to perform actual
join operation.
Joining of two datasets begin by comparing size of each dataset. If one dataset is smaller as compared to the other dataset then smaller
dataset is distributed to every datanode in the cluster. Once it is distributed, either Mapper or Reducer uses smaller dataset to perform
lookup for matching records from large dataset and then combine those records to form output records.
Depending upon the place where actual join is performed, this join is classified into1. Map-side join - When the join is performed by the mapper, it is called as map-side join. In this type, the join is performed before data is
actually consumed by the map function. It is mandatory that the input to each map is in the form of a partition and is in sorted order. Also,
there must be an equal number of partitions and it must be sorted by the join key.
2. Reduce-side join - When the join is performed by the reducer, it is called as reduce-side join. There is no necessity in this join to have
dataset in a structured form (or partitioned).
Here, map side processing emits join key and corresponding tuples of both the tables. As an effect of this processing, all the tuples with
same join key fall into the same reducer which then joins the records with same join key.
Overall process flow is depicted in below diagram.

HADOOP

31 | P a g e

From <http://www.guru99.com/introduction-to-counters-joins-in-map-reduce.html>
MapReduce Hadoop Program To Join Data
Problem Statement:
There are 2 Sets of Data in 2 Different Files

HADOOP

32 | P a g e

The Key Dept_ID is common in both files.
The goal is to use MapReduce Join to combine these files
Input: Our input data set is a txt file, DeptName.txt & DepStrength.txt
Download Input Files From Here
Prerequisites:


This tutorial is developed on Linux - Ubuntu operating System.



You should have Hadoop (version 2.2.0 used for this tutorial) already installed.



You should have Java(version 1.8.0 used for this tutorial) already installed on the system.
Before we start with the actual process, change user to 'hduser' (user used for Hadoop ).
su - hduser_

Steps:
Step 1) Copy the zip file to location of your choice

Step 2) Uncompress the Zip File
sudo tar -xvf MapReduceJoin.tar.gz

HADOOP

33 | P a g e

Step 3)
Go to directory MapReduceJoin/
cd MapReduceJoin/

Step 4) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

HADOOP

34 | P a g e

Step 5) DeptStrength.txt and DeptName.txt are the input files used for this program.
These file needs to be copied to HDFS using below command$HADOOP_HOME/bin/hdfs dfs -copyFromLocal DeptStrength.txt DeptName.txt /

Step 6) Run the program using below command$HADOOP_HOME/bin/hadoop jar MapReduceJoin.jar /DeptStrength.txt /DeptName.txt /output_mapreducejoin

HADOOP

35 | P a g e

Step 7)
After execution, output file (named 'part-00000') will stored in the directory /output_mapreducejoin on HDFS
Results can be seen using the command line interface
$HADOOP_HOME/bin/hdfs dfs -cat /output_mapreducejoin/part-00000

Results can also be seen via web interface as-

HADOOP

36 | P a g e

Now select 'Browse the filesystem' and navigate upto /output_mapreducejoin

Open part-r-00000

HADOOP

37 | P a g e

Results are shown

HADOOP

38 | P a g e

NOTE: Please note that before running this program for the next time, you will need to delete output directory
/output_mapreducejoin
$HADOOP_HOME/bin/hdfs dfs -rm -r /output_mapreducejoin
Alternative is to use different name for output directory.
From <http://www.guru99.com/map-reduce-hadoop-program-to-join-data.html>
Introduction To Flume and Sqoop
Before we learn more about Flume and Sqoop , lets study
Issues with Data Load into Hadoop
Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters.
This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with certain set of challenges.
Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting
right approach for data load.
Major Issues:
1. Data load using Scripts
Traditional approach of using scripts to load data, is not suitable for bulk data load into Hadoop; this approach is inefficient and very time
consuming.
2. Direct access to external data via Map-Reduce application
Providing direct access to the data residing at external systems(without loading into Hadopp) for map reduce applications complicates
these applications. So, this approach is not feasible.

HADOOP

39 | P a g e

3.In addition to having ability to work with enormous data, Hadoop can work with data in several different forms. So, to load such
heterogeneous data into Hadoop, different tools have been developed. Sqoop and Flume are two such data loading tools.
Introduction to SQOOP
Apache Sqoop (SQL-to-Hadoop) is designed to support bulk import of data into HDFS from structured data stores such as relational
databases, enterprise data warehouses, and NoSQL systems. Sqoop is based upon a connector architecture which supports plugins to
provide connectivity to new external systems.
An example use case of Sqoop, is an enterprise that runs a nightly Sqoop import to load the day's data from a production transactional
RDBMS into a Hive data warehouse for further analysis.
Sqoop Connectors
All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to
dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are
components which help overcome these challenges.
Data transfer between Sqoop and external storage system is made possible with the help of Sqoop's connectors.
Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and
DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting
to any database that supports Java's JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use
database-specific APIs to perform bulk transfers efficiently.

In addition to this, Sqoop has various third party connectors for data stores,
ranging from enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase). However,
these connectors do not come with Sqoop bundle ;those need to be downloaded separately and can be added easily to an existing Sqoop
installation.





Introduction to FLUME
Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from
web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.
Flume supports multiple sources like –
'tail' (which pipes data from local file and write into HDFS via Flume, similar to Unix command 'tail')
System logs
Apache log4j (enable Java applications to write events to files in HDFS via Flume).
Data Flow in Flume

HADOOP

40 | P a g e

A Flume agent is a JVM process which has 3 components -Flume Source, Flume Channel and Flume Sink- through which events
propagate after initiated at an external source .










In above diagram, the events generated by external source (WebServer) are consumed by Flume Data Source. The external
source sends events to Flume source in a format that is recognized by the target source.
Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event
until it is consumed by the flume sink. This channel may use local file system in order to store these events.
Flume sink removes the event from channel and stores it into an external repository like e.g., HDFS. There could be multiple
flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.
Some Important features of FLUME
Flume has flexible design based upon streaming data flows. It is fault tolerant and robust with multiple failover and recovery
mechanisms. Flume has different levels of reliability to offer which includes 'best-effort delivery' and an 'end-to-end delivery'. Besteffort delivery does not tolerate any Flume node failure whereas 'end-to-end delivery' mode guarantees delivery even in the event of
multiple node failures.
Flume carries data between sources and sinks. This gathering of data can either be scheduled or event driven. Flume has its
own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
Possible Flume sinks include HDFS and Hbase. Flume can also be used to transport event data including but not limited to
network traffic data, data generated by social-media websites and email messages.
Since July 2012, Flume is being released as Flume NG (New Generation), as it differs significantly from its original release, as known as
Flume OG (Original Generation).
Sqoop
Flume
Sqoop is used for importing data from structured data sources such as RDBMS.

Flume is used for moving bulk streaming data into HDF

Sqoop has a connector based architecture. Connectors know how to connect to the
respective data source and fetch the data.

Flume has an agent based architecture. Here, code is
as 'agent') which takes care of fetching data.

HDFS is a destination for data import using Sqoop.

Data flows to HDFS through zero or more channels.

Sqoop data load is not event driven.

Flume data load can be driven by event.

In order to import data from structured data sources, one has to use Sqoop only,
because its connectors know how to interact with structured data sources and fetch
data from them.

In order to load streaming data such as tweets generat
of a web server, Flume should be used. Flume agents
streaming data.

From <http://www.guru99.com/introduction-to-flume-and-sqoop.html>
Create Your First FLUME Program
Prerequisites:
This tutorial is developed on Linux - Ubuntu operating System.
You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.
You should have Java(version 1.8.0 used for this tutorial) already installed on the system.
You should have set JAVA_HOME accordingly.
HADOOP

41 | P a g e

Before we start with the actual process, change user to 'hduser' (user used for Hadoop ).
su - hduser

Steps :
Flume, library and source code setup


Create a new directory with name 'FlumeTutorial'
sudo mkdir FlumeTutorial

1.
1.

Give read, write and execute permissions sudo chmod -R 777 FlumeTutorial
Copy files MyTwitterSource.java and MyTwitterSourceForFlume.java in this directory.
Download Input Files From Here
Check the file permissions of all these files and if 'read' permissions are missing then grant the same-

2. Download 'Apache Flume' from site- https://flume.apache.org/download.html
Apache Flume 1.4.0 has been used in this tutorial.

Next Click

HADOOP

42 | P a g e

3. Copy the downloaded tarball in the directory of your choice and extract contents using the following command
sudo tar -xvf apache-flume-1.4.0-bin.tar.gz

This command will create a new directory named apache-flume-1.4.0-bin and extract files into it. This directory will be referred to
as <Installation Directory of Flume> in rest of the article.
4. Flume library setup
Copy twitter4j-core-4.0.1.jar, flume-ng-configuration-1.4.0.jar, flume-ng-core-1.4.0.jar, flume-ng-sdk-1.4.0.jar to
<Installation Directory of Flume>/lib/
It is possible that either or all of the copied JAR will have execute permission. This may cause issue with the compilation of code. So,
revoke execute permission on such JAR.
In my case, twitter4j-core-4.0.1.jar was having execute permission. I revoked it as belowsudo chmod -x twitter4j-core-4.0.1.jar

After this command give 'read' permission on twitter4j-core-4.0.1.jar to all.
sudo chmod +rrr /usr/local/apache-flume-1.4.0-bin/lib/twitter4j-core-4.0.1.jar
Please note that I have downloaded- twitter4j-core-4.0.1.jar from http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core
- Allflume JARs i.e., flume-ng-*-1.4.0.jar from http://mvnrepository.com/artifact/org.apache.flume
Load data from Twitter using Flume
1. Go to directory containing source code files in it.
2. Set CLASSPATH to contain <Flume Installation Dir>/lib/* and ~/FlumeTutorial/flume/mytwittersource/*
export CLASSPATH="/usr/local/apache-flume-1.4.0-bin/lib/*:~/FlumeTutorial/flume/mytwittersource/*"

3. Compile source code using commandHADOOP

43 | P a g e

javac -d . MyTwitterSourceForFlume.java MyTwitterSource.java

4.Create jar
First,create Manifest.txt file using text editor of your choice and add below line in itMain-Class: flume.mytwittersource.MyTwitterSourceForFlume
.. here flume.mytwittersource.MyTwitterSourceForFlume is name of the main class. Please note that you have to hit enter key at end
of this line.

Now, create JAR 'MyTwitterSourceForFlume.jar' asjar cfm MyTwitterSourceForFlume.jar Manifest.txt flume/mytwittersource/*.class

5. Copy this jar to <Flume Installation Directory>/lib/
sudo cp MyTwitterSourceForFlume.jar <Flume Installation Directory>/lib/

6. Go to configuration directory of Flume, <Flume Installation Directory>/conf
If flume.conf does not exist, then copy flume-conf.properties.template and rename it to flume.conf
sudo cp flume-conf.properties.template flume.conf

If flume-env.sh does not exist, then copy flume-env.sh.template and rename it to flume-env.sh
sudo cp flume-env.sh.template flume-env.sh

7. Create a Twitter application by signing in to https://dev.twitter.com/user/login?destination=home
HADOOP

44 | P a g e

a. Go to 'My applications' (This option gets dropped down when 'Egg'
button at top right corner is clicked)

HADOOP

45 | P a g e

b. Create a new application by clicking 'Create New App'
c. Fill up application details by specifying name of application, description
and website. You may refer to the notes given underneath each input box.

d. Scroll down the page and accept terms by marking 'Yes, I agree' and click on button 'Create your Twitter application'

HADOOP

46 | P a g e

e. On window of newly created application, go to tab, 'API Keys' scroll down the page and click button 'Create my access token'

HADOOP

47 | P a g e

f. Refresh the page.
g. Click on 'Test OAuth'. This will display 'OAuth' settings of application.

h. Modify 'flume.conf' (created in Step 6) using these OAuth settings. Steps to modify 'flume.conf' are given in step 8 below.

HADOOP

48 | P a g e

We need to copy Consumer key, Consumer secret, Access token and Access token secret to update 'flume.conf'.
Note: These values belongs to the user and hence are confidential, so should not be shared.
8. Open 'flume.conf' in write mode and set values for below parameters[A]
sudo gedit flume.conf
?
1 Copy below contents2 MyTwitAgent.sources = Twitter
3 MyTwitAgent.channels = MemChannel
4 MyTwitAgent.sinks = HDFS
5 MyTwitAgent.sources.Twitter.type = flume.mytwittersource.MyTwitterSourceForFlume
6 MyTwitAgent.sources.Twitter.channels = MemChannel
7 MyTwitAgent.sources.Twitter.consumerKey = <Copy consumer key value from Twitter App>
8 MyTwitAgent.sources.Twitter.consumerSecret = <Copy consumer secret value from Twitter App>
9 MyTwitAgent.sources.Twitter.accessToken = <Copy access token value from Twitter App>
10 MyTwitAgent.sources.Twitter.accessTokenSecret = <Copy access token secret value from Twitter App>
11 MyTwitAgent.sources.Twitter.keywords = guru99
12 MyTwitAgent.sinks.HDFS.channel = MemChannel
13 MyTwitAgent.sinks.HDFS.type = hdfs
14 MyTwitAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser/flume/tweets/
15 MyTwitAgent.sinks.HDFS.hdfs.fileType = DataStream
16 MyTwitAgent.sinks.HDFS.hdfs.writeFormat = Text
17 MyTwitAgent.sinks.HDFS.hdfs.batchSize = 1000
18 MyTwitAgent.sinks.HDFS.hdfs.rollSize = 0
19 MyTwitAgent.sinks.HDFS.hdfs.rollCount = 10000
20 MyTwitAgent.channels.MemChannel.type = memory

HADOOP

49 | P a g e

21 MyTwitAgent.channels.MemChannel.capacity = 10000
22 MyTwitAgent.channels.MemChannel.transactionCapacity = 1000

[B]
Also, set TwitterAgent.sinks.HDFS.hdfs.path as below,
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://<Host Name>:<Port Number>/<HDFS Home Directory>/flume/tweets/

HADOOP

50 | P a g e

To know <Host Name>, <Port Number> and <HDFS Home Directory> , see value of parameter 'fs.defaultFS' set
in $HADOOP_HOME/etc/hadoop/core-site.xml

[C]
In order to flush the data to HDFS, as an when it comes, delete below entry if it exists,
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
9. Open 'flume-env.sh' in write mode and set values for below parameters,
JAVA_HOME=<Installation directory of Java>
FLUME_CLASSPATH="<Flume Installation Directory>/lib/MyTwitterSourceForFlume.jar"

HADOOP

51 | P a g e

10. Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
11. Two of the JAR files from the Flume tar ball are not compatible with Hadoop 2.2.0. So, we will need to follow below steps to make
Flume compatible with Hadoop 2.2.0.
a. Move protobuf-java-2.4.1.jar out of '<Flume Installation Directory>/lib'.
Go to '<Flume Installation Directory>/lib'
cd <Flume Installation Directory>/lib
sudo mv protobuf-java-2.4.1.jar ~/

b. Find for JAR file 'guava' as below
find . -name "guava*"

Move guava-10.0.1.jar out of '<Flume Installation Directory>/lib'.
sudo mv guava-10.0.1.jar ~/

c. Download guava-17.0.jar from http://mvnrepository.com/artifact/com.google.guava/guava/17.0

HADOOP

52 | P a g e

Now, copy this downloaded jar file to '<Flume Installation Directory>/lib'
12. Go to '<Flume Installation Directory>/bin' and start Flume as./flume-ng agent -n MyTwitAgent -c conf -f <Flume Installation Directory>/conf/flume.conf

Command prompt window where flume is fetching Tweets-

From command window message we can see that the output is written to /user/hduser/flume/tweets/ directory.
Now, open this directory using web browser.
13. To see the result of data load, using a browser open http://localhost:50070/ and browse file system, then go to the directory where
data has been loaded, that is-

HADOOP

53 | P a g e

<HDFS Home Directory>/flume/tweets/

From <http://www.guru99.com/create-your-first-flume-program.html>
Introduction To Pig And Hive
In this tutorial we will discuss Pig & Hive
INTRODUCTION TO PIG
In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming
model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.
Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!
Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.
Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That's why the name, Pig!

HADOOP

54 | P a g e

1.
2.

Pig consists of two components:
Pig Latin, which is a language
Runtime environment, for running PigLatin programs.
A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These
operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results
of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus
on data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

HADOOP

55 | P a g e

1.
2.

Execution modes:
Pig has two execution modes:
Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis
of small data sets using Pig
Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop
cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data
sets.
INTRODUCTION TO HIVE
The size of data sets being collected and analyzed in the industry for business intelligence is growing and in a way, it is making traditional
data warehousing solutions more expensive. Hadoop with MapReduce framework, is being used as an alternative solution for analyzing
data sets with huge size. Though, Hadoop has proved useful for working on huge data sets, its MapReduce framework is very low level
and it requires programmers to write custom programs which are hard to maintain and reuse. Hive comes here for rescue of
programmers.
Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.
Hive provides SQL-like declarative language, called HiveQL, which is used for expressing queries. Using Hive-QL users associated with
SQL are able to perform data analysis very easily.
Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also
be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays
and maps.
Hive comes with a command-line shell interface which can be used to create tables and execute queries.
Hive query language is similar to SQL wherein it supports subqueries. With Hive query language, it is possible to take a MapReduce joins
across Hive tables. It has a support for simple SQL like functions- CONCAT, SUBSTR, ROUND etc., and aggregation functions- SUM,
COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It is also possible to write user defined functions in Hive query
language.
Comparing MapReduce, Pig and Hive
Sqoop
Flume
Sqoop is used for importing data from structured data sources such as
RDBMS.

Flume is used for moving bulk streaming data into HDFS.

Sqoop has a connector based architecture. Connectors know how to connect to Flume has an agent based architecture. Here, code is writte
the respective data source and fetch the data.
called as 'agent') which takes care of fetching data.

HADOOP

56 | P a g e

HDFS is a destination for data import using Sqoop.

Data flows to HDFS through zero or more channels.

Sqoop data load is not event driven.

Flume data load can be driven by event.

In order to import data from structured data sources, one has to use Sqoop
only, because its connectors know how to interact with structured data sources
and fetch data from them.

In order to load streaming data such as tweets generated on
files of a web server, Flume should be used. Flume agents a
fetching streaming data.

From <http://www.guru99.com/introduction-to-pig-and-hive.html>
Create your First PIG Program
Problem Statement:
Find out Number of Products Sold in Each Country.
Input: Our input data set is a CSV file, SalesJan2009.csv
Prerequisites:
This tutorial is developed on Linux - Ubuntu operating System.
You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.
You should have Java (version 1.8.0 used for this tutorial) already installed on the system.
You should have set JAVA_HOME accordingly.
This guide is divided into 2 parts
1.
2.

Pig Installation
Pig Demo
PART 1) Pig Installation
Before we start with the actual process, change user to 'hduser' (user used for Hadoop configuration).

Step 1) Download stable latest release of Pig (version 0.12.1 used for this tutorial) from any one of the mirrors sites available at
http://pig.apache.org/releases.html

Select tar.gz (and not src.tar.gz) file to download.
Step 2) Once download is complete, navigate to the directory containing the downloaded tar file and move the tar to the location where
you want to setup Pig. In this case we will move to /usr/local

Move to directory containing Pig Files
cd /usr/local
Extract contents of tar file as below
sudo tar -xvf pig-0.12.1.tar.gz

Step 3). Modify ~/.bashrc to add Pig related environment variables
Open ~/.bashrc file in any text editor of your choice and do below modificationsexport PIG_HOME=<Installation directory of Pig>

HADOOP

57 | P a g e

export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH

Step 4) Now, source this environment configuration using below command
. ~/.bashrc

Step 5) We need to recompile PIG to support Hadoop 2.2.0
Here are the steps to do thisGo to PIG home directory
cd $PIG_HOME
Install ant
sudo apt-get install ant

Note: Download will start and will consume time as per your internet speed.
Recompile PIG
sudo ant clean jar-all -Dhadoopversion=23

Please note that, in this recompilation process multiple components are downloaded. So, system should be connected to internet.
Also, in case this process stuck somewhere and you dont see any movement on command prompt for more than 20 minutes then
press ctrl + c and rerun the same command.
In our case it takes 20 minutes

HADOOP

58 | P a g e

Step 6) Test the Pig installation using command
pig -help

PART 2) Pig Demo
Step 7) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 8) Pig takes file from HDFS in MapReduce mode and stores the results back to HDFS.
Copy file SalesJan2009.csv (stored on local file system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File System) Home
Directory
Here the file is in Folder input. If the file is stored in some other location give that name
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether file is actually copied of not.
$HADOOP_HOME/bin/hdfs dfs -ls /

HADOOP

59 | P a g e

Step 9) Pig Configuration
First navigate to $PIG_HOME/conf
cd $PIG_HOME/conf
sudo cp pig.properties pig.properties.original

Open pig.properties using text editor of your choice, and specify log file path using pig.logfile
sudo gedit pig.properties

Loger will make use of this file to log errors.
Step 10) Run command 'pig' which will start Pig command prompt which is an interactive shell Pig queries.
pig

HADOOP

60 | P a g e

Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.
-- A. Load the file containing data.
?
1 salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,C
chararray,Longitude:chararray);
Press Enter after this command.

-- B. Group data by field Country
GroupByCountry = GROUP salesTable BY Country;

-- C. For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of Country : No. of products sold
CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
Press Enter after this command.

-- D. Store the results of Data Flow in the directory 'pig_output_sales' on HDFS
STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');

This command will take some time to execute. Once done, you should see following screen
HADOOP

61 | P a g e

Step 12) Result can be seen through command interface as,
$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

OR
Results can also be seen via web interface asResults through web interfaceOpen http://localhost:50070/ in web browser.

HADOOP

62 | P a g e

Now select 'Browse the filesystem' and navigate upto /user/hduser/pig_output_sales

Open part-r-00000

HADOOP

63 | P a g e

From <http://www.guru99.com/create-your-first-pig-program.html>
Introduction To Pig And Hive
In this tutorial we will discuss Pig & Hive
INTRODUCTION TO PIG
In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming
model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.
Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!
Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.
Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That's why the name, Pig!

HADOOP

64 | P a g e

1.
2.

Pig consists of two components:
Pig Latin, which is a language
Runtime environment, for running PigLatin programs.
A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These
operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results
of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus
on data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

HADOOP

65 | P a g e

1.
2.

Execution modes:
Pig has two execution modes:
Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis
of small data sets using Pig
Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop
cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data
sets.
INTRODUCTION TO HIVE
The size of data sets being collected and analyzed in the industry for business intelligence is growing and in a way, it is making traditional
data warehousing solutions more expensive. Hadoop with MapReduce framework, is being used as an alternative solution for analyzing
data sets with huge size. Though, Hadoop has proved useful for working on huge data sets, its MapReduce framework is very low level
and it requires programmers to write custom programs which are hard to maintain and reuse. Hive comes here for rescue of
programmers.
Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.
Hive provides SQL-like declarative language, called HiveQL, which is used for expressing queries. Using Hive-QL users associated with
SQL are able to perform data analysis very easily.
Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also
be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays
and maps.
Hive comes with a command-line shell interface which can be used to create tables and execute queries.
Hive query language is similar to SQL wherein it supports subqueries. With Hive query language, it is possible to take a MapReduce joins
across Hive tables. It has a support for simple SQL like functions- CONCAT, SUBSTR, ROUND etc., and aggregation functions- SUM,
COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It is also possible to write user defined functions in Hive query
language.
Comparing MapReduce, Pig and Hive
Sqoop
Flume
Sqoop is used for importing data from structured data sources such as
RDBMS.

Flume is used for moving bulk streaming data into HDFS.

Sqoop has a connector based architecture. Connectors know how to connect to Flume has an agent based architecture. Here, code is writte
the respective data source and fetch the data.
called as 'agent') which takes care of fetching data.

HADOOP

66 | P a g e

HDFS is a destination for data import using Sqoop.

Data flows to HDFS through zero or more channels.

Sqoop data load is not event driven.

Flume data load can be driven by event.

In order to import data from structured data sources, one has to use Sqoop
only, because its connectors know how to interact with structured data sources
and fetch data from them.

In order to load streaming data such as tweets generated on
files of a web server, Flume should be used. Flume agents a
fetching streaming data.

From <http://www.guru99.com/introduction-to-pig-and-hive.html>
Create your First PIG Program
Problem Statement:
Find out Number of Products Sold in Each Country.
Input: Our input data set is a CSV file, SalesJan2009.csv
Prerequisites:
This tutorial is developed on Linux - Ubuntu operating System.
You should have Hadoop (version 2.2.0 used for this tutorial) already installed and is running on the system.
You should have Java (version 1.8.0 used for this tutorial) already installed on the system.
You should have set JAVA_HOME accordingly.
This guide is divided into 2 parts
1.
2.

Pig Installation
Pig Demo
PART 1) Pig Installation
Before we start with the actual process, change user to 'hduser' (user used for Hadoop configuration).

Step 1) Download stable latest release of Pig (version 0.12.1 used for this tutorial) from any one of the mirrors sites available at
http://pig.apache.org/releases.html

Select tar.gz (and not src.tar.gz) file to download.
Step 2) Once download is complete, navigate to the directory containing the downloaded tar file and move the tar to the location where
you want to setup Pig. In this case we will move to /usr/local

Move to directory containing Pig Files
cd /usr/local
Extract contents of tar file as below
sudo tar -xvf pig-0.12.1.tar.gz

Step 3). Modify ~/.bashrc to add Pig related environment variables
Open ~/.bashrc file in any text editor of your choice and do below modificationsexport PIG_HOME=<Installation directory of Pig>

HADOOP

67 | P a g e

export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH

Step 4) Now, source this environment configuration using below command
. ~/.bashrc

Step 5) We need to recompile PIG to support Hadoop 2.2.0
Here are the steps to do thisGo to PIG home directory
cd $PIG_HOME
Install ant
sudo apt-get install ant

Note: Download will start and will consume time as per your internet speed.
Recompile PIG
sudo ant clean jar-all -Dhadoopversion=23

Please note that, in this recompilation process multiple components are downloaded. So, system should be connected to internet.
Also, in case this process stuck somewhere and you dont see any movement on command prompt for more than 20 minutes then
press ctrl + c and rerun the same command.
In our case it takes 20 minutes

HADOOP

68 | P a g e

Step 6) Test the Pig installation using command
pig -help

PART 2) Pig Demo
Step 7) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 8) Pig takes file from HDFS in MapReduce mode and stores the results back to HDFS.
Copy file SalesJan2009.csv (stored on local file system, ~/input/SalesJan2009.csv) to HDFS (Hadoop Distributed File System) Home
Directory
Here the file is in Folder input. If the file is stored in some other location give that name
$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/input/SalesJan2009.csv /

Verify whether file is actually copied of not.
$HADOOP_HOME/bin/hdfs dfs -ls /

HADOOP

69 | P a g e

Step 9) Pig Configuration
First navigate to $PIG_HOME/conf
cd $PIG_HOME/conf
sudo cp pig.properties pig.properties.original

Open pig.properties using text editor of your choice, and specify log file path using pig.logfile
sudo gedit pig.properties

Loger will make use of this file to log errors.
Step 10) Run command 'pig' which will start Pig command prompt which is an interactive shell Pig queries.
pig

HADOOP

70 | P a g e

Step 11) In Grunt command prompt for Pig, execute below Pig commands in order.
-- A. Load the file containing data.
?
1 salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Name:chararray,City:chararray,State:chararray,C
chararray,Longitude:chararray);
Press Enter after this command.

-- B. Group data by field Country
GroupByCountry = GROUP salesTable BY Country;

-- C. For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of Country : No. of products sold
CountByCountry = FOREACH GroupByCountry GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
Press Enter after this command.

-- D. Store the results of Data Flow in the directory 'pig_output_sales' on HDFS
STORE CountByCountry INTO 'pig_output_sales' USING PigStorage('\t');

This command will take some time to execute. Once done, you should see following screen
HADOOP

71 | P a g e

Step 12) Result can be seen through command interface as,
$HADOOP_HOME/bin/hdfs dfs -cat pig_output_sales/part-r-00000

OR
Results can also be seen via web interface asResults through web interfaceOpen http://localhost:50070/ in web browser.

HADOOP

72 | P a g e

Now select 'Browse the filesystem' and navigate upto /user/hduser/pig_output_sales

Open part-r-00000

HADOOP

73 | P a g e

From <http://www.guru99.com/create-your-first-pig-program.html>
Learn OOZIE in 5 Minutes




What is OOZIE?
Apache Oozie is a workflow scheduler for Hadoop. It is a system which runs workflow of dependent jobs. Here, users are permitted to
create Directed Acyclic Graphs of workflows, which can be run in parallel and sequentially in Hadoop.
It consists of two parts:
Workflow engine : Responsibility of a workflow engine is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive.
Coordinator engine: It runs workflow jobs based on predefined schedules and availability of data.
Oozie is scalable and can manage timely execution of thousands of workflows (each consisting of dozens of jobs) in a Hadoop cluster.

HADOOP

74 | P a g e

Oozie is very much flexible, as well. One can easily start, stop, suspend and rerun jobs. Oozie makes it very easy to rerun failed
workflows. One can easily understand how difficult it can be to catch up missed or failed jobs due to downtime or failure. It is even
possible to skip a specific failed node.
How does OOZIE work?
Oozie runs as a service in the cluster and clients submit workflow definitions for immediate or later processing.
Oozie workflow consists of action nodes and control-flow nodes.
An action node represents a workflow task, e.g., moving files into HDFS, running a MapReduce, Pig or Hive jobs, importing data using
Sqoop or running a shell script of a program written in Java.
A control-flow node controls the workflow execution between actions by allowing constructs like conditional logic wherein different
branches may be followed depending on the result of earlier action node.
Start Node, End Node and Error Node fall under this category of nodes.
Start Node, designates start of the workflow job.
End Node, signals end of the job.
Error Node, designates an occurrence of error and corresponding error message to be printed.
At the end of execution of workflow, HTTP callback is used by Oozie to update client with the workflow status. Entry-to or exit-from an
action node may also trigger callback.
Example Workflow Diagram
Why use Oozie?
Main purpose of using Oozie is to manage different type of jobs being processed in Hadoop system.
Dependencies between jobs are specified by user in the form of Directed Acyclic Graphs. Oozie consumes this information and takes
care of their execution in correct order as specified in a workflow. That way user's time to manage complete workflow is saved. In
addition, Oozie has a provision to specify frequency of execution of a particular job.
FEATURES OF OOZIE


Oozie has client API and command line interface which can be used to launch, control and monitor job from Java application.



Using its Web Service APIs one can control jobs from anywhere.



Oozie has provision to execute jobs which are scheduled to run periodically.

HADOOP

75 | P a g e



Oozie has provision to send email notifications upon completion of jobs.
From <http://www.guru99.com/learn-oozie-in-5-minutes.html>
Big Data Testing: Functional & Performance
What is Big Data?
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. Testing of these datasets
involves various tools, techniques and frameworks to process. Big data relates to data creation, storage, retrieval and analysis that is
remarkable in terms of volume, variety, and velocity. You can learn more about Big Data, Hadoop and Mapreduce here
In this tutorial we will learn,



Big Data Testing Strategy



Testing Steps in verifying Big Data Applications



Step 1: Data Staging Validation



Step 2: "MapReduce" Validation



Step 3: Output Validation Phase



Architecture Testing



Performance Testing



Performance Testing Approach



Parameters for Performance Testing



Test Environment Needs



Big data Testing Vs. Traditional database Testing



Tools used in Big Data Scenarios



Challenges in Big Data Testing
Big Data Testing Strategy
Testing Big Data application is more a verification of its data processing rather than testing the individual features of the software product.
When it comes to Big data testing, performance and functional testing are the key.
In Big data testing QA engineers verify the successful processing of terabytes of data using commodity cluster and other supportive
components. It demands a high level of testing skills as the processing is very fast. Processing may be of three types

Along with this, data quality is also an important factor in big data testing. Before testing the application, it is necessary to check the
quality of data and should be considered as a part of database testing. It involves checking various characteristics like conformity,
accuracy, duplication, consistency, validity, data completeness, etc.
Testing Steps in verifying Big Data Applications
The following figure gives a high level overview of phases in Testing Big Data Applications

HADOOP

76 | P a g e

Big Data Testing can be broadly divided into three steps
Step 1: Data Staging Validation
The first step of big data testing, also referred as pre-Hadoop stage involves process validation.


Data from various source like RDBMS, weblogs, social media, etc. should be validated to make sure that correct data is pulled
into system




Comparing source data with the data pushed into the Hadoop system to make sure they match
Verify the right data is extracted and loaded into the correct HDFS location
Tools like Talend, Datameer, can be used for data staging validation
Step 2: "MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic validation on every node and then
validating them after running against multiple nodes, ensuring that the



Map Reduce process works correctly



Data aggregation or segregation rules are implemented on the data



Key value pairs are generated



Validating the data after Map Reduce process
Step 3: Output Validation Phase
The final or third stage of Big Data testing is the output validation process. The output data files are generated and ready to be moved to
an EDW (Enterprise Data Warehouse) or any other system based on the requirement.
Activities in third stage includes



To check the transformation rules are correctly applied



To check the data integrity and successful data load into the target system

HADOOP

77 | P a g e



To check that there is no data corruption by comparing the target data with the HDFS file system data
Architecture Testing
Hadoop processes very large volumes of data and is highly resource intensive. Hence, architectural testing is crucial to ensure success of
your Big Data project. Poorly or improper designed system may lead to performance degradation, and the system could fail to meet the
requirement. Atleast, Performance and Failover test services should be done in a Hadoop environment.
Performance testing includes testing of job completion time, memory utilization, data throughput and similar system metrics. While the
motive of Failover test service is to verify that data processing occurs seamlessly in case of failure of data nodes




Performance Testing
Performance Testing for Big Data includes two main action
Data ingestion and Throughout: In this stage, the tester verifies how the fast system can consume data from various data
source. Testing involves identifying different message that the queue can process in a given time frame. It also includes how quickly data
can be inserted into underlying data store for example insertion rate into a Mongo and Cassandra database.
Data Processing: It involves verifying the speed with which the queries or map reduce jobs are executed. It also includes
testing the data processing in isolation when the underlying data store is populated within the data sets. For example running Map
Reduce jobs on the underlying HDFS



Sub-Component Performance: These systems are made up of multiple components, and it is essential to test each of these
components in isolation. For example, how quickly message is indexed and consumed, mapreduce jobs, query performance, search, etc.
Performance Testing Approach
Performance testing for big data application involves testing of huge volumes of structured and unstructured data, and it requires a
specific testing approach to test such massive data.

Performance Testing is executed in this order


Process begins with the setting of the Big data cluster which is to be tested for performance



Identify and design corresponding workloads



Prepare individual clients (Custom Scripts are created)



Execute the test and analyzes the result (If objectives are not met then tune the component and re-execute)



Optimum Configuration
Parameters for Performance Testing
Various parameters to be verified for performance testing are



Data Storage: How data is stored in different nodes



Commit logs: How large the commit log is allowed to grow



Concurrency: How many threads can perform write and read operation



Caching: Tune the cache setting "row cache" and "key cache."



Timeouts: Values for connection timeout, query timeout, etc.



JVM Parameters: Heap size, GC collection algorithms, etc.



Map reduce performance: Sorts, merge, etc.



Message queue: Message rate, size, etc.

HADOOP

78 | P a g e

Test Environment Needs
Test Environment needs depend on the type of application you are testing. For Big data testing, test environment should encompass


It should have enough space for storage and process large amount of data



It should have cluster with distributed nodes and data



It should have minimum CPU and memory utilization to keep performance high
Big data Testing Vs. Traditional database Testing
Properties
Traditional database testing
Data

Big data testing



Tester work with structured data



T



Testing approach is well defined and time-tested



T



Tester has the option of "Sampling" strategy doing



"S



It

manually or "Exhaustive Verification" strategy by automation tool
Infrastructure



It does not require special test environment as the file
size is limited

Validation Tools

Tester uses either the Excel basedmacros or UI based automation tools

No defined tools, the range i

Testing Tools can be used with basic operating knowledge and less training.

It requires a specific set of sk
stage and overtime it may co

Tools used in Big Data Scenarios
Big Data Cluster
Big Data Tools
NoSQL:



CouchDB, DatabasesMongoDB, Cassandra, Redis,
ZooKeeper, Hbase

MapReduce:



Hadoop, Hive, Pig, Cascading, Oozie, Kafka, S4,
MapR, Flume

Storage:
Servers:
Processing



S3, HDFS ( Hadoop Distributed File System)



Elastic, Heroku, Elastic, Google App Engine, EC2



R, Yahoo! Pipes, Mechanical Turk, BigSheets,
Datameer





Challenges in Big Data Testing
Automation
Automation testing for Big data requires someone with a technical expertise. Also, automated tools are not equipped to handle
unexpected problems that arise during testing
Virtualization
It is one of the integral phases of testing. Virtual machine latency creates timing problems in real time big data testing. Also managing
images in Big data is a hassle.
Large Dataset



Need to verify more data and need to do it faster



Need to automate the testing effort



Need to be able to test across different platform
Performance testing challenges



Diverse set of technologies: Each sub-component belongs to different technology and requires testing in isolation



Unavailability of specific tools: No single tool can perform the end-to-end testing. For example, NoSQL might not fit for
message queues

HADOOP

79 | P a g e



Test Scripting: A high degree of scripting is needed to design test scenarios and test cases



Test environment: It needs special test environment due to large data size



Monitoring Solution: Limited solutions exists that can monitor the entire environment



Diagnostic Solution: Custom solution is required to develop to drill down the performance bottleneck areas
Summary



As data engineering and data analytics advances to a next level, Big data testing is inevitable.




Big data processing could be Batch, Real-Time, or Interactive
3 stages of Testing Big Data applications are



Data staging validation



"MapReduce" validation



Output validation phase



Architecture Testing is the important phase of Big data testing, as poorly designed system may lead to unprecedented errors



and degradation of performance
Performance testing for Big data includes verifying



Data throughput



Data processing





Sub-component performance
Big data testing is very different from Traditional data testing in terms of Data, Infrastructure & Validation Tools
Big Data Testing challenges include virtualization, test automation and dealing with large dataset. Performance testing of Big
Data applications is also an issue.
From <http://www.guru99.com/big-data-testing-functional-performance.html#4>

HADOOP

80 | P a g e

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close