Big Data

Published on May 2016 | Categories: Types, School Work | Downloads: 22 | Comments: 0 | Views: 205
of 33
Download PDF   Embed   Report

it is a report on big data

Comments

Content

Guru Nanak Dev Engg. College

Big Data Using
Hadoop
SUBMITTED BY :-

GAGANPAL SINGH
D3IT-A1
1311374

BIG DATA

Growth of and Digitization of Global Information
Storage Capacity

Big data is a broad term for work with data sets so large or
complex that traditional data processing applications are
inadequate and distributed databases are needed. Challenges
include sensor design, capture, data curation, search,sharing,
storage, transfer, analysis, fusion, visualization, and information
privacy.
The term often refers simply to the use of predictive

analytics or other certain advanced methods to extract value
from data, and seldom to a particular size of data set. Accuracy
in big data may lead to more confident decision making. And
better decisions can mean greater operational efficiency, cost
reduction and reduced risk. Analysis of data sets can find new
correlations, to "spot business trends, prevent diseases, combat
crime and so on. Scientists, business executives, practitioners of
media, and advertising and governments alike regularly meet
difficulties with large data sets in areas including Internet
search, finance and business informatics. Scientists encounter
limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics
simulations, and biological and environmental research.
Relational database management systems and desktop statistics
and visualization packages often have difficulty handling big data.
The work instead requires "massively parallel software running on
tens, hundreds, or even thousands of servers". What is
considered "big data" varies depending on the capabilities of the
users and their tools, and expanding capabilities make big data a
moving target. Thus, what is considered "big" one year becomes
ordinary later. "For some organizations, facing hundreds of
gigabytes of data for the first time may trigger a need to
reconsider data management options. For others, it may take tens
or hundreds of terabytes before data size becomes a significant
consideration.
"Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to
enable enhanced decision making, insight discovery and process
optimization. Additionally, a new V "Veracity" is added by some
organizations to describe it.

Gartner’s definition of the 3Vs is still widely used, and in
agreement with a consensual definition that states that "Big Data
represents the Information assets characterized by such a High
Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value". The 3Vs
have been expanded to other complementary characteristics of
big data:




Volume: big data doesn't sample. It just observes and
tracks what happens
Velocity: big data is often available in real-time



Variety: big data draws from text, images, audio, video; plus
it completes missing pieces through data fusion



Machine Learning: big data often doesn't ask why and simply
detects patterns



Digital footprint: big data is often a cost-free byproduct of
digital interaction

Characteristics

Volume:- The quantity of generated data is important in this

context. The size of the data determines the value
and potential of the data under consideration, and
data’ itself contains a term related to size, and hence
the characteristic.
Variety:- The type of content, and an essential fact that data
analysts must know. This helps people who are
associated with and analyze the data to effectively
use the data to their advantage and thus uphold its
importance.

Velocity:- In this context, the speed at which the data is
generated and processed to meet the demands and
the challenges that lie in the path of growth and
development.
Variability:- The inconsistency the data can show at times—which
can hamper the process of handling and managing
the
data effectively.

Veracity:- The quality of captured data, which can vary greatly.
Accurate analysis depends on the veracity of source
data.
Complexity:- Data management can be very complex, especially
when large volumes of data come from multiple
sources. Data must be linked, connected, and
correlated so users can grasp the information
the
data is supposed to convey.

Applications
Big data has increased the demand of information management

specialists in that Software AG, Oracle
Corporation,IBM, Microsoft, SAP, EMC, HP and Dell have spent
more than $15 billion on software firms specializing in data
management and analytics. In 2010, this industry was worth more
than $100 billion and was growing at almost 10 percent a year:
about twice as fast as the software business as a whole.
Developed economies increasingly use data-intensive technologies.
There are 4.6 billion mobile-phone subscriptions worldwide, and
between 1 billion and 2 billion people accessing the internet.
Between 1990 and 2005, more than 1 billion people worldwide
entered the middle class, which means more people become more
literate, which in turn leads to information growth. The world's
effective capacity to exchange information
through telecommunication networks was 281 petabytes in 1986,
471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in
2007 and predictions put the amount of internet traffic at 667
exabytes annually by 2014. According to one estimate, one third
of the globally stored information is in the form of alphanumeric
text and still image data, which is the format most useful for
most big data applications. This also shows the potential of yet
unused data (i.e. in the form of video and audio content).
While many vendors offer off-the-shelf solutions for Big Data,
experts recommend the development of in-house solutions
custom-tailored to solve the company's problem at hand if the
company has sufficient technical capabilities.

Government

The use and adoption of Big Data within governmental processes
is beneficial and allows efficiencies in terms of cost, productivity,
and innovation. That said, this process does not come without its
flaws. Data analysis often requires multiple parts of government
(central and local) to work in collaboration and create new and
innovative processes to deliver the desired outcome. Below are
the thought leading examples within the Governmental Big Data
space.

United States of America
 In 2012, the Obama administration announced the Big Data
Research and Development Initiative, to explore how big data
could be used to address important problems faced by the
government. The initiative is composed of 84 different big data
programs spread across six departments.
 Big data analysis played a large role in Barack Obama's
successful 2012 re-election campaign.
 The United States Federal Government owns six of the ten
most powerful supercomputers in the world.
 The Utah Data Center is a data center currently being
constructed by the United States National Security Agency.
When finished, the facility will be able to handle a large amount
of information collected by the NSA over the Internet. The exact
amount of storage space is unknown, but more recent sources
claim it will be on the order of a few exabytes.

India
 Big data analysis helped in parts, responsible for the BJP and
its allies to win Indian General Election 2014.

 The Indian Government utilises numerous techniques to
ascertain how the Indian electorate is responding to
government action, as well as ideas for policy augmentation

United Kingdom
 Data on prescription drugs: by connecting origin, location
and the time of each prescription, a research unit was able
to exemplify the considerable delay between the release of
any given drug, and a UK-wide adaptation of the National
Institute for Health and Care Excellence guidelines. This
suggests that new/most up-to-date drugs take some time to
filter through to the general patien

HADOOP
Apache Hadoop is an open-source software framework written
in Java for distributed storage and distributed processing of
very large data sets on computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a
fundamental assumption that hardware failures (of individual
machines, or racks of machines) are commonplace and thus should
be automatically handled in software by the framework.
The core of Apache Hadoop consists of a storage part (Hadoop
Distributed File System (HDFS)) and a processing part
(MapReduce). Hadoop splits files into large blocks and distributes
them amongst the nodes in the cluster. To process the data,
Hadoop MapReduce transfers packaged code for nodes to process
in parallel, based on the data each node needs to process. This
approach takes advantage of data locality nodes manipulating the
data that they have on hand to allow the data to
be processed faster and more efficiently than it would be in a
more conventional supercomputer architecture that relies on
a parallel file system where computation and data are connected

via high-speed networking.
The base Apache Hadoop framework is composed of the following
modules:


Hadoop Common – contains libraries and utilities needed by
other Hadoop modules;



Hadoop Distributed File System (HDFS) – a distributed filesystem that stores data on commodity machines, providing
very high aggregate bandwidth across the cluster;



Hadoop YARN – a resource-management platform
responsible for managing computing resources in clusters and
using them for scheduling of users' applications; and



Hadoop MapReduce – a programming model for large scale
data processing.

The term "Hadoop" has come to refer not just to the base
modules above, but also to the "ecosystem", or collection of
additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache
HBase, Apache Phoenix, Apache Spark, Apache Zookeeper,
Impala,Apache Flume, Apache Sqoop, Apache Oozie, Apache
Storm and others.
Apache Hadoop's MapReduce and HDFS components were
inspired by Google papers on their MapReduce and Google File
System.
The Hadoop framework itself is mostly written in the Java
programming language, with some native code in C and command
line utilities written as Shell script. For end-users, though
MapReduce Java code is common, any programming language can
be used with "Hadoop Streaming" to implement the "map" and
"reduce" parts of the user's program. Other related projects

expose other higher-level user interfaces.

PROJECT DETAILS
REQUIREMENTS:




VMware WORKSATATION
OPERATING SYSTEM (UBUNTU)
FRAMEWORK (HADOOP)
PROGRAMING LANGUAGE (JAVA)

STEPS FOLLOWED:STEP 1:MAKING A VIRTUAL MACHINE

 Installing VMware WORKSTATION to make a

virtual machine on your system.
 Make a virtual machine using ubuntu.iso file.
 Update ubuntu after installation by executing command
(sudo apt-get update) in terminal.
 Install java by command in terminal.
(sudo apt-get install java-8-oracle)

k@laptop:~$ cd ~
# Update the source list
k@laptop:~$ sudo apt-get update
# The OpenJDK project is the default version of Java
# that is provided from a supported Ubuntu repository.
k@laptop:~$ sudo apt-get install java-8-oracle
k@laptop:~$ java -version
java version "1.8.0_52"

STEP 2:-

Installing SSH

ssh has two main components:
1. ssh : The command we use to connect to remote machines - the
client.
2. sshd : The daemon that is running on the server and allows clients
to connect to the server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we
need to install sshfirst. Use this command to do that :
k@laptop:~$ sudo apt-get install ssh

This will install ssh on our machine. If we get something similar to the
following, we can think it is setup properly:
k@laptop:~$ which ssh
/usr/bin/ssh
k@laptop:~$ which sshd
/usr/sbin/sshd

Create and Setup SSH Certificates
Hadoop requires SSH access to manage its nodes, i.e. remote machines
plus our local machine. For our single-node setup of Hadoop, we therefore
need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it
to allow SSH public key authentication.

Hadoop uses SSH (to access its nodes) which would normally require the
user to enter a password. However, this requirement can be eliminated by
creating and setting up SSH certificates using the following commands. If
asked for a filename just leave it blank and press the enter key to continue.
k@laptop:~$ su hduser
Password:
k@laptop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter
file
in
which
(/home/hduser/.ssh/id_rsa):

to

save

the

key

been

saved

in

been

saved

in

Created directory '/home/hduser/.ssh'.
Your
identification
/home/hduser/.ssh/id_rsa.
Your
public
key
/home/hduser/.ssh/id_rsa.pub.

has
has

The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3 hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
|

.oo.o

|

|

. .o=. o

|

|

. + .

|

o =

|
|
|
|
|

o . |
E

S +

|
|

. +

|

O +
O o
o..

|
|
|

+-----------------+

hduser@laptop:/home/k$
cat
$HOME/.ssh/authorized_keys

$HOME/.ssh/id_rsa.pub

>>

The second command adds the newly created key to the list of authorized
keys so that Hadoop can use ssh without prompting for a password.
We can check if ssh works:
hduser@laptop:/home/k$ ssh localhost
The authenticity
established.

of

host

'localhost

(127.0.0.1)'

can't

ECDSA
key
fingerprint
e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.

be
is

Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of
known hosts.
Welcome
x86_64)

to

Ubuntu

14.04.1

LTS

(GNU/Linux

3.13.0-40-generic

...

STEP 3:DISABLE IPV6
•Edit the file using following command
–$ sudo gedit /etc/sysctl.conf
•Add the the following line of code at the end of the file
–net.ipv6.conf.all.disable_ipv6 = 1
–net.ipv6.conf.default.disable_ipv6 = 1
–net.ipv6.conf.lo.disable_ipv6 = 1
•Restart system
•TO check whether the ipv6 is disabled or not by using the
following command

–cat /proc/sys/net/ipv6/conf/all/disable_ipv6
–It should return value 1.

STEP 4:INSTALL HADOOP

•(a) Extract and Modify Permissions
–Move the hadoop package to /usr/local.
–Then change the directory to /usr/local.
–Extract the package using tar command.
–Then move the extraced file to the directory hadoop after that
change the owner of the hadoop directory, all files and directorys
in it.
–The following commands are using for this purpose.
•sudo mv /home/hduser/Downloads/hadoop-1.2.1.tar.gz /usr/local
•cd /usr/local
•sudo tar xzf hadoop-1.2.1.tar.gz
•sudo mv hadoop-1.2.1 hadoop
•sudo chown –R hduser:hadoop hadoop
–Note:Sometimes -R option make some error the just use option –
recursive

Setup Configuration Files

The following files will have to be modified to complete the
Hadoop setup:
1. ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc:
Before editing the .bashrc file in our home directory, we need to
find the path where Java has been installed to set
the JAVA_HOME environment variable using the following
command:
hduser@laptop update-alternatives --config java
There is only one alternative in link group java (providing
/usr/bin/java): /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
Nothing to configure.

Now we can append the following to the end of ~/.bashrc:
hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
hduser@laptop:~$ source ~/.bashrc

note that the JAVA_HOME should be set as the path just before
the '.../bin/':
hduser@ubuntu-VirtualBox:~$ javac -version
javac 1.7.0_75
hduser@ubuntu-VirtualBox:~$ which javac
/usr/bin/javac
hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Adding the above statement in the hadoop-env.sh file ensures
that the value of JAVA_HOME variable will be available to
Hadoop whenever it is started up.

3. /usr/local/hadoop/etc/hadoop/core-site.xml:
The /usr/local/hadoop/etc/hadoop/core-site.xml file contains
configuration properties that Hadoop uses when starting up.
This file can be used to override the default settings that
Hadoop starts with.
hduser@laptop:~$ sudo mkdir -p /app/hadoop/tmp
hduser@laptop:~$ sudo chown hduser:hadoop /app/hadoop/tmp

Open the file and enter the following in between the
<configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A
directories.</description>

base

for

other

temporary

</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.
whose

A URI

scheme
and
implementation. The

authority

determine

the

FileSystem

uri's scheme determines the config property (fs.SCHEME.impl)
naming
the FileSystem implementation class.
used to
determine
the
filesystem.</description>

host,

The uri's authority is

port,

etc.

for

a

</property>
</configuration>

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapredsite.xml:
hduser@laptop:~$
cp
/usr/local/hadoop/etc/hadoop/mapredsite.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is
being used for MapReduce.
We need to enter the following content in between the
<configuration></configuration> tag:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The
tracker runs

host

and

port

that

the

MapReduce

job

at.
map

If "local", then jobs are run in-process as a single

and reduce task.
</description>
</property>
</configuration>

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to
be configured for each host in the cluster that is being used.
It is used to specify the directories which will be used as
the namenode and the datanode on that host.
Before editing this file, we need to create two directories which
will contain the namenode and the datanode for this Hadoop
installation.
This can be done using the following commands:
hduser@laptop:~$
sudo
/usr/local/hadoop_store/hdfs/namenode

mkdir

-p

hduser@laptop:~$
sudo
/usr/local/hadoop_store/hdfs/datanode

mkdir

-p

hduser@laptop:~$
sudo
/usr/local/hadoop_store

chown

-R

hduser:hadoop

Open the file and enter the following content in between the
<configuration></configuration> tag:
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>

<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the
file is created.
The default is used if replication is not specified in
create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>

Format the New Hadoop Filesystem
Now, the Hadoop file system needs to be formatted so that we
can start to use it. The format command should be issued with
write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:
hduser@laptop:~$ hadoop namenode -format

DEPRECATED: Use of this script to execute hdfs command is
deprecated.
Instead use the hdfs command for it.
15/04/18 14:43:03 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:

host = laptop/192.168.1.1

STARTUP_MSG:

args = [-format]

STARTUP_MSG:

version = 2.6.0

STARTUP_MSG:

classpath = /usr/local/hadoop/etc/hadoop

...
STARTUP_MSG:

java = 1.7.0_65

************************************************************/
15/04/18 14:43:03 INFO namenode.NameNode:
signal handlers for [TERM, HUP, INT]
15/04/18
format]

14:43:03

INFO

namenode.NameNode:

registered

UNIX

createNameNode

[-

15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
Formatting
using
5b1100a2bf7f

clusterid:

CID-e2f515ac-33da-45bc-8466-

15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider
found.
15/04/18
14:43:09
fair:true

INFO

namenode.FSNamesystem:

fsLock

is

15/04/18
14:43:10
INFO
blockmanagement.DatanodeManager:
dfs.block.invalidate.limit=1000
15/04/18
14:43:10
INFO
blockmanagement.DatanodeManager:
dfs.namenode.datanode.registration.ip-hostname-check=true
15/04/18
14:43:10
INFO
blockmanagement.BlockManager:
dfs.namenode.startup.delay.block.deletion.sec
is
set
to
000:00:00:00.000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: The block
deletion will start around 2015 Apr 18 14:43:10

15/04/18 14:43:10 INFO util.GSet: Computing capacity for map
BlocksMap
15/04/18 14:43:10 INFO util.GSet: VM type

= 64-bit

15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB =
17.8 MB
15/04/18 14:43:10 INFO util.GSet: capacity
2097152 entries

= 2^21 =

15/04/18
14:43:10
INFO
blockmanagement.BlockManager:
dfs.block.access.token.enable=false
15/04/18
14:43:10
defaultReplication

INFO
= 1

blockmanagement.BlockManager:

15/04/18
14:43:10
maxReplication

INFO
blockmanagement.BlockManager:
= 512

15/04/18
14:43:10
minReplication

INFO
= 1

blockmanagement.BlockManager:

15/04/18
14:43:10
maxReplicationStreams

INFO
= 2

blockmanagement.BlockManager:

15/04/18
14:43:10
INFO
blockmanagement.BlockManager:
shouldCheckForEnoughRacks = false
15/04/18
14:43:10
INFO
blockmanagement.BlockManager:
replicationRecheckInterval = 3000
15/04/18
14:43:10
encryptDataTransfer

INFO
blockmanagement.BlockManager:
= false

15/04/18
14:43:10
maxNumBlocksToLog

INFO
blockmanagement.BlockManager:
= 1000

15/04/18
14:43:10
INFO
= hduser (auth:SIMPLE)
15/04/18
14:43:10
= supergroup

INFO

15/04/18
14:43:10
isPermissionEnabled = true
15/04/18
false

14:43:10

INFO

namenode.FSNamesystem:
namenode.FSNamesystem:
INFO

fsOwner
supergroup

namenode.FSNamesystem:

namenode.FSNamesystem:

HA

Enabled:

15/04/18 14:43:10 INFO namenode.FSNamesystem: Append Enabled:
true
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map
INodeMap
15/04/18 14:43:11 INFO util.GSet: VM type

= 64-bit

15/04/18 14:43:11 INFO util.GSet: 1.0% max memory 889 MB = 8.9
MB
15/04/18 14:43:11 INFO util.GSet: capacity
1048576 entries

= 2^20 =

15/04/18 14:43:11 INFO namenode.NameNode: Caching file names
occuring more than 10 times
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map
cachedBlocks
15/04/18 14:43:11 INFO util.GSet: VM type

= 64-bit

15/04/18 14:43:11 INFO util.GSet: 0.25% max memory 889 MB =
2.2 MB
15/04/18 14:43:11 INFO util.GSet: capacity
262144 entries

= 2^18 =

15/04/18
14:43:11
INFO
namenode.FSNamesystem:
dfs.namenode.safemode.threshold-pct = 0.9990000128746033
15/04/18
14:43:11
INFO
namenode.FSNamesystem:
dfs.namenode.safemode.min.datanodes = 0
15/04/18
14:43:11
INFO
dfs.namenode.safemode.extension

namenode.FSNamesystem:
= 30000

15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on
namenode is enabled
15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache will
use 0.03 of total heap and retry cache entry expiry time is
600000 millis
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map
NameNodeRetryCache
15/04/18 14:43:11 INFO util.GSet: VM type
15/04/18 14:43:11 INFO util.GSet:
memory 889 MB = 273.1 KB

= 64-bit

0.029999999329447746%

15/04/18 14:43:11 INFO util.GSet: capacity
entries

max

= 2^15 = 32768

15/04/18 14:43:11 INFO namenode.NNConf: ACLs enabled? false
15/04/18 14:43:11 INFO namenode.NNConf: XAttrs enabled? true
15/04/18 14:43:11 INFO namenode.NNConf: Maximum size of an
xattr: 16384
15/04/18
14:43:12
INFO
namenode.FSImage:
Allocated
BlockPoolId: BP-130729900-192.168.1.1-1429393391595

new

15/04/18 14:43:12 INFO common.Storage:
/usr/local/hadoop_store/hdfs/namenode has
formatted.

Storage directory
been successfully

15/04/18
14:43:12
INFO
namenode.NNStorageRetentionManager:
Going to retain 1 images with txid >= 0
15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0
15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

Note that hadoop namenode -format command should be
executed once before we start using Hadoop.
If this command is executed again after Hadoop has been used,
it'll destroy all the data on the Hadoop file system.

Starting Hadoop
Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)
k@laptop:~$ cd /usr/local/hadoop/sbin
k@laptop:/usr/local/hadoop/sbin$ ls
distribute-exclude.sh

start-all.cmd

stop-balancer.sh

hadoop-daemon.sh

start-all.sh

stop-dfs.cmd

hadoop-daemons.sh

start-balancer.sh

stop-dfs.sh

hdfs-config.cmd
dns.sh

start-dfs.cmd

stop-secure-

hdfs-config.sh

start-dfs.sh

stop-yarn.cmd

httpfs.sh

start-secure-dns.sh

stop-yarn.sh

kms.sh

start-yarn.cmd

yarn-daemon.sh

mr-jobhistory-daemon.sh

start-yarn.sh

yarn-daemons.sh

refresh-namenodes.sh

stop-all.cmd

slaves.sh

stop-all.sh

k@laptop:/usr/local/hadoop/sbin$ sudo su hduser
hduser@laptop:/usr/local/hadoop/sbin$ start-all.sh
hduser@laptop:~$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and startyarn.sh
15/04/18 16:43:13 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
Starting namenodes on [localhost]
localhost:
starting
namenode,
logging
/usr/local/hadoop/logs/hadoop-hduser-namenode-laptop.out

to

localhost:
starting
datanode,
logging
/usr/local/hadoop/logs/hadoop-hduser-datanode-laptop.out

to

Starting secondary namenodes [0.0.0.0]
0.0.0.0:
starting
secondarynamenode,
logging
/usr/local/hadoop/logs/hadoop-hduser-secondarynamenodelaptop.out

to

15/04/18 16:43:58 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
starting yarn daemons
starting
resourcemanager,
logging
to
/usr/local/hadoop/logs/yarn-hduser-resourcemanager-laptop.out
localhost:
starting
nodemanager,
logging
/usr/local/hadoop/logs/yarn-hduser-nodemanager-laptop.out

We can check if it's really up and running:

to

hduser@laptop:/usr/local/hadoop/sbin$ jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode

Stopping Hadoop
$ pwd
/usr/local/hadoop/sbin
$ ls
distribute-exclude.sh httpfs.sh
start-yarn.cmd
stop-dfs.cmd

start-all.sh
yarn-daemon.sh

hadoop-daemon.sh
mr-jobhistory-daemon.sh
balancer.sh
start-yarn.sh
stop-dfs.sh
daemons.sh
hadoop-daemons.sh
refresh-namenodes.sh
stop-all.cmd
stop-secure-dns.sh
hdfs-config.cmd
stop-all.sh

slaves.sh
stop-yarn.cmd

hdfs-config.sh
start-all.cmd
dns.sh stop-balancer.sh stop-yarn.sh

startyarn-

start-dfs.cmd
start-dfs.sh
start-secure-

We run stop-all.sh or (stop-dfs.sh and stop-yarn.sh) to stop all
the daemons running on our machine:
hduser@laptop:/usr/local/hadoop/sbin$ pwd
/usr/local/hadoop/sbin
hduser@laptop:/usr/local/hadoop/sbin$ ls

distribute-exclude.sh httpfs.sh
start-secure-dns.sh stop-balancer.sh

start-all.cmd
stop-yarn.sh

hadoop-daemon.sh
start-yarn.cmd

start-all.sh
yarn-daemon.sh

kms.sh
stop-dfs.cmd

hadoop-daemons.sh
balancer.sh start-yarn.sh
daemons.sh

mr-jobhistory-daemon.sh
stop-dfs.sh

hdfs-config.cmd
stop-all.cmd

refresh-namenodes.sh
stop-secure-dns.sh

hdfs-config.sh
stop-all.sh

slaves.sh
stop-yarn.cmd

startyarn-

start-dfs.cmd
start-dfs.sh

hduser@laptop:/usr/local/hadoop/sbin$
hduser@laptop:/usr/local/hadoop/sbin$ stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stopyarn.sh
15/04/18 15:46:31 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: no secondarynamenode to stop
15/04/18 15:46:59 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop

WORDCOUNT PROGRAM
Step 1:
Open Eclipse IDE and create a new project with 3 class files –
WordCount.java , WordCountMapper.java and WordCountReducer.java

Step 2:
Open WordCount.java and paste the following code.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool{
public int run(String[] args) throws Exception

{
//creating a JobConf object and assigning a job name for identification purposes
JobConf conf = new JobConf(getConf(), WordCount.class);
conf.setJobName("WordCount");

//Setting configuration object with the Data Type of output Key and Value
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

//Providing the mapper and reducer class names
conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReducer.class);
//We wil give 2 arguments at the run time, one in input path and other is output path
Path inp = new Path(args[0]);
Path out = new Path(args[1]);
//the hdfs input and output directory to be fetched from the command line
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception
{
// this main function will call run method defined above.

int res = ToolRunner.run(new Configuration(), new WordCount(),args);
System.exit(res);
}
}

Step 3:
Open WordCountMapper.java and paste the following code.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable>
{
//hadoop supported data types
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//map method that performs the tokenizer job and framing the initial key value pairs
// after all lines are converted into key-value pairs, reducer is called.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{ //taking one line at a time from input file and tokenizing the same
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

//iterating through all the words available in that line and forming the key value pair
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
//sending to output collector which inturn passes the same to reducer
output.collect(word, one);
}
}
}

Step 4:
Open WordCountReducer.java and paste the following code.
import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable>
{
//reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys
and produce the final out put
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException
{
int sum = 0;
/*iterates through all the values available with a key and add them together and give the

final result as the key and sum of its values*/
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close