CCD 410 Demo

Published on December 2016 | Categories: Documents | Downloads: 47 | Comments: 0 | Views: 248
of 21
Download PDF   Embed   Report

cloudera exam sample

Comments

Content

Vendor: Cloudera

Exam Code: CCD-410

Exam Name: Cloudera Certified Developer for Apache
Hadoop (CCDH)

Version: Demo

Question: 1
When is the earliest point at which the reduce method of a given Reducer can be called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.

Answer: C
Explanation:
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have
completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they
are available. The programmer defined reduce method is called only after all the mappers have
finished.
Note: The reduce phase has 3 steps: shuffle, sort, and reduce. Shuffle is where the data is collected
by the reducer from each mapper. This can happen while mappers are generating data since it is
only a data transfer. On the other hand, sort and reduce can only start once all the mappers are
done. Why is starting the reducers early a good thing? Because it spreads out the data transfer from
the mappers to the reducers over time, which is a good thing if your network is the bottleneck. Why
is starting the reducers early a bad thing? Because they "hog up" reduce slots while only copying
data. Another job that starts later that will actually use the reduce slots now can't use them. You can
customize when the reducers startup by changing the default value of
mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the
mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A
value of 0.5 will start the reducers when half of the mappers are complete. You can also change
mapred.reduce.slowstart.completed.maps
on
a
job-by-job
basis.
Typically,
keep
mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has multiple jobs running at
once. This way the job doesn't hog up reducers when they aren't doing anything but copying data. If
you only ever have one job running at a time, doing 0.1 would probably be appropriate.
Reference:
24 Interview Questions & Answers for Hadoop MapReduce developers, When is the reducers are
started in a MapReduce job?

Question: 2
Which describes how a client reads a file from HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the block
location(s) to the client. The client reads the data directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the requested data
responds directly to the client. The client reads the data directly off the DataNode.

2

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

C. The client contacts the NameNode for the block location(s). The NameNode then queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode
redirects the client to the DataNode that holds the requested data block(s). The client then reads the
data directly off the DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts the
DataNode that holds the requested data block. Data is transferred from the DataNode to the
NameNode, and then from the NameNode to the client.

Answer: C
Explanation:
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the
NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file
on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode
servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode
has provided the location of the data.
Reference:
24 Interview Questions & Answers for Hadoop MapReduce developers, How the Client
communicates with HDFS?

Question: 3
You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys,
IntWritable values. Which interface should your class implement?
A. Combiner <Text, IntWritable, Text, IntWritable>
B. Mapper <Text, IntWritable, Text, IntWritable>
C. Reducer <Text, Text, IntWritable, IntWritable>
D. Reducer <Text, IntWritable, Text, IntWritable>
E. Combiner <Text, Text, IntWritable, IntWritable>

Answer: D
Question: 4
Indentify the utility that allows you to create and run MapReduce jobs with any executable or script
as the mapper and/or the reducer?
A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming
E. mapred

Answer: D

3

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Explanation:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to
create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Reference:
http://hadoop.apache.org/common/docs/r0.20.1/streaming.html (Hadoop Streaming, second
sentence)

Question: 5
How are keys and values presented and passed to the reducers during a standard sort and shuffle
phase of MapReduce?
A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending
order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending
order.

Answer: A
Explanation:
Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the
same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should
extend the key with the secondary key and define a grouping comparator. The keys will be sorted
using the entire key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference:
org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

4

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Question: 6
Assuming default settings, which best describes the order of data provided to a reducer’s reduce
method:
A. The keys given to a reducer aren’t in a predictable order, but the values associated with those
keys always are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C. Neither keys nor values are in any predictable order.
D. The keys given to a reducer are in sorted order but the values associated with each key are in no
predictable order

Answer: D
Explanation:
Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the
same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should
extend the key with the secondary key and define a grouping comparator. The keys will be sorted
using the entire key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference:
org.apache.hadoop.mapreduce, Class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Question: 7
You wrote a map function that throws a runtime exception when it encounters a control character in
input data. The input supplied to your mapper contains twelve such characters totals, spread across
five file splits. The first four file splits each have two control characters and the last split has four
control characters. Indentify the number of failed task attempts you can expect when you run the
job with mapred.max.map.attempts set to 4:
A. You will have forty-eight failed task attempts

5

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

B. You will have seventeen failed task attempts
C. You will have five failed task attempts
D. You will have twelve failed task attempts
E. You will have twenty failed task attempts

Answer: E
Explanation:
There will be four failed task attempts for each of the five file splits.
Note:

Question: 8
You want to populate an associative array in order to perform a map-side join. You’ve decided to put
this information in a text file, place that file into the DistributedCache and read it in your Mapper
before any records are processed. Indentify which method in the Mapper you should use to
implement code for reading the file and populating the associative array?
A. combine
B. map
C. init
D. configure

Answer: D
Explanation:
See 3) below.
Here is an illustrative example on how to use the DistributedCache:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz

6

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
localArchives = DistributedCache.getLocalCacheArchives(job);
localFiles = DistributedCache.getLocalCacheFiles(job);
}

}

public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}

Reference:
org.apache.hadoop.filecache , Class DistributedCache

Question: 9
You’ve written a MapReduce job that will process 500 million input records and generated 500
million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a
significant amount of intermediate data that it needs to transfer between mappers and reduces
which is a potential bottleneck. A custom implementation of which interface is most likely to reduce
the amount of intermediate data transferred across the network?

7

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

A. Partitioner
B. OutputFormat
C. WritableComparable
D. Writable
E. InputFormat
F. Combiner

Answer: F
Explanation:
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate
intermediate map output locally on individual mapper outputs. Combiners can help you reduce the
amount of data that needs to be transferred across to the reducers. You can use your reducer code
as a combiner if the operation performed is commutative and associative.
Reference:
24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When
should I use a combiner in my MapReduce Job?

Question: 10
Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that
the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.

Answer: A
Explanation:
Note:
* Join Algorithms in MapReduce
A) Reduce-side join
B) Map-side join
C) In-memory join
/ Striped Striped variant variant
/ Memcached variant
* Which join to use?
/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory
Map-side join: sort order and partitioning
Reduce-side join: general purpose

8

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Question: 11
You have just executed a MapReduce job. Where is intermediate data written to after being emitted
from the Mapper’s map method?
A. Intermediate data in streamed across the network from Mapper to the Reduce and is never
written to disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are
written into HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the
Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker
node running the Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are
written into HDFS.

Answer: D
Explanation:
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each
individual mapper nodes. This is typically a temporary directory location which can be setup in config
by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
Reference:
24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output
(intermediate kay-value data) stored ?

Question: 12
You want to understand more about how users browse your public website, such as which pages
they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How
will you gather this data for your analysis?
A. Ingest the server web logs into HDFS using Flume.
B. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for
reduces.
C. Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
D. Channel these clickstreams inot Hadoop using Hadoop Streaming.
E. Sample the weblogs from the web servers, copying them into Hadoop using curl.

Answer: B
Explanation:
Hadoop MapReduce for Parsing Weblogs
Here are the steps for parsing a log file using Hadoop MapReduce:

9

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Load log files into the HDFS location using this Hadoop command:
hadoop fs -put <local file path of weblogs> <hadoop HDFS location>
The Opencsv2.3.jar framework is used for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location.
public static class ParseMapper
extends Mapper<Object, Text, NullWritable,Text >{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
CSVParser parse = new CSVParser(' ','\"');
String sp[]=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i<spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))
rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}
}
The command below is the Hadoop-based log parse execution. TheMapReduce program is attached
in this article. You can add extra parsing methods in the class. Be sure to create a new JAR with any
change and move it to the Hadoop distributed job tracker system.
hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>
The output file is stored in the HDFS location, and the output file name starts with "part-".

Question: 13
MapReduce v2 (MRv2/YARN) is designed to address which two issues?
A. Single point of failure in the NameNode.
B. Resource pressure on the JobTracker.
C. HDFS latency.
D. Ability to run frameworks other than MapReduce, such as MPI.
E. Reduce complexity of the MapReduce APIs.
F. Standardize on a single MapReduce API.

Answer: B, D
Explanation:
YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major kinds of benefits:
* (D) The ability to use programming frameworks other than MapReduce.
/ MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce
alternative
* Scalability, no matter what programming framework you use.

10

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Note:
* The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling/monitoring, into separate daemons. The idea is to have a
global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a
single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
* (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed
together in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available.
Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately
for each job.
The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly
concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will
remove scalability bottlenecks
Reference:
Apache Hadoop YARN – Concepts & Applications

Question: 14
You need to run the same job many times with minor variations. Rather than hardcoding all job
configuration options in your drive code, you’ve decided to have your Driver subclass
org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?
A. hadoop “mapred.job.name=Example” MyDriver input output
B. hadoop MyDriver mapred.job.name=Example input output
C. hadoop MyDrive –D mapred.job.name=Example input output
D. hadoop setproperty mapred.job.name=Example MyDriver input output
E. hadoop setproperty (“mapred.job.name=Example”) MyDriver input output

Answer: C
Explanation:
Configure the property using the -D key=value notation:
-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Reference:
Python hadoop streaming : Setting a job name

11

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Question: 15
You are developing a MapReduce job for sales reporting. The mapper will process input keys
representing the year (IntWritable) and input values representing product indentifies (Text).
Indentify what determines the data types used by the Mapper for a given job.
A. The key and value types specified in the JobConf.setMapInputKeyClass and
JobConf.setMapInputValuesClass methods
B. The data types specified in HADOOP_MAP_DATATYPES environment variable
C. The mapper-specification.xml file submitted with the job determine the mapper’s input key and
value types.
D. The InputFormat used by the job determines the mapper’s input key and value types.

Answer: D
Explanation:
The input types fed to the mapper are controlled by the InputFormat used. The default input format,
"TextInputFormat," will load data in as (LongWritable, Text) pairs. The long value is the byte offset of
the line in the file. The Text object holds the string contents of the line of the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass()
andsetOutputValueClass().
The data types emitted by the reducer are identified by
setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the case,
the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the JobConf class
will override these.
Reference:
Yahoo! Hadoop Tutorial, THE DRIVER METHOD

Question: 16
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers
and monitoring application resource usage?
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. ApplicationMasterService
E. TaskTracker
F. JobTracker

Answer: C
Explanation:
The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of the JobTracker,
resource management and job scheduling/monitoring, into separate daemons. The idea is to have a

12

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a
single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
Note: Let’s walk through an application execution sequence :

A client program submits the application, including the necessary specifications to launch
the application-specific ApplicationMaster itself.

The ResourceManager assumes the responsibility to negotiate a specified container in which
to start the ApplicationMaster and then launches the ApplicationMaster.

The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration
allows the client program to query the ResourceManager for details, which allow it to directly
communicate with its own ApplicationMaster.

During normal operation the ApplicationMaster negotiates appropriate resource containers
via the resource-request protocol.

On successful container allocations, the ApplicationMaster launches the container by
providing the container launch specification to the NodeManager. The launch specification, typically,
includes the necessary information to allow the container to communicate with the
ApplicationMaster itself.

The application code executing within the container then provides necessary information
(progress, status etc.) to its ApplicationMaster via an application-specific protocol.

During the application execution, the client that submitted the program communicates
directly with the ApplicationMaster to get status, progress updates etc. via an application-specific
protocol.

Once the application is complete, and all necessary work has been finished, the
ApplicationMaster deregisters with the ResourceManager and shuts down, allowing its own
container to be repurposed.
Reference:
Apache Hadoop YARN – Concepts & Applications

Question: 17
Which best describes how TextInputFormat processes input files and line breaks?
A. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of
the split that contains the beginning of the broken line.
B. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of
both splits containing the broken line.
C. The input file is split exactly at the line breaks, so each RecordReader will read a series of
complete lines.
D. Input file splits may cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of
the split that contains the end of the broken line.

Answer: E
Explanation:
As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If
an individual file is so large that it will affect seek time it will be split to several Splits. The splitting
does not know anything about the input file's internal logical structure, for example line-oriented

13

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit. When
an individual map task starts it will open a new output writer per configured reduce task. It will then
proceed to read its FileSplit using the RecordReader it gets from the specified InputFormat.
InputFormat parses the input and generates key-value pairs. InputFormat must also handle records
that may be split on the FileSplit boundary. For example TextInputFormat will read the last line of
the FileSplit past the split boundary and, when reading other than the first FileSplit, TextInputFormat
ignores the content up to the first newline.
Reference:
How Map and Reduce operations are actually carried out

Question: 18
For each input key-value pair, mappers can emit:
A. As many intermediate key-value pairs as designed. There are no restrictions on the types of those
key-value pairs (i.e., they can be heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type as the
input key-value pair.
C. One intermediate key-value pair, of a different type.
D. One intermediate key-value pair, but of the same type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the same types and
all the values have the same type.

Answer: E
Explanation:
Mapper maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual
tasks that transform input records into intermediate records. The transformed intermediate records
do not need to be of the same type as the input records. A given input pair may map to zero or many
output pairs.
Reference:
Hadoop Map-Reduce Tutorial

Question: 19
You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducer’s reduce method?
A. Six
B. Five

14

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

C. Four
D. Two
E. One
F. Three

Answer: A
Explanation:
Only one key value pair will be passed from the two (The, 1) key value pairs.

Question: 20
You have user profile records in your OLPT database, that you want to join with web logs you have
already ingested into the Hadoop file system. How will you obtain these user records?
A. HDFS command
B. Pig LOAD command
C. Sqoop import
D. Hive LOAD DATA command
E. Ingest with Flume agents
F. Ingest with Hadoop Streaming

Answer: B
Explanation:
Apache Hadoop and Pig provide excellent tools for extracting and analyzing data from very large
Web logs.
We use Pig scripts for sifting through the data and to extract useful information from the Web logs.
We load the log file into Pig using the LOAD command.
raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray);
Note 1:
Data Flow and Components
* Content will be created by multiple Web servers and logged in local hard discs. This content will
then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these
are machines that collect data intermediately using collectors and finally push that data to HDFS.
* Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job
solution). These scripts actually analyze the logs on various dimensions and extract the results.
Results from Pig are by default inserted into HDFS, but we can use storage implementation for other
repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please
see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will
be required to read and push this data into HBase, or Pig scripts can push this data into HBase
directly. In this article, we use scripts to push data onto HDFS, as we are showcasing the Pig
framework applicability for log analysis at large scale.
* The database HBase will have the data processed by Pig scripts ready for reporting and further
slicing and dicing.

15

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

* The data-access Web service is a REST-based service that eases the access and integrations with
data clients. The client can be in any language to access REST-based API. These clients could be BI- or
UI-based clients.
Note 2:
The Log Analysis Software Stack
* Hadoop is an open source framework that allows users to process very large data in parallel. It's
based on the framework that supports Google search engine. The Hadoop core is mainly divided into
two modules:
1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using
multiple commodity servers connected in a cluster.
2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default
implementation is bonded with HDFS.
* The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that
it provides scalability for the reporting module as well, as we can keep historical processed data for
reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also
use MR jobs to process data. It gives real-time, random read/write access to very large data sets -HBase can save very large tables having million of rows. It's a distributed database and can also keep
multiple versions of a single row.
* The Pig framework is an open source platform for analyzing large data sets and is implemented as
a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of
developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to
be written in Java. In contrast, Pig enables users to write code in a scripting language.
* Flume is a distributed, reliable and available service for collecting, aggregating and moving a large
amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for further
processing. It's a data flow solution, where there is an originator and destination for each node and
is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage.
Reference:
Hadoop and Pig for Large-Scale Web Log Analysis

Question: 21
What is the disadvantage of using multiple reducers with the default HashPartitioner and
distributing your workload across you cluster?
A. You will not be able to compress the intermediate data.
B. You will longer be able to take advantage of a Combiner.
C. By using multiple reducers with the default HashPartitioner, output files may not be in globally
sorted order.
D. There are no concerns with this approach. It is always advisable to use multiple reduces.

Answer: C
Explanation:
Multiple reducers and total ordering If your sort job runs with multiple reducers (either because
mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because
you’ve used the -r option to specify the number of reducers on the command-line), then by default
Hadoop will use the HashPartitioner to distribute records across the reducers. Use of the

16

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

HashPartitioner means that you can’t concatenate your output files to create a single sorted output
file. To do this you’ll need total ordering,
Reference:
Sorting text files with MapReduce

Question: 22
Given a directory of files with the following structure: line number, tab character, string:
Example:
1
abialkjfjkaoasdfjksdlkjhqweroij
2
kadfjhuwqounahagtnbvaswslmnbfgy
3
kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to
complete the line: conf.setInputFormat (____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat

Answer: B
Explanation:
Note:
The output format for your first MR job should be SequenceFileOutputFormat - this will store the
Key/Values output from the reducer in a binary format, that can then be read back in, in your second
MR job using SequenceFileInputFormat.
Reference:
How to parse CustomWritable from text in Hadoop
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop
(see answer 1 and then see the comment #1 for it)

Question: 23
You need to perform statistical analysis in your MapReduce job and would like to call methods in the
Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which
is the best way to make this library available to your MapReducer job at runtime?
A. Have your system administrator copy the JAR to all nodes in the cluster and set its location in the
HADOOP_CLASSPATH environment variable before you submit your job.
B. Have your system administrator place the JAR file on a Web server accessible to all cluster nodes
and then set the HTTP_JAR_URL environment variable to its location.
C. When submitting the job on the command line, specify the –libjars option followed by the JAR file
path.
D. Package your code and the Apache Commands Math library into a zip file named JobJar.zip

Answer: C

17

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Explanation:
The usage of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one of
these
1. Copy the jar file in $HADOOP_HOME/lib dir
or
2. Use the generic option -libjars.

Question: 24
The Hadoop framework provides a mechanism for coping with machine issues such as faulty
configuration or impending hardware failure. MapReduce detects that one or a number of machines
are performing poorly and starts more copies of a map or reduce task. All the tasks run
simultaneously and the task finish first are used. This is called:
A. Combine
B. IdentityMapper
C. IdentityReducer
D. Default Partitioner
E. Speculative Execution

Answer: E
Explanation:
Speculative execution: One problem with the Hadoop system is that by dividing the tasks across
many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if
one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the
other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map
task to check in, which takes much longer than all the other nodes. By forcing tasks to run in
isolation from one another, individual tasks do not know where their inputs come from. Tasks trust
the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be
processed multiple times in parallel, to exploit differences in machine capabilities. As most of the
tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the
remaining tasks across several nodes which do not have other work to perform. This process is
known as speculative execution. When tasks complete, they announce this fact to the JobTracker.
Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing
speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The
Reducers then receive their inputs from whichever Mapper completed successfully, first.
Reference:
Apache Hadoop, Module 4: MapReduce
Note:
* Hadoop uses "speculative execution." The same task may be started on multiple boxes. The first
one to finish wins, and the other copies are killed.
Failed tasks are tasks that error out.
* There are a few reasons Hadoop can kill tasks by his own decisions:

18

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

a) Task does not report progress during timeout (default is 10 minutes)
b) FairScheduler or CapacityScheduler needs the slot for some other pool (FairScheduler) or queue
(CapacityScheduler).
c) Speculative execution causes results of task not to be needed since it has completed on other
place.
Reference:
Difference failed tasks vs killed tasks

Question: 25
For each intermediate key, each reducer task can emit:
A. As many final key-value pairs as desired. There are no restrictions on the types of those key-value
pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as the intermediate
key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the same type and all the
values have the same type.
D. One final key-value pair per value associated with the key; no restrictions on the type.
E. One final key-value pair per key; no restrictions on the type.

Answer: E
Explanation:
Reducer reduces a set of intermediate values which share a key to a smaller set of values. Reducing
lets you aggregate values together. A reducer function receives an iterator of input values from an
input list. It then combines these values together, returning a single output value.
Reference:
Hadoop Map-Reduce Tutorial; Yahoo! Hadoop Tutorial, Module 4: MapReduce

Question: 26
What data does a Reducer reduce method process?
A. All the data in a single input file.
B. All data produced by a single mapper.
C. All data for a given key, regardless of which mapper(s) produced it.
D. All data for a given value, regardless of which mapper(s) produced it.

Answer: C
Explanation:
Reducing lets you aggregate values together. A reducer function receives an iterator of input values
from an input list. It then combines these values together, returning a single output value. All values
with the same key are presented to a single reduce task.

19

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

Reference:
Yahoo! Hadoop Tutorial, Module 4: MapReduce

Question: 27
All keys used for intermediate output from mappers must:
A. Implement a splittable compression algorithm.
B. Be a subclass of FileInputFormat.
C. Implement WritableComparable.
D. Override isSplitable.
E. Implement a comparator for speedy sorting.

Answer: C
Explanation:
The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views
the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types. The key and value classes have to be serializable by
the framework and hence need to implement the Writable interface. Additionally, the key classes
have to implement the WritableComparable interface to facilitate sorting by the framework.
Reference:
MapReduce Tutorial

Question: 28
On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your
cluster, and alerts the JobTracker it has an open map task slot. What determines how the JobTracker
assigns each map task to a TaskTracker?
A. The amount of RAM installed on the TaskTracker node.
B. The amount of free disk space on the TaskTracker node.
C. The number and speed of CPU cores on the TaskTracker node.
D. The average system load on the TaskTracker node over the past fifteen (15) minutes.
E. The location of the InsputSplit to be processed in relation to the location of the node.

Answer: E
Explanation:
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to
reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number
of available slots, so the JobTracker can stay up to date with where in the cluster work can be
delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce
operations, it first looks for an empty slot on the same server that hosts the DataNode containing
the data, and if not, it looks for an empty slot on a machine in the same rack.

20

Get Complete Collection of CCD-410 Exam's Question and Answers.
http://www.Test4Prep.com

To Read the Whole Q&As, please purchase the Complete Version from Our website.

Trying our product !
★ 100% Guaranteed Success
★ 100% Money Back Guarantee
★ 365 Days Free Update
★ Instant Download After Purchase
★ 24x7 Customer Support
★ Average 99.9% Success Rate
★ More than 69,000 Satisfied Customers Worldwide
★ Multi-Platform capabilities - Windows, Mac, Android, iPhone, iPod, iPad, Kindle

Need Help
Please provide as much detail as possible so we can best assist you.
To update a previously submitted ticket:

Guarantee & Policy | Privacy & Policy | Terms & Conditions
Any charges made through this site will appear as Global Simulators Limited.
All trademarks are the property of their respective owners.
Copyright © 2004-2015, All Rights Reserved.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close