Tech Paper

Published on September 2019 | Categories: Documents | Downloads: 9 | Comments: 0 | Views: 236
of 6
Download PDF   Embed   Report

Comments

Content

RELATED TITLES

Documents  Lifestyle  Home & Garden

10 views

0

0



Tech Paper Uploaded by Dileep Vk Mapred Tutorial

Full description 







Save

Embed

Share

Print

hadoop

Design and Implementation

Bigdata H Bigdata A

Data Collection Framework for Social Network Hadoop-HBase Performance Evaluation Dileep V K, Anuj Kumar, Dr, Udayakumar Shenoy Computer Science and Engineering Department, NMAMIT, Nitte, India. [email protected] [email protected]

 Formcept  Bangalore, India [email protected]

 A BSTRACT 

The data collec collectio tion n framew framework ork is focusi focusing ng on the social social netw networ ork k Twit Twitte ter. r. It coll collec ecte ted d the data data from from sear search ch by  searching the specified keyword. The resultant data is stored  in HBase, which is a distributed database. Each resultant  data consist of tweet message, image of the author, date of  message message created, created, and the user id. Each search search consists consists of  hundre hundreds ds of tweet tweet data. data. HBase HBase is a distri distribute buted d databa database se which is built on top of a distributed file system, named   Hadoop. The searched data will be analyzed, and the analysis analysis will be shown using graphs. HBase is the open source source version version of BigTable BigTable - distribu distributed ted storage system system developed by Google for the management of large volume of structured data. Like most non SQL database systems, HBase is written in Java. The current work’s purpose is to evaluate the performances of the HBase implementation implementation in comparison with SQL database, and, of course, with the perf perfor orma manc nces es offe offere red d by HBase Base.. The The test testss aim aim at evalua evaluatin ting g the the perfo performa rmanc nces es regar regardin ding g the the random random writing and random reading of rows, sequential writing and and sequ sequen enti tial al read readin ing, g, how how are are they they affe affect cted ed by increa increasin sing g the the number number of colum column n famili families es and and using using MapReduce functions. Key words: Hadoop, HBase, BigTable, MapReduce, Hadoop, DataNode, NameNode, Twitter, ActiveMQ

1. INTRODUCTION Many modern applications include a database server, serving mult mu ltip iple le web web serv server erss acce access ssed ed by many many clie client nts. s. In this this situation, situation, many consider consider upgrading upgrading the hardware, hardware, without without taking taking into into consid considera eratio tion n of the databa database se server server.. We can consider that the SQL technology has reached its maximum point point of scala scalabil bility. ity. Applic Applicati ations ons that that use databa databases ses are gene genera rall lly y desi design gned ed for for comp comple lex x envi enviro ronm nmen ents ts of larg largee information management. As a general idea the ease of usage

more and more substituted by non-SQL datab There is a need for a new approach in which a l in performanc performancee requires requires insignific insignificant ant costs good scalability. Apache Hadoop is a top-level Apach includes includes open source implementations implementations of a system system and MapReduce MapReduce that were inspired inspired by and and MapR MapRed educ ucee proj projec ects ts.. The The Hado Hadoop op includes includes projects like Apache Apache HBase HBase which i Google’s BigTable, Apache Hive, a data wareh top of Hadoop, and Apache ZooKeeper, a coordi for distributed systems. Here, Here, Hadoop Hadoop has been used used in con HBase HBase for storag storagee and analysis analysis of large large dat workloads typically read and write large amount disk sequentially. As such, there has been less making Hadoop performant performant for random acce providing providing low latency latency access access to HDFS. HDFS. Adm MySQL MySQL cluste clusters rs requir requires es a relati relativel vely y hig overhead and they typically use more expensi Given our high confidence in the reliability and HDFS, we began to explore Hadoop and HB applications. We decided to use HBase for this p in turn leverages HDFS for scalable and fault to and ZooKeeper for distributed consensus. In the the foll follow owin ing g sect sectio ions ns we pres presen entt applications in more detail and why we decided t and HBase as the common foundation technol we discuss ongoing and future work in the projec 2. RELATED WORK

First step of the project is setting up of Hadoo dedicated Hadoop user account for running Had is a single-node setup of Hadoop. Our setup will Distributed File System.The first step to start instal installat lation ion is format formattin ting g the Hadoop Hadoop file

RELATED TITLES

Documents  Lifestyle  Home & Garden

0

10 views

0



Tech Paper Uploaded by Dileep Vk Mapred Tutorial

Full description 







Save

Embed

Share

Print

to an HBase configuration, make sure copy the content of the conf directory to all nodes of the cluster. In distributed mode, replace the hadoop jar found in the HBase lib directory with the hadoop jar running on the cluster to avoid version mismatch issues. Make sure to replace the jar in HBase everywhere on the cluster. Distributed modes require an instance of the  Hadoop Distributed File System (HDFS). A pseudo-distributed mode is simply a distributed mode run on a single host. Use this configuration testing and prototyping on HBase. 2.1 HDFS CLIENT CONFIGURATION To finish the  HDFS client configuration on your Hadoop cluster do one of the following: I. Add a pointer to your HADOOP_CONF_DIR to the HBASE_CLASSPATH environment variable in hbaseenv.sh. II. Add a copy of hdfs-site.xml (or hadoop-site.xml) or, better, symlinks, under ${HBASE_HOME}/conf, or III. if only a small set of HDFS client configurations, add them to hbase-site.xml. 2.2 RUNNING AND CONFIRMING YOUR INSTALLATION Make sure HDFS is running first. To ensure it started properly by testing the put and get of files into the Hadoop filesystem. HBase start up ZooKeeper as part of its start process. Shutdown can take longer if the cluster is comprised of many machines. To run a distributed operation, wait until HBase has shut down completely before stopping the Hadoop daemons. 2.3 MESSAGE QUEUE IMPLEMENTATION USING ACTIVEMQ ActiveMQ is an open source, Java Message Service (JMS) –compliant, message-oriented middleware (MOM) from the Apache Software Foundation that provides high availability, performance, scalability, reliability, and security for enterprise messaging. The generic term « Destination » refers both to Queues and Topics. Consumers and producers only share the name of a Destination. The message is deleted from the storage with some service-level acknowledgement from the consumer. - Remember to start the connection; otherwise the consumer will never receive the message - Remember to close the connection, in order to save resources. The program will not end if the connection is not closed. The receive command blocks; a consumer will block waiting forever if there is no message in the Queue. 2.4 TWITTER API PROGRAMMING

USING

JAVA

Ensured searching parameters are properly URL encoded. Constructing a Query http://search.twitter.com/search.json?q= .

hadoop

Design and Implementation

Bigdata H Bigdata A

real-time search results. See the result_type par for more information. Resource URL http://search.twitter.com/search.format Parameters q required : Search query. Should be URL enco will be limited by complexity. Count : Indicates the number of previo consider for delivery before transitioning delivery. On unfiltered streams, all considere delivered, so the number requested is the numbe filtered streams, the number requested is statuses that are applied to the filter predicate number of statuses returned. result_type: Specifies what type of search res prefer to receive. Valid values include: mixed: Include both popular and real time response. recent: return only the most recent results in th popular: return only the most popular results in th 3.SYSTEM MODEL 3.1 DATA COLLECTION FRAMEWORK

This project deals with a data collection framew storage of unstructured data by creating framework. The Framework is looking forward challenge of managing the incoming flow content, irrespective of it's source. The solution on top of distributed file system and message q handle the incoming flow of data, storage of data on demand. This data collection framework is only focus network Twitter. It collected the data fr searching the specified keyword. The resultant in Hbase, which is a distributed database. Each consist of tweet message, image of the author, da created, and the user id. Each search consists o tweet data. The controlled movement of data message queue using ActiveMQ. Hbase database which is built on top of a distributed named Hadoop. The searched data will be anal analysis will be shown using graphs. The las project will be ran some performance test wit server. HBase performed adequately for the mo my various test conditions. HBase work situations, especially if it is not pushed to its li still a work in progress. Finally it can retrieve the Database. 3.2 HADOOP FRAMEWORK

Documents  Lifestyle  Home & Garden

10 views

0

RELATED TITLES

0



Tech Paper Uploaded by Dileep Vk Mapred Tutorial

Full description 







Save

Embed

Share

Print

keeping each block of data replicated.three times. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into a combination over sets of keys and values. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks. MapReduce is also provided with a job tracker and a number of task trackers. The job tracker works as a master by coordination of all the jobs run on the system and by establishing a schedule for tasks to run on task trackers. Task trackers are the slaves that run tasks and send progress reports to the job tracker. HDFS comes in handy when the dataset outgrows the storage capacity of a single physical machine and we need several machines to process the task. Blocks. The default measurement unit for HDFS is the block size. This is the minimum amount of data that it can read or write. HDFS has the concept of a block, but it is a much larger unit-64 MB by default. Namenodes and Datanodes. A HDFS cluster has two types of nodes that operate in a master-slave configuration: a namenode (the master) and a number of datanodes (slaves). The namenode is the one in charge with the filesystem’s namespace. It maintains the metadata for all the files and directories in the tree. The access to the filesystem is established by the user by communicating with the namenode and datanodes. In fact, if the machine running the namenode were to be down, all the files on the filesystem would be lost since there would be no way of finding out how to reconstruct the files from the blocks on the datanodes. HBASE. It is said that data is stored in a database in structured manner, while a distributed storage system similar to the one proposed by Google through BigTable can store large amounts of semi-structured data without having to redesign the entire scheme. In this paper, we try to assess the performances of an open source implementation of BigTable, named Hbase, developed using the Java programming language. HBase is a distributed column-oriented database built on top of HDFS. HBase is built from the ground-up to scale just by adding nodes. Applications that use Map Reduce store data into labeled tables. Tables are made of rows and columns. Table cells have different version which is just a timestamp assigned by HBase at the time of inserting any kind of information in a cell. Table row keys are also byte arrays, so theoretically anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Table rows are sorted by row key, the table’s primary key. All table accesses are via the table primary key. HBase uses column families as a response to the relational indexing. Therefore row columns are grouped into column families. All column family members have a common prefix. Tables are automatically partitioned horizontally by HBase

hadoop

Design and Implementation

Bigdata H Bigdata A

remote commands mechanism. Conf/hbas conf/hbase-env.sh files are used to keep configuration, having the same format as equivalents up in HDFS . HBase, also has some special catalog tables and .META. within which it maintains the curr recent history, and location of all regions. The holds the list of .META. table regions. The . holds the list of all user-space regions.

4. WHY HADOOP AND HBASE

The requirements for the storage system can b as follows: 1. Elasticity: We need to be able to capacity to our storage systems with minimal ov downtime. In some cases we may want to add ca and the system should automatically ba utilization across new hardware. 2. High write throughput: Most of the appl tremendous amounts of data and require high ag throughput. 3. Efficient and low-latency strong consisten within a data center: There are important app Messages that require strong consistency within a This requirement often arises directly from user We also knew that, Messages was easy to fede particular user could be served entirely out of center making strong consistency within a single 4. Efficient random reads from disk: widespread use of application level caches a lo miss the cache and hit the back-end storage syste 5. High Availability and Disaster Recovery provide a service with very high uptime to use both planned and unplanned events. 6. Fault Isolation: In the warehouse usa individual disk failures affect only a small part o the system quickly recovers from such faults. 7. Range Scans: Several applications retrieval of a set of rows in a particular range. Fo the last 100 messages for a given user impression counts over the last 24 hours for a giv HBase is massively scalable and deliver writes as well as random and streaming reads. It row-level atomicity guarantees, but no nat transactional support. From a data model perspec orientation gives extreme flexibility in storing d rows allow the creation of billions of indexed v single table. Hbase is ideal for workloads th intensive, need to maintain a large amount o indices, and maintain the flexibility to scale out q 5. RESULTS

RELATED TITLES

Documents  Lifestyle  Home & Garden

0

10 views

0



Tech Paper Uploaded by Dileep Vk Mapred Tutorial

Full description 







Save

Embed

Share

Print

Tweets. At the moment that index includes between 6-9 days of Tweets. You cannot use the Search API to find Tweets older than about a week. Queries can be limited due to complexity. Search does not support authentication meaning all queries are made anonymously. Search is focused in relevance and not completeness. This means that some Tweets and users may be missing from search results. The Data Collection Framework is an API for running searches against the index of recent tweets based on keywords. Here Search API searching for JSON feeds. Using the framework we can keep the index includes all the Tweets realated to a particular keyword. That is we can find the Tweets older than one month or above. Quries can be flexible to get various attributes related to the Tweets. Our framework shows the details such as Message, Timestamp of Tweet, Author identification such as Image and User Id. All the details are stored in Hbase, which can be retrieved later. Further analsis can be done using these data. 2) Hadoop-HBase Performance Evaluation:

HBase Vs RDBMS HBase Twitter Message Example  – Table Twitter with family message

hadoop

Design and Implementation

Bigdata H Bigdata A

 3) Column Test 

# of columns Experiment

1000

10000

Writes/Sec Reads/Sec

55 15

42 41

Test Description: The BigTable pape BigTable can handle an unbounded number of c test was designed to test that claim within HB worked by creating a table with a single colum then writing out one 1000-byte value to that colu the specified number of columns. Test Analysis: Table shows that HBase does n as the number of columns increases to very la Write performance suffers somewhat but read suffers a lot. This is probably because as columns increases, the reads have a higher chanc fetch that row from disk instead of in me 10,00,000 columns exposed an HBase bug that program to crash. 4) Sort Test 

# of rows  – Row is RowKey with Columns Experiment

• message:text stores tweet message • message:date stores Timestamp of tweet

10000

100000

485 450 462

430 471 421

Lexicographic Reverse Random

• message:userID stores id of the use  – If processing raw data for hyperlinks and images, add families links and images •

links:<RowKey>

column

for

each

hyperlink • images:<RowKey> column for each image

Test Description: HBase stores lexicographically by row key. The motivation of determine whether performance changes if written in in reverse lexicographic order or rando Test Analysis: The results show that there is performance loss when rows are inputt lexicographic order or random order. In fact, th performed slightly better with 100,000 rows. 5) Interspersed Read/Write Test 

RDBMS Twitter Message Example  – Table Twitter with columns RowKey, text, date, userID

# of writes Experiment Reads/sec

 – Table links with columns RowKey and link - Table images with columns RowKey and image • How will this scale?  – 10M documents w/ avg10 links and 10 images

10000 95

100000

100

42

Test Description:The goal of this test is to d performance hit when reading from very large ta works by writing a specified number of rows and back 5000 of them randomly and averaging the e

RELATED TITLES

Documents  Lifestyle  Home & Garden

0

10 views

0



Tech Paper Uploaded by Dileep Vk Mapred Tutorial

Full description 







Save

Embed

Share

Print

Secondly scanning data structures will be more costly as their size increases. 6) Testing by number of column families

In this test, we studied both speed for reading, writing and updating rows from a table with multiple column families and we tried to identify the maximum number of column families that can be used in a HBase table. HBase performance was evaluated using tables of 10, 50, 100, 500, 700 and 1,000 column families. For each of the above cases the following steps were made: Step 1. A table called “test” was created with more column families, a single column and a single row. For this single row 1000 bytes value randomly generated in each column (sequential insert) were inserted. Step 2. Using the table created at Step 1, a number of 5000 sequential readings were made. Step 3. Using the table created at Step 1, a number of 5000 random reads were made. Step 4. Using the table created at Step 1, sequential updates of the rows from the table were made. # of column families 10 50 100 500 700 Setup time for one col. family (ms)

12

24

45

133

100

Sequential insert (col. families / sec)

181

41

77

51

32

Sequential reads (col. families / sec)

800

23

142

6

Crash Eclipse

Random reads (col. families / sec)

800

23

140

6

Crash Eclipse

Sequential updates ( col. families / sec

136

50

76

56

Crash Eclipse

Table shows that one of the main problems is time to add columns families to the table. Time needed to set up a family

hadoop

Design and Implementation

Bigdata H Bigdata A

Google specialists argue that the column familie in limited number, not exceeding a few hundr that the number can go up to 500, but the pe decreasing considerably. At more than 500 col the table could be built, but could not be used. In build a table with 1000 column families, the answ one:”Connection refused”. 6. CONCLUSION

Our tests performed on an Ubuntu platform. Ha with Hbase are oriented towards columns are of with relational databases. As we pointed out distributed database system oriented on using co is a continuation of Hadoop, offering read/write having a data storage based on HDFS It was built from scratch following a few importa a very large number of rows (the size of bill number of columns (the size of millions), t horizontal partitioning and the ability of easy re large number of nodes in the system. The general relational databases have a fixed uses rows and columns that have the ACID pro powerful SQL engine behind. The accent consistency, on referential integrity, abstra physical and complex queries by language SQL easily create secondary indexes, bring inner an complex, use functions such as Sum, Count, Sor track data on multiple tables, rows and columns. Among the advantages of using a Had platform we can find that the data is parallel pro execution time decreases.The data is replicated s always a backup and employment problems of s machine are passed to HDFS. Another advant that one or more column families can be added any time. A disadvantage might be that HBase does not between tables. This is not a major drawba information must be kept together in a single tab more easily accessible. In this way, it eliminate  joins. Extracting data from HBase proved to regardless of the number of entries in the tab HBase provides the possibility of conducting a which HBase + MapReduce test has prov effective than reading sequential/random data option of table scan is implemented within HB benefit of HBase is Zookeeper's use which i release the master node of various task availability of cluster servers, client applicatio replies to the table root. FUTURE WORK

We are trying to work on the analysis of data sto

RELATED TITLES

Documents  Lifestyle  Home & Garden

0

10 views

0



Tech Paper Uploaded by Dileep Vk Mapred Tutorial

Full description 







Save

Embed

Share

Print

modularity, pluggability, and coexistence, on both the storage and application execution tiers. The use of Hadoop and HBase at Facebook is just getting started and we expect to make several iterations on this suite of technologies and continue to optimize for our applications. As we try to use HBase for more applications, we have discussed adding support for maintenance of secondary indices and summary views in HBase. Finally, as we try to use Hadoop and HBase for applications that are built to serve the same data in an activeactive manner across different data centers, we are exploring approaches to deal with multi data-center replication and conflict resolution. Our data collection framework will be useful for other social networks besides, Twitter. ACKNOWLEDGMENTS The current state of the Hadoop Realtime Infrastructure has been the result of ongoing work over the last couple of years. Last but not the least, thanks are also due to the users of our infrastructure who have patiently dealt with periods of instability during its evolution and have provided valuable feedback that enabled us to make continued improvements to this infrastructure. . REFERENCES [1] Apache Hadoop. Available at http://hadoop.apache.org [2] Apache HDFS. Available at http://hadoop.apache.org/hdfs [3] Apache HBase. Available at http://hbase.apache.org [4] The Google File System. Available at http://labs.google.com/papers/gfs-sosp2003.pdf [5] Hadoop:The Definitive Guide: Tom White [6] Hbase - non SQL Database, Performances Evaluation, Dorin Carstoiu, Elena Lepadatu, Mihai Gaspar [7] BigTable: A Distributed Storage System for Structured Data. Available at http://labs.google.com/papers/bigtable-osdi06.pdf [8] HBase: The Definitive Guide by Lars George [9] ActiveMQ in Action by Bruce Snyder, Dejan Bosanac,Rob Davies [10] Apache Hadoop Goes Realtime at Facebook at borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf [11] Hadoop Hbase-0.20.2 Performance Evaluation, D. Carstoiu, A. Cernian, A. Olteanu “Politehnica” University

hadoop

Design and Implementation

Bigdata H Bigdata A

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close