Lesson 1 - Hadoop and Big Data Overview

Published on May 2017 | Categories: Documents | Downloads: 29 | Comments: 0 | Views: 273

of 57

Content

Hadoop Developer Day

Nicolas Morales IBM Big Data [email protected] @NicolasJMorales

Big Data Developers @      



 

FREE Monthly Events

San JoseDeveloper & Foster City Full Day Days Afternoon & Evening Hackathons Past Meetups covered… – Text Analytics – Real-time Analytics – SQL for Hadoop – HBase – Social Media Analytics – Machine Data Analytics – Security and Privacy Development Environment provided Live Topicstreaming suggestions welcome

http://www.meetup.com/BigDataDevelopers/

NEXT MEETUP: Streams Developer Day on Thursday, April 17. Coming Soon: Soon: Big R, Watson, Watson, Big Data Data in the Cloud, Big SQL, MongoDB & more! 2

© 2013 IBM Corporation

3

© 2013 IBM Corporation

Agenda: Hadoop Developer Day Time Subject

4 4

8:00 AM – 9:00 AM

Registration & Breakfast

9:00 AM – 9:30 AM 9:30 AM – 11:00 AM

Introduction to Hadoop Hadoop Architecture a an nd HDFS + Hands-on Lab

11:00 AM – 11:45 AM

Introduction to MapReduce

11:45 AM – 12:45 PM 12:45 PM – 2:00 PM

Lunch MapReduce Hands-on Lab

2:00 PM – 4:00 PM

Using Hive ffo or Data Warehousing + Hands-on Lab

4:00 PM – 6:00 PM 6:00 PM

SQL for Hado Hadoop op + Hand Hands-on s-on Lab Closing Remarks

© 2013 IBM Corporation

Big Data University www.bigdatauniversity.com

5

© 2013 IBM Corporation

Big Data University www.bigdatauniversity.com

6

© 2013 IBM Corporation

Quick Start Edition VM  

7

Download: http://ibm.co/QuickStart .tar.gz – Unpack using WinRAR, 7-Zip, etc.

© 2013 IBM Corporation

Your Feedback is Important, please complete your Survey

8 8

© 2013 IBM Corporation

Introduction to Hadoop

Rafael Coss IBM Big Data [email protected] @racoss 9

© 2013 IBM Corporation

Executive Summary 

What’s Big Data? – More Analytics on More Data for More People – More than just Hadoop



What’s Hadoop? – Distributed Computing framework that is • • Cost Effective • Flexible • Fault Tolerance



What Hadoop’s Distribution? – Common set of Apache Projects – Install – Unique Value Add

10

© 2013 IBM Corporation

Key Business-driven Use Cases Improve Business Outcomes Enrich Your Base Information

Improve Customer Interaction with

Help ReduceFraud Risk and Prevent

with Big Data Exploration

Enhanced 360º View of the Customer

with Security and Intelligence Extension

99%

In Time Required For Analysis

1,100

Optimize Infrastructure and Monetize Data with Operations Analysis

60K

Metered Customers in Five States

11

42TB

Publishing Partnerships

Acoustic Data Analyzed

Gain IT efficiency and scale with Data Warehouse Modernization

40X

Gain in Analysis Performance

© 2013 2013 IBM IBM Corporation Corporation ©

12 12

© 2013 IBM Corporation

Why is Big Data important? Data AV AVAILA AILABLE BLE to an organization

data an organization can PROCESS

Organizations are able to process less and less of the

Enterprises are “more blind” to new opportunities.

available data. 100 Millionen Tweets Tweets are posted every day, day, 35 hours of video are bee beeing ing uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net. 80 % spam and viruses. => Prefiltering is more and more important. 13 13

© 2013 IBM Corporation

What is Big Data?

More Analytics on More Data for More People

Transactional & Application Data

14

Machine Data

Social Data

Enterprise Content

• Volume

• Velocity

• Variety

• Variety

• Structured

• Semi-structured

• Highly unstructured

• Highly unstructured

• Throughput

• Ingestion

• Veracity

• Volume

© 2013 2013 IBM IBM Corporation ©

Every Industry can Leverage Big Data and Analytics Analytics Banking • Opt Optimiz imizing ing Off Offers ers and Cross-sell

• 360˚ View View of of D Domai omain n or Subject

• Cus Custom tomer er Ser Service vice and Call Center Efficiency • Fr Fraud aud Dete Detect ction ion & Investigation • Cre Credit dit & C Count ounterp erpart arty y Risk

• Cat Catastr astrophe ophe Mo Modeli deling ng • Frau Fraud d & Ab Abus use e • Produce Producerr Perfor Performan mance ce Analytics • An Analy alytic tics s Sandb Sandbox ox

Retail • Act Actiona ionable ble Cus Custome tomerr Insight • Mer erch chan andi dise se Optimization • Dynam Dynamic ic Pr Prici icing ng

Automotive

15

Insurance

Travel & Transport • Custome Customerr Analyt Analytics ics & Loyalty Marketing • Predict Predictive ive Mainte Maintenanc nance e Analytics • Capa Capacit city y & Pr Prici icing ng Optimization

Chemical & Petroleum

Telco • Pro-act Pro-active ive Call Call Ce Center nter • Netw Networ ork k Anal Analyt ytics ics • Lo Locat cation ion Ba Based sed Services

Consumer Products • Sh Shelf elf Avail Availab abili ility ty • Pr Prom omoti otion onal al Spend Spend Optimization • Me Merch rchan andis dising ing Compliance • Promoti Promotion on Except Exceptions ions & Alerts

Aerospace & Defense

Energy & Utilities • Sma Smart rt Meter Meter Analy Analytics tics • Dist Distrib ribut ution ion Load Load Forecasting/Scheduling • Cond Conditi ition on Base Based d Maintenance • Crea Create te & Targ Target et Customer Offerings

Government • Civi Civilia lian n Servi Services ces • Def Defense ense & Intell Intelligen igence ce • Tax & Tr Treasu easury ry S Serv ervices ices

Electronics

• Advance Advanced d Conditi Condition on Monitoring

• Operational Surveillance, Analysis & Optimization

• Unifor Uniform m Inf Informa ormation tion Access Platform

• Cust Custome omer/ r/ Chann Channel el Analytics

• Data Data Wareh Warehou ouse se Optimization • Act Actiona ionable ble Cus Custome tomerr Intelligence

• Data Warehouse Consolidation, Integration & Augmentation • Big Data Exploration for Interdisciplinary Collaboration

• Data Data Wareho Warehous use e Optimization • Airline Airlinerr C Cert ertific ificatio ation n Platform • Advance Advanced d Conditi Condition on Monitoring (ACM)

• Advance Advanced d Conditi Condition on Monitoring

Media & Entertainment • Bu Busin siness ess pr proce ocess ss transformation • Audienc Audience e& M Marke arketing ting Optimization • Mu Multi lti-C -Cha hanne nnell Enablement • Digit Digital al c comm ommer erce ce optimization

Healthcare • Me Measu asure re & Act Act o on n Population Health Outcomes • Engage Engage Consume Consumers rs in their Healthcare

Life Sciences • Incr Increase ease vis visibil ibility ity into drug safety and effectiveness

© © 2013 2013 IBM IBM Corporation Corporation

Big Data use study

Big data adoption

When segmented into four groups based on current levels of big data activity, respondents showed significant consistency in organizational behaviors

2012 Big Data @ Work Study surveying 1144 1144 business and IT professionals in 95 countries 16

© 2013 IBM Corporation

Warehouse Modernization Has Two Themes

Traditional Analytics

Big Data Analytics

Structured & Repeatable Structure built to store data

Iterative & Exploratory Data is the structure

Business Users

Analyzed

Determine

On Flexible

Information

Available Information

Capacity down sampling of constrained available information Analyzed

Information IT Team Builds System To Answer Known Questions

Carefully cleanse all information before any analysis 17 17

IT Team Delivers Data

Anal yze ALL Available Information

Whole population analytics connects the dots Analyzed Information

Business Users Explore and Ask Any Question

Analyze information as is & cleanse as needed © 2013 IBM Corporation

Warehouse Modernization Has Two Themes

Traditional Analytics

Big Data Analytics

Structured & Repeatable Structure built to store data

Iterative & Exploratory Data is the structure

Hypothesis

Question

Data

Exploration

All Information

?

Analyzed Information

Answer

Data

Start with w ith hypothesis Test against selected data

Analyze Analy ze after landing… 18 18

Actionable Insight

Correlation

Data leads the way Explore all data, identify correlations

Analyze in motion… © 2013 IBM Corporation

Getti Getting ng the the Value Value from Big Data Data – Why a Platfo Platform? rm? BIG DATA PLATFORM Systems Management

Application Development

Discovery

The Whole is Greater than the Sum of the Parts 

Accelerators Hadoop ys em

Stream ompu ng

Data are ouse



Information Integration & Governance 



Data

19

Media

Content

Machine

Social

Almost all big data use cases require an integrated set of big data technologies to address the business pain completely Reduce time and cost and provide quick ROI by leveraging pre-integrated components Provide both out of the box and standardsbased services Start small with a single project and progress to others over your big data journey

© 2013 2013 IBM IBM Corporation Corporation ©

Watson Foundations Differentiators

5

Watson Foundations

Real-time processing & analytics analytics

Data types Machine and sensor data Image and video Enterprise content Transaction & application data

STREAMS, DATA REPLICATION

Operational systems

Exploration, landing and archive

2

Trusted data

3

3

Predictive analytics & modeling

3 Reporting, analysis, content analytics

3 3

More than Hadoop Greater resiliency and recoverability Advanced workload management, multi-tenancy Enhanced, flexible storage management (GPFS) Enhanced data access&(BigSQL, Search) Analytics accelerators visualization Enterprise-ready security framework

20

3 Reporting & interactive analysis

Discovery and exploration

Information Integration & Governance

2

Decision management

Deep analytics & modeling

1

Third-party data

1

3

Actionable insight

Data in Motion Enterprise class stream processing & analytics

3 4

4

Analytics Everywhere Richest set of analyti analytics cs capabilities Ability to analyze data in place

Governance Everywhere Complete integration & governance Ability to govern all data where ever capabilities it is

5

Complete Portfolio End-to-end capabilities to address all needs Ability to grow and address future needs Remains open to work with existing investments © 2013 IBM Corporation

IBM Big Data & Analy Analytics tics New/Enhanced Applications

IBM Watson Foundations

All Data

Real-time Data Processing & Analytics

pera ona data zone

Landing, Exploration and Archive data zone

What is happening? Discovery and exploration

Deep Analytics

What action should I take? Decision management

EDW and data mart zone

Why did it Cognitive Fabric

Reporting and analysis

What could happen? Predictive analytics and modeling

Information Integration & Governance Systems

Security

Storage

On premise, Cloud, As a service

IBM Big Data & Analytics Analytics Infrastructure I nfrastructure 21

© 2013 IBM Corporation

What is Hadoop? 

Apache open source software framework for reliable, scalable, distributed computing of massive amount of data  



Hides underlying system details and complexities from fr om user Developed in Java

Core sub projects: − MapReduce − − Ha Hado doop op Co Comm mmon on



. . .

Supported HBase by several Hadoop-related projects    



22

Zookeeper Avro Etc.

Meant for heterogeneous commodity hardware © 2013 IBM Corporation

Design principles of Hadoop 

New way of storing and processing the data: – Let system handle most of the issues automatically: •• • • • •

F Sa cialulare bsility Reduc educe e commu communi nica cattio ions ns Distri Distribut bute e data data a and nd pr proces ocessin sing g power power to wh where ere the the data data is Make Make para paralle llelis lism m par partt of op oper erat atin ing g sy syst stem em Rela Relati tive vely ly inex inexpe pens nsiv ive e har hardw dwar are e ($2 – 4K 4K))



Hadoop = HDFS + MapReduce infrastructure + …



Optimized to handle

– Massive amounts of data through pa parallelism rallelism

– A variety of data (structured, unstructured, semi-structured) – Using inexpensive commodity hardware 

23

Reliability provided through replication

© 2013 IBM Corporation

Hadoop is not for all types of work 

Not to process transactions (random access)



Not good when work w ork cannot be parallelized



Not good for low latency data access



Not good for processing lots of small files



Not good for intensive calculations with little data

Big Data Solution

24

© 2013 IBM Corporation

Who uses Hadoop?

25

© 2013 IBM Corporation

Map-Reduce

26

Hadoop

→

BigInsights

→

© 2013 IBM Corporation

What is Apache Hadoop? 

Flexible, enterprise-class support for processing large volumes of data – Inspired by Google Google technologies technologies (MapReduce, GFS, BigTable, …) – Initiated at Yahoo • Originally built to address scalability problems of Nutch, an open source Web search technology

– Well-suited to batch-oriented, read-intensive applications – 

Enables applications to work with w ith thousands of nodes and petabytes of data in a highly parallel, cost effective manner – CPU + disks = “node” – Nodes can be combined into clusters – New nodes can be added as needed w without ithout changing • Data formats • Ho How wd da ata is llo oaded • Ho How w jjo obs are writt ritte en

27

© 2013 IBM Corporation

Hadoop Open Source Projects 

28

Hadoop is supplemented by an ecosystem of open source projects

© 2013 IBM Corporation

How do I leverage Hadoop to create new value for my enterprise? Hadoop, Pig, Hive, Zookeeper Z ookeeper,, Jaql, Hbase, Ozzie, Flume Terabytes Petabytes Exabytes anal Log sis CDRs ... ...

29 29

HDFS MapReduce AQL Machine learning Sentiment analysis ... ...

© 2013 IBM Corporation

What’s a Hadoop Distribution? 

What’s a Linux Distribution? – – – –



Linux Kernel Open Source Tools around Kernel Installer Administration UI

Open Source Distribution Formula – – Core Projects around Kernel – Value Add • • • •



Test Co Components Installer Adminis nistration UI UI Apps

WebSphere WAS

– 25 > Apache Projects + Additional Open Source + installer + IBM Value Add

30

© 2013 IBM Corporation

BigInsights: Value Beyond Open Source Key differentiato differentiators rs Enterprise Capabilities Visualization Visualizati on & Exploration Development Tools

Advanced Engines Connectors Workload Optimization

Administration & Security

Open source components

IBM-certified Apache Hadoop

• Built-in analytics • Text engine, annotators, Eclipse tooling • Interface to project R (statistical platform)

• Enterprise software integration • Spreadsheet-style analysis • Integrated installation of supported open source and other components • Web Console for admin and application access a orm enr c men : a ona se s ecur y, • performance features, . . . • World-class support • Full open source compatibility

Business benefits • Quicker time-to-value due to IBM technology and support • Reduced operational risk Enhancedplatform business knowledge with flexible • analytical • Leverages and complements existing software

31

© 2013 IBM Corporation

From Getting Starting to Enterprise Deployment: Different BigInsights Editions For Varying Needs Enterprise Edition

  s   s   a    l   c   e   s    i   r   p   r

- Acce Acceler lerato ators rs -- GP F S – F P O Adaptive ve Ma MapRedu pReduce ce -- Adapti - Text an analytics alytics - Enterp Enterprise rise Integr Integration ation

Standard Edition

   t   e   n

Quick Start Free. Non-production

Apache Hadoop

- S readsheett-s st le tool -- Web con console sole -- Da Dashb shboa oard rds s - Pre-bu Pre-built ilt applic applications ations -- Ecl Eclipse ipse tool tooling ing -- RDBM RDBMS S connec connectivity tivity -- Bi Big gS SQL QL -- Ja Jaq ql -- Platform enhan enhancemen cements ts -- . . .

- Monito Monitoring ring and alerts InfoSphere re Stream Streams* s* -- InfoSphe Watson on Exp Explor lorer* er* -- Wats Cognos os BI BI** -- Cogn -

-

-

-

-. ..

* Limited use license

Breadth of capabilities 32 32

© 2013 2013 IBM IBM Corporation Corporation ©

IBM Enriches Hadoop



Scalable



– Adaptive MapReduce, Compression, Indexing, Flexible Scheduler, +++

– New nodes can be added on the fly 

Affordable

Performance & reliability



Enterprise Hardening of Hadoop

– Massively parallel computing on commodity servers 



– – – –

Flexible – Hadoop is schema-less, and can absorb any type of data



Fault Tolerant – Through MapReduce software framework

33 33

Productivity Accelerators



Web-based UI’s and tools End-user visualization Analytic Accelerators +++

Enterprise Integration – To extend & enrich your information supply chain © 2013 IBM Corporation

Big Database Vendors Adopt Hadoop

34

IBM Internal Use Only

© 2013 IBM Corporation

Competing Hadoop Distribution Vendors 

Cloudera – “Cloudera “Cloudera ma makes kes it easy to run open open source Hadoop in produ production” ction” – “Focus on deriv deriving ing busin business ess value fr from om all your da data ta inste instead ad o off worrying worrying abou aboutt managing Hadoop”



Hortonworks

– “Make Hadoop easier to consu consume me fo forr enterp enterprises rises and tec technology hnology vendo vendors” rs” – “Provide “Provide exp expert ert sup support port b by y the leading contributo contributors rs to the Apache Hadoop open s source ource projects”



EMC Greenplum HD ** Pivotal HD **

” – “ – “Provides “Provides a c complete omplete platform including installatio installation, n, training, global support, and value-add value-add beyond simple packaging of the Apache Hadoop distribution”



MapR – “The “Highfirst Performan Performance ce H Hadoop, adoop, up true to 2 2-5 -5 times faster performan performance ce than Ap Apache-base ache-based d distribution distributions” – distr distribution ibution to prov provide ide high av availability ailability at all levels levels making making it more dependa dependable” ble” s”



Amazon Elastic MapReduce

– “Amazon Elastic M MapReduce apReduce lets yo you u focus on crunc crunching hing or ana analyzing lyzing your da data ta with without out having having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit”

35

IBM Internal Use Only

© 2013 IBM Corporation

Capabilities Required for Hadoop Style Workloads

Application Support and Development Tooling

Visualization & Discovery

Analytics Engines Data Ingest

Cluster and Workload Management

Runtime Data Store

Security

File System

36 36

© 2013 IBM Corporation

Open Source Hadoop Components Visualization Visualizati on & Discovery

Lucene

Application Support and Development Tooling

MapReduce

Pig Pig

Hive Hive

Data Ingest

Oozie

Analytics Engines

Cluster Optimization and Management

ZooKeeper

Runtime

Data Store

File System

MapReduce

Avro

HCatalog

Derby

Security Flume

HBase Sqoop HDFS

Open Source

37 37

© 2013 IBM Corporation

Open Source Components Across Distributions Big Insights 2.0

HortonWorks HDP 1.2

MapR 2.0

Greenplum HD 1.2

Cloudera CDH3u5

Cloudera CDH4*

Hadoop

1.0.3

1.1.2

0.20.2

1.0.3

0.20.2

2.0.0 *

HBase

0.94.0

0.94.2

0.92.1

0.92.1

0.90.6

0.92.1

Hive

0.9.0

0.10.0

0.9.0

0.8.1

0.7.1

0.8.1

Pig

0.10.1

0.10.1

0.10.0

0.9.2

0.8.1

0.9.2

Zookeeper

3.4.3

3.4.5

X

3.3.5

3.3.5

3.4.3

Oozie

3.2.0

3.2.0

3.1.0

X

2.3.2

3.1.3

Avro

1.6.3

X

X

X

X

X

Flume

0.9.4

1.3.0

1.2.0

X

0.9.4

1.1.0

Sqoop

1.4.1

1.4.2

1.4.1

X

1.3.0

1.4.1

HCatalog

0.4.0

0.5.0

0.4.0

X

X

X

Component

38

© 2013 IBM Corporation

Two Key Aspects of Hadoop





Hadoop Distributed File System = HDFS

– Where Hadoop stores data dat a – A file system that spans all the nodes in a Hadoop cluster – It links together the t he file systems on many local nodes to make them into one bi file s stem

MapReduce framework – How Hadoop understands underst ands and assigns work wor k to the nodes (machines)

40

© 2013 IBM Corporation

What is the Hadoop Distributed File System? 





HDFS stores data across multiple nodes HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes e e sys em s u rom a c us er o a a no es, each of which serves up blocks of data over the network using a block protocol specific to HDFS.

41

© 2013 IBM Corporation

MapReduce 

Take a large problem and divide it into sub-problems – Break data set down into small chunks

     P      A      M

… 

Perform the same function on all sub-problems

DoWork()

   E    C    U    D    E    R



DoWork()

DoWork()

Combine the output from all sub-problems …

Output

…

42

© 2013 IBM Corporation

MapReduce Example 

Hadoop computation model – Data stored in a distributed file system spanning many inexpensive computers – Bring function to the data – Distribute application to the compute resources where the data is stored



Scalable to thousands of nodes and petabytes of data Hadoop Data Nodes

public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context Str String ingTok Tokeniz enizer er itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }

1. Map Ph Phase (break job into small parts)

} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable IntWritable result = new IntWritable(); IntWritable(); public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (In (IntWrit tWritable able v : val) { sum += v.get();

Distribute map tasks to cluster

. . .

2. (transfer Shuffleinterim output for final processing)

3. Redu Reduce ce Ph Phas ase e (boil all output down to a single result set)

MapReduce Application Shuffle

Result Set

Return a single result set

43

© 2013 IBM Corporation

So What Does This Result In?



Easy To Scale



Fault Tolerant and Self-Healing



Data Agnostic



Extremely Flexible

45

© 2013 IBM Corporation

Resources 





bigdatauniversity.com youtube.com/ibmBigData Quick Start Editions – Ibm.co/quickstart – Ibm.co/streamsqs



ibm.meetu .com



ibmdw.net/streamsdev – ibm.co/streamscon



ibmbigdatahub.com



ibm.co/bigdatadev



http://tinyurl.com/biginsights

– Links to demos, papers, forum, downloads, etc

46

© 2013 IBM Corporation

Than ank k isYimportant! ou Th Your feedback • Please fill out survey

47

© 2013 IBM Corporation

Acknowledgements and Disclaimers Availability. References in this presentation to IBM IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session session speakers and reflect their own v views. iews. They are provided for informational purposes only, and are neither intended to, nor shall have the effect ef fect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers suppliers or licensors, or altering the terms and conditions of the a applicable pplicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in in these materials is intended intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2014. All rights reserved.

•U.S. Government Users Restricted Rights Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, ibm.com , and InfoSphere BigInsights are trademarks or rregistered egistered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked mark ed on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common comm on law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common com mon law trademarks in other countries. A current list list of IBM trademarks is available on the Web at “Copyright “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks m arks of others.

48

© 2013 IBM Corporation

Backup

49

© 2013 IBM Corporation

Global TLE Framework

Implications of Big Data 

Just reading 100 terabytes is slow – Standard computer (100 MBPS) ~11 days days – Across 10Gbit link (high end storage) 1 day – 1000 standard computers 15 minutes!



Seek times for random disk access is a problem – 1 TB data set with 1010 100-byte records •• Update Upd ates s and tto o 1% wo would uld requi rthe equire rehole 1 month mont Reading rewriting rew riting w whole data datah set set would would take take 1 day*

 

One node is not enough! Need to scale out not up!

* From the Hadoop mailing list

© 2013 IBM Corporation 50

50

Global TLE Framework

Scaling out 

Bad news: nodes fail, especially if you have many – Mean time between failures for 1 node = 3 years, 1000 nodes = 1 day – Super-fancy hardware still fails and commodity machines give better performance performa nce per dollar



Bad news II: distributed programming is hard – – – –

Communication, synchronization synchronization,, and deadlocks Recovering from machine failure Debugging Optimization

© 2013 IBM Corporation 51

51

Global TLE Framework

A new model is needed • It’s all about the right level of abstraction system-level details from the developers • Hide system-level – No more race conditions, lock contention, etc.

from how what from • Separating the what – Developer specifies the computation that needs to be performed – Execution framework (“runtime”) handles ha ndles actual execution

© 2013 IBM Corporation 52

52

Global TLE Framework

MapReduce

Traditional computing

MapReduce computing

© 2013 IBM Corporation 53

53

Global TLE Framework

MapReduce, the reality

Many node, little communication between the the nodes, some stragglers and failures

© 2013 IBM Corporation 54

54

Big Difference: Schema on Run 

Regular database



Big Data (Hadoop) – Schema on run

– Schema on load Raw data

Schema to filter

Raw data

Storage (unfiltered, raw data)

Schema to filter

Storage (pre-filtered data)

Output

55 55

© 2013 2013 IBM IBM Corporation Corporation ©

Key Benefit: Agility/Flexibili Agility/Flexibility ty Schema-on-Read (Hadoop)

Schema-on-Write (RDBMS) 



Schema must be defined before any data is loaded An explicit load operation has





A SerDe (Serializer/Deserlizer) (Serializer/Deserlizer) is applied during read time to extr ex trac actt the the rre e uire uired d colu column mns s (late binding)

to take place which transforms data to internal DB structure 

Data is copied to the file store, no transformation is needed

New Columns must be added 

explicitly beforecan new for such columns bedata loaded into the database

Read First Standard/Governance

New dataand canwill start flowing anytime appear retroactively once SerDe is updated to parse it. Pros

Load Fast Flexibility/Agility

56

Scalability: Scalable Software Deployment

© 2013 IBM Corporation

Lesson 1 - Hadoop and Big Data Overview

Comments

Content

Sponsor Documents

Recommended