Hadoop Developer Day
Nicolas Morales IBM Big Data
[email protected] @NicolasJMorales
Big Data Developers @
FREE Monthly Events
San JoseDeveloper & Foster City Full Day Days Afternoon & Evening Hackathons Past Meetups covered… – Text Analytics – Real-time Analytics – SQL for Hadoop – HBase – Social Media Analytics – Machine Data Analytics – Security and Privacy Development Environment provided Live Topicstreaming suggestions welcome
http://www.meetup.com/BigDataDevelopers/
NEXT MEETUP: Streams Developer Day on Thursday, April 17. Coming Soon: Soon: Big R, Watson, Watson, Big Data Data in the Cloud, Big SQL, MongoDB & more! 2
© 2013 IBM Corporation
3
© 2013 IBM Corporation
Agenda: Hadoop Developer Day Time Subject
4 4
8:00 AM – 9:00 AM
Registration & Breakfast
9:00 AM – 9:30 AM 9:30 AM – 11:00 AM
Introduction to Hadoop Hadoop Architecture a an nd HDFS + Hands-on Lab
11:00 AM – 11:45 AM
Introduction to MapReduce
11:45 AM – 12:45 PM 12:45 PM – 2:00 PM
Lunch MapReduce Hands-on Lab
2:00 PM – 4:00 PM
Using Hive ffo or Data Warehousing + Hands-on Lab
4:00 PM – 6:00 PM 6:00 PM
SQL for Hado Hadoop op + Hand Hands-on s-on Lab Closing Remarks
© 2013 IBM Corporation
Big Data University www.bigdatauniversity.com
5
© 2013 IBM Corporation
Big Data University www.bigdatauniversity.com
6
© 2013 IBM Corporation
Quick Start Edition VM
7
Download: http://ibm.co/QuickStart .tar.gz – Unpack using WinRAR, 7-Zip, etc.
© 2013 IBM Corporation
Your Feedback is Important, please complete your Survey
8 8
© 2013 IBM Corporation
Introduction to Hadoop
Rafael Coss IBM Big Data
[email protected] @racoss 9
© 2013 IBM Corporation
Executive Summary
What’s Big Data? – More Analytics on More Data for More People – More than just Hadoop
What’s Hadoop? – Distributed Computing framework that is • • Cost Effective • Flexible • Fault Tolerance
What Hadoop’s Distribution? – Common set of Apache Projects – Install – Unique Value Add
10
© 2013 IBM Corporation
Key Business-driven Use Cases Improve Business Outcomes Enrich Your Base Information
Improve Customer Interaction with
Help ReduceFraud Risk and Prevent
with Big Data Exploration
Enhanced 360º View of the Customer
with Security and Intelligence Extension
99%
In Time Required For Analysis
1,100
Optimize Infrastructure and Monetize Data with Operations Analysis
60K
Metered Customers in Five States
11
42TB
Publishing Partnerships
Acoustic Data Analyzed
Gain IT efficiency and scale with Data Warehouse Modernization
40X
Gain in Analysis Performance
© 2013 2013 IBM IBM Corporation Corporation ©
12 12
© 2013 IBM Corporation
Why is Big Data important? Data AV AVAILA AILABLE BLE to an organization
data an organization can PROCESS
Organizations are able to process less and less of the
Enterprises are “more blind” to new opportunities.
available data. 100 Millionen Tweets Tweets are posted every day, day, 35 hours of video are bee beeing ing uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net. 80 % spam and viruses. => Prefiltering is more and more important. 13 13
© 2013 IBM Corporation
What is Big Data?
More Analytics on More Data for More People
Transactional & Application Data
14
Machine Data
Social Data
Enterprise Content
• Volume
• Velocity
• Variety
• Variety
• Structured
• Semi-structured
• Highly unstructured
• Highly unstructured
• Throughput
• Ingestion
• Veracity
• Volume
© 2013 2013 IBM IBM Corporation ©
Every Industry can Leverage Big Data and Analytics Analytics Banking • Opt Optimiz imizing ing Off Offers ers and Cross-sell
• 360˚ View View of of D Domai omain n or Subject
• Cus Custom tomer er Ser Service vice and Call Center Efficiency • Fr Fraud aud Dete Detect ction ion & Investigation • Cre Credit dit & C Count ounterp erpart arty y Risk
• Cat Catastr astrophe ophe Mo Modeli deling ng • Frau Fraud d & Ab Abus use e • Produce Producerr Perfor Performan mance ce Analytics • An Analy alytic tics s Sandb Sandbox ox
Retail • Act Actiona ionable ble Cus Custome tomerr Insight • Mer erch chan andi dise se Optimization • Dynam Dynamic ic Pr Prici icing ng
Automotive
15
Insurance
Travel & Transport • Custome Customerr Analyt Analytics ics & Loyalty Marketing • Predict Predictive ive Mainte Maintenanc nance e Analytics • Capa Capacit city y & Pr Prici icing ng Optimization
Chemical & Petroleum
Telco • Pro-act Pro-active ive Call Call Ce Center nter • Netw Networ ork k Anal Analyt ytics ics • Lo Locat cation ion Ba Based sed Services
Consumer Products • Sh Shelf elf Avail Availab abili ility ty • Pr Prom omoti otion onal al Spend Spend Optimization • Me Merch rchan andis dising ing Compliance • Promoti Promotion on Except Exceptions ions & Alerts
Aerospace & Defense
Energy & Utilities • Sma Smart rt Meter Meter Analy Analytics tics • Dist Distrib ribut ution ion Load Load Forecasting/Scheduling • Cond Conditi ition on Base Based d Maintenance • Crea Create te & Targ Target et Customer Offerings
Government • Civi Civilia lian n Servi Services ces • Def Defense ense & Intell Intelligen igence ce • Tax & Tr Treasu easury ry S Serv ervices ices
Electronics
• Advance Advanced d Conditi Condition on Monitoring
• Operational Surveillance, Analysis & Optimization
• Unifor Uniform m Inf Informa ormation tion Access Platform
• Cust Custome omer/ r/ Chann Channel el Analytics
• Data Data Wareh Warehou ouse se Optimization • Act Actiona ionable ble Cus Custome tomerr Intelligence
• Data Warehouse Consolidation, Integration & Augmentation • Big Data Exploration for Interdisciplinary Collaboration
• Data Data Wareho Warehous use e Optimization • Airline Airlinerr C Cert ertific ificatio ation n Platform • Advance Advanced d Conditi Condition on Monitoring (ACM)
• Advance Advanced d Conditi Condition on Monitoring
Media & Entertainment • Bu Busin siness ess pr proce ocess ss transformation • Audienc Audience e& M Marke arketing ting Optimization • Mu Multi lti-C -Cha hanne nnell Enablement • Digit Digital al c comm ommer erce ce optimization
Healthcare • Me Measu asure re & Act Act o on n Population Health Outcomes • Engage Engage Consume Consumers rs in their Healthcare
Life Sciences • Incr Increase ease vis visibil ibility ity into drug safety and effectiveness
© © 2013 2013 IBM IBM Corporation Corporation
Big Data use study
Big data adoption
When segmented into four groups based on current levels of big data activity, respondents showed significant consistency in organizational behaviors
2012 Big Data @ Work Study surveying 1144 1144 business and IT professionals in 95 countries 16
© 2013 IBM Corporation
Warehouse Modernization Has Two Themes
Traditional Analytics
Big Data Analytics
Structured & Repeatable Structure built to store data
Iterative & Exploratory Data is the structure
Business Users
Analyzed
Determine
On Flexible
Information
Available Information
Capacity down sampling of constrained available information Analyzed
Information IT Team Builds System To Answer Known Questions
Carefully cleanse all information before any analysis 17 17
IT Team Delivers Data
Anal yze ALL Available Information
Whole population analytics connects the dots Analyzed Information
Business Users Explore and Ask Any Question
Analyze information as is & cleanse as needed © 2013 IBM Corporation
Warehouse Modernization Has Two Themes
Traditional Analytics
Big Data Analytics
Structured & Repeatable Structure built to store data
Iterative & Exploratory Data is the structure
Hypothesis
Question
Data
Exploration
All Information
?
Analyzed Information
Answer
Data
Start with w ith hypothesis Test against selected data
Analyze Analy ze after landing… 18 18
Actionable Insight
Correlation
Data leads the way Explore all data, identify correlations
Analyze in motion… © 2013 IBM Corporation
Getti Getting ng the the Value Value from Big Data Data – Why a Platfo Platform? rm? BIG DATA PLATFORM Systems Management
Application Development
Discovery
The Whole is Greater than the Sum of the Parts
Accelerators Hadoop ys em
Stream ompu ng
Data are ouse
Information Integration & Governance
Data
19
Media
Content
Machine
Social
Almost all big data use cases require an integrated set of big data technologies to address the business pain completely Reduce time and cost and provide quick ROI by leveraging pre-integrated components Provide both out of the box and standardsbased services Start small with a single project and progress to others over your big data journey
© 2013 2013 IBM IBM Corporation Corporation ©
Watson Foundations Differentiators
5
Watson Foundations
Real-time processing & analytics analytics
Data types Machine and sensor data Image and video Enterprise content Transaction & application data
STREAMS, DATA REPLICATION
Operational systems
Exploration, landing and archive
2
Trusted data
3
3
Predictive analytics & modeling
3 Reporting, analysis, content analytics
3 3
More than Hadoop Greater resiliency and recoverability Advanced workload management, multi-tenancy Enhanced, flexible storage management (GPFS) Enhanced data access&(BigSQL, Search) Analytics accelerators visualization Enterprise-ready security framework
20
3 Reporting & interactive analysis
Discovery and exploration
Information Integration & Governance
2
Decision management
Deep analytics & modeling
1
Third-party data
1
3
Actionable insight
Data in Motion Enterprise class stream processing & analytics
3 4
4
Analytics Everywhere Richest set of analyti analytics cs capabilities Ability to analyze data in place
Governance Everywhere Complete integration & governance Ability to govern all data where ever capabilities it is
5
Complete Portfolio End-to-end capabilities to address all needs Ability to grow and address future needs Remains open to work with existing investments © 2013 IBM Corporation
IBM Big Data & Analy Analytics tics New/Enhanced Applications
IBM Watson Foundations
All Data
Real-time Data Processing & Analytics
pera ona data zone
Landing, Exploration and Archive data zone
What is happening? Discovery and exploration
Deep Analytics
What action should I take? Decision management
EDW and data mart zone
Why did it Cognitive Fabric
Reporting and analysis
What could happen? Predictive analytics and modeling
Information Integration & Governance Systems
Security
Storage
On premise, Cloud, As a service
IBM Big Data & Analytics Analytics Infrastructure I nfrastructure 21
© 2013 IBM Corporation
What is Hadoop?
Apache open source software framework for reliable, scalable, distributed computing of massive amount of data
Hides underlying system details and complexities from fr om user Developed in Java
Core sub projects: − MapReduce − − Ha Hado doop op Co Comm mmon on
. . .
Supported HBase by several Hadoop-related projects
22
Zookeeper Avro Etc.
Meant for heterogeneous commodity hardware © 2013 IBM Corporation
Design principles of Hadoop
New way of storing and processing the data: – Let system handle most of the issues automatically: •• • • • •
F Sa cialulare bsility Reduc educe e commu communi nica cattio ions ns Distri Distribut bute e data data a and nd pr proces ocessin sing g power power to wh where ere the the data data is Make Make para paralle llelis lism m par partt of op oper erat atin ing g sy syst stem em Rela Relati tive vely ly inex inexpe pens nsiv ive e har hardw dwar are e ($2 – 4K 4K))
Hadoop = HDFS + MapReduce infrastructure + …
Optimized to handle
– Massive amounts of data through pa parallelism rallelism
– A variety of data (structured, unstructured, semi-structured) – Using inexpensive commodity hardware
23
Reliability provided through replication
© 2013 IBM Corporation
Hadoop is not for all types of work
Not to process transactions (random access)
Not good when work w ork cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little data
Big Data Solution
24
© 2013 IBM Corporation
Who uses Hadoop?
25
© 2013 IBM Corporation
Map-Reduce
26
Hadoop
→
BigInsights
→
© 2013 IBM Corporation
What is Apache Hadoop?
Flexible, enterprise-class support for processing large volumes of data – Inspired by Google Google technologies technologies (MapReduce, GFS, BigTable, …) – Initiated at Yahoo • Originally built to address scalability problems of Nutch, an open source Web search technology
– Well-suited to batch-oriented, read-intensive applications –
Enables applications to work with w ith thousands of nodes and petabytes of data in a highly parallel, cost effective manner – CPU + disks = “node” – Nodes can be combined into clusters – New nodes can be added as needed w without ithout changing • Data formats • Ho How wd da ata is llo oaded • Ho How w jjo obs are writt ritte en
27
© 2013 IBM Corporation
Hadoop Open Source Projects
28
Hadoop is supplemented by an ecosystem of open source projects
© 2013 IBM Corporation
How do I leverage Hadoop to create new value for my enterprise? Hadoop, Pig, Hive, Zookeeper Z ookeeper,, Jaql, Hbase, Ozzie, Flume Terabytes Petabytes Exabytes anal Log sis CDRs ... ...
29 29
HDFS MapReduce AQL Machine learning Sentiment analysis ... ...
© 2013 IBM Corporation
What’s a Hadoop Distribution?
What’s a Linux Distribution? – – – –
Linux Kernel Open Source Tools around Kernel Installer Administration UI
Open Source Distribution Formula – – Core Projects around Kernel – Value Add • • • •
Test Co Components Installer Adminis nistration UI UI Apps
WebSphere WAS
– 25 > Apache Projects + Additional Open Source + installer + IBM Value Add
30
© 2013 IBM Corporation
BigInsights: Value Beyond Open Source Key differentiato differentiators rs Enterprise Capabilities Visualization Visualizati on & Exploration Development Tools
Advanced Engines Connectors Workload Optimization
Administration & Security
Open source components
IBM-certified Apache Hadoop
• Built-in analytics • Text engine, annotators, Eclipse tooling • Interface to project R (statistical platform)
• Enterprise software integration • Spreadsheet-style analysis • Integrated installation of supported open source and other components • Web Console for admin and application access a orm enr c men : a ona se s ecur y, • performance features, . . . • World-class support • Full open source compatibility
Business benefits • Quicker time-to-value due to IBM technology and support • Reduced operational risk Enhancedplatform business knowledge with flexible • analytical • Leverages and complements existing software
31
© 2013 IBM Corporation
From Getting Starting to Enterprise Deployment: Different BigInsights Editions For Varying Needs Enterprise Edition
s s a l c e s i r p r
- Acce Acceler lerato ators rs -- GP F S – F P O Adaptive ve Ma MapRedu pReduce ce -- Adapti - Text an analytics alytics - Enterp Enterprise rise Integr Integration ation
Standard Edition
t e n
Quick Start Free. Non-production
Apache Hadoop
- S readsheett-s st le tool -- Web con console sole -- Da Dashb shboa oard rds s - Pre-bu Pre-built ilt applic applications ations -- Ecl Eclipse ipse tool tooling ing -- RDBM RDBMS S connec connectivity tivity -- Bi Big gS SQL QL -- Ja Jaq ql -- Platform enhan enhancemen cements ts -- . . .
- Monito Monitoring ring and alerts InfoSphere re Stream Streams* s* -- InfoSphe Watson on Exp Explor lorer* er* -- Wats Cognos os BI BI** -- Cogn -
-
-
-
-. ..
* Limited use license
Breadth of capabilities 32 32
© 2013 2013 IBM IBM Corporation Corporation ©
IBM Enriches Hadoop
Scalable
– Adaptive MapReduce, Compression, Indexing, Flexible Scheduler, +++
– New nodes can be added on the fly
Affordable
Performance & reliability
Enterprise Hardening of Hadoop
– Massively parallel computing on commodity servers
– – – –
Flexible – Hadoop is schema-less, and can absorb any type of data
Fault Tolerant – Through MapReduce software framework
33 33
Productivity Accelerators
Web-based UI’s and tools End-user visualization Analytic Accelerators +++
Enterprise Integration – To extend & enrich your information supply chain © 2013 IBM Corporation
Big Database Vendors Adopt Hadoop
34
IBM Internal Use Only
© 2013 IBM Corporation
Competing Hadoop Distribution Vendors
Cloudera – “Cloudera “Cloudera ma makes kes it easy to run open open source Hadoop in produ production” ction” – “Focus on deriv deriving ing busin business ess value fr from om all your da data ta inste instead ad o off worrying worrying abou aboutt managing Hadoop”
Hortonworks
– “Make Hadoop easier to consu consume me fo forr enterp enterprises rises and tec technology hnology vendo vendors” rs” – “Provide “Provide exp expert ert sup support port b by y the leading contributo contributors rs to the Apache Hadoop open s source ource projects”
EMC Greenplum HD ** Pivotal HD **
” – “ – “Provides “Provides a c complete omplete platform including installatio installation, n, training, global support, and value-add value-add beyond simple packaging of the Apache Hadoop distribution”
MapR – “The “Highfirst Performan Performance ce H Hadoop, adoop, up true to 2 2-5 -5 times faster performan performance ce than Ap Apache-base ache-based d distribution distributions” – distr distribution ibution to prov provide ide high av availability ailability at all levels levels making making it more dependa dependable” ble” s”
Amazon Elastic MapReduce
– “Amazon Elastic M MapReduce apReduce lets yo you u focus on crunc crunching hing or ana analyzing lyzing your da data ta with without out having having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit”
35
IBM Internal Use Only
© 2013 IBM Corporation
Capabilities Required for Hadoop Style Workloads
Application Support and Development Tooling
Visualization & Discovery
Analytics Engines Data Ingest
Cluster and Workload Management
Runtime Data Store
Security
File System
36 36
© 2013 IBM Corporation
Open Source Hadoop Components Visualization Visualizati on & Discovery
Lucene
Application Support and Development Tooling
MapReduce
Pig Pig
Hive Hive
Data Ingest
Oozie
Analytics Engines
Cluster Optimization and Management
ZooKeeper
Runtime
Data Store
File System
MapReduce
Avro
HCatalog
Derby
Security Flume
HBase Sqoop HDFS
Open Source
37 37
© 2013 IBM Corporation
Open Source Components Across Distributions Big Insights 2.0
HortonWorks HDP 1.2
MapR 2.0
Greenplum HD 1.2
Cloudera CDH3u5
Cloudera CDH4*
Hadoop
1.0.3
1.1.2
0.20.2
1.0.3
0.20.2
2.0.0 *
HBase
0.94.0
0.94.2
0.92.1
0.92.1
0.90.6
0.92.1
Hive
0.9.0
0.10.0
0.9.0
0.8.1
0.7.1
0.8.1
Pig
0.10.1
0.10.1
0.10.0
0.9.2
0.8.1
0.9.2
Zookeeper
3.4.3
3.4.5
X
3.3.5
3.3.5
3.4.3
Oozie
3.2.0
3.2.0
3.1.0
X
2.3.2
3.1.3
Avro
1.6.3
X
X
X
X
X
Flume
0.9.4
1.3.0
1.2.0
X
0.9.4
1.1.0
Sqoop
1.4.1
1.4.2
1.4.1
X
1.3.0
1.4.1
HCatalog
0.4.0
0.5.0
0.4.0
X
X
X
Component
38
© 2013 IBM Corporation
Two Key Aspects of Hadoop
Hadoop Distributed File System = HDFS
– Where Hadoop stores data dat a – A file system that spans all the nodes in a Hadoop cluster – It links together the t he file systems on many local nodes to make them into one bi file s stem
MapReduce framework – How Hadoop understands underst ands and assigns work wor k to the nodes (machines)
40
© 2013 IBM Corporation
What is the Hadoop Distributed File System?
HDFS stores data across multiple nodes HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes e e sys em s u rom a c us er o a a no es, each of which serves up blocks of data over the network using a block protocol specific to HDFS.
41
© 2013 IBM Corporation
MapReduce
Take a large problem and divide it into sub-problems – Break data set down into small chunks
P A M
…
Perform the same function on all sub-problems
DoWork()
E C U D E R
DoWork()
DoWork()
Combine the output from all sub-problems …
Output
…
42
© 2013 IBM Corporation
MapReduce Example
Hadoop computation model – Data stored in a distributed file system spanning many inexpensive computers – Bring function to the data – Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data Hadoop Data Nodes
public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text val, Context Str String ingTok Tokeniz enizer er itr = new StringTokenizer(val.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }
1. Map Ph Phase (break job into small parts)
} public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable IntWritable result = new IntWritable(); IntWritable(); public void reduce(Text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (In (IntWrit tWritable able v : val) { sum += v.get();
Distribute map tasks to cluster
. . .
2. (transfer Shuffleinterim output for final processing)
3. Redu Reduce ce Ph Phas ase e (boil all output down to a single result set)
MapReduce Application Shuffle
Result Set
Return a single result set
43
© 2013 IBM Corporation
So What Does This Result In?
Easy To Scale
Fault Tolerant and Self-Healing
Data Agnostic
Extremely Flexible
45
© 2013 IBM Corporation
Resources
bigdatauniversity.com youtube.com/ibmBigData Quick Start Editions – Ibm.co/quickstart – Ibm.co/streamsqs
ibm.meetu .com
ibmdw.net/streamsdev – ibm.co/streamscon
ibmbigdatahub.com
ibm.co/bigdatadev
http://tinyurl.com/biginsights
– Links to demos, papers, forum, downloads, etc
46
© 2013 IBM Corporation
Than ank k isYimportant! ou Th Your feedback • Please fill out survey
47
© 2013 IBM Corporation
Acknowledgements and Disclaimers Availability. References in this presentation to IBM IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session session speakers and reflect their own v views. iews. They are provided for informational purposes only, and are neither intended to, nor shall have the effect ef fect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers suppliers or licensors, or altering the terms and conditions of the a applicable pplicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in in these materials is intended intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2014. All rights reserved.
•U.S. Government Users Restricted Rights Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, ibm.com , and InfoSphere BigInsights are trademarks or rregistered egistered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked mark ed on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common comm on law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common com mon law trademarks in other countries. A current list list of IBM trademarks is available on the Web at “Copyright “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks m arks of others.
48
© 2013 IBM Corporation
Backup
49
© 2013 IBM Corporation
Global TLE Framework
Implications of Big Data
Just reading 100 terabytes is slow – Standard computer (100 MBPS) ~11 days days – Across 10Gbit link (high end storage) 1 day – 1000 standard computers 15 minutes!
Seek times for random disk access is a problem – 1 TB data set with 1010 100-byte records •• Update Upd ates s and tto o 1% wo would uld requi rthe equire rehole 1 month mont Reading rewriting rew riting w whole data datah set set would would take take 1 day*
One node is not enough! Need to scale out not up!
* From the Hadoop mailing list
© 2013 IBM Corporation 50
50
Global TLE Framework
Scaling out
Bad news: nodes fail, especially if you have many – Mean time between failures for 1 node = 3 years, 1000 nodes = 1 day – Super-fancy hardware still fails and commodity machines give better performance performa nce per dollar
Bad news II: distributed programming is hard – – – –
Communication, synchronization synchronization,, and deadlocks Recovering from machine failure Debugging Optimization
© 2013 IBM Corporation 51
51
Global TLE Framework
A new model is needed • It’s all about the right level of abstraction system-level details from the developers • Hide system-level – No more race conditions, lock contention, etc.
from how what from • Separating the what – Developer specifies the computation that needs to be performed – Execution framework (“runtime”) handles ha ndles actual execution
© 2013 IBM Corporation 52
52
Global TLE Framework
MapReduce
Traditional computing
MapReduce computing
© 2013 IBM Corporation 53
53
Global TLE Framework
MapReduce, the reality
Many node, little communication between the the nodes, some stragglers and failures
© 2013 IBM Corporation 54
54
Big Difference: Schema on Run
Regular database
Big Data (Hadoop) – Schema on run
– Schema on load Raw data
Schema to filter
Raw data
Storage (unfiltered, raw data)
Schema to filter
Storage (pre-filtered data)
Output
55 55
© 2013 2013 IBM IBM Corporation Corporation ©
Key Benefit: Agility/Flexibili Agility/Flexibility ty Schema-on-Read (Hadoop)
Schema-on-Write (RDBMS)
Schema must be defined before any data is loaded An explicit load operation has
A SerDe (Serializer/Deserlizer) (Serializer/Deserlizer) is applied during read time to extr ex trac actt the the rre e uire uired d colu column mns s (late binding)
to take place which transforms data to internal DB structure
Data is copied to the file store, no transformation is needed
New Columns must be added
explicitly beforecan new for such columns bedata loaded into the database
Read First Standard/Governance
New dataand canwill start flowing anytime appear retroactively once SerDe is updated to parse it. Pros
Load Fast Flexibility/Agility
56
Scalability: Scalable Software Deployment
© 2013 IBM Corporation