ds

Published on December 2016 | Categories: Documents | Downloads: 48 | Comments: 0 | Views: 282

of 46

Content

Data Science with SAS
and Cloudera
Josh Wills, Senior Director of Data Science
Cloudera

1

What is a Data Scientist?

2

One Definition…

3

…versus Another

4

What Do Data Scientists Do?

5

What I Think I Do

6

What Other People Think I Do

7

What I Actually Do

8

A Brief Introduction to Hadoop

9

Data Storage in 2001: Databases
Structured schemas
• Intensive processing
done where data is
stored
• Somewhat reliable
• Expensive at scale
•

10

Data Storage in 2001: Filers
No schemas, stores any
kind of file
• No data processing
capability
• Reliable
• Expensive at scale
•

11

And Then, This Happened

12

Data Economics: Return on Byte

13

Big Data Economics
No individual record is
particularly valuable
• Having every record is
incredibly valuable
•

•
•
•
•
•
14

Web index
Recommendation
systems
Sensor data
Market basket analysis
Online advertising

Enter Hadoop

15

The Hadoop Distributed File System
Based on the Google
File System
• Data stored in large files
•

•
•

16

Large block size: 64MB
to 256MB per block
Blocks are replicated to
multiple nodes in the
cluster

Reliable Distributed Processing:
MapReduce
•

Map Stage
•
•

•

Shuffle Stage: Large-scale distributed sort
•

•

Embarrassingly parallel
Like a DATA Step
Like PROC SORT

Reduce Stage
•
•

Process all of the values that have the same key in a single
step
Like PROC MEANS with a BY statement

Process the data where it is stored
• Write once and you’re done.
•

17

Getting Started with Hadoop
•

Apache Hive
•

•

SQL-based query
language
•
•

18

Data Warehouse System
on top of Hadoop

SELECT, INSERT, CREATE
TABLE
Includes some
MapReduce-specific
extensions

Thinking Like a Data Scientist

19

Solving The Right Problem

20

Scarcity vs. Abundance

21

The Star Schema

22

Going Supernova

23

Batch vs. Interactive Processing

24

Cloudera Impala

25

SAS LASR

26

Advanced Analytics on Hadoop

27

Data Science as ETL

28

Iterative Algorithms

29

Iterative Algorithms: Hadoop

30

Iterative Algorithms: SAS HPA

31

MapReduce and You

32

Iterative Algorithms: Getting Clever

33

Case Study: Rare Event Prediction

34

K-Means Clustering

35

K-Means Clustering: Lloyd’s Algorithm

36

K-Means++

37

Scalable K-Means++ with Cloudera ML

38

Thinking About the Future

39

Data Science as Statistics

40

Data Science as Decision Engineering

41

Decisions Should Be Cheap.

42

Operational Analytics

43

Understanding Operational Analytics
Investigative Analytics
•
•
•
•
•

44

Question-driven
Interactive
Ad-hoc, post-hoc
Fixed data
Output is embedded into a
report or in-database
scoring engine

Operational Analytics
•
•
•
•
•

Metric-driven
Automated
Systematic
Fluid data
Output is a production
system that makes
customer-facing decisions

Building Data Products

45

Thank you!

Josh Wills, Director of Data Science, Cloudera

@josh_wills

ds

Comments

Content

Sponsor Documents

Recommended