ds

Published on December 2016 | Categories: Documents | Downloads: 48 | Comments: 0 | Views: 282
of 46
Download PDF   Embed   Report

Comments

Content

Data Science with SAS
and Cloudera
Josh Wills, Senior Director of Data Science
Cloudera

1

What is a Data Scientist?

2

One Definition…

3

…versus Another

4

What Do Data Scientists Do?

5

What I Think I Do

6

What Other People Think I Do

7

What I Actually Do

8

A Brief Introduction to Hadoop

9

Data Storage in 2001: Databases
Structured schemas
• Intensive processing
done where data is
stored
• Somewhat reliable
• Expensive at scale


10

Data Storage in 2001: Filers
No schemas, stores any
kind of file
• No data processing
capability
• Reliable
• Expensive at scale


11

And Then, This Happened

12

Data Economics: Return on Byte

13

Big Data Economics
No individual record is
particularly valuable
• Having every record is
incredibly valuable







14

Web index
Recommendation
systems
Sensor data
Market basket analysis
Online advertising

Enter Hadoop

15

The Hadoop Distributed File System
Based on the Google
File System
• Data stored in large files





16

Large block size: 64MB
to 256MB per block
Blocks are replicated to
multiple nodes in the
cluster

Reliable Distributed Processing:
MapReduce


Map Stage





Shuffle Stage: Large-scale distributed sort




Embarrassingly parallel
Like a DATA Step
Like PROC SORT

Reduce Stage



Process all of the values that have the same key in a single
step
Like PROC MEANS with a BY statement

Process the data where it is stored
• Write once and you’re done.


17

Getting Started with Hadoop


Apache Hive




SQL-based query
language



18

Data Warehouse System
on top of Hadoop

SELECT, INSERT, CREATE
TABLE
Includes some
MapReduce-specific
extensions

Thinking Like a Data Scientist

19

Solving The Right Problem

20

Scarcity vs. Abundance

21

The Star Schema

22

Going Supernova

23

Batch vs. Interactive Processing

24

Cloudera Impala

25

SAS LASR

26

Advanced Analytics on Hadoop

27

Data Science as ETL

28

Iterative Algorithms

29

Iterative Algorithms: Hadoop

30

Iterative Algorithms: SAS HPA

31

MapReduce and You

32

Iterative Algorithms: Getting Clever

33

Case Study: Rare Event Prediction

34

K-Means Clustering

35

K-Means Clustering: Lloyd’s Algorithm

36

K-Means++

37

Scalable K-Means++ with Cloudera ML

38

Thinking About the Future

39

Data Science as Statistics

40

Data Science as Decision Engineering

41

Decisions Should Be Cheap.

42

Operational Analytics

43

Understanding Operational Analytics
Investigative Analytics






44

Question-driven
Interactive
Ad-hoc, post-hoc
Fixed data
Output is embedded into a
report or in-database
scoring engine

Operational Analytics






Metric-driven
Automated
Systematic
Fluid data
Output is a production
system that makes
customer-facing decisions

Building Data Products

45

Thank you!

Josh Wills, Director of Data Science, Cloudera

@josh_wills

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close