Big Data, Data Management & Visualization
Ashish Sharma Director & Co-Founder, BRIDGEi2i Analytics Solutions
[email protected] January 2012
@ 2012 BRIDGEi2i Analytics Solutions Pvt. Ltd. All rights reserved
Agenda
1
BIG DATA
2
BUSINESS INTELLIGENCE
4
VISUALIZATION
3
DATA MANAGEMENT
2
Competing on Analytics with Big Data is big news
What is BIG Data
In the world of social media and news: • The entire US allocated research infrastructure is 12PB of disk and 22PB of tape! • Microsoft’s Bing search engine uses 150PB of spinning disk • Biggest scientific projects will generate only 10-20TB / day of data, while Twitter alone produces 28PB of new data a day and Bing processes 2PB / day • 200 MILLION new tweets a day • 1BILLION new Facebook items a day: average person adds 3 items to Facebook every single day
What is BIG Data
• Entire New York Times 1945-2005 = 18M articles = 2.9 billion words • 5 BILLION words added to Twitter each DAY (almost twice the total volume of the Times in the last 60 years)
• Estimated 49.5 trillion words ever printed in books over last 600 yrs • Twitter alone will reach that size in just three years with its current rate of tripling post volume each year
What makes it BIG Data
SOCIAL
BLOG
SMART METER
101100101001 001001101010 101011100101 010100100101
VOLUME
VELOCITY
VARIETY
VALUE
It is not a single number but a set of parameters
Social Data Machine-Generated Data Video and Images
Documents
Why is BIG Data Important
US HEALTH CARE MANUFACTURING GLOBAL PERSONAL LOCATION DATA EUROPE PUBLIC SECTOR ADMIN US RETAIL
Increase industry value per year by
Decrease dev., Increase service provider Increase industry assembly costs by revenue by value per year by
Increase net margin by
$300 B
–50%
$100 B
New Data
€250 B
60+%
Today’s Challenge
Healthcare Expensive office visits Manufacturing In-person support Location-Based Services Based on home zip code Public Sector Standardized services Retail One size fits all marketing
What’s Possible
Preventive care, reduced hospitalization
Remote patient monitoring Product sensors
Real time location data Citizen surveys Social media
Automated diagnosis, support
Geo-advertising, traffic, local search Tailored services, cost reductions Sentiment analysis segmentation
Catalina Marketing: Building loyalty one customer at a time
No targeting Basic targeting e.g., offer dog food coupon to customer buying dog food Using predictive models to find latent correlations
Coupon redemption rate
1%
6-10%
25%
Marketing to a segment of one – 195 million US loyalty program members – Every coupon printed is unique to the individual customer – Customized based on three years' worth of purchase history
• Identifies items that shoppers are likely to buy in future visits • 25% increase in coupon redemption rates
What needs to be done & how
Tapping into diverse data sets
Finding and monetizing unknown relationships Data driven business decisions
DECIDE ACQUIRE
ANALYZE
ORGANIZE & DISTILL
Just in US - 150K Incremental Advanced Analytical Talent & 1.5 M Data Savvy Manager Required to take full advantage of Big Data
Analytics as Competitive Advantage
• Research to identify “real world” applications of data and analytics in business
– Summarize the business challenge – name of company, function, time, sources of example – What data was used – Insights – How it created value for business
• 1 page powerpoint output (font 12+)
– Use additional pages if required to show sample dashboards / outputs
• No two examples should be same
Agenda
1
BIG DATA
2
BUSINESS INTELLIGENCE
4
VISUALIZATION
3
DATA MANAGEMENT
11
What is Business Intelligence
How is it used today …
Customer Transaction Business Transaction Transactional Database Simple Query
Item:‘Shoes’ Cost:‘$34’ Cust:‘James’
Item Shoes
2011 Sales
Cost $34
Cust James
Business Analyst
Data Warehouse
SALES
BI Reports & Dashboards
Complex Query
Sales & Profit for Shoes & Belts Year >= 2005
2010 2009 2008 2007 2006 2005
BI & DW will evolve to meet BIG Data challenges
Will Need to Integrate in BIG DATA world
ANALYTICS
So what is Business reporting & analysis
Why? • To report findings from different transactional, performance & financial data stored by businesses • Best interpretation of data based right business metrics and optimal slicing and dicing of data • Integrate Reporting platforms, remove redundancies across reports • Automate all repeatable Reports • Provide drilldown analysis build in reporting automated tools
Monthly Operations Metrics Analysis
Website Click Stream Analysis
Spend Analysis
Help business track performance based on historical data
Agenda
1
BIG DATA
2
BUSINESS INTELLIGENCE
4
VISUALIZATION
3
DATA MANAGEMENT
16
Visualization – an example
http://news.yahoo.com/s/yblog_thelookout/watch-200-years-of-history-in-5-minutes
Principles of Visualization
1. Who is my audience
• What are they keen to know, what do they already know, what do they believe, what metrics do they understand
2. What am I trying to communicate 3. How do I expect the message be used – informative, actionable etc 4. What type of Dashboard am I creating
What type of Dashboard am I creating
Chart Chooser
Information Discrimination
• Find the crux of what needs to be presented
– What is the story, what metrics articulate the full story, what should one do to improve
• Ask a better question
– Not all questions are important, separate good to know to what would drive action if someone knew this
• Have hypothesis, analyze, but only show what matters
What is the right metric
What do you think of this?
Chart summarizing sales performance of a business
What do you think of this?
What do you think of this?
Very common charts to show how Company G is doing compared to competition
What do you think of this?
What do you think of this?
What do you think about this?
Resources
• http://www.juiceanalytics.com/white-papers-guides-and-more/#registration • http://www-958.ibm.com/software/data/cognos/manyeyes • http://www.perceptualedge.com/library.php
Visualization Assignment
• Identify a data set of your choice. Present a visualization of that that that you think is interesting. Key parameters of evaluation:
– Interesting insight – Quality of visual output – Ease of interpretation
• You could use data sets from (but do not use any existing visual outputs) http://www958.ibm.com/software/data/cognos/manyeyes/d atasets?q=
Agenda
1
BIG DATA
2
BUSINESS INTELLIGENCE
4
VISUALIZATION
3
DATA MANAGEMENT
31
Understand Data & Define Objective
u
–
–
–
–
–
Understand Data (“INFORMATION”) Data Collection Approach – Actual / Imputed Values Structure, Type, Granularity / Level of Data – Credit Bureau Attributes – Demographic Data (Infobase) – Commercial Data (D&B) – Macro Economic Data – DRI-WEFA – Market Research Data Period / Population for which data available / would be collected (VERY IMPORTANT) – Keep the “END IMPLEMENTATION / USE” in mind Confirm “Operational Definition” of Attributes (esp. Dependent Attribute) – Dictionary / Discussions – Financial Ratios / Response / Net Response / Profitability Calculation Data Integrity Check for Relationship between various attributes – Check for Logical Relationships – Frequency / Recency – Response / Quote / Conversion
Data Preparation
– –
–
–
Data Extraction / Reading into SAS Numeric Attributes: – Plot Univariate Graphs (PROC UNIVARIATE) for every attribute – “Identify” & Cap Outliers (MAX(MIN( , ), ) (Don’t use a 99.XX percentile approach) – Identify Treatment for Missing Values (Differentiate from “ALL MISSING”) – Business understanding / Data Collection Methodology based Imputation – Mean Imputation (To Minimize Distortion in end model) – Median Imputation (for Discrete Attributes) – Regression Based Imputation (Can be explored) – MCMC Based Imputation (Vijay’s GB Project) Character Attributes: – Frequency Plots (PROC FREQ) (For “Numeric Valued” Continuous Character Attributes convert to numeric and then use the approach for Numeric Attributes) – Treatment of Missing Values – Identify Appropriate ways of creating Numeric Attributes from them: – Ordered Values (Income Deciles) » Level (1,2,3 …or 12000, 15000, 25000 etc) – Meaningful Bucketing (Education, Type of Car, SIC Code, Credit Score) » Create Dummies for various Categories (example: Gender, Marital Status etc) • (Can Use CHAID to identify some useful combinations, will discuss in Modeling Session) Confirm No Missing Data, All “Information” in Numeric Form (PROC MEANS nmiss option)
Summary Day 2
• Big Data & Technological advancements are creating opportunities for businesses to turn analytics into a competitive advantage • Business Intelligence fast evolving to providing insights, integrating data storage, reporting & analytics to provide real time relevant answers • Visualization – crisp & insightful key to making executives take decisions based on analytics • At the core of any Analytics exercise is “Data Management” and how it is treated appropriately to maximize information from it
Deliverables
• Assignment 1: One “Unique” example (within the class) of Impact from Analytics in business world • Assignment 2: Visualization of an interesting insight using any data of your choice
THANK YOU
E-Mail –
[email protected] Linkedin - http://www.linkedin.com/company/bridgei2i-analytics-solutions Facebook – http://www.facebook.com/pages/BRIDGEi2i-Analytics-Solutions/127891620624459 Twitter - @BRIDGEi2i Web – www.bridgei2i.com