Big Data and its Analytics –A Challenge or Boon for Governance
Ravikumar Ramachandran
My Profile
• CISA, CISM, CGEIT, CRISC, SSCP, CAP, CISSP-ISSAP, CFE, CIA, CRMA, PMP, CEH, ECSA, CHFI, FCMA • COBIT 5 (F), ISO 27001:2013 Lead Auditor • More than 22 years Industry experience • Last 12 years as CRO, CISO • Research and Review Committee –ISACA • e-journal editor of Mumbai Chapter & CGEIT Coordinator • Presently in Hewlett-Packard
References
• Big Data Big Analytics –Michael Minelli, Michele Chambers, Ambiga Dhiraj • Big Data Analytics-Turning big data into big money-Frank J. Ohlhorst • Big Data Now-Current perspectives from O’Reilly Media • Privacy and Big Data-Terence Craig & Mary E. Ludloff • Ethics of Big Data-Kord Davis with Doug Patterson • A Revolution that will transform How we live, Work and Think –Big Data-Viktor Mayer-Schonberger and Kenneth Cukier • Big Data: The next frontier for innovation, competition and productivity-McKinsey Global Institute-June 2011 • Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph –David Loshin
Disclaimer & Author’s Note
• The views expressed belongs to the author and not that of the employer or any of the Professional Associations • This Presentation is meant for the members of the Institute of Chartered Accountants of India • The Author is sharing his own independent views and whenever references have been made to other works, due credit is given to the respective authors
Seizing the future…..
• “ As for the future, your task is not to foresee it, but to enable it” -French aviator and author Antoine de Saint-Exupery
What is Big Data
• Extremely large data sets
• Unmanageable by database software tools • Relative and not an absolute figure • Increase with technology advances • Varies with Sector
What is Big Data
• “Every two days now we create as much information as we did from the dawn of civilization up until 2003. That’s something like five exabytes of data”-Former Google CEO Erik Schmidt
Human Brain (Scientific American)
• Storage Capacity -2.5 Petabytes ( or 1 million gigabyte) • Capacity to hold 3 million hours of TV shows • TV to run for more than 300 years……!!
Internet-World’s largest library
• Estimated at Yottabytes as on date • 11 trillion years using the fastest internet connectivity • Estimated at 5 lakh TB in 2003 • In 10 years…. Expanded 20 lakh times!!
Internet-World’s largest library
• “The Internet emphasizes the depth of our ignorance because our knowledge can only be finite, while our ignorance must necessarily be infinite”-Sir Karl Popper, Conjectures and Refutation: The Growth of Scientific knowledge (2002)
IDC’s Digital Universe Study
• “Between 2009 and 2020, digital data will grow 44-fold to 35 zettabytes per year”
IDC ‘s Prediction
• Volume of Digital Content: 2012 -2.7 billion terrabytes ( 48% more than 2011)
2015 -8 billion terrabytes • Digital content doubles every 18 months
Economist
• Humans created 150 exabytes of information in the year 2005
• In 2011-more than 1200 exabytes!!
Gartner ‘s prediction
• More than 90% of universal data have been created in the last two years
• About 80% of enterprise data will be in the form of unstructured data
The arrival of Analytics
• Big Data-Big Opportunity • NASA, National Oceanic and Atmospheric Administration • Pharmaceutical companies, energy companies • Big Data & Today’s business
Dimensions of Big Data
• Volume : Whole and sample size • Variety : Structured and unstructured Structured : Any data capable of being entered in a data field. Unstructured : Audio, Video, image, geospatial, click streams and log files
Dimensions of Big Data
• Velocity : The speed at which the data is created, accumulated, ingested and processed • Real-time decision making
Big Data Synergies
• • • • • Traditional Business Intelligence Data Mining Statistical applications Predictive analysis Data Modeling
Getting the Big of Big Data
• • • • Transformation Capabilities Big Data is too big an opportunity Best Integration Storage Technologies
Open Source
• Hadoop-its suitability • Limitations-Pre-requisites, hardware requirements
Business Takeaway
• Business cannot wait to take decision for the completed and structured data • It needs to take decision on unstructured data • However not all unstructured data is useful • Business Houses ignoring unstructured data are doomed
Factors enabling Big Data
• Internet and digitization of opinions & behaviour • Mobile computing • Social Networking • Moore’s Law & Cloud
Key factors driving Big Data-1
• Increasing data volumes being captured and stored • 2011 IDC Digital Universe Study- “In 2011, the amount of information created and replicated will surpass 1.8 zettabytes…growing by a factor of 9 in just 5 years…” • The scale of this growth surpasses traditional technologies and configuration setups
Key factors driving Big Data-2
• Rapid acceleration of data growth • 2012 IDC Digital Universe study, “ From 2005 to 2020, the digital universe will grow by a factor of 300, from 130 exabytes to 40000 exabytes…” • From now, until 2020, the digital universe will double about every two years
Key factors driving Big Data-3
• Increased data volumes pushed into the network • According to CISCO’s annual Visual Networking Index Forecast, “ By 2016, annual global IP traffic is forecasted to be 1.3 zettabytes” • Due to increasing number of smartphones, tablets and other internet devices • Increased bandwidth and proliferation of Wi-fi availability
Key factors driving Big Data-4
• Growing variation in types of data assets for analysis • Data scientists take advantage of unstructured datasets as against structured datasets • Acquired from a wide variety of sources • Format can be that of text, images, audio and video content • Existing structured data management needs to enhanced to accommodate the above
Key factors driving Big Data-5
• Alternate and unsynchronized methods for facilitating data delivery • Structured environment gives clear methods of data delivery and exchange • File transfers through tape and disk storage systems • Unstructured data coming from twitter, Government websites • Pressure for rapid acquisition, absorption and analysis
Key factors driving Big Data-6
• Rising demand for real-time integration of analytical results • Increasing number of consumers for analytical results • Business required real-time results of consumer behaviour
Data Explosion
• Data doubles itself in every two years
Malthusian Theory of Population
• Author of book “Essay on the Principles of Population” (1798) • Food production increases in A.P (25 years) • Population growth increases in G.P (25 years) • Restraint on reproduction
Malthusian Theory of Data Explosion (Imaginary)
• • • • • • • • Population growth increases in G.P (25 years) Data explodes every 2 years ( 1024 times app) Do not use mobile devices Restraint on internet Do not go to social sites Reproduction is allowed But no DATA Reproduction!! All economists to become Data Scientists
Evolution of Big Data
• Farnam Jahanian-Assistant Director for computer and information science and engineering for National Science foundation(NSF) defines data “ a transformative new currency for science, engineering, education and commerce”
Evolution of Big Data
• “Big Data is characterized not only by the enormous volume of data but also by the diversity and heterogeneity of the data and the velocity of its generation”
Implications of Big Data-Farnam
• Creation of new products and services • Accelerate the pace of discovery in every science and engineering discipline
• Solve the nation’s challenges-medicine to cyber security
Data Explosion & Knowledge Management
• Data multiplies every two years
Big Data Technology
• Hadoop
• Open source software framework for processing huge datasets on a distributed system • Development was inspired by Google’s Map Reduce and Google File system • Allows you to question on structured and unstructured data
Hadoop
• • • • Store any kind of data in its native format Stores petabytes of data inexpensively Assurance of availability Runs on a cluster of servers each having its own CPU and disk storage
Components of Hadoop
• Hadoop Distributed File System (HDFS)
• • • • Storage system for Hadoop cluster HDFS breaks the data into pieces Distributes among the servers in the cluster Each server stores a small segment of the data set • Each piece of data is replicated on more than one server
Components of Hadoop
• Map Reduce
• Each server does its part of analytical job • Reports the results for collation into a comprehensive answer • Map Reduce is the agent that distributes the work and collects the results
Hadoop
• HDFS continually monitors the data stored in the cluster • In case of hardware or software failure, it takes the data from the known good replica • Map Reduce monitors the progress of each server • In case of server slowing down or failing to return an answer….
Hadoop
• MapReduce automatically starts another instance of the task in the server having copy • HDFS & MapReduce joins to do a super fast & reliable job
Hadoop Users
• As of early 2013, Facebook was recognized as having the largest Hadoop cluster in the world • Other prominent users Google Yahoo IBM
New Approach of Data processing
• Data needs to be stored in a system in which hardware is infinitely scalable • Storage and network cannot be a bottleneck • Data must be processed into BI where it is • Move the code to the data and not other way • Data sits in one place and never move it around
Challenges in Protection of Big Data
• Big Data –Risk of permanent loss Data from monitoring devices Surveillance cameras In frequency and in real time • Uniqueness- No deduplication • Large files- Huge CPU processing power • No good Back up solution available
Challenges in Protection of Big Data
• • • • Not handled well by RDBMS Nosql –new DBMS evolution HIPAA & PCI compliance challenge Very risky in medical industry
SQL/NoSQL
• SQL Databases
• • • • Predefined Scheme Standard Definition and Interface language Tight consistency Well defined semantics
SQL/NoSQL
• NoSQL Database
• No predefined scheme • Per-product definition and interface language • Getting an answer quickly is more important than getting an correct answer
Challenges in Protection of Big Data
• CIA Triad- Focus on Access Control • Balance with performance High levels of encryption Complex security technology Additional security layers • Liability
Way forward….
• Destroy data if not legally required (logs) • Classify data
Protection measures
• Control access on Need to Know • Secure the Data at rest • Keep the cryptographic keys on a separate hardened server • Ensure that security does not impede performance • Pick the right encryption scheme • Flexible security solution with changing requirements
Big Data & IP
• • • • • • • Inventions, literary and artistic works Symbols, images designs What to protect Prioritize protection Labeling and locking Security awareness Holistic approach
Governance Measures
• Strategic Alignment
Identify Business priorities Define problems to be solved Time frame Measurable and achievable outcomes
Strategic alignment
• Demonstration of Value: Whether these technologies add value to real business problems • Operationalization : How to migrate the big data projects into the production environment in a controlled and managed way
Governance Measures
• Management Sponsorship
Management support for fact-based decision making Identify champions for consumption of analytics Ensure benefits realization from various reports and statistical models
Integration of Big Data Analytics
• Standard processes for soliciting input from business users • Clear evaluation criteria for acceptability and adoption • Massive data scalability • Data reuse • Oversight and Governance • Mainstreaming accepted technologies
Governance Measures
• Analytical Human Capital Mobilize resources for analytics Hire the right talent and retain them Increasing demand for analysts skilled in mathematics, business and technology
Key Governance Role
• Ensure business effectively uses analytics to make better decisions • Ensure investment is made in right type of analytics • Ensure investment happens in right type of people, process & technology
Data Governance
• Alert : Identify data issues that might have negative business impact • Triage : Prioritize those issues in relation to corresponding business value drivers • Remediate : Data owners to take proper actions when alerted to the existence of those issues
McKinsey study
• Approximately 1,40,000 to 1,80,000 unfilled positions of data analytic experts in U.S by 2018
• Shortage of 1.5 million managers and analysts who have the ability to understand and make decisions using Big Data
Rise of Data Scientist
• New designation • The Data Scientist
Yesterday’s skills
• Business + Mathematics = Consulting profession • Usage of heuristics and persuasive arguments in the board roon
Yesterday’s skills
• Business + Technology = IT Profession • Automate algorithmic Tasks improving productivity and efficiency
Yesterday’s skills
• Mathematics + Technology = Software Development • Address a wide range of business problems
Tomorrow’s Skills (Big Data, Big Analytics –Michael Minelli et al)
Privacy Landscape-Businesses
• Increased need to leverage privacy information for competitive advantage
• Huge investment in data sources and data analytics
Privacy Landscape-Criminals
• Rise in Identity theft • Sophisticated technology to exploit data security vulnerabilities
Privacy Landscape-Consumers
• Increased awareness and concern about Collection Use Disclosure of personal information
Privacy Landscape-Legislators
• Responding to consumer concern by restricting use of PI
• Significant impact and restriction for business
Seven Global Privacy Principles
• Notice : Inform individuals the purpose for which information is collected • Choice : Offer individuals the opportunity to choose or opt-out • Consent : Only disclose information to third parties consistent with the above principles • Security : Take responsibility for CIA of PI
Seven Global Privacy Principles-Cont’d
• Data Integrity : Assure the reliability of PI • Access : Provide access to individuals to PI about them • Accountability : A firm must be accountable for following principles-compliance mechanism
Other Regulations
• HIPAA • GLB
• FTC
Different approach
• Privacy may be wrong focus • “Data privacy is the thing you do to keep from getting sued, data ethics is the thing you do to make your relationship with your customers positive”-James Stogdill, O’Reilly Radar
James Powell, CTO, Thomson Reuters, 2011, O’Reilly Strata Data Conference
Conclusion
• Availability of Big Data • Low Cost Hardware • New Information Management and Analytic software Enormous opportunity Efficiency, productivity, profitability
Concluding Remarks
• “There are known knowns, there are known unknowns, but there are also unknown unknowns”-Former U.S. Secretary of Defense, Donald Rumsfeld
Concluding Remarks…..
• “I love that quote…When I think about these three things in our daily life, they fall into these three outcomes for me.. The known unknowns more fall into the category of analysis throwing………the thing I love is the last part, if you could figure this thing out, we could have saved Afghanistan from big problems” –Google’s Avinash Kaushik in his presentation at Strata 2012, “A Big Data Imperative, Driving Big Action”