Objective
• Share Contemporary understanding on Big
Data.
• Creating awareness, and spark up interest to
explore new avenues in Big Data trends /
technologies.
• Big Data initiatives in ThomsonReuters.
Content
The rise of the Bytes
Astonishing facts and figures
World Data forecast
Broad classification of Big Data
Characteristics of Big Data –The 3 Vs of Big
Data
Challenges of Big Data and next Gen tools
Big Data’s impact on Thomson Reuters
The rise of the Bytes …
10008 YB -> Yottabyte
10007 ZB -> Zetabyte
10006 EB-> Exabyte
10005 PB ->Petabyte
10004 TB -> Terabyte
10003 GB -> Gigabyte
10002 MB -> Megabyte
1000 KB -> Kilobyte
Astonishing facts and figures …
ERIC Schmidt, Chairman of Google Said :
“From the dawn of humanity to 2003 data produced by
human race is 5 Exa bytes( 10006), and now every 2
days we are creating 2 Exa Bytes of data”
World Data forecast.
•
•
•
•
In 2010, estimated amount of world digital data was 1.2 ZB.
In 2013, the web data reached to 4 Zettabytes
Data growth will be 44 times greater in 2020 than in 2009.
Data volume is doubling in every 1.2 years.
Big Data :Broad classification
Big Data :Broad classification
(Contd…)
• Structured data
– Fits into table, stored in RDMBS
– It is 20% of the world data
• Semi-Structured Data:
Big Data :Broad classification (Contd…)
• Unstructured data:
– 80% of world data semi-Structured /
Unstructured
Big Data :Characteristics
• The 3 Vs of Big Data…..
Big Data :Characteristics (contd..)
• Volume: Huge Volume of data is being
generated by different sources.
• Velocity: The speed at which data comes into
real time as a consequence of different
sources.
Variety: The different forms of data.
Machine Generated: Sensors, Machines, Satellites, Weather data
User Generated Data: Social Media sites, Face book, Twitter
Operational Data: Stock Market, Application Logs
Big data :Significant data
producers
NYSE trading/day produces 1 TB
New websites created every minute a day
571.
Google data processing /day 20 peta
bytes.
Data uploaded daily to Facebook 100
terabytes.
Aadhar card for India…
UIDs for Indian population of 1.5 BILLION.
Per resident 5MB
I/O everyday 30 TB
Big Data : Challenges
• Handle the variety of data.
• Store the Huge volumes of data in
existing in different forms.
• Process /Analyze this Huge data
. Eg :
By using the traditional RDBMS approach
for decoding the human genome takes
10 years.
What next ??
• Next generation of data tools and
techniques like Hadoop and NoSQL
databases are needed to handle the Big
Data….
Big Data’s impact on
Thomson Reuters….
What Thomson Reuters intends…
Thomson Reuters’s Big data strategy.
BOLD…BIG OBJECT LINKED DATA
• Thomson Reuters Big Data initiative to place/link data under one common
platform for analytics.
• It’s a data lake for all the content from TR.
• Content pumped from Legal,IP & Science,F&R,Tax and accounting.
• A Hadoop store of our content.
CORE GOAL: A Knowledge Graph that manages facts and relationships
extracted from the Lake
Linked Data - RDF
• RDF (Resource Description Framework) is a
standard model for data interchange on the Web
• It’s the foundation upon which the web of
semantic data is built
• Organized into triples [Subject, Predicate, Object]
Predicate
Subject
Object
• A “predicate” defines the relationship between the “subject”
and “object” nodes
20
RDF Example
RDF: XML based language for triples using URIs
Inferred relationships…
Subject=Dan,
Predicate= is_from,
Object=England
Relationship doesn’t exist inferred from the other two: new
knowledge