Bigdata Challenges Opportunities

Published on June 2016 | Categories: Documents | Downloads: 31 | Comments: 0 | Views: 555
of 87
Download PDF   Embed   Report

Comments

Content


For information on obtaining additional copies, reprinting or translating articles, and all other correspondence,
please contact:
Email: [email protected]
© Infosys Limited, 2013
Infosys acknowledges the proprietary rights of the trademarks and product names of the other
companies mentioned in this issue of Infosys Labs Briefngs. The information provided in this
document is intended for the sole use of the recipient and for educational purposes only. Infosys
makes no express or implied warranties relating to the information contained in this document or to
any derived results obtained by the recipient from the use of the information in the document. Infosys
further does not guarantee the sequence, timeliness, accuracy or completeness of the information and
will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of,
any of the information or in the transmission thereof, or for any damages arising there from. Opinions
and forecasts constitute our judgment at the time of release and are subject to change without notice.
This document does not contain information provided to us in confdence by our clients.
BIG DATA:
CHALLENGES AND
OPPORTUNITIES
£
¥
$

£
¥ $

Subu Goparaju
Senior Vice President
and Head of Infosys Labs
“At Infosys Labs, we constantly look for opportunities to leverage
technology while creating and implementing innovative business
solutions for our clients. As part of this quest, we develop engineering
methodologies that help Infosys implement these solutions right,
frst time and every time.”
B
I
G

D
A
T
A
:

C
H
A
L
L
E
N
G
E
S

A
N
D

O
P
P
O
R
T
U
N
I
T
I
E
S
V
O
L

1
1


N
O

1


2
0
1
3
VOL 11 NO 1
2013
Infosys Labs Briefings
I
n
f
o
s
y
s

L
a
b
s

B
r
i
e
f
i
n
g
s
AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be
reached at [email protected].
ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys.
He can be reached at [email protected].
AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys.
He can be contacted at [email protected].
ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys.
He can be reached at [email protected].
BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at
[email protected].
GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys.
He can be contacted at [email protected].
KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted
at [email protected].
MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached
at [email protected].
NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted
at [email protected].
NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting
and Systems Integration wing of the FSI business unit of Infosys. He can be reached at
[email protected].
NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be
contacted at [email protected].
PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be
reached at [email protected].
PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems
Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be
contacted at [email protected].
PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can
be reached at [email protected].
SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting
Group. He can be contacted at [email protected].
SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be
contacted at [email protected].
SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice
under the Cloud Unit of Infosys. He can be reached at [email protected].
ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of
Infosys. He can be contacted at [email protected].
Big data was the watchword of year 2012. Even before one could understand
what it really meant, it began getting tossed about in huge doses in almost every
other analyst report. Today, the World Wide Web hosts upwards of 800 million
webpages, each page trying to either educate or build a perspective on the concept
of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing
after cloud. Big data is of late being adopted across industries with great fervor.
In this issue we explore what the Big data revolution is and how it will likely help
enterprises reinvent themselves.
As the citizens of this digital world we generate more than 200 exabytes of
information each year. This is equivalent to 20 million libraries of Congress.
According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook
logins, 204-million email exchanges, and more than 2 million search queries fred.
Looking at the scale at which data is getting churned it is beyond the scope of a
human’s capability to process data and hence there is need for machine processing
of information. There is no dearth of data for today’s enterprises. On the contrary,
they are mired with data and quite deeply at that. Today therefore the focus
is on discovery, integration, exploitation and analysis of this overwhelming
information. Big data may be construed as the technological intervention to
undertake this challenge.
Since Big data systems are expected to help analysis of structured and
unstructured data and hence are drawing huge investments. Analysts have
estimated enterprises will spend more than US$120 billion by 2015 on analysis
systems. The success of Big data technologies depends upon natural language
processing capabilities, statistical analytics, large storage and search technologies.
Big data analytics can help cope with large data volumes, data velocity and
data variety. Enterprises have started leveraging these Big data systems to mine
hidden insights from data. In the first issue of 2013, we bring to you papers
that discuss how Big data analytics can make a significant impact on several
industry verticals like medical, retail, IT and how enterprises can harness the
value of Big data.
Like always do let us know your feedback about the issue.
Happy Reading,
Yogesh Dandawate
Deputy Editor
[email protected]
Authors featured in this issue
Infosys Labs Briefings
Advisory Board
Anindya Sircar PhD
Associate Vice President &
Head - IP Cell
Gaurav Rastogi
Vice President,
Head - Learning Services
Kochikar V P PhD
Associate Vice President,
Education & Research Unit
Raj Joshi
Managing Director,
Infosys Consulting Inc.
Ranganath M
Vice President &
Chief Risk Officer
Simon Towers PhD
Associate Vice President and
Head - Center of Innovation for
Tommorow’s Enterprise,
Infosys Labs
Subu Goparaju
Senior Vice President &
Head - Infosys Labs
Big Data: Countering
Tomorrow’s Challenges
Infosys Labs Briefings

3
9
19
27
35
41
47
53
65
73
VOL 11 NO 1
2013
Opinion: Metadata Management in Big Data
By Gautham Vemuganti
Any enterprise that is in the process of or considering Big data applications deployment
has to address the metadata management problem. The author proposes a metadata
management framework to realize Big data analytics.
Trend: Optimization Model for Improving Supply Chain Visibility
By Saravanan Balaraj
The paper tries to explore the challenges that dot the Big data adoption in supply chain and
proposes a value model for Big data optimization.
Discussion: Retail Industry – Moving to Feedback Economy
By Prasanna Rajaraman and Perumal Babu
Big data analysis of customers’ preferences can help retailers gain a significant competitive
advantage, suggest the authors.
Perspective: Harness Big Data Value and Empower Customer Experience Transformation
By Zhong Li PhD
Always-on digital customers continuously create more data in various types. Enterprise
are analyzing this heterogeneous data for understanding customer behavior, spend, social
media patterns.
Framework: Liquidity Risk Management and Big Data: A New Challenge for Banks
By Abhishek Kumar Sinha
Managing liquidity risk on simple spreadsheets can lead to non-real-time and inappropriate
information that may not be enough for efficient liquidity risk management (LRM). The author
proposes an iterative framework for effective liquidity risk management.
Model: Big Data Medical Engine in the Cloud (BDMEiC): Your New Health Doctor
By Anil Radhakrishnan and Kiran Kalmadi
In this paper the authors describe how Big data analytics can play a significant role in the early
detection and diagnosis of fatal diseases, reduction in health care costs improving quality of health
care administration.
Approach: Big Data Powered Extreme Content Hub
By Sudeeshchandran Narayanan and Ajay Sadhu
With the arrival of Big Content, the need to extract, enrich, organize and manage
semi-structured and un-structured content and media is increasing. This paper talks about
the need for an Extereme Content Hub to tame the Big data explosion.
Insight: Complex Events Processing: Unburdening Big Data Complexities
By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur
Complex Event Processing along with in-memory data grid technologies can help in pattern
detection, matching, analysis, processing and split second decision making in Big data
scenarios opine the authors.
Practioners Perspective: Big Data: Testing Approach to Overcome Quality Challenges
By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja
This paper suggests the need for a robust testing approach to validate Big data systems to
identify possible defects early in the implementation life cycle.
Research: Nature Inspired Visualization of Unstructured Big Data
By Aaditya Prakash
Classical visualization methods are falling short in accurately representing the multidimensional
and ever growing Big data. Taking inspiration from nature, the author has proposed a nature
inspired spider cobweb visualization technique for visualization of Big data.
Index

“Robust testing approach needs to be defined for validating
structured and unstructured data to identify possible
defects early in the implementation life cycle.”
Naju D. Mohan
Delivery Manager, RCL Business Unit
Infosys Ltd.
“Big Data augmented with Complex Event Processing
capabilities can provide solutions in utilizing
memory data grids for analyzing trends,
patterns and events in real time.”
Bill Peer
Principal Technology Architect
Infosys Labs, Infosys Ltd.
3
VOL 11 NO 1
2013
Metadata Management in Big Data
By Gautham Vemuganti
B
ig data, true to its name, deals with large
volumes of data characterized by volume,
variety and velocity. Any enterprise that is
in the process of or considering a Big data
applications deployment has to address the
metadata management problem. Traditionally,
much of the data that business users use is
structured. This however is changing with the
exponential growth of data or Big data.
Metadata defning this data, however,
is spread across the enterprise in spreadsheets,
databases, applications and even in people’s
minds (the so-called “tribal knowledge”). Most
enterprises do not have a formal metadata
management process in place because of
the misconception that it is an Information
Technology (IT) imperative and it does not have
an impact on the business.
However, the converse is true. It has been
proven that a robust metadata management
process is not only necessary but required for
successful information management. Big data
introduces large volumes of unstructured data
for analysis. This data could be in the form of a
text fle or any multimedia fle (for e.g., audio,
video). To bring this data into the fold of an
information management solution, its metadata
should be correctly defned.
Met adat a management sol ut i ons
provided by various vendors usually have
a narrow focus.An ETL vendor will capture
metadata for the ETL process.A BI vendor will
provide metadata management capabilities
for their BI solution. The silo-ed nature of
metadata does not provide business users an
opportunity to have a say and actively engage
in metadata management. A good metadata
management solution must provide visibility
across multiple solutions and bring business
users into the fold for a collaborative, active
metadata management process.
METADATA MANAGEMENT CHALLENGES
Metadata, simply defned, is data about data.
In the context of analytics some common
examples of metadata are report defnitions,
table defnitions, meaning of a particular master
data entity (sold-to customer, for example),
ETL mappings and formulas and computations.
The i mport ance of met adat a cannot be
overstated. Metadata drives the accuracy of
reports, validates data transformations, ensures
Big data analytics must reckon the
importance and criticality of metadata
Infosys Labs Briefings
4
accuracy of calculations and enforces consistent
defnition of business terms across multiple
business users.
In a typical large enterprise which has
grown by mergers, acquisitions and divestitures,
metadata is scattered across the enterprise in
various forms as noted in the introduction.
In large enterprises, there is wide
acknowledgement that metadata management
is critical but most of the time there is no
enterprise level sponsorship of a metadata
management initiative.Even if there is, it is only
focused either for one specifc project sponsored
by one specifc business.
The i mpa c t of good me t a da t a
management practices are not consistently
understood across the various levels of the
enterprise. Conversely, the impact of poorly
managed metadata comes to light only after
the fact i.e., a certain transformation happens,
a report or a calculation is run or two divisional
data sources are merged.
Metadata is typically viewed as the
exclusive responsibility of the IT organization
with business having little or no input or say in
its management. The primary reason is that there
are multiple layers of organization between IT
and business. This introduces communication
barriers between IT and business.
Finally, metadata is not viewed as a very
exciting area of opportunity.It is only addressed
as an after-thought.
DIFFERENCES BETWEEN TRADITIONAL
AND BIG DATA ANALYTICS
In traditional analytics, implementations
data is typically stored in a data warehouse.
The data warehouse is modeled using one
of several techniques, developed over time
and is a constantly evolving entity. Analytics
People Rules
Metrics
Single monolithic
governance process
Multiple
governance process
Process
People
Rules
Metrics
Process
People Rules
Metrics Process
People Rules
Metrics Process
Figure 1: Data Governance Shift with Big Data Analytics Source: Infosys Research
5
application developed using the data in a data
warehouse are also long-lived. Data governance
in traditional analytics is a centralized process.
Metadata is managed as part of the data
governance process.
In traditional analytics, data is discovered,
collected, governed, stored and distributed.
Big data introduces large volumes of
unstructured data.This data changes is highly
dynamic and therefore needs to be ingested
quickly for analysis.
Bi g dat a anal yt i cs appl i cat i ons,
however, are characterized by short-lived,
quick implementations focused on solving a
specific business problem.The emphasis of
Big data analytics applications is more on
experimentation and speed as opposed to long
drawn out modeling exercise.
The need to experiment and derive
insights quickly using data changes the way
data is governed. In traditional analytics
there is usually one central governance team
focused on governing the way data is used
and distributed in the enterprise.In Big data
analytics, there are multiple governance
processes in play simultaneously, each geared
towards answering a specific business question.
Figure 1 illustrates this.
Most of the metadata management
challenges we referred to in the previous section
alluded to typical enterprise data that is highly
structured. To analyze unstructured data,
additional metadata defnitions are necessary.
To illustrate the need to enhance metadata
to support Big data analytics, consider sentiment
analysis using social media conversations as
an example. Say someone posts a message on
Facebook “I do not like my cell-phone reception.
My wireless carrier promised wide cell coverage
but it is spotty at best.I think I will switch
carriers”. To infer the intent of this customer,
the inference engine has to rely on metadata
as well as the supporting domain ontology.
The metadata will defne “Wireless Carrier”,
“Customer”, “Sentiment” and “Intent”.The
inference engine will leverage the ontology
dependent on this metadata to infer that this
customer wants to switch cell phone carriers.
Big data is not just restricted to text.It
could also contain images, videos, and voice
fles. Understanding, categorizing and creating
metadata to analyze this kind of non-traditional
content is critical.
It is evident that Big data introduces
additional challenges in metadata management.It
is clear that there is a need for a robust metadata
management process which will govern metadata
with the same rigor as data for enterprises to be
successful with Big data analytics.
To summarize, a metadata management
process specifc to Big data should incorporate
the context and intent of data, support non-
traditional sources of data and be robust to
handle the velocity of Big data.
ILLUSTRATIVE EXAMPLE
Consider an existing master data management
system in a large enterprise.This master data
system has been developed over time.This
has specifc master data entities like product,
customer, vendor, employee etc.The master data
system is tightly governed and data is processed
(cleansed, enriched and augmented) before it is
loaded into the master data repository.
This specific enterprise is considering
bringing in social media data for enhanced
customer analytics.This social media data is to be
sourced from multiple sources and incorporated
into the master data management system.
As not ed ear l i er , s oc i al medi a
conversat i ons have cont ext , i nt ent and
sentiment.The context refers to the situation
6
METADATA STORAGE
METADATA DISTRIBUTION
METADATA GOVERNANCE
METADATA COLLECTION Collect
METADATA DISCOVERY
Figure 2: Metadata Management Framework for Big Data
Analytics
Source: Infosys Research
in which a customer was mentioned, the intent
refers to the action that an individual is likely
to take and the sentiment refers to the “state of
being” of the individual.
For example, if an individual sent a
tweet or a starts a Facebook conversation about
a retailer from a football game. The context
would then be a sports venue. If the tweet or
conversation consisted of positive comments
about the retailer then the sentiment would be
determined as positive. If the update consisted
of highlighting a promotion by the retailer then
the intent would be to collaborate or share with
the individual’s network.
If such social media updates have to
be incorporated into any solution within the
enterprise then the master data management
solution has to be enhanced with metadata about
“Context”, ”Sentiment” and “Intent”. Static
lookup information will need to be generated
and stored so that an inference engine can
leverage this information to provide inputs for
analysis. This will also necessitate a change in the
back-end.The ETL processes that are responsible
for this master data will now have to incorporate
the social media data as well. Furthermore, the
customer information extracted from these feeds
need to be standardized before being loaded into
any transaction system.
FRAMEWORK FOR METADATA
MANAGEMENT IN BIG DATA ANALYTICS
We propose that metadata be managed using
5 components shown in Figure 2.
Metadata Discovery – Discovering metadata
is critical in Big data for the reasons of context
and intent noted in the prior section. Social data
is typically sourced from multiple sources.All
these sources will have different formats. Once
metadata for a certain entity is discovered for
one source it needs to be harmonized across all
sources of interest. This process for Big data
will need to be formalized using metadata
governance.
Metadata Collection – A metadata collection
mechanism should be implemented. A robust
collection mechanism should aim to minimize
or eliminate metadata silos. Once again, a
technology or a process for metadata collection
should be implemented.
Metadata Governance – Metadata creation
and maintenance needs to be governed.
Governance should include resources from
both the business and IT teams. A collaborative
framework between business and IT should
be established to provide this governance.
Appropriate processes (manual or technical)
should be utilized for this purpose. For example,
on-boarding a new Big data source should be
a collaborative effort between business users
and IT. IT will provide the technology to enable
business users discover metadata.
7
BIG DATA DISTRIBUTION
DATA DISTRIBUTION
DATA STORAGE
DATA GOVERNANCE
DATA COLLECTION
DATA DISCOVERY
Collect
METADATA DISCOVERY
METADATA COLLECTION
METADATA GOVERNANCE
METADATA STORAGE
METADATA DISTRIBUTION
Collect
Metadata Storage – Multiple models for
enterprise metadata storage exist.The Common
Warehouse Meta-model (CWM) is one example.
A similar model or its extension thereof can be
utilized for this purpose.If one such model will
not ft the requirements of an enterprise then
suitable custom models can be developed.
Metadata Distribution – This is the final
component. Metadata, once stored will need
to be distributed to consuming applications.A
formal distribution model should be put into
place to enable this distribution. For example,
some applications can directly integrate to
the metadata storage layer while others will
need some specialized interfaces to be able to
leverage this metadata.
We note that in traditional analytics
implementation, a framework similar to the one
we propose exists but with data.
The metadata management framework
should be implemented alongside a data
management framework to realize Big data analytics.
THE PARADIGM SHIFT
The discussion in this paper brings to light the
importance of metadata and the impact it has
not only on Big data analytics but traditional
analytics as well.We are of the opinion that if
enterprises want to get value out of their data
assets and leverage the Big data tidal wave then
the time is right to shift the paradigm from
data governance to metadata governance and
make data management part of the metadata
governance process.
A framework is as good as how it is
viewed and implemented within the enterprise.
The metadata management framework is
successful if there is sponsorship for this effort
from the highest levels of management.This
Figure 3: Equal Importance of Metadata &
Data Processing for Big Data Analytics
Source: Infosys Research
8
include both business and IT leadership within
the enterprise. The framework can be viewed as
being very generic. Change is a constant in any
enterprise.The framework can be made fexible
to adapt to changing needs and requirements
of the business.
All the participants and personas in
engaged in the data management function within
an enterprise should participate in the process.
This will promote and foster collaboration
between business and IT.This should be made
sustainable and followed diligently by all the
participants until this framework is used to on-
board not only new data sources but also new
participants in the process.
Metadata and its management is an
oft ignored area in enterprises with multiple
consequences.The absence of robust metadata
management processes lead to erroneous results,
project delays and multiple interpretations of
business data entities. These are all avoidable
with a good metadata management framework.
The consequences affect the entire
enterprise either directly or indirectly.From
the l owest l evel empl oyee to the seni or
most executive, incorrect or poorly managed
metadata not only will affect operations but also
directly contribute to the top-line growth and
bottom-line proftability of an enterprise. Big
data is viewed as the most important innovation
that brings tremendous value to enterprises.
Without a proper metadata management
framework, this value might not be realized.
CONCLUSION
Big data has created quite a bit of buzz in the
market place.Pioneers like Yahoo and Google
created the foundations of what is today called
Hadoop.There are multiple players in the Big
data market today developing everything from
technology to manage Big data to applications
needed to analyze Big data to companies engaged
in Big data analysis and selling that content.
In the midst of all the innovation in the
Big data space, metadata is often forgotten. It
is important for us to recognize and realize the
importance of metadata management and the
critical impact it has on enterprises.
If enterprises wish to remain competitive,
they have to embark on Big data analytics
initiatives.In this journey, enterprises cannot
afford to ignore the metadata management
problem.
REFERENCES
1. Davenport, T., and Harris, J., (2007),
Competing on Analytics – The New
Science of Winning, Harvard Business
School Press.
2. J e nni ngs , M. , Wha t r ol e doe s
me t a da t a ma na ge me nt pl a y i n
enterprise information management
( EI M) ? . Av a i l a b l e a t h t t p: //
searchbusinessanalytics.techtarget.com/
answer/The-importance-of-metadata-
management-in-EIM.
3. Metadata Management Foundation
Capabilities Component (2011). http://
mike2.openmethodology.org/wiki/
Metadata_Management_Foundation_
Capabilities_Component.
4. Rogers, D. (2010), Database Management:
Metadata is more important than you think.
Available at http://www.databasejournal.
com/sql et c/art i cl e. php/3870756/
Database-Management-Metadata-is-more-
important-than-you-think.htm.
5. Data Governance Institute, (2012), The
DGI Data Governance Framework.
Available a t http://datagovernance.
com/fw_the_DGI_data_governance_
framework.html.
9
VOL 11 NO 1
2013
Optimization Model for Improving
Supply Chain Visibility
By Saravanan Balaraj
I
n today’ s competi ti ve ‘ l ead or l eave’
market pl ace, Bi g dat a i s seen as an
oxymoron that offers both challenge as well as
opportunity. Effective and efficient strategies
to acquire, manage and analyze data leads
to better decision making and competitive
advantage. Unlocking potential business
value out of this diverse and multi-structured
dataset beyond organizational boundary is a
mammoth task.
We have stepped into an interconnected
and intelligent digital world where convergence
of new technologies is fast happening round
the corners. In this process the underlying
data set is growing not only in volumes but
also in velocity and variety. The resulting data
explosion created by a combination of mobile
devices, tweets, social media, blogs, sensors and
emails demands a new kind of data intelligence.
Big data has started creating lot of buzz
across verticals and Big data in supply chain is
no different. Supply chain is one of the key focus
areas that are undergoing transformational
changes in the recent past. Traditional supply
chain applications leverage only on transactional
data to solve operational problems and improve
effciency. Having stepped into Big data world,
the existing supply chain applications have
become obsolete as they are unable to cope up
with tremendously increasing data volumes
cutting across multiple sources, the speed with
which they are generated and unprecedented
growth in new data forms.
Enterprises are in tremendous pressure
to solve new problems emerging out of new
forms of data. Handling large volume of data
across multiple sources and deriving value out
of this massive chunk for strategy execution
is the biggest challenge that enterprises are
facing in today’s competitive landscape.
Careful analysis and appropriate usage of
these data would result in cost-reduction and
better operational performance. Competitive
pressures and customers ‘ more for less’
Enterprises need to adopt different
Big data analytic tools and technologies
to improve their supply chains
Infosys Labs Briefings
10
attitudes have left enterprise with no choice
other than to re-think on their supply chain
strategies and creating a differentiation.
Enterprises need to adopt appropriate
Big data techniques and technologies and
build suitable models to derive value out
of these unstructured data and henceforth
plan, schedule and route in a cost-effective
manner. The paper tries to explore what are
the challenges that dot the Big data adoption in
supply chain and proposes a value model for
Big data optimization.
BIG DATA WAVE
International Data Corporation (IDC) has
predicted that Big data market will grow from
$3.2 billion in 2010 to $16.9 billion by 2015
at a compound annual growth rate of 40%
[2]. This shows tremendous traction towards
Big data tools, technologies and platforms
among enterprises. Lots of researches and
investments are carried out on how to fully tap
the potential benefts hidden in Big data and
derive fnancial value out of it. Value derived
out of Big data enables enterprises to achieve
differentiation by reducing cost, effcient
planning and thereby improving process
effciency.
Big data is an important asset in supply
chain which enterprises are looking forward
to capitalize upon. They adopt different Big
data analytic tools and technologies to improve
their supply chain, production and customer
engagement processes. The path towards
operational excellence is facilitated through
effcient planning and scheduling of production
and logistic processes.
Though supply chain data is really huge,
it brings about the biggest opportunity for
enterprises to reduce cost and improve their
operational performances. The areas in supply
chain planning where Big data can create an
impact are: demand forecasting, inventory
management, production planning, vendor
management and logistics optimization. Big
data can improve supply chain planning process
if appropriate business models are identifed,
designed, built and then executed. Some of
its key benefits are: short time-to-market,
improved operational excellence, cost reduction
and increased proft margins.

CHALLENGES WITH SUPPLY CHAIN
PLANNING
Supply chain planning process success
depends on how closely demands are
forecasted, inventories are managed and
logistics are planned. Supply chain is the
heart of industry vertical and if managed
effciently drives positive business and enables
sustainable advantage. With the emergence of
Big data, optimizing supply chain processes
has become complicated than ever before.
Handling Big data challenges in supply chain
and transforming them into opportunities
is the key to corporate success. The key
challenges are:
â–  Volume - According to a McKinsey
report, the number of RFID tags sold
globally is projected to increase from
12 million in 2011 to 209 billion in
2021 [3]. Along with this, phenomenal
increase in the usage of temperature
sensors, QR codes and GPS devices, the
underlying supply chain data generated
has multiplied manifold beyond our
expectations. Data is flowing across
multiple systems and sources and hence
they are likely to be error-prone and
incomplete. Handling such huge data
volumes is a challenge.
11
â–  Velocity - Business has become highly
dynamic and volatile. The changes arising
due to unexpected events must be handled
in a timely manner in order to avoid losing
out in business. Enterprises are fnding it
extremely diffcult to cope up with this
data velocity. Optimal decisions must
be made quickly and shorter processing
time is the key for successful operational
execution which is lacking in traditional
data management systems.
â–  Variety - In supply chain, data has
emerged in different forms which
don’t ft in traditional applications and
models. Structured (transactional data),
unstructured (social data), sensor data
Launch Customer
Promotion Inventory Transportation
Data Sourcing
Data Extraction & Cleansing
Data Representation
Acquire
OLTP DB
Transactional Systems Big Data Systems
Sensor
RFID
Transactional Social
Video
Voice
Digital Image
Channel Time bound
QR
Temperature
Structured Unstructured
New Type
Cascading | Hive
Pig | MapReduce
HDFS | NoSQL
(temperature and RFID) along with
new data types (video, voice and digital
images) have created nightmares among
enterprise to handle such diverse and
heterogeneous data sets.
In today’s data explosion in terms
of volume, variety and velocity, handling
them alone doesn’t suffce. Value creation by
analyzing such massive data sets and extraction
of data intelligence for successful strategy
execution is the key.
BIG DATA IN DEMAND FORECASTING &
SUPPLY CHAIN PLANNING
Enterprises use forecasting to determine how
much to produce of each product type, when
Figure 1: Optimization Model for Improving Supply
Chain Visibility - I
Source: Infosys Research
12
and where to ship them, thereby improving
supply chain visibility. Inaccurate forecast
causes detrimental effect in supply chain.
Over-forecast results in inventory pile ups
and working capital locks. Under-forecast
leads to failure in meeting demand, resulting
in loss of customer and sales. Hence in today’s
volatile market comprised of unpredictable
shifts in customer demands, improving
accuracy of forecast is of paramount
importance.
Data in supply chain planning has
mushroomed in terms of volumes, velocity
and variety. Tesco, for instance, generates
more than 1.5 billion new data items every
mont h. Wal - Mart ’ s warehouse handl es
some 2.5 petabytes of information which is
roughly equivalent to half of all the letters
delivered by the US Postal Service in 2010.
Accordi ng to McKi nsey Gl obal i nsti tute
report [3], leveraging on Big data in demand
forecasting and supply chain planning could
increase profit margin by 2-3% in Fast Moving
Consumer Goods (FMCG) manufacturing
val ue chai n. Thi s uneart hs t remendous
opportunity in forecasting and supply chain
planning available for enterprises to capitalize
on this Big data deluge.
MISSING LINKS IN TRADITIONAL
APPROACHES
Ent erpri ses have st art ed real i zi ng t he
importance of Big data in forecasting and
have begun investing in Big data forecasting
tools and technologies to improve their supply
chain, production and manufacture planning
processes. Traditional forecasting tools aren’t
adequate enough in handling huge data
volumes, variety and velocity. Moreover they
are missing out on the following key aspect
which improves accuracy of forecasts:
â–  Soci al Medi a Dat a As An Input :
Social media is a platform that enables
enterpri ses to col l ect i nformati on
a b o ut po t e nt i a l a nd pr o s pe c t
customers. Thanks to the technological
advancements that has made tracking
customer data easier. Companies can
now track every visit customer makes
to the websites, e-mail exchanged and
comments logged across social media
websi tes. Soci al medi a data hel ps
analyze customer pulse and gain insights
on forecasting, planning, scheduling of
supply chain and inventories. Buzz in
social networks can be used as an input
for demand forecasting for numerous
benefit. One such use case is, enterprise
can launch a new product to online fans
to sense customer acceptance. Based on
the response, inventories and supply
chain can be planned to direct stocks
to high buzz locations during launch
phase.
â–  Predi ct And Respond Approach:
Traditional forecasting is done by
analyzing historical patterns, considering
sales inputs and promotional plans
to forecast demand and supply chain
planning. They focus on ‘what happened’
and work on ‘sense and respond’ strategy.
‘History repeats itself’ is no longer apt
in todays’ competitive marketplace.
Enterprises need to focus on ‘what
will happen’ and require ‘predict and
respond’ strategy to stay alive in business.
This calls for models and systems capable
of capturing, handling and analyzing
huge volume of real-time data generated
from unexpected competitive events,
weather patterns, point-of-sales and
13
natural disasters (volcanoes, foods, etc.)
and converting them into actionable
information for forecasting plans on
production, inventory holdings and
supply chain distribution.
â–  Optimized Decisions with Simulations:
Traditional decision support systems
lack fexibility to meet changing data
requirements. In real world scenario,
supply chain delivery plan changes
unexpectedly due to various reasons
like demand change, revised sales
forecast, etc. The model and system
should have ability to factor in this and
respond quickly to such unplanned
events. Decision should be taken only
after careful analysis of the unplanned
events impact on other elements of
supply chain. Traditional approaches
lack this capability and this necessitates
a model for performing what-if analysis
on all possible decisions and selecting
the optimal one in the Big data context.
IMPROVING SUPPLY CHAIN VISIBILITY
USING BIG DATA
Supply chain doesn’t lack data – what’s missing
is a suitable model to convert this huge diverse
raw data into actionable information so that
enterprises can make critical business decisions
for effcient supply chain planning. A 3-stage
optimized value model helps to overcome
the challenges posed by Big data in supply
chain planning and demand forecasting. It
bridges the existing gaps in traditional Big
data approaches and offers a perspective
to unlock the value from growing Big data
torrent. Designing and building an optimized
Big data model for supply chain planning is a
complex task but successful execution leads to
signifcant fnancial benefts. Let’s take a deep
dive into each stage of this model and analyze
what their value-add are in enterprises supply
chain planning process.
Acquire Data: The biggest driver of supply
chain planning is data. Acquiring all the relevant
data for supply chain planning is the frst step
in this optimized model. It involves three steps
namely data sourcing, data extraction and
cleansing and data representation which make
data ready for further analysis.
â–  Data Sourcing - Data is available in
different forms across multiple sources,
systems and geographies. It contains
extensive details of historical demand
data and other relevant information. For
further analysis it is therefore necessary
to source required data. Data that are
to be sourced for improving accuracy
of forecast in-addition to transactional
data are:
â–  Product Promotion data - items,
prices, sales
â–  Launch data - items to be ramped up
or down
â–  Inventory data - stock in warehouse
â–  Customer data - purchase history,
social media data
â–  Transport at i on dat a - GPS and
logistics data.
Enterprises should adopt appropriate
Big data systems that are capable of handling
such huge data volumes, variety and velocity.
14
â–  Data Extraction and Cleansing - Data
sources are available in different forms
from structured (transactional data) to
un-structured (social media, images,
sensor data, etc.) and they are not in
analysis-friendly formats. Also due
to l arge vol ume of heterogeneous
dat a t here i s hi gh probabi l i t y of
inconsistencies and data errors while
sourcing. The sourced data should be
expressed in structured form for supply
chain planning. Moreover analyzing
inaccurate and untimely data leads to
erroneous non-optimal results. High
quality and comprehensive data is a
valuable asset and appropriate data
cleansing mechanisms should be in
place for maintaining the quality of Big
data. Choice of Big data tools for data
cleansing and enrichment plays a crucial
role in supply chain planning.
â–  Data Representation – Database design
for such huge data volume is a herculean
task and poses some serious performance
issues if not executed properly. Data
representation plays a key role in Big
data analysis. There are numerous ways
to store data and each design has its
own set of advantages and drawbacks.
Selection of appropriate database design
and executi ng appropri ate desi gn
favoring business objectives reduces the
efforts in reaping benefts out of Big data
analysis in supply chain planning.
Analyze Data: The next stage is analyzing
cleansed data and capturing value for forecasting
and supply chain planning. There is plethora of
Big data techniques available in market for
forecasting and supply chain planning. The
selection of Big data technique depends on the
business scenario and enterprise objectives.
Incompatible data formats make value creation
from Big data a complex task and this calls for
innovation in techniques to unlock business
value out of the growing Big data torrent. The
proposed model adopts optimization technique
to generate insights out of this voluminous and
diverse Big dataset.
â–  Optimization in Big data analysis -
Manufacturers have started synchronizing
forecasting with production cycles,
so accuracy of forecasting plays a
crucial role in their success. Adoption
of optimization technique in Big data
analysis creates a new perspective and
it helps in improving the accuracy of
demand forecasting and supply chain
planning. Analyzing the impact of
promotions on one specifc product for
demand forecasting appears to be an easy
task. But real life scenarios comprises
of huge army of products with factors
affecting their demand varying for
every product and location making it
difficult for traditional techniques in
data analysis.
Optimization technique has several
capabilities which make it an ideal choice for
data analysis in such scenarios. Firstly, this
technique is designed for analyzing and drawing
insights for highly complex system with huge
data volumes, multiple constraints and factors
to be accounted for. Secondly, supply chain
planning has number of enterprise objectives
associated with it like cost reduction, demand
fulfllment, etc. The impact of each of these
objective measures on enterprises proftability
can be easily analyzed using optimization
15
technique. Flexibility of optimization technique
is another beneft that makes it suitable for Big
data analysis to uncover new data connections
and turn them into insights.
Opt i mi zat i on model compri ses of
four components, viz., (i) input – consistent,
real-time, quality data which is sourced,
cleansed and integrated becomes the input
of the optimization model; (ii) goals – the
model should take into consideration all
the goals pertaining to the forecasting and
supply chain planning like minimizing cost,
maximizing demand coverage, maximizing
profits, etc. (iii) constraints – the model should
incorporate the entire constraints specific to
the supply chain planning in the model; some
of the constraints are minimum inventory
in warehouse, capacity constraint, route
constraint, demand coverage constraint, etc;
and (iv) output – results based on input, goals
and constraints defined in the model that can
be used for strategy executions. The result can
be demand plan, inventory plan, production
plan, logistics plan, etc.
â–  Choice of Algorithm: One of the key
differentiators in supply chain planning
is the algorithm used in modeling.
Data Sourcing
Data Extraction & Cleansing
Data Representation
ACQUIRE
OPTIMIZATION TECHNIQUE
INPUT
OUTPUT
GOALS
Min (Cost)
Max (Profit)
Max (Demand Coverage)
CONSTRAINTS
Capacity constraint
Route Constraint
Demand Coverage Constraint
Inventory Plan
Demand Plan
Logistics Plan
ANALYZE
ACHIEVE
Performance Trackers
KPI Dashboards
Actual Vs. Planned
Multi User
Collaboration
Build
Compare Simulate
Scenario Management
Figure 2: Optimization Model for Improving Supply
Chain Visibility – II
Source: Infosys Research
16
Optimization problems have numerous
possible solutions and the algorithm
should have the capability to fne-tune
itself for achieving optimal solutions.
Achieve Business Objective: The fnal stage
in this model is achieving business objectives
through demand forecasting and supply
chain planning. It involves three steps which
facilitates enterprise in supply chain decisions.
â–  Scenario Management – Business events
are diffcult to predict and most of the
times deviate from their standard paths
resulting unexpected behaviors and
events. This makes it diffcult for planning
and optimizing during uncertain times.
Scenario management is the approach
to overcome such uncertain situations.
Scenario management facilitates creating
business scenarios, comparing multiple
different scenarios, analyze and assessing
its impact before making decisions. This
capability helps to balance conficting
KPIs and arrive at an optimal solution
matching business needs.
â–  Multi User Collaboration – Optimization
model in real business case comprises
of highly complex data sets and models
which requires support from an army
of analysts and determines its effects
on enterprises goals. Combinations
of technical and domain experts are
required to obtain optimal results.
To achieve near accurate forecasts
and supply chain optimization the
model shoul d support mul t i -user
collaboration so that multiple users can
collaboratively produce optimal plans
and schedules and re-optimize as and
when business changes. This model
builds a collaborative system with
capability of supporting inputs from
multiple users and incorporating in its
decision making process
â–  Per f or mance Tr acker – Demand
forecasting and supply chain planning
does not follow build-model-execute
approach, i t requi res si gni f i cant
continuous effort. Frequent changes in
the inputs and business rules necessitate
monitoring of data, model and algorithm
performance. Actual and planned results
are to be compared regularly and steps
are to be taken to minimize the deviations
in accuracy. KPI is to be defned and
dashboard shoul d be const ant l y
monitored for model performances.
KEY BENEFITS
Enterprises can accrue lot of beneft by adopting
this 3-stage model for Big data analysis. Some of
them are detailed below:
Improves Accuracy of Forecast: One of
the key objectives of forecasting is profit
maximization. This model adopts effective data
sourcing, cleansing and integration systems and
makes data ready for forecasting. Inclusion of
social media data, promotional data, weather
predictions, seasonality’ s in addition to
historical demand and sales histories adds value
and improves forecasting accuracy. Moreover
optimization technique for Big data analysis
reduces forecasting errors to a great extent.
Continuous Improvement: Acquire-Analyze-
Achieve model is not a hard-wired model. It
allows flexibility to fine tune and supports
what-if analysis. Multiple scenarios can be
17
created, compared and simulated to identify
the impact of change on the supply chain and
demand forecasting prior to the making any
decisions. Also it enables enterprise to defne,
track and monitor KPIs from time to time
resulting in continuous process improvements.
Better Inventory Management: Inventory data
along with weather predictions, history of sales
and seasonality is considered as an input to
the model for forecasting and planning supply
chain. This approach minimizes incidents of
out-of-stock or over-stocks across different
warehouses. Optimal plan for inventory
movement is forecasted and appropriate stocks
are maintained at each warehouse to meet the
upcoming demand. To a great extent this will
reduce loss of sales and business due to stock-
outs and leads to better inventory management.
Logistic Optimization: Constant sourcing
and continuous analysis of transportation
data (GPS and other logistics data) and using
them for demand forecasting and supply chain
planning through optimization techniques
helps in improving distribution management.
Moreover optimization of logistics improves
fuel effciency and effcient routing of vehicles
resulting in operational excellence and better
supply chain visibility.
CONCLUSIONS
As rapid penetration of information technology
in supply chain planning continues, the amount
of data that can be captured, stored and analyzed
has increased manifold. The challenge is to
derive value out of these large volumes of data
by unlocking fnancial benefts in congruent
with the enterprises’ business objectives.
Competitive pressures and customers
‘more for less’ attitude has left enterprises with
no option other than reducing cost in their
operational executions. Adopting effective
and efficient supply chain planning and
optimization techniques to match customer
expectations with its offerings is the key
to corporate success. To attain operational
excellence and sustainable advantage, it is
necessary for the enterprise to build innovative
models and frameworks leveraging the power
of Big data.
Optimized value model on Big data
offers a unique way of demand forecasting
and suppl y chai n opti mi zati on through
collaboration, scenario management and
performance management. This model on
continuous improvement opens up doors for big
opportunities for the next generation of demand
forecasting and supply chain optimization.
REFERENCES
1. I DC - Press Rel ease ( 2012) , I DC
Releases First Worldwide Big data
Technol ogy and Servi ces Market
Forecast, Shows Big data as the Next
Essential Capability and a Foundation
for the Intelligent Economy. Available
at ht t p: //www. i dc. com/get doc.
jsp?containerId=prUS23355112.
2. McKinsey Global Institute (2011), Big
data: The next frontier for innovation,
competition, and productivity. Available
at http://www.mckinsey.com/~/media/
McKinsey/dotcom/Insights%20and%20
pubs/MGI/Research/Technology%20
and%20Innovation/Big%20Data/MGI_
big_data_full_report.ashx.
3. Furio, S. , Andres, C. , Lozano, S. ,
Adenso-Diaz, B., (2009), Mathematical
model to optimize land empty container
movement s. Avai l abl e at ht t p: //
www. fundacion. valenciaport. com/
18
Articles/doc/presentations/HMS2009_
Paperid_27_Furio.aspx.
4. Stojkovića, G., Soumisb, F., Desrosiersc,
J., Solomon, M. (2001), An optimization
model for a real-time fight scheduling
problem. Available at http://www.
sciencedirect.com/science/article/pii/
S0965856401000398.
5. Beck, M., Moore, T., Plank, J., Swany, M.
(2000), Logistical Networking. Available
at: http://loci.cs.utk.edu/ibp/files/
pdf/LogisticalNetworking.pdf.
6. Lasschuit, W., Thijssen, N., (2004),
Supporting supply chain planning and
scheduling decisions in the oil and
chemical industry, Computers and
Chemical Engineering, issue 28, pp. 863–
870. Available at http://www.aimms.
com/aimms/download/case_studies/
shell_elsevier_article.pdf.
19
VOL 11 NO 1
2013
Retail Industry –
Moving to Feedback Economy
By Prasanna Rajaraman and Perumal Babu
R
etail industry is going through a major
paradigm shift. The past decade has seen
unprecedented churn in retail industry virtually
changing the landscape. Erstwhile marquee
brands from traditional retailing side have
ceded space to start-ups and new business
models.
The key driver of this change is a
confuence of technological, sociological and
customer behavioral trends creating this
strategic infection point in retailing ecology.
Trends like emergence of internet as major
retailing channel, social platforms going
mainstream, pervasive retailing and emergence
of digital customer has presented a major
challenge to traditional retailers and retailing
models.
On the other hand, these trends have
also enabled opportunities for retailers to better
understand customer dynamics. For the frst
time, retailers have access to unprecedented
amount of publicly available information on
customer behavior and trends; voluntarily
shared by customers. The more effective
retailers can tap into these behavioral and social
reservoirs of data to model purchasing behaviors
and trends of their current and prospective
customers. Such data can also provide the
retailers with predictive intelligence, which
if leveraged effectively can create enough
mindshare, that the sale is completed even
before the conscious decision to purchase is
taken.
This move to a feedback economy
where retailers can have 360 degree view of
the customer thought process across the selling
cycle is a paradigm shift for retail industry –
from retailer driving sales to retailer engaging
customer across the sales and support cycle.
Every aspect of retailing from assortment/
allocation planning, marketing/promotions to
customer interactions has to take the evolving
consumer trends into consideration.
The i mpl i c at i on f r om bus i ne s s
perspective is that retailers have to better
understand customer dynamics and align
Gain better insight into customer dynamics
through Big Data analytics
Infosys Labs Briefings
20
Implicit
Guidance &
Control
Implicit
Guidance &
Control
Observation
Orient Decide Act
Outside
Information
Observe
Unfolding
Circumstances
Unfolding
Interaction
with
Enviroment
Unfolding
Interaction
with
Enviroment
Decision
(Hypothesis)
Action
(Test)
F
e
e
d
F
o
r
w
a
r
d
F
e
e
d
F
e
e
d
F
o
r
w
a
r
d
F
o
r
w
a
r
d
Cultural
Transactions
Genetic
Heritage
New
Information
Previous
experiences
Analysis
and
Synthesis
Feedback
Feedback
Feedback
business processes effectively with these
trends. In addition, this implies that cycle
times will be shorter and businesses have to be
more tactical in their promotions and offerings.
Retailers who can ride this wave will be better
able to address demand and command higher
margins for the products and services. Failing
this, retailers will be left with low-margin
pricing/commodity space.
F r o m i nf o r ma t i o n t e c hno l o g y
perspective, the key challenge is that nature
of this information with respect to lifecycle,
velocity, heterogeneousness of the sources
and volume is radically different from what
traditional systems handle. Also, there are
overarching concerns like that of data privacy,
compliance and regulatory changes that need
to be internalized with internal processes. The
key is to manage lifecycle of this Big data and
effectively integrate with the organizational
system and to derive actionable information.
TOWARDS A FEEDBACK ECONOMY
Customer dynami cs refers to customer-
business relationships that describe the ongoing
interchange of information and transactions
between customers and organizations that
goes beyond the transactional nature of
the interaction to look at emotions, intent
and desires. Retailers can create significant
competitive differentiation by understanding
the customer’s true intent in a way that also
supports the business’ intents [1, 2, 3, 4].
John Boyd a colonel military strategist
in the US air force developed the OODA loop
(Observe, Orient, Decide and Act) which he
used for combative operations. Today’s business
environment is nothing different Retailers
are battling to get customer into their shops
(physical or net-front) and convert their visits
to sales. And understanding customer dynamics
play a key role in this effort. The OODA loop
explains the crux of the feedback economy.
Figure 1: OODA loop Source: Reference [5] Source: Reference [5]
21
In a feedback economy, there is constant feedback
to the system from every phase of its execution.
Along with this, the organization should
observe the external environment, unfolding
circumstances and customer interactions. These
inputs are analyzed and action is taken based
on these inputs. This cycle of adaptation and
optimization makes the organization more
effcient and effective on an ongoing basis.
Leveraging this feedback loop is pivotal
in having a proper understanding of customer
needs and wants and the evolving trends. In
today’s environment, this means acquiring
data from heterogeneous sources viz., in-
store transaction history, web analytics, etc.
This creates a huge volume of data that has
to be analyzed to get the required actionable
insights
BIG DATA LIFECYCLE: ACQUIRE-
ANALYZE-ACTIONIZE
The lifecycle of Big data can be visualized as a
three-phased approach resulting in continuous
optimization. The frst step in moving towards
feedback economy is to acquire data. In this
case, retailer should look into the macro and
micro environment trends, consumer behavior
- their likes, emotions, etc. Data from electronic
channels like blogs, social networking sites
and twitter will give the retailer a humongous
amount of data regarding the consumer. These
feeds help the retailer understand consumer
dynamics and give more insights into her
buying patterns.
The key advantage of plugging into these
disparate sources is the sheer information one
can gather about customer – both individually
and in aggregate. On other hand, Big data is
materially different from the data the retailers
are used to handling. Most of the data is
unstructured (from blogs, twitter feeds, etc.) and
cannot be directly integrated with traditional
analytics tool leading to challenges on how the
data can be assimilated with backend decision
making systems and analyzed.
In the assimilate/analyze phase, retailer
must decide which data is of use and defne
rules for fltering the unwanted data. Filtering
should be done with utmost care, as there are
cases where indirect inferences are possible. The
data available to the retailer after the acquisition
phase would be of multiple formats and they
have to be cleaned and harmonized with the
backend platforms.
Cleaned up data is then mined for
actionable insights. Actionize is a phase where the
insights gathered from analyze phase is converted
to actionable business decisions by the retailer.
The response i.e., business outcome is
fed back to the system so that the system can
self-tune on an ongoing basis to result in a self-
adaptive system that leverages Big data and
feedback loops to offer business insight more
customized than what would be traditionally
possible. It is imperative to understand that
this feedback cycle is an ongoing process and
not to be considered as a one stop solution for
the analytics needs of a retailer.
ACQUIRE: FOLLOWING CUSTOMER
FOOTPRINTS
To understand the customer, retailers have to
leverage every interaction with the customer
and tap into the source of customer insight.
Traditionally, retailers have relied primarily on
in-store customer interactions and associated
transaction data along with specialized campaigns
like opinion polls to gain better insight into
customer dynamics. While this interaction looks
limited, a recent incident shows how powerful
customer sales history can be leveraged to gain
predictive intelligence on customer needs.
22
“A father of a teenage girl called in a
major North American retailer to complain that
the retailer had mailed coupons for child care
products addressed to his underage daughter.
Few days later, the same father called in and
apologized that his daughter was indeed
pregnant and he was not aware of it earlier” [6].
Surprisingly, by all indications, only in-
store purchase data was mined by the retailer
in this scenario to identify the customer need
which in this case is that of childcare products.
To exploit the power of next generation
of analytics retailers must plug into data from
non-traditional sources like social sites, twitter
feeds, environment sensor networks, etc. to
have better insight into customer needs. Most
major retailers now have multiple channels –
brick/mortar store, online store, mobile apps,
etc. Each of these touch points not only acts
as a sales channel but can also generate data
on customer needs and wants. Coupling this
information with other repository like Facebook
posts, twitter feeds (i.e., sentiment analysis) and
web analytics retailers have the opportunity to
track customer footprints both in and outside
the store and to customize their offerings and
interactions with customer.
Traditionally retailers have dealt with
voluminous data. For example, Wal-Mart logs
more than 2.5 petabytes of information about
customer transactions every hour, equivalent
to 167 times the books in the Library of
Congress [7].
However, the nature of Big data is
materially different from traditional transaction
data and this must be considered while data
planning is done. Further, while data is
available readily, the legality and compliance
aspect of gathering and using data is additional
aspect that needs to be considered. Further,
integrating information from multiple sources
can result in generating data that is beyond
what user originally consented to; potentially
resulting in liability for the retailer. Given that
most of this information is accessible globally,
retailers should ensure compliance with local
regulations (EU data /privacy protection
regulations, HIPAA for US medical data, etc.)
where they operate.
ANALYZE - INSIGHTS (LEADS)
TO INNOVATION
Analyst Doug Laney defined data growth
challenges and opportunities as being three-
dimensional, i.e. increasing volume (amount of
data), velocity (speed of data in and out), and
variety (range of data types and sources)[9].
The key to acquire Big data is to handle
these dimensions while assimilating these
aforementioned external sources of data. To
understand how Big data analytics can enrich
and enhance a typical retail process – allocation
planning – let’s look at the allocation planning
case study for a major North American apparel
retailer.
The forecasting engine used for planning
process uses statistical algorithms to determine
allocation quantities. Key inputs to forecasting
engine are sales history and current performance
of store. In addition, adjustments are also
based on parameters like Promotional events
(including markdown), current stock levels,
back orders to determine the inventory that
needs to be shipped to particular store.
While this is fairly in line with industry
standard for allocation forecasting, Big data
can enrich this process by including additional
parameters that can impact demand. For e.g., a
news piece on a town’s go-green initiative or no
plastic day can be taken as additional adjustment
parameter for non-green items in that area.
Similarly, a weather forecast on warm front in
23
an area can automatically trigger reduction of
stocks of warm-clothing for stores there.
A high-level logical view of Big data
implementation is explained below to further
understanding on how Big data can be assimilated
with traditional data sources. The data feeds
for the implementation comes from various
structured sources like forums, feedback forms,
rating sites and unstructured source like social
web, etc. as well as semi-structured data from
emails, word documents, etc. This is a veritable
data feast thrown compared to traditional
systems but it is important that we diet on such
data and use only those feeds that create optimum
value. This is done through synergy of business
knowledge and processes specifc to retailer and
the industry segment the retailer operates in and
set of tools specialized in analyzing huge volume
of data in rapid speed. Once data is massaged for
downstream systems, big analytics tools are used
to analyze. Based on business needs, real-time or
offine data processing/analytics can be used. In
real-life scenarios, both these approaches are used
based on situation and need.
Proper analysis needs data not just
from consumer insight sources but also from
transactional data history and consumer
profles.
ACTIONIZE – BIG DATA TO BIG IDEAS
This is the key part of the Big data cycle. Even
the best data cannot be substituted for timely
action. The technology and functional stacker
will facilitate retailer getting proper insight
into key customer intent on purchase – what,
where, why and at what price. By knowing this,
Figure 2: Correlation between Customer Ratings and Sales Source: Reference [12]
Best Sellers in Tablet PCs Most Wished For in Tablet PCs
Kindle Fire HD 7”, Dolby Audio,
Dual-Band Wi-Fi, 32GB
1.
Samsung Galaxy Tab 2
(7-Inch,Wi-Fi)
3.
Samsung Galaxy Tab 2
(10.1-Inch, Wi-Fi)
4.
Kindle Fire HD 8.9”, Dolby Audio,
Dual-Band Wi-Fi, 32 GB
5.
Kindle Fire HD 8.9”, 4G LTE Wireless,
Dolby Audio, Dual-Band Wi-Fi, 32GB
1.
Kindle Fire HD 8.9”, Dolby Audio,
Dual-Band Wi-Fi, 32GB
2.
Kindle Fire, Full Color 7”
Multi-touch Display, Wi-Fi
3.
Kindle Fire HD 7”, Dolby Audio,
Dual-Band Wi-Fi, 32 GB
4.
Kindle Fire HD 8.9”, Dolby Audio,
Dual-Band Wi-Fi, 16GB
2.
Samsung Galaxy Tab 2
(7-Inch, Wi-Fi)
5.
24
the retailer can customize the 4Ps (product,
pricing, promotions and place) to create enough
mindshare from customer perspective that sales
become inevitable [10].
For example, a cursory look at random
product category (tablet) in an online retailer
site shows the strong correlation between
customer ratings and sales, i.e., 4 out of 6 best
user-rated products are in the top fve in sales –
a 60% correlation even when other parameters
like brand, price, release date are not taken into
consideration [Fig. 2] 12. The retailer knowing
the customer ratings can offer promotions
that can tip the balance between sales and lost
opportunity. While this example may not be the
rule, the key to analysis and actionizing the data
is to correlate the importance of user feedback
data and concomitant sales.
BIG DATA OPPORTUNITIES
The implication of Big data analytics on major
retailing processes will be along the following
areas.
â–  Identi fyi ng the Product Mi x: The
assortment and allocation will need to
take into consideration the evolving user
trends identifed from Big data analytics
to ensure the offering matches the market
needs. Allocation planning especially
has to be tactical with shorter lead times.
â–  Promotions and Pricing: Retailers have
to move from generic pricing strategies
to customized user specifc.
â–  Communi cat i on wi t h Cus t omer :
Advertising will move from mass media
to personalized communication; from
one way to two-way communication.
Retailers will gain more from viral
marketing [12] than from traditional
advertising channels.
â–  Compliance: Governmental regulations
and compl i ance requi rements are
mandatory to avoid liability as co-
mingling data from disparate sources
can result in generation of personal data
beyond the scope of the original user’s
intent. While data is available globally,
the use has to comply with local law of
the land and ensure it is done keeping in
mind customer’s sensibilities.
â–  People, Process and Organizational
Dynamics: The move to feedback economy
requi res di f f erent organi zat i onal
mindset and processes. Decision making
will need to be more bottom-up and
collaborative. Retailers need to engage
customer to ensure the feedback loop is
in place. Further, Big data being cross-
functional, needs the active participation
and coordination between various
departments in the organization; hence
managing organizational dynamics is the
key consideration.
â–  B e t t e r Cu s t o me r E x p e r i e n c e :
Organizations can improve the overall
customer experience by providing
updates services and thereby eliminating
surprises. For instance Big data solutions
can be used to pro-actively inform
customers of expected shipment delays
based on traffc data, climate and other
external factors.
BIG DATA ADOPTION STRATEGY
Presented below is a perspective on how to
adopt a Big data solution within the enterprise.
25
Defne Requirements, Scope and Mandate:
Defne mandate and objective in terms of what is
the required from Big data solution. A guiding
factor to identify the requirements would be the
prioritized list of business strategies. As part of
initiation, it is important to also identify the goal
and KPIs that vindicates the usage of Big data.
Key Player: Business
Choosing the Right Data Sources:
Once the requirement and scope is defned,
the IT department has to identify the various
feeds that would fetch the relevant data. These
feeds would be structured, semi structured and
unstructured. The source could be internal or
external. For internal sources, the policies and
processes should be defned to enable friction
less fow of data.
Key Players: IT and Business
Choosing the Required Tools and Technologies:
After deciding upon the sources of data that
would feed the system, the right tools and
technology should be identifed and aligned
with business needs. Key areas are capturing the
data, tools and rules to clean the data, identify
tools for real-time and offine analytic, identify
storage and other infrastructure needs.
Key Player: IT
Creating Inferences from Insights:
One of the key factors to a successful Big
data implementation is to have a pool of
talented data analyst who can create proper
inferences from the insights and facilitate
build and definition of new analytic models.
These models help in probing the data and
understand the insights.
Key Player: Data Analyst
Strategy to Actionize the Insights:
Business should create process that would take
these inferences as inputs to decision making.
Stakeholders in decision making should be
identifed and actionable inferences have to be
communicated at the right time. Speed is critical
to the success of Big data.
Key Player: Business
Measuring the Business Benefts:
The success of the Big data initiative depends on
the value it creates to the organization and its
decision making body. It should also be noted
that unlike other initiatives, Big data initiatives
are usually continuous process in search of the
best results. Organizations should be in tune to
this understanding to derive the best results.
However, it is important that a goal is set and
measured to track the initiative and ensure its
movement in the right direction.
Key Players: IT and Business
CONCLUSION
The move to feedback economy presents an
inevitable paradigm shift for the retail industry.
Big data as the enabling technology will play key
role in this transformation. As ever, business
needs will continue to drive technology process
and solution. However, given the criticality of
Big data, organizations will need to treat Big
data as an existential strategy and make the right
investment to ensure they can ride the wave.
REFERENCES
1. Customer dynamics. Available at http://
en. wi ki pedi a. org/wi ki /Customer_
dynamics.
26
2. Davenport, T. and. Harris, G., (2007),
Competi ng on Anal yti cs, Harvard
Business School Publishing.
3. De Bor de , M. , ( 2006) , Do Your
Organizational Dynamics Determine Your
Operational Success?, The O and P Edge.
4. Lemon, K., Barnett, T., White, Russell S.
Winer, Dynamic Customer Relationship
Management: Incorporating Future
Considerations into the Service Retention
Decision, Journal of Marketing.
5. Boyd, J. (September 3, 1976). OODA
loop, In Destruction and Creation.
Available at http://en.wikipedia.org/
wiki/OODA_loop.
6. Doyne, S. (2012), Should Companies
Collect Information About You?, NY
Times. Available at http://learning.
bl ogs. nyt i mes. com/2012/02/21/
should-companies-collect-information-
about-you/.
7. Data, data everywhere (2010), The
Economist. Available at http://www.
economist.com/node/15557443.
8. IDC Digital Universe (2011). Available
at ht t p: //c huc ks bl og. emc . c om/
chucks_blog/2011/06/2011-idc-digital-
universe-study-big-data-is-here-now-
what.html.
9. Gartner Says Solving ‘Big data’ Challenge
Involves More Than Just Managing
Volumes of Data (2011). http://www.
gartner.com/it/page.jsp?id=1731916.
10. Gens, F. ( 2012) . I DC Pr edi ct i on
2012: Competing for 2020. Available
at ht t p: //cdn. i dc. com/research/
Predi ct i ons12/Mai n/downl oads/
IDCTOP10Predictions2012.pdf.
11. Bhasin, H. 4Ps of marketing. Available
at http: //www. marketi ng91. com/
marketing-mix-4-ps-marketing/.
12. Amazon US site / tablets category (2012).
Available at http://www.amazon.com/
gp/top-rated/electronics/3063224011/
r e f = z g _ b s _ t a b _ t _ t r ? p f _ r d _
p = 1 3 7 4 9 6 9 7 2 2 &p f _ r d_ s = r i g h t -
8&pf_rd_t=2101&pf_rd_i =l i st&pf_
r d_m=ATVPDKI KX0DER&pf _r d_
r=14YWR6HBVR6XAS7WD2GG.
13. Godwin, G. (2008) Viral marketing. Available
at http://sethgodin.typepad.com/seths_
blog/2008/12/what-is-viral-m.html.
14. Wang, R. (2012), Monday’s Musings:
Beyond The Three V’s of Big data –
Viscosity and Virality , http://blog.
sof t warei nsi der. org/2012/02/27/
mondays-musings-beyond-the-three-
vs-of-big-data-viscosity-and-virality/
27
VOL 11 NO 1
2013
Harness Big Data Value and
Empower Customer Experience
Transformation
By Zhong Li PhD
I
n today’s hyper-competitive experience
economy, communication service providers
(CSPs) recognize that product and price alone
will not differentiate their business and brand.
Since brand loyalty, retention and long-term
profitability are now so closely aligned with
customer experience, the ability to understand
customers, spot changes in their behavior
and adapt quickly to new consumer needs is
fundamental to the success of the consumer
driven Communication Service Industry.
The increasingly sophisticated digital
consumers demand more personal i zed
ser vi ces t hr ough t he channel of t hei r
choice. In fact, the internet, mobile and
particularly, the rise of social media in the
past 5 years have empowered consumers
more than ever before. There is a growing
challenge for CSPs that are contending with
an increasingly scattered relationship with
customers who can now choose from multiple
channels to conduct business interactions.
A recent industry research indicates that
some 90% of today’s consumers in the US
and West Europe interact across multiple
channels, representing a moving target that
makes achieving a full view of the customer
that much more challenging .
To compound this trend, the always-on
digital customers continuously create more data in
various types, from many more touch points with
more interaction options. CSPs encounter “Big
data phenomenon” by accumulating signifcant
amounts of customer related information such
as purchase patterns, activities on the website,
from mobile, social media or interactions with
the network and call centre.
Such Big data phenomenon presents
CSPs with challenges along 3V dimensions
(Fig. 1), viz.,
Communication Service Providers need to
leverage the 3M Framework with a holistic
5C process to extract Big Data value (BDV)
Infosys Labs Briefings
28
â–  Large Volume: Recent industry research
shows that the amount of data that
the CSP has to manage with consumer
transaction and interaction has doubled in
the past three years, and its growth is also
in acceleration to double the size again
in the next two years, much of it coming
from new sources including blogs, social
media, internet search, and networks [7].

â–  Broad Variety: The type, form and
format of data are created in a broad
variety. Data is created from multiple
channels such as online, call centre,
stores and soci al medi a i ncl udi ng
Facebook, Twitter and other social media
platforms. It presents itself in a variety of
types, comprising structured data form
transaction, semi-structure data from call
records and unstructured data in multi-
media forms from social interactions
â–  Rapidly Changing Velocity: The always
on digital consumers create change
dynamics of data in the speed of light.
They equally demand fast response from
CSPs to satisfy their personalized needs
in real time.
CSPs of all sizes have learned the hard way that
it is very diffcult to take full advantage of all of
the customer interactions in Big data if they do
not know what their customers are demanding
or what their relative value to the business is.
Even some CSPs that do segment their customers
with the assistance of customer relationship
management (CRM) system struggle to take
complete advantage of that segmentation in
developing a real-time value strategy. In hyper-
sophisticated interaction patterns throughout
their journey spanning marketing, research,
order, service and retention, Big data sheds
shining light to expose treasured customer
intelligence along aspects of 4Is viz., interest,
insight, interaction and intelligence.
â–  Interest and Insight: Customers offer
their attention for interest and share
their insights. They visit a web site,
make a call, or access a retail store, share
view on social media because they want
something from CSP at that moment
– information about a product or help
with a problem. These interactions
present an opportunity for the CSP to
communicate with a customer who is
engaged by choice and ready to share
information regarding her personalized
wants and needs.
â–  Interaction and Intelligence: It is typically
crucial for CSPs to target offerings to
particular customer segments based on
the intelligence of customer data. The
success of these real time interactions –
whether through online, mobile, social
media, or other channels depends to a
great extent on the CSP’s understanding
of the customer’s wants and needs at the
time of the interaction.
Social Store
Web Call centre Variety
Value
Volume Velocity
Mobile
Figure 1: Big Data in 3Vs is accumulated from Multiple
Channels
Source: Infosys Research
29
Therefore, alongside managing and securing
Big data in 3V dimensions, CSPs are facing a
fundamental challenge on how to explore and
harness Big data Value (BDV).
A HOLISTIC 5C PROCESS TO
HARNESS BDV
Rising to the challenges and leveraging on the
opportunity in Big data, CSPs need to harness
BDV with predictive models to provide deeper
insight into customer intelligence from profles,
behaviours and preferences that are hidden in
Big data of vast volume and broad variety, and
to deliver superior personalized experience
with fast velocity in real time throughout entire
customer journey.
In the past decade, most CSPs have
invested signifcant amount of efforts in the
implementation of complex CRM systems to
manage customer experience. While those CRM
systems bring effciency in helping CSPs to
deliver on “what” to do in managing historical
transactions, they lack the crucial capability
of defning “how” to act in time with the most
relevant interaction to maximize the value for
the customer.
CSPs now need to look beyond what
CRM has to offer and dive deeper to cover
“how” to do things right for the customer by
capturing customers’ subjective sentiment
in a particular interaction, resultant insight
into predication on what a customers demand
from CSPs and trigger proactive action to
satisfy their needs, which is more likely
to lead to customer delight and ultimate
revenues.
â–  To do so, CSPs needs to execute a
holistic 5C process, i.e., collect, converge,
correlate, collaborate and control, in
extracting BDV (Fig. 2).
The holistic 5C process will help CSPs to
aggregate the whole interaction with a customer
across time and channels, support with large
volume and broad variety of data including
promotion, product, order and services, defne
interactions with that of customer’s preferences.
The context of the customer’s relationship with
the CSP, and actual and potential value that she
derives, in particular, determine the likelihood
that she consumer will take particular actions
based on real time intelligence. Big data can
help the CSP correlate the customer’s needs with
product, promotion, order, service and deliver
the right offer at the right time in the appropriate
context that she is most likely to respond to.
AN OVERARCHING 3M FRAMEWORK
TO EXTRACT BDV
To execute a holistic 5C process for Big data,
CSPs need to implement an overarching
framework that integrates the various pools
of customer related data residing in CSPs
enterprise systems, create an actionable
customer profile, deliver insight based on
that profle in real time customer interaction
event and effectively match sales and service
resources to take proactive actions, so as to
monetize ultimate value on the fy.
Collaborate Collect
Converge Control
Correlate
Customer
Product
Service
Order Promotion
Figure 2: Harness BDV with a Holistic 5C Process
Source: Infosys Research
30
The overarching framework needs to incorporate
3M modules, i.e. Model, Monitor and Mobilize
â–  Model Profile: It models customer
profle based on all the transactions that
helps CSPs gain insight at the individual-
customer level. Such a profle requires
not only integration of all customer
facing systems and enterprise systems,
but integration with all the customer
interactions such as email, mobile, online
and social in enterprise systems such as
OMS, CMS, IMS and ERP in parallel with
CRM paradigm, and model an actionable
customer profle to be able to effectively
deploy resources for a distinct customer
experience.
â–  Monitor Pattern: It monitors customer
interaction events from multiple touch
points in real time, dynamically senses
and triggers matching patterns of events
with the defned policies and set models,
and makes suitable recommendations
and offers at right time through an
appropriate channel. It enables CSPs
to quickly respond to changes in the
marketplace—a seasonal change in
demand, for example—and bundle
offerings that will appeal to a particular
customer, across a particular channel, at
a particular time.
â–  Mobilize Process: It mobilizes a set
of automations that allows customers
enj oy t he personal i zed engagi ng
journey in real time that spans outbound
and inbound communications, sales,
orders, service and help intervention,
and fulfil customer’s next immediate
demand.
The 3M framework needs to be based on an
event-driven architecture (EDA) incorporating
Enterprise Service Bus (ESB) and Business
Process Management (BPM) and should be
application and technology agnostic. It needs
to interact with multiple channels using events;
match patterns of a set of events with pre-defned
policies, rules, and analytical models; deliver a set
of automations to fulfl personalized experience
that spans the complete customer lifecycle.
Furthermore, the 3M framework needs to
be supported with key high-level functional
components, which include:
â–  Customer Intelligence from Big data:
A typical implementation of customer
i nt el l i gence f rom Bi g dat a i s t he
combination of Data Warehouse and real
time customer intelligence analytics. It
requires aggregation of customer and
product data from CSP’s various data
sources in BSS/OSS, leveraging CSP’s
existing investments with data models,
workfows, decision tables, user interface,
etc. It also integrates with the key modules
in CSP’s enterprise landscape, covering:
â–  C u s t o me r Ma n a g e me n t :
A complete customer relationship
management solution combines a
360 degree view of the customer
with intelligent guidance and
seamless back-offce integration
to increase frst contact resolution
and operational effciency.
â–  O f f e r M a n a g e m e n t :
CSP- s peci f i c s peci al i zat i on
and re- use capabi l i t i es t hat
define new services, products,
31
bundles, fulfilment processes
and dependencies and rapidly
c a pi t a l i z e on ne w ma r ke t
oppor t uni t i es and i mpr ove
customer experience.
â–  O r d e r M a n a g e m e n t :
The confgurable best practices for
creating and maintaining holistic
order journey that is critical to the
success of such product-intensive
functions as account opening, quote
generation, ordering, contract
generation, product fulflment and
service delivery.
â–  Service Management: Case based
work automation and a complete
view of each case enables an
effective management of every
case throughout its lifecycle.
â–  Event Driven Process Automation:
A dynamic process automation engine
empowered with EDA leverages the
context of the interaction to orchestrate
the fow of activities, guiding customer
service representatives (CSRs) and self-
service customers through every step in
their inbound and outbound interactions,
in particular for Campaign Management
and Retention Management.
â–  Campaign Management: Outbound
interactions are typically used
to target products and services
to particular customer segments
based on analysis of customer data
through appropriate channels.
It uncovers relevant, timely and
actionable consumer and network
insights to enable intelligently
dri ven marketi ng campai gns
to develop, define and refine
marketing messages and target
customer with a more effective
planand meet customers at the
touch points of their choosing
through optimized display and
search results while generating
demand via automated email
creation, delivery and results
tracking.
â–  R e t e n t i o n Ma n a g e me n t :
Customers offer their attention,
e i t he r i nt r us i ve l y or non-
intrusively to look for the products
and services that meet their needs
through the channel of thei r
choices. It dynamically captures
consumer data from highly active
and relevant outlets such as social
media, websites and other social
sources and enabl es CSPs to
quickly respond to customer needs
and proactively deliver relevant
offers for upgrades and product
bundles that take into account each
customer’s personal preference.
â–  Experience Personalization: It provides
the customer with personalized, relevant
experience, enabled from business process
automation that connects people, processes
and systems in real time and eliminates
product, process and channel silos. It helps
CSPs extend predictive targeting beyond
basic cross-sells to automate more of their
cross-channel strategies and gain valuable
insights from hidden, consuming and
interaction patterns.
32
Overall, the 3M framework will empower BDV
solution for CSP to execute on the real-time
decision that aligns individual needs with
business objectives and dynamically fulfls the
next best action or offer that will increase the
value of each personalized interaction.
BDV IN ACTION- CUSTOMER EXPERIENCE
OPTIMIZATION
By implementing the proposed BDV solution,
CSPs can optimize customer experience that
delivers the right interaction with each customer
at right time so as to build strong relationships,
reduce churn, and increase customer value to
the business.
â–  From Customer Experience Perspective:
It provides CSP with real-time, end-
to end visibility into all the customer
interaction events taking place across
multi-channels, by correlating and
analyzing these events, using a set of
business rules, and automatically takes
proactive actions which ultimately lead
to customer experience optimization.
It helps CSP turn their multi-channel
contacts with customers into cohesive,
integrated interaction patterns, allowing
them to better segment their customers
and ultimately to take full advantage of
that segmentation, deliver personalized
experiences that are dynamically tailored
to each customer while dramatically
improving interaction effectiveness and
effciency.
â–  From CSPs Perspective: It helps CSPs
quickly weed out underperforming
campaigns and learn more about their
customers and their needs. From retail
store to contact centre to Web to social
media, it helps CSPs deliver a new
standard of branded, consistent customer
experiences that build deeper, more
profitable and lasting relationships. It
enables CSPs to maximize productivity
by handling customer interactions as fast
as possible in the most proftable channel.
At every point in the customer lifecycle, from
marketing campaigns, offer and order to
servicing and retention efforts, BDV helps to
inform its interactions with that customer’s
preferences, the context of her relationship with
the business, and actual and potential value,
enables CSPs focus on creating personalized
experiences that balance the customer’s needs
with business values.
â–  Campaign Management: BDV delivers
focused campaigns on the customer with
predictive modelling and cost-effective
campaign automation that consistently
distinguishes the brand and supports
personalized communications with
prospects and customers.
â–  Offer Management: BDV dynamically
generates offers that account for such
factors as the current interaction with
the customer, the individual’s total value
across product lines, past interactions,
and likelihood of defecting. It helps
deliver optimal value and increases the
effectiveness of propositions with next-
best-action recommendations tailored to
the individual customer.
â–  Order Management: BDV enables the
unified process automation applicable
to multiple product lines, with agile and
33
flexible workflow, rules and process
orchestration that accounts for the
individual needs in product pricing,
configuration, processing, payment
scheduling and delivery.
â–  Service Management: BDV empowers
customer service representatives to
act based on the unique needs and
behaviours of each customer using real-
time intelligence combined with holistic
customer content and context.
â–  Retention Management: BDV helps
CSPs retain more high-value customers
wi t h t ar ge t e d ne xt - be s t - ac t i on
dialogues. It consistently turns customer
interactions into sales opportunities
by automatically prompting customer
service representatives to proactively
deliver relevant offers to satisfy each
customer’s unique need.
CONCLUSION
Today’s increasingly sophisticated digital
consumers expect CSPs to deliver product,
service and interaction experience designed
“just for me at this moment.” To take on the
challenge, CSPs need to deliver customer
experience optimization powered by BDV in
real time.
By implementing an overarching 3M
BDV framework to execute a holistic 5C process
new products can be brought to market with
faster velocity and with the ability to easily
adapt common services to accommodate unique
customer and channel needs.
Suffce it to say that BDV will enable
CSP to deliver customer-focused experience
that matches responses to specifc individual
demands; provide real time intelligent guidance
that streamlines complex interactions; and
automate interactions from end-to-end. The
result is an optimized customer experience
that helps CSPs substantially increase customer
satisfaction, retention and proftability, and
consequently empowers CSPs evolving into
the experience centric Tomorrow’s Enterprise.
REFERENCES
1. IBM Big data solutions deliver insight
and relevance for digital media – Solution
Brief- June 2012 available at www-05.
ibm. com/fr/events/netezzaDM. . . /
Solutions_Big_Data.pdf.
2. Oracle Big data Premier-Presentation
( May 2012) . Avai l abl e at ht t p: //
premiere.digitalmedianet.com/articles/
viewarticle.jsp?id=1962030.
3. SAP HANA™ for Next-Generation
Business Applications and Real-Time
Analytics (July 2012). Available at http://
www.saphana.com/docs/DOC-1507.
4. SAS
®
High-Performance Analytics (June
2012). Available at http://www.sas.
com/reg/gen/uk/hpa?gclid=CJKpvv
CJiLQCFbMbtAodpj4Aaw.
5. Transform the Customer Experience
with Pega-CRM (2012). Available at
http://www.pega.com/sites/default/
files/private/Transform-Customer-
Exper i ence- wi t h- Pega- CRM- WP-
Apr2012.pdf.
6. The Forrester Wave™: Ent erpri se
Hadoop Solutions for Big data-Feb 2012.
Available at http://center.uoregon.
edu/AIM/uploads/INFOTEC2012/
H A N D O U T S / K E Y _ 2 4 1 3 5 0 6 /
Infotec2012BigDataPresentationFinal.
pdf.
7. Shah S. ( 2012) , Top 5 Reas ons
Communications Service Providers
34
Need Operational Intelligence. Available
at http://blog.vitria.com/bid/88402/
Top-5-Reasons-Communications-Service-
Providers-Need-Operational-Intelligence.
8. Connolly S. and Wooledge S. (2012),
Harnessing the Value of Big data Analytics.
Available at http://www.asterdata.com/
wc-0217-harnessing-value-bigdata/.
35
VOL 11 NO 1
2013
Liquidity Risk Management and
Big Data: A New Challenge for Banks
By Abhishek Kumar Sinha
D
uring the 2008 financial crisis, banks
faced an enormous challenge of managing
liquidity and remaining solvent. As many
fnancial institutions failed, those who survived
the crisis have fully understood the importance
of liquidity risk management. Managing
liquidity risk on simple spreadsheets can lead
to non-real-time and inappropriate information
that may not be enough for effcient liquidity
risk management (LRM). Banks must have
reliable data on daily positions and other
liquidity measures that have to be monitored
continuously. During signs of stress, like
changes in liquidity of various asset classes
and unfavorable market conditions, banks need
to react to these changes in order to remain
credible in the market. In banking liquidity
risk and reputation is so heavily linked to the
extent that even a single liquidity event can lead
to catastrophic funding problems for a bank.
MISMANAGEMENT OF LIQUIDITY RISK:
SOME EXAMPLES OF FAILURES
Northern Rock was a star performer UK bank
until the 2007 crisis struck. The source of
funding was mostly wholesale funding and
capital market funding. Hence in the 2008 crisis,
when these funding avenues dried up across
the globe, it was unable to fund its operations.
During the crisis, the bank’s stock fell 32% along
with depositors run on the bank. The central
bank had to intervene and support the bank
in the form of deposit protection and money
market operations. Later the Government took
the ultimate step of nationalizing the bank.
Lehman Brothers had 600 billion in
assets before its eventual collapse. The bank’s
stress testing omitted its riskiest asset -- the
commercial real estate portfolio, which in
turn led to misleading stress test results. The
liquidity of the bank was very low compared to
the balance sheet size and the risks it had taken.
The bank had used deposits with clearing banks
as assets in its liquidity buffer which was not
in compliance with the regulatory guidelines.
The bank lost 73% in share price during the
frst half of 2008, and fled for bankruptcy in
September 2008.
Implement a Big Data framework and
manage your liquidity risk better
Infosys Labs Briefings
36
2008 financial crisis has shown that
the current liquidity risk management (LRM)
approach is highly unreliable in a changing and
diffcult macroeconomic atmosphere. The need
of the hour is to improve operational liquidity
management on a priority basis.
THE CURRENT LRM APPROACH AND
ITS PAIN POINTS
Compliance/Regulation
Across global regulators, LRM principles have
become stricter and complex in nature. The
regulatory focus is mainly on areas like risk
governance, measurement, monitoring and
disclosure. Hence, the biggest challenge for
the fnancial institutions worldwide is to react
to these regulatory measures in an appropriate
and timely manner. Current systems are not
equipped enough to handle these changes. For
example, LRM protocols for stress testing and
contingency funding planning (CFP) focus
more on the inputs to the scenario analysis and
new stress testing scenarios. These complex
inputs need to be very clearly selected and
hence it poses a great challenge for the fnancial
institution.
Siloed Approach to Data Management
Many banks use a spreadsheet-based LRM
approach that gets data from different sources
which are neither uniform nor comparable.
This leads to a great amount of risk in manual
processes and data quality issues. In such
a scenario, it becomes impossible to collate
enterprise wide liquidity position and the risk
remains undetectable.
Lack of Robust LRM Infrastructure
There is a clear lack of a robust system which
can incorporate real-time data and generate
necessary actions in time. The various liquidity
parameters can be changing funding costs,
counterparty risks, balance sheet obligations,
and quality of liquidity in capital markets.
THE NEED OF A READY-MADE SOLUTION
In a recent Swift survey, 91% respondents
indicated that there is a lack of ready-made
liquidity risk analytics and business intelligence
applications to complement risk integration
processes. Since we can see that the regulation
around the globe in form of Basel III, Solvency
II, CRD IV, etc., are shaping up hence there is
an opportunity to standardize the liquidity
reporting process. A solution that can do this
can be of great help to banks as it would save
them both effort and time, as well as increase
the effciency of reporting. Banks can focus
solely on the more complex aspects like inputs
to the stress testing process and on business and
strategy to control liquidity risk. Even though
there can be differences in approach of various
banks in managing liquidity, these changes
can be incorporated in the solution as per the
requirements.
CHALLENGES/SCOPE OF REQUIREMENTS
FOR LRM
The scope of requirements for LRM ranges
from concentration analysis of liquidity
exposures, calculation of average daily peak
of liquidity usage, historical and future view
of liquidity flows on both contractual and
behavioral in nature, collateral management,
stress testing and scenario analysis, generate
regulatory reports, liquidity gap across buckets,
contingency fund planning, net interest income
analysis, fund transfer pricing, to capital
allocation. All these liquidity measures are
monitored and alerts generated in case of
thresholds breached.
37
Concentration analysis of liquidity exposures
shows some important points on whether
the assets or liabilities of the institution are
dependent on a certain customer, or a product
like asset or mortgage backed securities. It also
tries to see if the concentration is region wise
country wise, or by any other parameter that can
be used to detect a concentration for the overall
funding and liquidity situation.
Calculation of average daily peak of liquidity
usage gives a fair idea of the maximum intraday
liquidity demand and the firm can keep
necessary steps to manage the liquidity in ideal
way. The idea is to detect patterns and in times
of high, low or medium liquidity scenarios
utilize the available liquidity buffer in the most
optimized way.
Collateral management is very important
as the need for col l ateral and i ts val ue
after applying the required haircuts has
to be monitored on a daily basis. In case
of unfavorable margin calls the amount of
collateral needs to be adjusted to avoid default
in various outstanding positions.
Stress testing and scenario analysis is like a
self-evaluation for the banks, in which they
need to see how bad things can go in case of
high stress events. Internal stress testing is
very important to see the amount of loss in case
of unfavorable events. For the systematically
important institutions, regulators have devised
some stress scenarios based on the past crisis
events. These scenarios need to be given as an
input to the stress tests and the results have
to be given to the regulators. A proper stress
testing ensures that the institution is aware
of what risk it is taking and what can be the
consequences of the same.
Regulatory liquidity reports have Basel III
liquidity ratios like liquidity coverage ratio
(LCR), net stable funding ratio (NSFR), FSA and
Fed 4G guidelines, early warning indicators,
fundi ng concentrati on, l i qui di ty asset/
collateral, and stress testing analysis. Timely
completion of these reports in the prescribed
format is important for fnancial institutions to
remain complaint with the norms.
Net interest income analysis (NIIA), FTP and
capital allocation are performance indicators
for an institution that raises money from
deposits or other avenues and lends it to
customers, or performs an investment to
achieve a rate of return. The NII is the difference
between the cost of funds to the interest rate
achieved by lending or investing the same. The
implementation of FTP links the liquidity risk/
market risk to the performance management
of the business units. The NII analysis helps in
predicting the future state of the P/L statement
and balance sheet of the bank.
Contingency fund planning contains of
wholesale, retail and other funding reports in
areas of both secured and unsecured funds, so
that in case of these funding avenues drying up
banks can look for other alternatives. It states
the reserve funding avenues like use of credit
lines, repro transactions, unsecured loans, etc.,
that can be accessed timely and at a reasonable
cost in liquidity crisis situation.
Intra-group borrowing and lending reports
show the liquidity position across group
companies. Derivatives reports related to
market value, collateral and cash fows are very
important to an effcient derivatives portfolio
management. Bucket-wise and cumulative
liquidity gap under business as usual and stress
38
scenario situations give a fair idea of varying
liquidity across time buckets. Both contractual
and behavioral cash fows are tracked to get the
fnal infow and outfow scenario. This is done
over different time periods, like 30 days to 3
years to get a long term as well as short term
view of liquidity. Historic cash fows are tracked
as they help in modeling the future behavioral
cash fows. Historical assumptions plus current
market scenarios are very important in dynamic
analysis of behavioral cash flows. Other
important reports are related to available pool
of unencumbered assets and non-marketable
assets.
All the scoped requirements can only
be satisfed when the frm has a framework
in place to take necessary decisions related
to liquidity risk. Hence, next we would have
a look into a LRM framework and as well as
a data governance framework for managing
liquidity risk data.
LRM FRAMEWORK
Separate group for LRM that is a constituted
of members from the asset liability committee,
risk committee and top management needs
to be formed. This group must function
independent of the other groups in the firm
and must have the autonomy to take liquidity
decisions. Strategic level planning helps in
defining the liquidity risk policy in a clear
manner related to the overall business strategy
of the firm.
The risk appetite of the frm needs to be
mentioned in measurable terms and the same
has to be communicated to all the stakeholders
in the frm. Liquidity risks across the business
need to be identifed and the key risk indicators
and metrics are to be decided. Risk indicators
are to be monitored on a regular basis, so that
in the case of an upcoming stress scenario
preemptive steps can be taken. Monitoring and
reporting is to be done for internal control as
well as for the regulatory compliance.
Finally there has to be a periodic analysis
of the whole system in order to identify possible
gaps in it and the frequency of review has to be
at least once in a year and in case of extreme
markets scenarios more frequently.
To satisfy the scoped out requirements
we can see that the data from various sources
is used to form liquidity data warehouse and
datamart which acts as an input to the analytical
engines.
The engines contain business rules and
logic based on which the key liquidity parameters
are calculated. All the analysis is presented in
report and dashboards form for both regulatory
compliance and internal risk management as well
as for decision making purposes.
Some Uses of Big data Application in LRM
1. St agi ng Ar ea Cr eat i on f or Dat a
Warehouse: Bi g dat a appl i cat i on
can store huge volumes of data and
perform some analysis on it along with
aggregating data for further analysis.
Due to its fast processing for large
amount of data it can be used as loader to
Corporate Governance
Strategic Level Planning
Identify & Assess Liquidity Risk
Monitor & Report
PeriodicAnalysis for Possible Gaps
T
a
k
e

C
o
r
r
e
c
t
i
v
e
M
e
a
s
u
r
e
s
Figure 1: Iterative Framework for effective liquidity risk
management
Source: Infosys Research
39
load data into the data warehouse along
with facilitating the extract-transform-
load (ETL) processes.
2. Preliminary Data Analysis: Data can be
moved in from various sources and then
using a visual analytics tool to create a
picture of what data is available and how
it can be used.
3. Making Full enterprise Data Available for
High performance Analytics: Analytics
at large firms were often limited to
the sample set of records on which
the analytical engines would run and
provide certain results, but as a Big data
application provides distributed parallel
processing capacity the limitation of
number of records is non-existent now.
Billions of records can now be processed
at increasingly amazing speeds.
HOW BIG DATA CAN HELP IN LRM
ANALYTICS AND BI
â–  Operational effciency and swiftness is a
point where high performance analytics
can help to achieve faster decision
making because all the required analysis
is obtained much faster.
â–  Liquidity risk is a killer in today’s
fnancial world and is most diffcult to
tracks as for large banks have diverse
instruments and a large number of
scenarios need to be analyzed like
changes in interest rates, exchange
rates, liquidity and depth in the markets
Big Data
Application
Data quality/
Data checks/
Operational
Data Store/
Staging Layer
Reporting / BI
Regulatory Reports
Basel related ratios
NSFR & LCR
FED 4G
FSA reports
Stress testing
Reports
Regulatory capital
allocation
Internal Liquidity
related Reports
Net interest income
analysis
ALM reports
FTP & liquidity costs
Funding
Concentration
Liquid assets
Capital allocation &
planning
Internal stress test
Key risk indicators
Other reports
Data Store
Data Warehouse
DataMart
General Ledger
Reconciliation
Analytical Engine
Asset Liability
Management .
Fund Transfer
Pricing
Liquidity Risk
& Capital
Calculation
Data Sources
Market Data
Reference data
External Data
General Ledger
System of Records
Collateral,
Deposits,
Loans,
Securities,
Product/LOB
Load
ETL
ETL
Figure 2: LRM data governance framework for Analytics
and BI with Big data capabilities
Source: Infosys Research
40
worldwide, and for such dynamic
analysis Big data analytics is a must.
â–  Stress testing and scenario analysis, both
require intensive computing as lot of
data is involved hence faster scenario
analysis means quick action in case of
stressed market conditions. With Big
data capabilities scenarios that would
takes hours to otherwise run can now
be run in minutes and hence aid in quick
decision making and action.
â–  Effcient product pricing can be achieved
by implementing real time fund transfer
pri ci ng syst em and prof i t abi l i t y
calculations. This ensures the best
possible pricing of market risks along
with adjustments like liquidity premium
across the business units.

CONCLUSION
The LRM system is the key for a financial
institution to survive in competitive and highly
unpredictable fnancial markets. The whole idea
of managing liquidity risk is to know the truth,
and be ready for the worst market scenarios.
This predictability is what is needed, and can
save a bank in times like the 2008 crisis. Even
at the business level a proper LRM system can
help in better product pricing using FTP, and
hence pricing can be logical and transparent.
Traditionally data has been a headache
for banks and is seen more as compliance and
regulation requirement, but going forward
there are going to be even more stringent
regulations and reporting standards across
the globe. After the crisis of 2008 new Basel III
liquidity reporting standards, newer scenarios
for stress testing have been issued that requires
extensive data analysis and can only be timely
possible with Big data applications. All in
the banking industry know that the future is
uncertain and high margins will always be a
challenge, so an efficient data management
along with Big data capabilities needs to be in
place. This will add value to the banks profile
by clear focus on the new opportunities for
banks and bring predictability to their overall
businesses.
Successful banks in future would be the
ones who take LRM initiatives seriously and
implement the system successfully. Banks with
an effcient LRM system would defnitely build
a strong brand and reputation in the eyes of
investors, customers, and regulators around
the world.
REFERENCES
1. Banking on Analytics: How High-
Performance Analytics Tackle Big data
Challenges in Banking (2012), SAS white
paper. Available at http://www.sas.com/
resources/whitepaper/wp_42594.pdf.
2. New regime, rules and requirements —
welcome to the new liquidity, Basel lll:
implementing liquidity requirements,
ERNST & YOUNG (2011).
3. Leveraging Technology to Shape the
future of Liquidity Risk Management,
Sybase Aite. Group study, July, 2010.
4. Managing liquidity risk, Collaborative
solutions to improve position management
and analytics (2011), SWIFT white paper.
5. Principles for Sound Liquidity Risk
Management and Supervision, BIS
Document, (2008).
6. Technology Economics: The Cost of
Data, Howard Rubin, Wall Street and
Technology Website, Available at http://
www. wallstreetandtech. com/data-
management/231500503.
41
VOL 11 NO 1
2013
Big Data Medical Engine in the Cloud
(BDMEiC): Your New Health Doctor
By Anil Radhakrishnan and Kiran Kalmadi
I
magine a world, where the day to day data
about an individual’s health is tracked,
transmitted, stored, analyzed on a real-time
basis. Worldwide diseases are diagnosed at an
early stage without the need to visit a doctor.
And lastly a world, where every individual
will have a ‘life certifcate’ that contains all
their health information, updated on a real
time basis. This is the world, to which Big data
can lead us to.
Given the amount of data generated for
e.g., , body vitals, blood samples, etc., every day
in the human body, it’s a haven for generating
Big data. Analyzing this Big data in healthcare is
of prime importance. Big data analytics can play
a signifcant role in the early detection/advanced
diagnosis of such fatal diseases that which can
reduce health care cost and improve quality.
Hospi t al s, medi cal uni ver si t i es,
researchers, insurers will be positively impacted
on applying analytics on this Big data. However,
the principal benefciaries of analyzing this Big
data will be the Government, patients and
therapeutic companies.
RAMPANT HEALTHCARE COSTS
A l ook at the heal thcare expendi ture of
countries like US and UK, would automatically
explain the burden that healthcare is on the
economy. As per data released by Centers
for Medicare and Medicaid Services, health
expenditure in the US is estimated to have
reached $2.7 trillion or over $8,000 per person
[1]. By 2020, this is expected to balloon to $4.5
trillion [2]. These costs will have a huge bearing
on an economy that is struggling to get up on
its feet, having just come out of a recession.
According to the Office for National
Statistics in the UK, healthcare expenditure in
UK amounted to £140.8 billion in 2010; from
£136.6 billion in 2009 [3]. With rising healthcare
cost, countries like Spain have already pledged
to save €7 Billion by slashing health spending,
while also charging more for drugs [5]. Middle
income earners will now have to pay more for
drugs.
This increase in healthcare costs is not
isolated to a few countries alone. According to
World Health Organization statistics released
Diagnose, customize and administer health care
on real time using BDMEiC
Infosys Labs Briefings
42
in 2011, per capita total expenditure on health
jumped from US$ 566 to US$ 899 from 2000
to 2008, an alarming increase of 58% [4]. This
huge increase is testimony to the fact that far
from increasing steadily, healthcare costs have
been increasing exponentially.
Whi l e heal thcare costs have been
increasing, the data generated through body
vitals, lab reports, prescriptions, etc. has also
been increasing signifcantly. Analysis of this
data will lead to better and advanced diagnosis,
early detection and more effective drugs which
in turn will result in signifcant reduction in
healthcare costs.
HOW BIG DATA ANALYTICS CAN HELP
REDUCE HEALTHCARE COSTS?
Analysis of ‘Big data’ that is generated from
various real time patient records possesses a
lot of potential for creating quality healthcare
at reduced costs. Real time refers to data like
body temperature, blood pressure, pulse/
heart rate, and respiratory rate that can
be generated every 2-3 minutes. This data
collected across individuals provides the
volume of data at a high velocity, while also
providing the required variety since it is
obtained across geographies. The analysis
of this data can help in reducing costs by
enabling real time diagnosis, analysis and
medication, which offers
â–  Improved insights into drug effectiveness
â–  Insights for early detection of diseases
â–  Improved insights into origins of various
diseases
â–  Insights to create personalized drugs.
These insights that Big data analytics
provides are unparalleled and go a long way
in reducing the cost of healthcare.
USING BIG DATA ANALYTICS FOR
PERSONALIZING DRUGS
The patents of many high profile drugs are
ending by 2014. Hence, therapeutic companies
need to examine the response of patients to
these drugs to help create personalized drugs.
Personalized drugs are those that are tailored
according to an individual patient. Real time
data collected from various patients will help
generate Big data, the analysis of which will
help identify how individual patients, reacted
to the drugs administered to them. By this
analysis, therapeutic companies will be able
to create personalized drugs custom-made to
an individual.
A personalized drug is one of the
important solutions that Big data analytics will
have the power to offer. Imagine a situation
where, analytics will help determine the exact
amount and type of medicine that an individual
would require, even without them having to
visit a doctor. That’s the direction in which
Big data analytics in healthcare has to move.
In addition, the analytics of this data can also
signifcantly reduce healthcare costs that run
into billions of dollars every year.
BIG DATA ANALYTICS FOR REAL TIME
DIAGNOSIS USING BIG DATA MEDICAL
ENGINE IN THE CLOUD (BDMEIC)
Big data analytics for real time diagnosis are
characterized by real time Big data analytics
systems. These systems contain a closed loop
feedback system, where insights from the
application of the solution serve as feedback
for further analysis. (Refer Figure 1).
Access to real time data provides a
quick way to accumulate and create Big data.
The closed loop feedback system is important
because it helps the system in building its
intelligence. These systems can not only help
43
to monitor patients in real time but can also
be used to provide diagnosis, detect early and
deliver medication in real time.
This can be achieved through a Big data
Medical Engine in the Cloud (BDMEiC) [Fig. 2].
This solution would consist of:
â–  Two medical patches (arm and thigh)
â–  Analytics engine
â–  Smartphone
â–  Data Center.
As depicted above, the BDMEiC solution
consists of the following:
1. Arm and thigh based electronic medical
patch
An arm based electronic medical patch
(these patches are thin, lightweight,
elastic and have embedded sensors) that
can monitor the patient is strapped to the
arm of an individual , which reads vitals
like body temperature, blood pressure,
pulse/heart rate, and respiratory rate to
monitor brain, heart, muscle activity, etc.
The patch then transmits this real time
data to the individual’s smartphone
whi ch i s synced wi t h t he pat ch.
The extraction of the data happens at
regular intervals (every 2-3 minutes).
The smartphone transmits the real time
data to the data center in the medical
engi ne. The thigh based electronic
medical patch is used for providing
medication. The patch comes with a
drug cartridge (pre-loaded drugs) that
can be inserted into a slot in the patch.
When i t r ecei ves dat a f r om t he
smartphone, the device can provide
the required medication to the patient
through auto-injectors that are a part of
the drug cartridge.
2. Data Center
The data center is the Big data cloud
storage that receives real time data from
the medical patch and stores it. This data
center will be a repository of real time
data received across different individuals
across geographies. This data is then
transmitted to the Big data analytics engine
3. Big Data Analytics Engine
The Big data analytics engine performs
three major functions - analyzing data,
sharing analyzed data with organizations
and transmitting medication instructions
back to the smartphone.
• Analyzing Data: It analyzes the
data (like body temperature, blood
pressure, pulse/heart rate, and
respi ratory rate, etc. ) recei ved
f rom t he dat a cent er usi ng i t s
inbuilt medical intelligence, across
individuals. As the system keeps
analyzing this data it also keeps
building on its intelligence.
New solution
based on analysis
Analysis of
real time data
Newer Insights
from Solutions
Real Time
Medical Data
Real Time Big Data
Analytics system
Feedback
Figure 1: Real Time Big Data Analytics System Source:
Infosys Research
Source: Infosys Research
44
Figure 2: Big Data Medical Engine in the Cloud (BDMEiC)
Source: Infosys Research
• Sharing Analyzed Data: The analytics
engine also transmits its analysis to
various universities, medical centers,
therapeutic companies and other
related organizations for further
research.
• T r a n s mi t t i n g Me d i c a t i o n
Instructions: The analytics engine
al s o can t r ans mi t medi cat i on
i nst ruct i ons t o an i ndi vi dual ’ s
s ma r t pho ne , whi c h i n t ur n
transmits data to the thigh patch,
whenever medication has to be
provided.
The BDMEiC solution can act as a
real time doctor that diagnoses, analyzes,
and provides personalized medication to
individuals. Such a solution that harnesses the
potential of Big data provides manifold benefts
to various benefciaries.
BENEFITS AND BENEFICIARIES OF
BDMEIC
The BDMEiC solution if adopted in a large scale
manner can offer a multitude of benefts, few of
which are listed below.
Real time Medication
With the analytics engine, monitoring patient
data in real time, the diagnosis and treatment of
patients in real time is possible. With the data
being shared with top research facilities and
medical institutions in the world, the diagnosis
and treatment would be more effective and
accurate.
Specifc Instances: Blood pressure data can be
monitored real time and stored in the data
center. The analysis of this data by the analytics
engine can keep the patients as well as doctor
updated real time, if the blood pressure moves
beyond permissible limits.
Benefciaries: Patients, medical institutions and
research facilities.
Convenience
The BDMEiC solution offers convenience to
patients, who would not always be in a position
to visit a doctor.
Specifc Instances: Body vitals can be measured
and analyzed with the patient being at home.
This especially helps in the case of senior citizens
and busy executives who can now be diagnosed
and treated right at home or while on the move.
Benefciaries: Patients.
Insights into drug effectiveness
The system allows doctors, researchers and
therapeutic companies to understand the
impact of their drugs in real time. This helps
them to create better drugs in the future.
Specifc Instances: The patents of many high
profle drugs are ending by 2014. Therapeutic
companies can use BDMEiC to perform real
Organizations
Medical Research
Centers
Therapeutic
Companies
Medical
Universities
Medical Labs Medical Engine
Analytics
Engine
Data
Center
1
2
3
4
45
time Big data analysis, to understand their
existing drugs better, so that they can create
better drugs in the future.
Benef i ci ari es: Doct ors, researchers and
therapeutic companies
Early Detection of Diseases
As BDMEiC monitors, stores, and analyzes data
in real time, it allows medical researchers, doctors
and medical labs to detect diseases at an early
stage. This allows them to provide an early cure.
Specifc Instances: Early detection of diseases
like cancer, childhood pneumonia, etc., using
BDMEiC can help provide medication at an
early stage thereby increasing the survival rate.
Beneficiaries: Researchers, medical Labs and
patients.
Improved Insights into Origins of Various
Diseases
With BDMEiC storing and analyzing real time
data, researchers get to know the cause and
symptoms of a disease much better and at an
early stage.
Specifc Instances: Newer strains of viruses can
be monitored and researched in real time.
Benefciaries: Researchers and medical labs.
Insights to Create Personalized Drugs
Real time data collected from BDMEiC will help
doctors administer the right dose of drugs to
the patients.
Specifc Instances: Instead of a standard pill,
patients can be given the right amount of drugs,
customized according to their needs.
Benefciaries: Patients and doctors
Reduced Costs
Real time data collected from BDMEiC assists in
the early detection of diseases, thereby reducing
the cost of treatment.
Specifc Instances: Early detection of cancer and
other life threatening diseases can lead to lesser
spending on healthcare.
Benefciaries: Government and patients.
CONCLUSION
The present state of the healthcare system
leaves a lot to be desired. Healthcare costs
are spiraling and forecasts suggest that they
are not poised to come down any time soon.
In such a situation, organizations world over,
including governments should look to harness
the potential of real time Big data analytics
to provide high quality and cost effective
healthcare. vThe solution proposed in this
paper, tries to utilize this potential to bridge
the gap between medical research, and the fnal
delivery of the medicine.
REFERENCES
1. US Food and Drug Administration, 2012
2. National Health Expenditure Projections
2011-2021 (January 2012), Centers for
Medicare & Medicaid Services, Office of
the Actuary. Available at http://www.
cms. gov/Research-Stati sti cs-Data-
and-Systems/Statistics-Trends-and-
Reports/NationalHealthExpendData/
Downloads/Proj2011PDF.pdf.
3. Jurd, A. (2012), Expenditure on healthcare
in the UK 1997 - 2010, Offce for National
Statistics. Available at http://www.ons.
gov.uk/ons/dcp171766_264293.pdf .
46
4. World Health Statistics 2011, World
Heal th Organi zati on. Avai l abl e at
h t t p: //www. wh o . i n t /wh o s i s /
whostat/EN_WHS2011_Full.pdf .
5. The Ministry of Health, Social Policy and
Equality Spain (). Available at http://
www.msssi.gob.es/ssi/violenciaGenero/
publicaciones/comic/docs/PilladaIngles.pdf.
47
VOL 11 NO 1
2013
Big Data Powered
Extreme Content Hub
By Sudheeshchandran Narayanan and Ajay Sadhu
C
ontent is getting bigger by the minute
and smart er by t he second [ 5] . As
content grows in size and becomes varied in
structure, discovery of valuable and relevant
content becomes a challenge. Existing Content
Management (ECM) products are limited
by scalability, variety, rigid schema, limited
indexing and processing capability.
Content enrichment often is an external
activity and not often deployed. The content
manager is more like a content repository
and is used primarily for search and retrieval
of the published content. Existing content
management solutions can handle few data
formats and provide very limited capability
wi t h respect t o cont ent di scovery and
enrichment.
With the arrival of Big Content, the
need to extract, enrich, organize and manage
the semi-structured and un-structured content
and media is increasing. As the next generation
of users will rely heavily on the new modes of
interacting with the content for e.g., mobile
devices and tablets , there is a need to re-
look at the traditional content management
strategies. Artificial intelligence will now play
a key role in information retrieval, information
classification and usage for these sophisticated
users. To facilitate the usage of Artificial
Intelligence on this Big Content, there is a need
to have knowledge on entities, domain, etc., to
be captured, processed, reused, and interpreted
by the computer. This has resulted in formal
specification and capture of the structure of
the domain called ontologies. Classification
of these entities within the domain into
predefined categories called taxonomy and
inter-relating them to create the semantic web
(web of data).
The new breed of content management
solutions need to bring in elastic indexing,
distributed content storage and low latency
to address these changes. But the story
does not end there. The ease to depl oy
Taming Big content explosion and
providing contextual and relevant
information is the need of the day
Infosys Labs Briefings
48
t echnol ogi es l i ke natural l anguage text
analytics, machine learning now takes these
new breed of content management to the
next level of maturity. Time is the essence for
everyone today. Contextual filtering of the
content based on relevance is an immediate
need. There is a need to organize content,
create new taxonomy, and create new links
and relationships beyond what is specified.
The next generation of content management
solutions should leverage the ontologies,
semantic web and linked data to derive the
context of the content and enrich the content
metadata with this context. Then leveraging
this context, the system should provide real-
time alerts as the content arrives.
In this paper, we discuss the details of
the extreme content hub and its implementation
semantics, technology viewpoint and use
cases.
THE BIG CONTENT PROBLEM IN TODAYS
ENTERPRISES
Legacy Content Management System (CMS)
has focused on addressing the fundamental
problems in content management i.e., content
organization, indexing, and searching. With
the internet evolution, these CMS’ evolved
to Content Publishing Lifecycle Management
(CPLM) and workfow capabilities to the overall
offering. The focus of these ECM products were
towards providing a solution for the enterprise
customers to easily store and retrieve various
documents and provide a simplified search
interface. Some of these solutions evolved to
address the web publishing problem. These
existing content management solutions have
constantly shown performance and scalability
concerns. Enterprises have invested in high
end servers and hired performance engineering
experts to address this. But will this last long?
Figure 1: Augmented Capabilities of Extreme Content
Hub Manager
Source: Infosys Research
Automated Content
Discovery
Highly Available
Elastic Scalable
System
Heterogeneous
Content Ingestion
Unified Intelligent
Content Access and
Insights
Content Enrichment
Core Features
• Indexing
• Search
• Workflow
• Metadata Repository
• Content Versioning
49
With the arrival of Big data (volume,
variety and velocity), these problems have
amplified further and the need for next
generation capabilities for content management
has evolved further.
Requirements and demand has gone
just beyond storing, searching and indexing
of traditional documents. Enterprise needs
to store a wide variety of contents ranging
from documents, videos, social media feeds,
blogs posts, podcast, images, etc. Extraction,
enrichment, organization and management
of semi, unstructured and multi-structured
content and media are a big challenge today.
Enterprises are under tremendous competitive
pressure to derive meaningful insights from
these piles of information assets and derive
business value from this Big data. Enterprises
are l ooki ng for contextual and rel evant
information at lightning speed. The ECM
solution must address all of the above technical
and business requirements.
EXTREME CONTENT HUB: KEY
CAPABILITIES
Key capabilities required for the Extreme Content
Hub (ECH) apart from the traditional indexing,
storage and search capabilities can be classifed
in the following fve dimensions. (Fig. 2)
Heterogeneous Content Ingestion that
provides input adapters to a wide variety of
content (document, videos, images, blogs,
feeds, etc.) into the content hub seamlessly. The
next generation of content management system
needs to support
Real-Time Content Ingestion for RSS feeds,
news feeds, etc. and support stream of events
to be ingested as one of the key capabilities for
content ingestion.
Automated Content discovery that extracts the
metadata and classifes the incoming content
seamlessly to pre-defined ontologies and
taxonomies.
Scalable, Fault-tolerant Elastic System that can
seamlessly expand to the demands of volume,
velocity and variety growth of the content.
Content Enrichment services that leverages
machine learning and text analytics technologies
to enrich the context of the incoming content.
Unified Intelligent Content Access that
provides a set of content access services that
are context aware and based on information
relevance by user modeling and personalization.
To realize ECH, there is a need to
augment the existing search and indexing
technologies with the next generation of
machine learning and text analytics to bring
in a cohesive platform. The existing content
management solution still provides quite a good
list of features that cannot be ignored.
BIG DATA TECHNOLOGIES: RELEVANCE
FOR THE CONTENT HUB
With the advent of Big data, the technology
l andscape has made a si gni fi cant shi ft.
Distributed computing has now become a key
enabler for large scale data processing and with
open source contributions this has received a
signifcant boost in recent years. Year 2012 has
been the year for large scale Big data technology
adoption.
The other significant advancement
has been in the NoSQL (Not Only SQL)
technology which complements the existing
RDBMS systems for scalability and fexibility.
Scalable near real-time access provided by these
systems has boosted the adoption of distributed
50
computing for real-time data storage and
indexing needs.
Scal abl e and el ast i c depl oyment s
provided by the advancement in private and
public cloud deployments has accelerated
adoption of distributed computing in enterprises.
Overall, there is a signifcant change from our
earlier approaches to solve the ever increasing
data and performance problem by throwing
more hardware at the problem. Today deploying
a scalable distributed computing infrastructure
that not only addresses the velocity, variety
and volume problem but also providing it at
a cost effective alternative using open source
technologies provides the business case for
building the ECH. The solution to the problem
is to augment the existing content management
solution with the processing capabilities of the
Big data technologies to create a comprehensive
platform that brings in the best of both worlds.
REALIZATION OF THE ECH
ECH requi res a scal abl e f aul t t ol erant
elastic system that provides scalability on
storage, compute and network infrastructure.
Distributed processing technologies like
Hadoop provide the foundation platform
for this. Private cloud based deployment
model will provide the on-demand elasticity
and scale that is required to setup such a
platform.
Met adat a model dri ven i ngest i on
framework could ingest a wide variety of
feeds to the hub seamlessly. Content ingestion
could deploy content security tagging during
the ingestion process to ensure that the content
stored inside the hub is secured and authorized
before access.
NoSQL technologies like HBase and
MongoDB could provide the scalable metadata
repository needs for the system.
Figure 2: Extreme Content Hub Source: Reference [12]
Machine Learning Algorithms
Content Services
Unified Enterprise Content Access
Knowledge
Feeds to
various
existing
systems
Existing
Enterprise
Content
Social Feed
Integration
Log Feeds
from various
enterprise
system
News, Alerts
& RSS Feeds
(Real Time)
Search Services
Content Classification
Service
U
n
-
S
t
r
u
c
t
u
r
e
d

C
o
n
t
e
n
t

E
x
t
r
a
c
t
o
r
M
e
t
a
d
a
t
a

E
x
t
r
a
c
t
o
r
Content Classification
Service
Distributed File System (Hadoop)
Metadata Driven Augmented CM Processing Framework
(Generic Transformation, Dynamic Cluster Expansion, Audit Logging)
Content Processing Workflows
(Task Co-ordination, sequencing, scheduling etc. for Backend Processing)
Unified
Content
Extractor
Link
Storage
(Hbase)
Index
Storage
(Hbase)
Rule
Engine
Existing
Enterprise
CM
Content Management
Interface
Alerts & Content
API Service
Dashboard
Auto -
Classifier
Recommendation
Extreme Content Hub
51
Search and indexing technologies have
matured to be next level after the advent of the
Web 2.0 0 and deploying a scalable indexing
service like Solr, Elastic Search, etc., provides
the much needed scalable indexing and search
capability required for the platform.
Deploying machine learning algorithms
leveraging Mahout and R on this platform can
bring in auto-discovery of the content metadata
and auto-classifcation for content enrichment.
De-duplication and other value added services
can be seamless deployed as batch framework
on the Hadoop infrastructure to bring value
added context to the content.
Machine learning and text analytics
technologies can be further leveraged to provide
the recommendation and contextualization of
the user interactions to provide unifed context
aware services.
BENEFITS OF ECH
ECH is at the center of enterprise knowledge
management and innovation. Serving contextual
and relevant information to the users will be one
of the fundamental usages ECH.
Auto-i ndexi ng wi l l hel p di scover
multiple facets of the content and help in
discovering new patterns and relationships
between the various entities that would have
been particular unnoticed in the legacy world.
The integrated metadata view of the content will
help in building a 360 degree view on a particular
domain or entity from the various sources.
ECH could enable discovery of user
taste and likings based on the content searched
and vi ewed. Thi s coul d serve real -ti me
recommendation to users through content
hub services. This could help the enterprise
in specifc user behavior modeling. Emerging
trends in the various domains can be discovered
as content gets ingested on the hub.
ECH could extend as an analytics
platform for video and text analytics. Real-
time information discovery can be facilitated
using pre-defned alerts/rules which could get
triggered as new content arrives in the hub.
The derived metadata and context could
be pushed to the existing content management
solution to derive the benefts and investments
done on the existing products and platforms
and augment the processing and analytics
capabilities with new technologies.
ECH will now be able to handle large
volumes, wide variety of content formats and
bring in deep insights leveraging the power of
machine learning. These solutions will be very
cost effective and will also leverage existing
investment in the current CMS.
CONCLUSION
There need is to take a platform centric approach
to this Big content problem rather than a
standalone content management solution. There
is a need to look at it strategically and adopt a
scalable architecture platform to address this.
However such initiative doesn’t need to replace
the existing content management solutions but
to augment the capabilities to fll in required
white spaces. The approach discussed in this
paper provides one such implementation of the
augmented content hub leveraging the current
advancement in Big data technologies. Such
an approach will provide the enterprise with a
competitive edge in years to come.
REFERENCES
1. Agichtein, E., Brill, E. and Dumais, S.
(2006), Improving web search ranking by
incorporating user behavior. Available `at
http://research.microsoft.com/en-us/
um/people/sdumais/.
2. Dumain, S. (2011), Temporal Dynamics
52
and Information Retrieval. Available at
http://research.microsoft.com/en-us/
um/people/sdumais/.
3. Reamy, T. ( 2012) , Taxonomy and
Ent erpri se Cont ent Management .
Available at http://www.kapsgroup.
com/presentations.shtml.
4. Reamy, T. (2012), Enterprise Content
Categorization – How to Successfully
Choose, Develop and Implement a
Semant i c St rat egy, ht t p: //www.
ka ps gr oup. c om/pr e s e nt a t i ons /
ContentCategorization-Development.pdf.
5. Barroca, E. (2012), Bi g data’ s Bi g
Challenges for Content Management,
TechNewsWorld. Available at http://
www.technewsworld.com/story/74243.
html.
53
VOL 11 NO 1
2013
Complex Events Processing:
Unburdening Big Data Complexities
By Bill Peer, Prakash Rajbhoj and Narayanan Chathanur
A
study by The Economist revealed that 1.27
Zettabyte was the amount of information
in existence in 2010 as household data [1]. The
Wall Street Journal reported Big data as the
new boss in all key sectors such as education,
retail and fnance. But on the other side, an
average Fortune 500 enterprise is estimated
to have around 10 years’ worth of customer
data and more than two-thirds of it being
unusable. How can enterprises make such an
explosion of data usable and relevant? Not
trillions but quadrillions amount of data for
analysis overall and it is expected to increase
exponentially and evidently impacts businesses
worldwide. Additionally the problem is of
providing speedier results and that is expected
to go slower with more data to analyze unless
technologies innovate in the same pace.
Any function or business, whether it is
road traffc control, high frequency trading,
auto adj udication of insurance claims or
controlling supply chain logistics of electronics
manufacturing, all requires huge data sets to be
analyzed as well as a need for timely processing
and decision making. Any delay even in
seconds or milliseconds affects the outcome.
Signifcantly, technology should be capable of
interpreting historical patterns, apply them to
current situations and take accurate decisions
with minimal human interference.
Big data is about the strategy to deal
with vast chunk of incomprehensible data sets.
There is now awareness across industries that
traditional methods of data stores and processing
power like databases, fles, mainframes or even
mundane caching cannot be used as a solution
for Big data. Still the existing models do not
address capabilities of processing, analysis
of data, integrating with events and real time
analytics, all in split second intervals.
On the other hand, Complex Event
Processing (CEP) has evolved to provide
solutions in utilizing memory data grids for
analyzing trends, patterns and events in real
time and assessments in a matter of milliseconds.
However, Event Clouds, a byproduct of using
Analyze, crunch and detect unforeseen
conditions in real time through CEP of Big Data
Infosys Labs Briefings
54
CEP techniques, can be further leveraged to
monitor for unforeseen conditions birthing, or
even the emergence of an unknown-unknown,
creating early awareness and potential frst
mover advantage for the savvy organization.
To set the context of the paper we attempt
at highlighting how CEP with in-memory data
grid technologies helps in pattern detection,
matching, analysis, processing and decision
making in split seconds with the usage of Big
data. This model should serve any industry
function where time is the essence and Big
data is at the core and CEP acts as the mantle.
Later, we propose treating an Event Cloud as
more than just an event collection bucket used
for event pattern matching or as simply the
immediate memory store of an exo-cortex for
machine learning; an Event Cloud is also a robust
corpus with its own intrinsic characteristics that
can be measured, quantifed, and leveraged for
advantage. For example, by automating the
detection of a shift away from an Event Cloud’s
steady state, the emergence of a previously
unconsidered situation may be observed. It is
this application, programmatically discerning
the shift away from an Event Cloud’s normative
state, which is explored in this paper.
CEP AS REAL TIME MODEL FOR BIG DATA:
SOME RELEVANT CASES
In current times, traffc updates are integrated
with cities traffic control system as well as
many global positioning service (GPS) electronic
receivers used quite commonly by drivers. These
receivers automatically adjust and reroute in case
of the normal route is traffc ridden. This helps
but the solution is reactionary. Many technology
companies are investing in pursuit of the holy
grail of the solution to detect and predict traffc
blockages and take proactive action to control
the traffc itself and even avoid mishaps. For
this there is a need to analyze traffc data over
different parameters such as rush hour, accidents,
seasonal impacts of snow, thunderstorms, etc.,
and come up with predictable patterns over years
and decades. Second is application of this pattern
to input conditions. All this requires huge data
crunching, analyses and on top of it real time
application such as CEP.
Big data has already taken importance in
fnancial market particularly in high frequency
trading. Since the 2008 economic downturn
and its rippling effects on the stock market, the
volume of trade has come down at all the top
exchanges such as New York, London, Singapore,
Hong Kong or Mumbai. But the contrasting factor
is the rise in High Frequency Trading (HFT). It is
claimed that around 70% of all equity trades were
accounted by HFT in 2010 versus 10% in 2000.
HFT is 100% dependent on technology and the
trading strategies are developed out of complex
algorithms. Only those trades will have a better
win ratio that has developed a better strategy
and has more data to crunch in faster time. This
is where CEP could be useful.
The healthcare industry in USA is set to
undergo a rapid change with the Affordable
Care Act. Healthcare insurers are expected to see
an increase in their costs due to increased risks
of covering more individuals and legally cannot
deny insurance with pre-conditions. Hospitals
are expected to see more patient data which
means increased analyses and pharmaceutical
companies need better integration with the
insurers and consumers to have speedier and
accurate settlements. Even though most of these
transactions can be performed on non-real time
basis, technology still needs both Big data and
complex processing for a scalable solution.
In India the outstanding cases in various
judicial courts touch 32 million. In USA, family
based cases and immigration related ones
55
are piling up waiting for a hearing. Judicial
pendency has left no country untouched.
Scanning through various federal, state and local
law points, past rulings, class suits, individual
profles, evidence details etc., are required to put
forward the cases for the parties involved and the
winner is the one who is able to present a better
analysis of available facts. Can technology help
in addressing such problems across nations?
All of these cases across such diverse
i ndust ri es showcase t he i mport ance of
processing gigantic amounts of data and also
the need to have the relevant information
churned out in right time.
WHY AND WHERE BIG DATA
Big data has evolved due to the existing limitations
of current technologies. Two-tier or multi-
tier architecture with even a high performing
database at one end is not enough to analyze
and crunch such colossal information in desired
time frames. The fastest databases today are
benchmarked at tera bytes of information as
noted by the transaction processing council
Volumes of exa and zetta bytes of data need a
different technology. Analysis of unstructured
data is another criterion for the evolution of Big
data. Information available as part of health
records, geo maps, multimedia (audio, video
and picture) is essential for many businesses
and mining such unstructured sets require
storage power as well as transaction processing
power. Add this to the variety of sources such as
social media, legacy systems, vendor systems,
localized data, mechanical and sensor data.
Finally the critical component of Speed to get
the data through the steps of Unstructured →
Structured → Storage → Mine → Analyze →
Process → Crunch → Customize → Present.
BIG DATA METHODOLOGIES:
SOME EXAMPLES
Apache™ Hadoop™ project [2] and its relatives
such as Avro™, ZooKeeper™, Cassandra™,
Pig™ provided the non-database form of
technology as the way to solve problems with
massive data. It used distributed architecture
as the foundation to remove the constraints of
traditional constructs.
Both Data (storage, transportation) and
Processing (analysis, conversion, formatting)
are distributed in this architecture. Figure 1 and
Figure 2 compare the traditional vs. Distributed
Architecture.
Validation
Enrichment
Transformation
Strandardization
Route
Operate
Middle Tier Server Tier Client Tier Distributed Nodes
Data Nodes Data Nodes
Data Nodes Data Nodes
Client Tier
Processing Nodes
Processing Nodes
Distributed Nodes
Data Nodes Data Nodes
Data Nodes Data Nodes
Client Tier
Processing Nodes
Processing Nodes
Figure 1: Conventional Multi-Tier Architecture
Source: Infosys Research
Figure 2: Distributed Multi-Nodal Architecture
Source: Infosys Research
56
A key advant age of di st r i but ed
architecture is scalability. Nodes can be added
without affecting the design of the underlying
data structures and processing units.
IBM has even gone a step ahead in getting
Watson [5], the famous artificial intelligent
computer which can learn as it gets more
information and patterns for decision making.
Similarly IBM [6], Oracle [7], Teradata
[8] and many leading software providers
have created the Big data methodologies as
an impetus to help enterprise information
management.
VELOCITY PROBLEM IN BIG DATA
Even though we clearly see the benefts of Big
data and its architecture can easily be applicable
to any industry, there are some limitations that
is not easily perceivable. Few pointers:
â–  Can Big data help a trader to give the
best win scenarios based on millions and
even billions of computations of multiple
trading parameters in real time?
â–  Can Big data forecast traffic scenarios
based on sensor data, vehicle data,
seasonal change, major public events and
provide alternate path to drivers through
their GPS devices in real time helping both
city offcials as well as drivers to save time?
â–  Can Big data detect fraud detection
scenarios running through multiple
shopping patterns of a user through
historical data and match with the
current transaction in real time?
â–  Can Big data provide real time analytical
solutions out of the box and support
predictive analytics?
There are multiple business scenarios
in which data has to be analyzed in real time.
These data are created, updated and transferred
because of real time business or system level
events. Since the data is in the form of real time
events, this requires a paradigm shift in the
methodology in the way data is viewed and
analyzed. Real time data analyses in such cases
means that data has to be analyzed before the
data hits the disk. Difference between ‘event’
and ‘data’ just vanishes.
In such cases across the industry where
Big data is unequivocally needed to manage
the data but to use this data effectively and
integrate with real time events and provide
business with express results, a complimentary
technology is required and that’s where CEP
can ft in.
VELOCITY PROBLEM: CEP AS A SOLUTION
The need here is the analyses of data arriving
through the form of real time event streams
and identifying patterns or trends based on
vast historical data. Adding to the complexity
is other real time events.
The vastness is solved with Big data and
real time analysis of multiple events, pattern
detection and appropriate matching and
crunching is solved by CEP.
Real time event analysis ensures avoiding
duplicates and synchronization issues as data
is still in fight and storage is still a step away.
Similarly it facilitates predictive analysis of data
by means of pattern matching and trending.
This enables enterprise to provide early
warning signals and take corrective measures
in real time itself.
Reference architecture of traditional CEP
is shown in Figure 3.
CEP’s original objective was to provide
processing capability similar to Big data with
57
distributed architecture and in memory grid
computing. The difference was CEP was to
handle multiple events seemingly unrelated
and correlate them to provide a desired and
meaningful output. The backbone of CEP
though can be the traditional architectures such
multi-tier technologies with CEP usually in the
middle tier.
Figure 4 shows how the CEP on Big
data solves the velocity problem with Big
data and complements the overall information
management strategy for any enterprise that
aims to use Big data. CEP can utilize Big data
particularly by highly scalable in-memory data
grids to store the raw feeds, events of interests
and detected events and analyze this data in real
time by correlating with other in fight events.
Fraud detection is a very apt example where
historic data of the customer’s transaction,
his usage profle, location, etc., is stored in
the in memory data grid and every new event
(transactions) from the customer is analyzed
by CEP engine by correlating and applying
patterns on the event data with the historic data
stored in the memory grid.
There are multiple scenarios some
of them outlined through this paper where
CEP complements Big data and other offine
analytical approaches to accomplish an active
and dynamic event analytics solution.
EVENT CLOUDS AND DETECTION
TECHNIQUES
CEP and Event Clouds
A linearly ordered sequence of events is called
an event stream [9]. An event stream may
contain many different types of events, but
there must be some aspect of the events in the
event stream that allow for a specifc ordering.
This is typically an ordering via timestamp.
Event Access Event Attributes Relationships
Persistence
Models
Storage
Options
Security
and Search
Scalability
Event Modeling and Management
U
s
e
r

R
o
l
e
s
P
e
r
f
o
r
m
a
n
c
e
S
e
c
u
r
i
t
y

a
n
d
A
u
t
h
e
n
t
i
c
a
t
i
o
n
F
a
i
l
u
r
e

a
n
d
R
e
c
o
v
e
r
y
A
c
c
e
s
s
M
a
n
a
g
e
m
e
n
t
M
e
m
o
r
y
M
a
n
a
g
e
m
e
n
t
M
o
n
i
t
o
r
i
n
g

a
n
d

A
d
m
i
n
i
s
t
r
a
t
i
o
n

T
o
o
l
s
L
a
n
g
u
a
g
e
C
o
n
s
t
u
c
t
s
M
u
l
t
i

U
s
e
r
S
u
p
p
o
r
t
S
t
a
n
d
a
r
d
F
u
n
c
t
i
o
n
s
D
e
b
u
g
C
a
p
a
b
i
l
i
t
y
F
e
a
t
u
r
e
S
e
t
D
e
v
.
,

B
u
s
i
n
e
s
s

U
s
e
r

T
o
o
l
s

(
P
l
a
t
f
o
r
m

I
n
d
e
p
e
n
d
e
n
t
)
M
e
t
a

D
a
t
a
R
e
p
o
s
i
t
o
r
y
D
o
m
a
i
n
m
o
d
e
l

C
a
t
a
l
o
g
O
b
j
e
c
t

M
o
d
e
l
C
a
t
a
l
o
g
E
v
e
n
t
C
a
t
a
l
o
g
E
v
e
n
t
O
r
i
g
i
n
a
t
o
r
E
v
e
n
t

G
e
n
e
r
a
t
i
o
n

a
n
d

C
a
p
t
u
r
e
Event
Pre-filtering
Event
Streams
Event Consumer
CEP
Languages
Patterns
Domain Specific
Algorithms
Pre-
processing
Refine Visualize
Aggregate
and correlate
Event Handlers Event Processing Engine
Event Processing and Logic
Actions Patterns
Source: Infosys Research Figure 3: Complex Events Processing-
Reference Architecture
58
By watching for Event patterns of interest, such
as multiple usages of the same credit card at
a gas station within a 10 minute window, in
an event stream, systems can respond with
predefned business driven behaviors, such as
placing a fraud alert on the suspect credit card.
An Event Cloud is “a partially ordered
set of events (POSET), either bounded or
unbounded, where the partial orders are imposed
by the causal, timing and other relationships
between events” [10]. As such, it is a collection of
events within which the ordering of events may
not be possible. Further, there may or may not
be an affnity of the events within a given Event
Cloud. If there is an affnity, it may be as broad
as “all events of interest to our company” or as
specifc as “all events from the emitters located
at the back of the building.”
Event Clouds and event streams may
contain events from sources outside of an
organization, such as stock market trades or
tweets from a particular twitter user. Event
Clouds and event streams may have business
events, operational events, or both. Strictly
speaking, an event stream is an Event Cloud,
but an Event Cloud may or may not be an event
stream, as dictated by the ordering requirement.
Typi cal l y, a l andscape wi t h CEP
capabilities will include three logical units:
(i) emitters that serve as sources of events, (ii)
a CEP engine, and (iii) targets to be notifed
under certain event conditions. Sources can
be anything from an application to a sensor to
even the CEP engine itself. CEP engines, that
are the heart of the system, are implemented
in one of two fundamental ways. Some follow
the paradigm of being rules based, matching on
explicitly stated event patterns using algorithms
like Rete, while other CEP engines use the more
sophisticated event analytics approach looking
Event Access Event Attributes Relationships
Persistence
Models
Storage
Options
Security
and Search
Scalability
Event Modeling and Management
D
e
v
.
,

B
u
s
i
n
e
s
s

U
s
e
r

T
o
o
l
s

(
P
l
a
t
f
o
r
m

I
n
d
e
p
e
n
d
e
n
t
)
E
v
e
n
t

G
e
n
e
r
a
t
i
o
n

a
n
d

C
a
p
t
u
r
e
Dashboard Event Streams
Event Consumer
CEP
Languages
Patterns
Domain Specific
Algorithms
Pre-
processing
Refine Visualize
Aggregate
and correlate
Query
Agent
Write
Connector Event Handlers Event Processing Engine
Event Processing and Logic
Actions Patterns
L
a
n
g
u
a
g
e
C
o
n
s
t
u
c
t
s
M
u
l
t
i

U
s
e
r
S
u
p
p
o
r
t
S
t
a
n
d
a
r
d
F
u
n
c
t
i
o
n
s
D
e
b
u
g
C
a
p
a
b
i
l
i
t
y
F
e
a
t
u
r
e
S
e
t
M
e
t
a

D
a
t
a
R
e
p
o
s
i
t
o
r
y
D
o
m
a
i
n
m
o
d
e
l

C
a
t
a
l
o
g
O
b
j
e
c
t

M
o
d
e
l
C
a
t
a
l
o
g
E
v
e
n
t
C
a
t
a
l
o
g
E
v
e
n
t
O
r
i
g
i
n
a
t
o
r
In
Memory
DB or
Data
Grid
Big Data
Source: Infosys Research Figure 4: CEP on Big Data
59
for probabilities of event patterns emerging using
techniques like Bayesian Classifers [11]. In either
case of rules or analytics, some consideration of
what is of interest must be identifed up front.
Targets can be anything from dashboards to
applications to the CEP engine itself.
Users of the system, using the tools
provided by the CEP provider, articulate events
and patterns of events that they are interested
in exploring, observing, and/or responding to.
For example, a business user may indicate to
the system that for every sequence wherein a
customer asks about a product three times but
does not invoke an action that results in a buy,
the system is then to provide some promotional
material to the customer in real-time. As another
example, a technical operations department
may issue event queries to the CEP engine,
in real time, asking about the number of
server instances being brought online and
the probability that there may be a defcit in
persistence storage to support the servers.
Focusing on events, while extraordinarily
powerful, biases what can be cognized. That
is, what you can think of, you can explore.
What you can think of, you can respond to.
However, by adding the Event Cloud, or event
stream, to the pool of elements being observed,
emergent patterns not previously considered
can be brought to light. This is the crux of this
paper, using the Event Cloud as a porthole into
unconsidered situations emerging.
EVENT CLOUDS HAVE FORM
As represented in Figure 5 , there is a
point wherein events fowing through a CEP
engine are unprocessed. This point is an Event
Cloud, which may or may not be physically
located within a CEP engine memory space.
This Event Cloud has events entering its
logical space and leaving it. The only bias to
the events travelling through the CEP engine’s
Event Cloud is based on which event sources
are serving as inputs to the particular CEP
engine. For environments wherein all events,
regardless of source, are sent to a common
CEP engine, there is no bias of events within
the Event Cloud.
There are a number of attributes about
the Event Cloud that can be captured, depending
upon a particular CEP’s implementation.
For example, if an Event Cloud is managed
Source: Infosys Research
Input Adapter
Input Adapter
Input Adapter
Input Adapter
O
u
t
p
u
t

B
u
s
E
v
e
n
t

I
n
g
r
e
s
s

B
u
s
Apply Rules
Filter
Union
Correlate
Match
Event Cloud
Output Adapter
Output Adapter
Output Adapter
Input Adapter
Input Adapter
Input Adapter
Input Adapter
O
u
t
p
u
t

B
u
s
E
v
e
n
t

I
n
g
r
e
s
s

B
u
s
Apply Rules
Filter
Union
Correlate
Match
Event Cloud
Output Adapter
Output Adapter
Output Adapter
Figure 5: CEP Engine Components
60
in memory and is based on a time window,
for e.g., events of interest only stay within
consideration by the engine for a period of
time, then the number of events contained
within an Event Cloud can be counted. If the
structure holding an Event Cloud expands
and contracts with the events it is funneling,
then the memory footprint of the Event Cloud
can be measured. In addition to the number of
events and the memory size of the containing
unit, the counts of the event types themselves
that happen to be present at a particular time
within the Event Cloud become a measurable
characteristic. These properties, viz., memory
size, event counts, and event types, can serve as
measurable characteristics describing an Event
Cloud, giving it a size and shape Figure 6.
EVENT CLOUD STEADY STATE
The properties of an Event Cloud that give
it form can be used to measure its state. By
collecting its state over time, a normative
operating behavior can be identifed and its
steady state can be determined. This steady
state is critical when watching for unpredicted
patterns. When a new fow pattern of events
causes an Event Cloud’s shape to shift away
from its steady state, a situation change has
occurred Figure 7. When these steady state
deviations happen, and if no new matching
patterns or rules are being invoked, then an
unknown-unknown may have emerged. That
is, something signifcant enough to adjust your
systems operating characteristics has occurred
yet isn’t being acknowledged in some way.
Either it has been predicted but determined
to not be important, or it was simply not
considered.
ANOMALY DETECTION APPLIED TO
EVENT CLOUD STEADY STATE SHIFTS
Finding patterns in data that do not match
a baseline pattern is the realm of anomaly
detection. As such, by using the steady state of
an Event Cloud as the baseline we can apply
anomaly detection techniques to discern a shift.
Table 1 presents a catalog of various
anomaly detection techniques that are applicable
to Event Cloud shift discernment. This list isn’t
to serve as an exhaustive compilation, but
rather to showcase the variety of possibilities.
Each algorithm has its own set of strengths
Event S
Event A
Event M
Event A
Event S
Event M
Event A
Event A
Event M
Event Cloud
Figure 6: Event Cloud (The Events traversing an Event
Cloud at any particular moment give it shape and size)
Source: Infosys Research
Figure 7: Event Cloud Shift (Shape shifts as new
patterns occur)
Source: Infosys Research
Event Cloud Steady State Shift
Buy
Buy
Buy
Buy
Ask
Look
Look
Ask
Buy
Ask
Buy
Ask
Ask
Ask
Ask
Ask
Event Cloud Steady
State Form
Event Cloud
New Form
61
such as simplicity, speed of computation, and
certainty scores. Each algorithm, likewise, has
weaknesses to include computational demands,
blind spots in data deviations, and diffculty in
establishing a baseline for comparison. All of
these factors must be considered when selecting
an appropriate algorithm.
Using the three properties defned for an
Event Cloud’s shape (for e.g., event counts, event
types, and Event Cloud size) combined with
time properties, we have a multivariate data
instance with three of them being continuous
types, viz., counts, sizes, and time and one being
categorical, viz., types. These four dimensions,
and their characteristics, become a constraint
on which anomaly detection algorithms can be
applied [13].
The anomaly type being detected is
also a constraint. In this case, the Event Cloud
deviations are being classified as collective
anomaly. It is collective anomaly, as opposed
to point anomaly or context anomaly as we are
comparing a collection of data instances that
form the Event Cloud shape with a broader
set of all data instances that formed the Event
Cloud steady state shape.
Statistical algorithms lend themselves
well to anomaly detection when analyzing
continuous and categorical data instances.
Further, knowing an Event Cloud’s steady state
shape a priori isn’t assumed, so the use of a non-
parametric statistical model is appropriate [13].
Therefore, the technique of statistical profling
using histograms is explored as an example
implementation approach for catching a steady
state shift.
One basic approach to trap the moment
of an Event Cloud’s steady state shift is to
leverage a histogram based on each event type,
with the number of times a particular count of
an event type shows up in a given Event Cloud
instance becoming a basis for comparison. The
histogram generated over time would then
serve as the baseline steady state picture of
normative behavior. Individual instances of an
Event Cloud’s shape could then be compared
to the Event Cloud’s steady state histogram to
discern if a deviation has occurred. That is, does
the particular Event Cloud instance contain
counts of events that have rarely, or never,
appeared in the Event Cloud’s history.
Figure 8 represents the case with a
steady state histogram on the left, and the Event
Cloud comparison instance on the right. In this
depiction the histogram shows, as an example,
that three Ask Events were contained within
an Event Cloud instance exactly once in the
history of this Event Cloud. The Event Cloud
Table 1: Applicability of Anomaly Detection Techniques to
Event Cloud Steady State Shifts
Source: Derived from Anomaly Detection: A survey [12]
Technique
Classifcation
Example Constituent
Techniques
Event Cloud Shift Applicability
Challenges
Classifcation Based
Neural Networks | Bayesian Networks
Support Vector Machines Rule
Accurately labeled training data for the
classifers is diffcult to obtain
Nearest Neighbour Based
Clustering Based
Distance to kth Nearest Neighbour
Relative Density
Defning meaningful distance measures
diffcult
Statistical Parametric | Non-Parametric
Histogram approaches miss unique
combinations
Spectral Low Variance PCA Eigenspace - Based High computational complexity
62
instance, on the right, that will be compared
shows that the instance has six Ask Events in
its snap shot state.
An anomaly score for each event type
is calculated, by comparing each Event Cloud
instance event type count to the event type
quantity occurrence bins within the Event
Cloud steady state histogram, and then these
individual scores are combined for an aggregate
score [13]. This aggregate score then becomes the
basis upon which a judgment is made regarding
a whether deviation has occurred or not.
While simple to implement, the primary
weakness of usi ng the hi stogram based
approach is that a rare combination of events
in an Event Cloud would not be detected, if the
quantities of the individual events present were
in their normal or frequent quantities.
LIMITATIONS OF EVENT CLOUD SHIFTS
Anomaly detection algorithms have blind
spots, or situations where they cannot discern
an Event Cloud shift. This implies that it is
possible for an Event Cloud to shift undetected,
under just the right circumstances. However,
following the lead suggested by Okamoto
and Ishida with immunity-based anomaly
detection systems [13], rather than having
a single observer detecting when an Event
Cloud deviates from steady state, a system
could have multiple observers, each with their
own techniques and approaches applied. Their
individual results could then be aggregated,
with varying weights applied to each technique,
to render a composite Event Cloud steady state
shift score. This will help remove the chances
of missing a state change shift.
With the approach outlined by this
paper, the scope of indicators is such that you
get an early indicator that something new is
emerging and nothing more. Noticing an Event
Cloud shift only indicates that a situational
change has occurred; it does not identify or
highlight what the root cause of the change is,
nor does it fully explain what is happening.
Analysis is still required to determine what
initiated the shift along with what opportunities
for exploitation may be present.
FURTHER RESEARCH
Many enterprise CEP implementations are
architected in layers, wherein event abstraction
hierarchies, event pattern maps and event
processing networks are used in concert to
increase the visibility aspects of the system [14]
as well as to help with overall performance by
allowing for the segmenting of Event flows.
In general, each layer going up the hierarchy
is an aggregation of multiple events from its
immediate child layer. With the lowest layer
containing the finest grained events and the
highest layer containing the coarsest grained
events, the Event Clouds that manifest at
each layer are likewise of varying granularity
(Figure 9). Therefore a noted Event Cloud
steady state shift at the lowest layer represents
the finest granularity shift that can be observed.
An Event Cloud’s steady state shifts at the
highest layer represent the coarsest steady
Figure 8: Event Cloud Histogram & Comparison
Source: Infosys Research
1
Ask
Event (s)
Ask
Event
Buy
Event (s)
Event Cloud
Steady State Histogram
Event Cloud
Comparison Instance
Event Cloud Histogram
and Instance Comparison
Look
Event (s)
Buy
Event
Look
Event
2 3 1 2 3 1 2 3
A
A
A
A
A
A
B
B
B
B
63
state shifts that can be observed. Techniques
for interleaving individual layer Event Cloud
steady state shifts along with opportunities
and consequences of their mixed granularity
can be explored.
The technique presented in this paper
is designed to capture the beginnings of
a situational change not explicitly coded
for. With the recognition of a new situation
emerging, the immediate task is to discern what
is happening and why, while it is unfolding.
Further research can be done to discern which
elements available from the steady state
shift automated analysis would be of value
to help an analyst — business or technical
-- unravel the genesis of the situation change.
By discovering what change information is of
value, not only can an automated alert be sent
to interested parties, but it can contain helpful
clues on where to start their analysis.
CONCLUSION
It would be an understatement that without the
right set of systems, methodologies, controls,
checks and balances on data, no enterprise can
survive. Big data solves the problem of vastness
and multiplicity of the ever rising information
in this information age. What Big data does not
fulfll is the complexity associated with real time
data analysis. CEP though designed purely for
events complements the Big data strategy of
any enterprise.
Event Cloud, a constituent component
of CEP can be used for more than its typical
application. By treating it as a frst class citizen
of indicators, and not just a collection point
computing construct, a company can gain
insight into the early emergence of something
new, something previously not considered
and potentially the birthing of an unknown-
unknown.
With organizations growing in their
usage of Big data, and the desire to move closer
to real time response, companies will inevitably
leverage the CEP paradigm. The question
will be do they use it as everyone else does,
triggering off of conceived patterns, or will they
exploit it for unforeseen situation emergence?
When the situation changes, the capability is
present and the data is present, but are you?
REFERENCES
1. WSJ article on Big data. Available at
http://online.wsj.com/article/SB1000
0872396390443890304578006252019616
768.html.
2. Tr ans act i on Pr oces s i ng Counci l
Benchmark comparison or leading
databases. Available at http://www.tpc.
org/tpcc/results/tpcc_perf_results.asp.
3. Tr ans act i on Pr oces s i ng Counci l
Benchmark comparison or leading
databases. Available at http://www.tpc.
org/tpcc/results/tpcc_perf_results.asp.
4. Apache Hadoop project site. Available
at http://hadoop.apache.org/.
5. IBM Watson – Artifcial intelligent super
computer’s Home Page. Available at
http://www-03.ibm.com/innovation/
us/watson/.
CEP In Layers
S A M A
AN
S M S
TH
S M
S
A
M
S
A
TH
AN
Events
Event Clouds
Figure 9: Event Hierarchies
Source: Infosys Research
64
6. IBM’s Big data initiative. Available at
http://www-01.ibm.com/software/
data/bigdata/.
7. Oracle’s Big data initiative. Available
at ht t p: //www. or ac l e. c om/us /
technologies/big-data/index.html.
8. Teradata Big data Analytics offerings.
Available at http://www.teradata.com/
business-needs/Big-Data-Analytics/.
9. Luckham, D. and Schulte, R. (2011),
Event Processing Glossary – Version 2.0,
Compiled. Available at http://www.
complexevents.com/2011/08/23/event-
processing-glossary-version-2-0/.
10. Bass, T. (2007), What is Complex Event
Processing? TIBCO Software Inc.
11. Bass, T. ( 2010) , Orwel l i an Event
Processing. Available at http://www.
thecepblog.com/2010/02/28/orwellian-
event-processing/.
12. Chandola, V., Banerjee, A., and Vipin
Kumar, V. (2009), Anomaly Detection :
A Survey, ACM Computing Surveys.
13. Okamoto, T. and Ishida, Y. (2009), An
Immunity-Based Anomaly Detection
System with Sensor Agents, sensor ISSN
1424-8220.
14. Luckham, D. (2002), The Power of
Events, An Introduction to Complex
Event Pr ocessi ng i n Di st r i but ed
Enterprise Systems, Addison Wesley,
Boston.
15. Vincent, P. (2011), ACM Overview
of BI Technology misleads on CEP.
Available at http://www.thetibcoblog.
com/2011/07/28/acm-overview-of-bi-
technology-misleads-on-cep/.
16. About Esper and NEsper FAQ, http://
esper. codehaus. org/tutori al s/faq_
esper/faq.html#what-algorithms.
17. I de, T. and Kashi ma, H. ( 2004) ,
Eigenspace-based Anomaly Detection
in Computer Systems, Tenth ACM
SIGKDD International Conference on
Knowledge Discovery and Data Mining,
August pp. 22-25.
65
VOL 11 NO 1
2013
Big Data: Testing Approach to
Overcome Quality Challenges
By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja
T
esting Big data is one of the biggest
challenges faced by organizations because
of lack of knowledge on what to test and how
much data to test. Organizations have been
facing challenges in defning the test strategies
for structured and unstructured data validation,
setting up an optimal test environment, working
with non-relational databases and performing
non-functional testing. These challenges are
causing in poor quality of data in production
and delayed implementation and increase in
cost. Robust testing approach need to be defned
for validating structured and unstructured
data and start testing early to identify possible
defects early in the implementation life cycle
and to reduce the overall cost and time to
market.
Different testing types like functional
and non-functional testing are required along
with strong test data and test environment
management to ensure that the data from varied
sources is processed error free and is of good
quality to perform analysis. Functional testing
activities like validation of map reduce process,
structured and unstructured data validation,
data storage validation are important to ensure
that the data is correct and is of good quality.
Apart from functional validations other non-
functional testing like performance and failover
testing plays a key role to ensure the whole
process is scalable and is happening within
specifed SLA.
Big data implementation deals with
writing complex Pig, Hive programs and
running these jobs using Hadoop map reduce
framework on huge volumes of data across
different nodes. Hadoop is a framework that
allows for the distributed processing of large
data sets across clusters of computers. Hadoop
uses Map/Reduce, where the application is
divided into many small fragments of work,
each of which may be executed or re-executed
on any node in the cluster. Hadoop utilizes its
own distributed fle system, HDFS, which makes
data available to multiple computing nodes.
Figure 1 shows the step by step process
on how Big data is processed using Hadoop
ecosystem. First step loading source data into
Validate data quality by employing a
structured testing technique
Infosys Labs Briefings
66
HDFS involves in extracting the data from
different source systems and loading into
HDFS. Data is extracted using crawl jobs for
web data, tools like sqoop for transactional
data and then loaded into HDFS by splitting
into multiple fles. Once this step is completed
second step perform map reduce operations
involves in processing the input files and
applying map and reduce operations to get a
desired output. Last setup extract the output
results from HDFS involves in extracting the
data output generated out of second step and
loading into downstream systems which can
be enterprise data warehouse for generating
analytical reports or any of the transactional
systems for further processing
BIG DATA TESTING APPROACH
As we are dealing with huge data and executing
on multiple nodes there are high chances of
having bad data and data quality issues at each
stage of the process. Data functional testing is
performed to identify these data issues because
of coding errors or node configuration errors.
Testi ng shoul d be performed at each of
the three phases of Big data processing to
ensure that data is getting processed without
any errors. Functional Testing includes (i)
validation of pre-Hadoop processing; (ii),
validation of Hadoop Map Reduce process
data output; and (iii) validation of data
extract, and load into EDW. Apart from these
functional validations non-functional testing
including performance testing and failover
testing needs to be performed.
Fi gure 2 shows a typi cal Bi g data
architecture diagram and highlights the areas
where testing should be focused.
Validation of Pre-Hadoop Processing
Data from various sources like weblogs, social
network sites, call logs, transactional data
etc., is extracted based on the requirements
and loaded into HDFS before processing it
further.
Issues: Some of the issues which we face during
this phase of the data moving from source
Figure 1: Big Data Testing Focus Areas Source: Infosys Research
1 2
3
Loading Source
data files into HDFS
Perform Map
Reduce operations
Extract
the output
results from
HDFS
67
systems to Hadoop are incorrect data captured
from source systems, incorrect storage of data,
incomplete or incorrect replication.
Validations: Some hi gh l evel scenari os
that need to be validated during this phase
include:
1. Comparing input data file against
source systems data to ensure the data
is extracted correctly
2. Validating the data requirements and
ensuring the right data is extracted,
3. Validating that the fles are loaded into
HDFS correctly, and
4. Validating the input files are split,
moved and replicated in different data
nodes.
Validation of Hadoop Map Reduce Process
Once the data is loaded into HDFC Hadoop
map-reduce process is run to process the data
coming from different sources.
Issues: Some issues that we face during this
phase of the data processing are coding issues
in map-reduce jobs, jobs working correctly
when run in standalone node, but working
incorrectly when run on multiple nodes,
incorrect aggregations, node confgurations,
and incorrect output format.
Validations: Some high level scenarios that
need to be val i dated duri ng thi s phase
include:
1. Val i dat i ng t hat dat a pr oces s i ng
i s c ompl et ed and out put f i l e i s
generated
Figure 2: Big Data architecture Source: Infosys Research
Enterprise Data
Warehouse
ReportsTesting
Reporting using BI Tools
25% 25%
25% 25% 1
2
3
4
5
Big Data Testing Focus Areas
Bar graph
Big Data Analytics
Pig HIVE
HBase (NoSQL DB)
Map Reduce
(Job Execution)
HDFS (Hadoop Distributed File System)
Transactional
Data (RDBMS)
N
o
n
-
F
u
n
c
t
i
o
n
a
l
T
e
s
t
i
n
g

(
P
e
r
f
o
r
m
a
n
c
e
,

F
a
i
l

o
v
e
r

t
e
s
t
i
n
g
)
4
4
Map-Reduce
process validation
2
ETL Process
validation
3
Pre-Hadoop
process validation
1
Web Logs
Streaming
Data
Social Data
Processed Data
Data Load using Sqoop
h
a
d
o
o
p
ETL Process
68
2. Val i dat i ng t he busi ness l ogi c on
standalone node and then validating
after running against multiple nodes
3. Validating the map reduce process to
verify that key value pairs are generated
correctly
4. Val i dat i ng t he aggr egat i on and
consolidation of data after reduce
process
5. Validating the output data against
the source fles and ensuring the data
processing is completed correctly
6. Validating the output data fle format
and ensuring that the format is per the
requirement.
Validation of Data Extract, and Load into EDW
Once map-reduce process is completed and data
output fles are generated, this processed data
is moved to enterprise data warehouse or any
other transactional systems depending on the
requirement.
Issues: Some issues that we face during this
phase include incorrectly applied transformation
rules, incorrect load of HDFS fles into EDW and
incomplete data extract from Hadoop HDFS.
Validations: Some high level scenarios that
need to be validated during this phase include:
1. Validating that transformation rules are
applied correctly
2. Validating that there is no data corruption by
comparing target table data against HDFS
fles data
3. Validating the data load in target system
4. Validating the aggregation of data
5. Validating the data integrity in the target
system.
Validation of Reports
Analytical reports are generated using reporting
tools by fetching the data from EDW or running
queries on Hive.
Issues: Some of the issues faced while generating
reports are report defnition not set as per the
requirement, report data issues, layout and
format issues.
Validations: Some high level validations
performed during this phase include:
Reports Validation: Reports are tested after
ETL/transformation workfows are executed for
all the sources systems and the data is loaded
into the DW tables. The metadata layer of the
reporting tool provides an intuitive business
view of data available for report authoring.
Checks are performed by writing queries to
verify whether the views are getting the exact
data needed for the generation of the reports.
Cube Testing: Cubes are testing to verify
that dimension hierarchies with pre-aggregated
values are calculated correctly and displayed
in the report.
Dashboard Testing: Dashboard testing
consists of testing of individual web parts and
reports placed in a dashboard. Testing would
involve ensuring all objects are rendered
properly and the resources on the webpage
are current and latest. The data fetched from
various web parts is validated against the
databases.
69
VOLUME, VARIETY AND VELOCITY:
HOW TO TEST?
In the earlier sections we have seen step by step
details on what need to be tested at each phase
of the Big data processing. During these phases
of Big data processing the three dimensions or
characteristics of Big data i.e. volume, variety
and velocity are validated to ensure there are no
data quality defects and no performance issues.
Volume: The amount of data created both inside
corporations and outside the corporations via
the web, mobile devices, IT infrastructure,
and other sources is increasing exponentially
each year [3]. Huge volume of data fows from
multiple systems which need to be processed
and analyzed. When it comes to validation it is
a big challenge to ensure that whole data setup
processed is correct. Manually validating the
whole data is a tedious task. We should use
compare scripts to validate the data. As data
is stored in HDFS is in fle format scripts can
be written to compare two fles and extract the
differences using compare tools [4]. Even if we
use compare tools it will take a lot of time to do
100% data comparison. To reduce the time for
execution we can either run all the comparison
scripts in parallel on multiple nodes just like
how data is processed using Hadoop map-
reduce process or sample the data ensuring
maximum scenarios are covered.
Figure 3 shows the approach on how
voluminous amount of data is compared. Data
is converted into expected result format and
then compared using compare tools with actual
data. This is a faster approach but involves
initial scripting time. This approach will reduce
further regression testing cycle time. When
we don’t have time to validate complete data,
sampling can be done for validation.
Variety: The variety of data types is increasing,
namely unstructured text-based data and semi-
structured data like social media data, location-
based data, and log-fle data.
Structured Data is data which is in
defned format which is coming from different
RDBMS tables or from structured files. The
data that is of transactional nature can be
handled in fles or tables for validation purpose.
Figure 3: Approach for High Volume Data Validation Source: Infosys Research
T
o
o
l

t
o
c
o
m
p
a
r
e
t
h
e

f
i
l
e
s
Map Reduce Jobs
Discrepancy Report
Testing Scripts to validate data in HDFS
U
n
s
t
r
u
c
t
u
r
e
d
t
o

S
t
r
u
c
t
u
r
e
d
Map Reduce Jobs
run in test
environment to
generate the output
Custom scripts to
convert unstructured
data to structured data
Scripts to convert
data to expected
results data
Unstructured
(data)
“testing”
R
a
w

D
a
t
a
t
o

E
x
p
e
c
t
e
d
R
e
s
u
l
t
s
f
o
r
m
a
t
Structured
data
1 SD Test
2 SD1Test1
Output
Data
Files
Unstructured
(data)
“testing”
Structured
data
testing
Structured
data testing
Actual
Results
Expected
Results
Expected
Results
File by
File
Comparison
70
Semi-structured data does not have any defned
format but structure can be derived based on the
multiple patterns of the data. Example of data is
extracted by crawling through different websites
for analysis purposes. For validation data need
to be first transformed into structured format
using custom built scripts. First the pattern
need to be identified and then copy books
or pattern outline need to be prepared, later
this copy book need to be used in scripts to
convert the incoming data into a structured
format and then validations performed using
compare tools.
Unstructured data i s the data that
does not have any format and is stored in
documents or web content, etc. Testi ng
unstructured data is very complex and is time
consuming. Automation can be achieved to
some extent by converting the unstructured
data into structured data using scripting like
PIG scripting as showing in Figure 3. But
the overall coverage using automation will
be very less because of unexpected behavior
of data; input data can be in any form and
changes every time new test is performed. We
need to deploy a business scenario validation
strategy for unstructured data. In this strategy
we need to identify different scenarios that
can occur in our day to day unstructured data
analysis and test data need to be setup based
on test scenarios and executed.
Velocity: The speed at which new data is being
created – and the need for real-time analytics
to derive business value from it -- is increasing
thanks to digitization of transactions, mobile
computing and the sheer number of internet
and mobile device users. Data speed needs
to be considered when implementing any
Big data appliance to overcome performance
probl ems. Performance testi ng pl ays an
important role to identify any performance
bottleneck in the system and the system can
handle high velocity streaming data.
NON-FUNCTIONAL TESTING
In the earlier sections we have seen how
functional testing is performed at each phase of
Big data processing, these tests are performed to
identify functional coding issues, requirements
issues. Performance testing and failover testing
need to be performed to identify performance
bottlenecks and to validate the non-functional
requirements.
Performance Testing: Any Big data project
involves in processing huge volumes of
structured and unstructured data and is
processed across multiple nodes to complete
the j ob in less time. At times because of
bad architecture and poorly designed code,
performance is degraded. If the performance
is not meeting the SLA, the purpose of setting
up Hadoop and other Big data technologies is
lost. Hence, performance testing plays a key role
in any Big data project due to huge volume of
data and complex architecture.
Some of the areas where performance
issues can occur are imbalance in input splits,
redundant shuffle and sorts, moving most
of the aggregation computations to reduce
process which can be done at map process. [5].
These performance issues can be eliminated
by carefully designing the system architecture
and doing performance test to identify the
bottlenecks.
Performance t est i ng i s conduct ed
by setting up huge volume of data and an
infrastructure similar to production. Utilities
like Hadoop performance monitoring tool can
be used to capture the performance metrics and
identify the issues. Performance metrics like
71
job completion time, throughput, and system
level metrics like memory utilization etc. are
captured as part of performance testing.
Failover Testing: Hadoop architecture consists
of a name node and hundreds of data notes
hosted on several server machines and each
of them are connected. There are chances of
node failure and some of the HDFS components
become non-functional. Some of the failures can
be name node failure, data node failure and
network failure. HDFS architecture is designed
to detect these failures and automatically
recover to proceed with the processing.
Failover testing is an important focus
area in Big data implementations with the
objective of validating the recovery process
and to ensure the data processing happens
seamlessly when switched to other data nodes.
Some val i dat i ons t hat need t o be
performed during failover testing are validating
that checkpoints of edit logs and FsImage
of name node are happening at a defined
intervals, recovery of edit logs and FsImage
files of name node, no data corruption because
of the name node failure, data recovery when
data node fails and validating that replication
is initiated when one of data node fails or data
become corrupted. Recovery Time Objective
(RTO) and Recovery Point Objective (RPO)
metrics are captured during failover testing.
TEST ENVIRONMENT SETUP
As Big data involves handling huge volume
and processing across multiple nodes, setting
up a test environment is the biggest challenge.
Setting up the environment on cloud will
give us the fexibility to setup and maintain it
during test execution. Hosting the environment
on the cloud will also help in optimizing the
infrastructure and faster time to market.
Key steps involved in setting up environment
on cloud are [6]:
A. Big data Test infrastructure requirement
assessment
1. As s es s t he Bi g dat a pr oces s i ng
requirements
2. Evaluate the number of data nodes
required in QA environment
3. Un de r s t a n d t h e da t a pr i v a c y
requirements to evaluate private or
public cloud
4. Evaluate the software inventory required
to be setup on cloud environment
(Hadoop, File system to be used, No
SQL DBs, etc).
B. Big data Test infrastructure design
1. Document the high level cloud test
infrastructure design (Disk space, RAM
required for each node, etc.)
2. Identify the cloud infrastructure service
provider
3. Document the SLAs, communication
plan, maintenance plan, environment
refresh plan
4. Document the data security plan
5. Document hi gh l evel t est st rat egy,
t e s t i ng r e l e a s e c y c l e s , t e s t i ng
t ypes, vol ume of dat a processed
b y Ha do o p , t h i r d p a r t y t o o l s
requi red.
72
C. Big data Test Infrastructure Implementation
and Maintenance
â–  Create a cloud instance of Big data test
environment
â–  Install Hadoop, HDFS, MapReduce and other
software as per the infrastructure design
â–  Perform a smoke test on the environment
by processing a sample map reduce,
Pig/Hive jobs
â–  Deploy the code to perform testing.
BEST PRACTICES
Data Quality: It is very important to establish
the data quality requirements for different
forms of data like traditional data sources, data
from social media, data from sensors, etc. If the
data quality is ascertained, the transformation
logic alone can be tested, by executing tests
against all possible data sets.
Da t a Sa mpl i ng: Data sampl i ng gai ns
significance in Big data implementation and
it becomes the testers’ job to identify suitable
sampling techniques that includes all critical
business scenarios and the right test data set.
Automation: Automate the test suites as much
as possible. The Big data regression test suite
will be used multiple times as the database will
be periodically updated. Hence an automated
regression test suite should be built to use it
after reach release. This will save a lot of time
during Big data validations.
CONCLUSION
Data quality challenges can be encountered by
deploying a structured testing approach for both
functional and non-functional requirements.
Applying right test strategies and following
best practices will improve the testing quality
which will help in identifying the defects early
and reduce overall cost of the implementation. It
is required that organizations invest in building
skillset both in development and testing. Big
data testing will be a specialized stream and
testing team should be built with diverse skillset
including coding, white-box testing skills and
data analysis skills for them to perform a better
job in identifying quality issues in data.
REFERENCES
1. Big data overview, Wikipedia.org at
http://en.wikipedia.org/wiki/Big_data.
2. White, T. (2010), Hadoop- The Defnitive
Guide 2nd Edition, O’Reilly Media.
3. Kelly, J. (2012), Big data: Hadoop,
Business Analytics and Beyond, A
Big data Manifesto from the Wikibon
Communi ty. Avai l abl e at http: //
wi ki bon. or g/wi ki /v/Bi g_Dat a: _
Hadoop, _Business_Analytics_and_
Beyond, Mar 2012.
4. Informatica Enterprise Data Integration
(1998), Data verifcation using File and
Table compare utility for HDFS and Hive
tool. Available at https://community.
informatica.com/solutions/1998.
5. Bhandarkar M. (2009), Practical Problem
Solving with Hadoop, USENIX ‘09
annual technical conference, June 2009.
Available at http://static.usenix.org/
event/usenix09/training/tutonefle.html.
6. Naganat han, V. ( 2012) , I ncr ease
Business Value with Cloud-based QA
Environments, Available at http://www.
infosys.com/IT-services/independent-
validation-testing-services/Pages/
cloud-based-QA-environments.aspx.
73
VOL 11 NO 1
2013
Nature Inspired Visualization
of Unstructured Big Data
By Aaditya Prakash
E
xponential growth of data capturing devices
has led to an explosion of data available.
Unfortunately not all data available is in the
database friendly format. Data which cannot
be easily categorized, classified or imported
into database are termed Unstructured Data.
Unstructured data is ubiquitous and is assumed
to be around 80% of all data generated [1].
While tremendous advancements have taken
place for analyzing, mining and visualizing
structured data, the field of unstructured
data, especially unstructured Big data is still
in nascent stage.
Lack of recognizable structure and
huge size makes it very challenging to work
with unstructured large datasets. Classical
visualization methods limit the amount of
information presented and are asymptotically
slow with rising dimensions of the data. We
present here a model to mitigate these problems
and allow effcient and vast visualization of
large unstructured datasets.
A novel approach in unsupervised
machine learning is Self-Organizing Maps
(SOM). Al ong wi th cl assi fi cati on, SOMs
have added benef i t of di mensi onal i t y
reduction. SOMs are also used for visualizing
multidimensional data into 2D planar diffusion
map. Thi s achi eves data reducti on thus
enabling visualization of large datasets.
Present model s used t o vi sual i ze SOM
maps lack any deductive ability that may be
defeating the power of SOM. We introduce
better restructuring of SOM trained data for
more meaningful interpretation of very large
data sets.
Taking inspiration from the nature,
we model the large unstructured dataset into
spider cobweb type graphs. This has the beneft
of allowing multivariate analysis as different
variables can be presented into one spider
graph and their inter-variable relations can be
projected, which cannot be done with classical
SOM maps.
Reconstruct self-organizing maps as spider
graphs for better visual interpretation of
large unstructured datasets
Infosys Labs Briefings
74
UNSTRUCTURED DATA
Unstructured data come in different formats
and sizes. Broadly the textual data, sound,
video, images, webpages, logs, emails, etc., are
categorized into unstructured data. In some
cases even a bundle of numeric data could
be collectively unstructured, for e.g., health
records of a patient. While a table of cholesterol
level of all the patients is more structured,
all the biostats of a single patient is largely
unstructured.
Unstructured data could be of any form
and could contain any number of independent
variables. Labeling as is done in machine
learning is only possible with data where
information of variable such as size, length,
dependency, precision, etc., is known. Even
extraction of the underlying information in a
cluster of unstructured data is very challenging
because it is not known on what is to be
extracted [2].
The potential of hidden analytics within
the unstructured large datasets could be a
valuable asset to any business or research entity.
Consider the case of Enron emails (collected and
prepared by CALO project). Emails are primarily
unstructured, mostly because people often
reply above the last email even when the new
email’s content and purpose might be different.
Therefore most organizations do not analyze
emails or logs but several researchers analyzed
the Enron emails and their results show that lot
of predictive and analytical information could
be obtained from the same [3, 4, 5].
SELF ORGANIZING MAPS
Ability to harness the increased computing
power has been a great boon to business.
From traditional business analytics to machine
learning, the knowledge we get from data is
invaluable. With computing forecasted to get
faster, may be quantum computing someday, it
promises greater role for the data. While there
has been a lot of effort to bring some structure
into unstructured data [6], the cost of doing so
has been the hindrance. With larger datasets
it is even a greater problem as it entails more
randomness and unpredictability in the data.
Self-Organizing Maps (SOM) are a
class of artifcial neural networks proposed by
Teuvo Kohonen [7] that transform the input
dataset into two dimensional lattice, also called
Kohonen Map.
Structure
All the points of the input layer are mapped
onto two dimensional lattice, called as Kohonen
Network. Each point in the Kohonen Network
is potentially a Neuron.
Figure 1: Kohonen Network
Source: Infosys Research
Competition of Neurons
Once the Kohonen Network is completed the
neurons of the network compete according
to the weights assigned from the input layer.
Function used to declare the winning neuron
is the simple Euclidean distance of the input
point and its corresponding weight for each of
75
the neuron. The function called as discriminant
function is represented as,
where, x = point on Input Layer
w = weight of the input point (x)
i = all the input points
j = all the neurons on the lattice
d = Euclidean distance
Simply put, the winning neuron is the one
whose weight is closest (distance in lattice)
to the input layer. This process effectively
discretizes the output layer.
Cooperation of Neighboring Neurons
Once the wi nni ng neuron i s found, the
topological structure can be determined. Similar
to the behavior in human brain cells (neurons),
the winning neuron also excites its neighbor.
Thus the topological structure is determined by
the cooperative weights of the winning neuron
and its neighbor.
Self-Organization
The process of selecting winning neurons and
formation of topological structure is adaptive.
The process runs multiple times to converge
on the best mapping of the given input layer.
SOM is better than other clustering algorithms
in that it requires very few repetitions to get to
a stable structure.
Parallel SOM for large datasets
Among al l cl assi fyi ng machi ne l earni ng
algorithms, convergence speed of the SOM has
been found to be the fastest [8]. This implies that
for large data sets SOM is the best viable model.
Since the formation of topological
structuring is independent of the input points it
can easily be parallelized. Carpenter et.al. have
demonstrated the ability of SOM to work under
massively parallel processing[9]. Kohonen himself
has shown that even where the input data may not
be in vector form, as found in some unstructured
data, large scale SOM can be run nonetheless[10].
SOM PLOTS
SOM plots are a two dimensional representation
of the topological structure obtained after
training the neural nets for given number of
repetitions and with given radius. The SOM
can be visualized as a complete 2-D topological
structure [Fig.2].
Figure 2: SOM Visualization using Rapidminer
(AGPL Open Source)
Source: Infosys Research
Figure 2, shows the overall topological
structure obtai ned after di mensi onal i ty
reduction of multivariate dataset. While
the graph above may be useful for outlier
detection or general categorization it isn’t
very useful in analysis of individual variables.
Other option of visualizing SOM is to
plot different variables in grid format. One
can use R programming language (GNU Open
Source) to plot the SOM results.
76
Note on running example
All the plots presented henceforth have been
obtained using R programming language.
Dataset used is SPAM Email Database. Database
is in public domain and freely available for
research at ‘UCI Machine Learning Repository’.
It contains 266858 word instances of 4601
SPAM emails. Emails are good example of
unstructured data.
Us i ng t he publ i c pa c ka ge s i n
R, we obtain the SOM plots.
Figure 3, is the plot of SOM trained result
using the package ‘Kohonen’[11]. This plot gives
inter-variable analysis. In this case variable
being 4 of one the most used words in the SPAM
database viz. ‘order’, ‘credit’, ‘free’ and ‘money’.
While this plot is better than topological plot as
given in Figure 2, it is still diffcult to interpret
the result in canonical sense.
Figure 4, is again the SOM plot of
the above given four most common words
in the SPAM database but this one uses the
package called ‘SOM’[12]. While this plot is
numerical and gives strength of intervariatek
relationship it does not help in giving us the
analytical picture. The information obtained is
not actionable.
SPIDER PLOTS OF SOM
As we have seen in the Figures 2, 3 and 4
the current visualization of SOM output
coul d be i mproved f or more anal yt i cal
ability. We introduce a new method to plot
SOM output especially designed for large
datasets.
Algorithm
1. Filter the results of SOM.
2. Make a polygon with as many sides as
the variables in input.
3. Make the radius of the polygon to
be the maximum of the value in the
dataset.
4. Draw the grid for the polygon.
5. Make segments inside the polygon
if the strength of the two variables
inside the segment is greater than the
specified threshold.
6. Loop Step v for every variable against
every other variable
7. Col or t he segment s based on t he
frequency of variable.
8. Col or the l i ne segments based on
the threshold of each variable pair
plotted.
Figure 3: SOM Visualization in R using the Package
‘Kohonen’
Figure 4: SOM visualization in R using the package
‘SOM’
Source: Infosys Research Source: Infosys Research
77
Plots
As we can see, this plot is more meaningful than
the SOM visualization plots obtained before.
From the fgure we can easily deduce that the
words ‘free’ and ‘order’ do not have similar
relation as ‘credit’ and ‘money’. Understandably
so, because if a Spam email is selling something,
it will probably have the words ‘order’ and
conversely if it is advertising any product or
software for ‘free’ download then it wouldn’t
have the words ‘order’ in it. High relationship
between ‘credit’ and ‘money’ signifes Spam
emails advertising for better ‘Credit Score’
programs and other marketing traps.
Figure 6 shows the relationship of each
variable-- in this case four popular recurring
words in the Spam database. The number of
threads between one variable to another shows
the probability of second variable given the
first variable. Several threads between ‘free’
and ‘credit’ suggests that Spam emails offering
‘free credit’ (disguised in other forms by fees or
deferred interests) are among the most popular.
Using these Spider plots we can analyze
several variables at once. This may cause the
graph to be messy but sometimes we need to see
the complete picture in order to make canonical
decisions about the dataset.
From Figure 7 we can see that even
though the fgure shows 25 variables it is not
as cluttered as a Scatter Plot or Bar chart would
be if plotted with 25 variables.
Figure 8: Uncolored Representation of Threads in
Six variables
Figure 6: SOM visualization in R using Above Algorithm:
Showing Threads, i.e., inter-variable strength)
Source: Infosys Research
Figure 5: SOM Visualization in R Using the Above
Algorithm: Showing Segment, i.e., inter-variable dependency
Figure 7: Spider Plot showing 25 Sampled Words from the
Spam Database
Source: Infosys Research Source: Infosys Research
Source: Infosys Research
78
Figure 8 shows the different levels of
strength between different variables. While
‘contact’ variable is strong with ‘need’ but not
enough with ‘help’ it is no surprise that ‘you’
and ‘need’ are strong. Here the idea was only to
present the visualization technique and not the
analysis of Spam dataset. For more analysis on
Spam fltering and Spam analysis one may refer
to several independent works on the same [13, 14].
ADVANTAGES
There are several visual and non-visual
advantages of using this new plot against
the existing plot obtained. This plot has been
designed to handle Big data. Most of the existing
plots mentioned above are limited in their
capacity to scale. Principally if the range of data
is large then most of the existing plots tend to
get skewed and important information is lost.
By normalizing the data this new plot prevents
this issue. By allowing multiple dimensions
to be incorporated allows for recognition of
indirect relationships.
CONCLUSION
While unstructured data is abundant, free and
hidden with information the tools of analyzing
the same are still nascent and cost of converting
them to structured form is very high. Machine
learning is used to classify unstructured data
but comes with issues of speed and space
constraints. SOM are the fastest machine
learning algorithms but their visualization
powers are limited. We have presented a
naturally intuitive method to visualize SOM
outputs which facilitates multi-variable analysis
and is also highly scalable.
REFERENCE
1. Grimes, S. , Unstructured data and
the 80 percent rule. Retrieved from
ht t p: //cl ar abr i dge. com/def aul t .
aspx?tabid=137.
2. Doan, A., Naughton, J. F., Ramakrishnan,
R., Baid, A., Chai, X., Chen, F. and Vuong,
B. Q. (2009), Information extraction
challenges in managing unstructured
data, ACM SIGMOD Record, vol. 37, no.
4, pp. 14-20.
3. Diesner, J., Frantz, T. L. and Carley, K.
M. (2005). Communication networks
from the Enron email corpus “It’s always
about the people. Enron is no different”.
In Computational & Mathematical
Organization Theory, vol. 11, no. 3, pp.
201-228.
4. Chapanond, A., Krishnamoorthy, M.
S., & Yener, B. (2005), Graph theoretic
and spectral analysis of Enron email
data. In Computational & Mathematical
Organization Theory, vol. 11, no.3, pp.
265-281.
5. Peterson, K., Hohensee, M., and Xia, F.
(2011), Email formality in the workplace:
A case study on the enron corpus.
In Proceedings of the Workshop on
Languages in Social Media, pp. 86-95.
Association for Computational Linguistics.
6. Buneman, P., Davidson, S., Fernandez,
M. , and Suci u, D. (1997), Addi ng
structure to unstructured data. Database
Theory—ICDT’97, pp. 336-350.
7. Kohonen, T. (1990),The self-organizing
map. Proceedings of the IEEE, vol. 78,
no. 9, pp. 1464-1480.
8. Waller, N. G., Kaiser, H. A., Illian, J. B.,
and Manry, M. (1998), A comparison
of the classification capabilities of
the 1-dimensional kohonen neural
network with two pratitioning and three
hierarchical cluster analysis algorithms.
Psychometrika, vol. 63, no.1, pp. 5-22.
79
9. Carpenter, G. A., and Grossberg, S.
(1987), A massively parallel architecture
for a self-organizing neural pattern
recognition machine. Computer vision,
graphics, and image processing, vol. 37,
no. 1, pp. 54-115.
10. Kohonen, T., and Somervuo, P. (2002),
How to make large self-organizing maps
for non-vectorial data. Neural Networks,
vol.15, no. 8, pp. 945-952.
11. Wehrens, R & Buydens, L.M.C (2007),
Self- and Super-organizing Maps in
R: The Kohonen Package. Journal of
Statistical Software, vol. 21, no. 5, pp. 1-19.
12. Yan, J. (2012), Self-Organizing Map (with
application in gene clustering) in R.
Available at http://cran.r-project.org/
web/packages/som/som.pdf.
13. Dasgupta, A., Gurevich, M., & Punera,
K. (2011), Enhanced email spam fltering
through combining similarity graphs.
In Proceedings of the fourth ACM
international conference on Web search
and data mining, pp. 785-794.
14. Cormack, G. V. (2007), Email spam
f i l t er i ng: A s ys t emat i c r evi ew.
Foundations and Trends in Information
Retrieval, vol. 1, no. 4, pp. 335-455.
NOTES
81
Index
Automated Content Discovery 48, 49,
Big Data
Analytics 4-8, 19, 24, 40-43, 45, 67,
Lifecycle 21,
Medical Engine 42- 44
Value, also BDV 27, 29,
Campaign Management 31, 32,
Common Warehouse Meta-Model, also CWM 7
Communication Service Providers, also
CSPS 27,
Complex Event Processing, also CEP 53-63
Content
Processing Workfows 50
Publishing Lifecycle Management, also
CPLM 48,
Management System, also CMS 30, 48, 51
Contingency Funding Planning, also CFP 36,
Customer
Dynamics 19-21, 25
Relationship 28, 30
Data Warehouse 4- 5, 30, 38-39, 66, 68
Enterprise Service Bus, also ESB 30
Event Driven
Process Automation
Architecture, also EDA 30-31
Experience Personalization 31
Extreme Content Hub, also ECH 47-51
Global Positioning Service, also
GPS 10, 13, 17, 54, 56
Management
Business Process, also BPM 30,
Custom Relationship, also CRM 28-30
Information 3, 56-57
Liquidity Risk, also LRM 35-40
Master Data 5-6
Offer 32
Order 30
Retention 31, 32
Metadata
Discovery 6-7
Extractor 50,
Governance 6-7
Management 3-8
Net Interest Income Analysis, also NIIA 37
Predictive
Intelligence 19
Modeling 32
Analytics 54
Service Management 31, 33
Supply Chain Planning 9-12, 53
Un-Structured Content Extractor 50
Web Analytics 21
BUSINESS INNOVATION through TECHNOLOGY
Editorial Office: Infosys Labs Briefings, B-19, Infosys Ltd.
Electronics City, Hosur Road, Bangalore 560100, India
Email: [email protected] http://www.infosys.com/infosyslabsbriefings
© Infosys Limited, 2013
Infosys acknowledges the proprietary rights of the trademarks and product names of the other companies
mentioned in this issue. The information provided in this document is intended for the sole use of the recipient
and for educational purposes only. Infosys makes no express or implied warranties relating to the information
contained herein or to any derived results obtained by the recipient from the use of the information in this
document. Infosys further does not guarantee the sequence, timeliness, accuracy or completeness of the
information and will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions
of, any of the information or in the transmission thereof, or for any damages arising therefrom. Opinions and
forecasts constitute our judgment at the time of release and are subject to change without notice. This document
does not contain information provided to us in confdence by our clients.
Editor
Praveen B Malla PhD
Deputy Editor
Yogesh Dandawate
Graphics & Web Editor
Rakesh Subramanian
Chethana M G
Vivek Karkera
IP Manager
K V R S Sarma
Marketing Manager
Gayatri Hazarika
Online Marketing
Sanjay Sahay
Production Manager
Sudarshan Kumar V S
Database Manager
Ramesh Ramachandran
Distribution Managers
Santhosh Shenoy
Suresh Kumar V H
How to Reach Us:
Email:
[email protected]
Phone: +91 40 44290563
Post:
Infosys Labs Briefngs,
B-19, Infosys Ltd.
Electronics City, Hosur Road,
Bangalore 560100, India
Subscription:
[email protected]
Rights, Permission,
Licensing and Reprints:
[email protected]
Infosys Labs Briefngs is a journal published by Infosys Labs with the
objective of offering fresh perspectives on boardroom business technology.
The publication aims at becoming the most sought after source for thought
leading, strategic and experiential insights on business technology
management.
Infosys Labs is an important part of Infosys’ commitment to leadership
in innovation using technology. Infosys Labs anticipates and assesses the
evolution of technology and its impact on businesses and enables Infosys
to constantly synthesize what it learns and catalyze technology enabled
business transformation and thus assume leadership in providing best
of breed solutions to clients across the globe. This is achieved through
research supported by state-of-the-art labs and collaboration with industry
leaders.
About Infosys
Many of the world’s most successful organizations rely on Infosys to
deliver measurable business value. Infosys provides business consulting
technology, engineering and outsourcing services to help clients in over
32 countries build tomorrow’s enterprise.
For more information about Infosys (NASDAQ:INFY), visit www.infosys.com
Infosys Labs Briefings
31
%
OF COMPANIES REPORT THEY ARE
JUST STARTING TO DEVELOP
A MOBILE STRATEGY OR HAVE
NO MOBILE STRATEGY AT ALL.
SOCIAL BUSINESS
CL UD
MOBILITY
HOW THE WORLD GETS ONLINE
IN ONE MINUTE ... MOBILE STRATEGY
2008 2014
60
%
OF SERVER WORKLOADS
WILL BE VIRTUALIZED
IN 2 YEARS
90
%
OF THE
WORLD’S DATA
WAS CREATED IN
THE LAST
2
YEARS
Big
Data
Cloud Apps
.com Mobile Email
Video Social Search
BIG DATA APPS
WHERE MOBILE USERS SPEND TIME
60
%
1.5b
via desktop
5.5b
via mobile
1
0
0
,
0
0
0
T
w
e
e
t
s
4
7
,
0
0
0
A
p
p

S
t
o
r
e
d
o
w
n
l
o
a
d
s
2 million Google searches
6
9
5
,
0
0
0

F
a
c
e
b
o
o
k
s
t
a
t
u
s

u
p
d
a
t
e
s
5
7
1

n
e
w
w
e
b
s
ite
s
= 100 MILLION
12
%
Fortune 500
companies
with blogs
23
%
Fortune 500
companies
active on Twitter
62
%
ONLINE RETAIL
U.S. OUTLOOK: GROWTH
2012 2016
192 million people 167 million people
45%
$327B $226B
43
Mobile Web
16
42
17
54
21
54
23
59
23
70
24
72
25
71
27
72
25
83
26
79
20
88
21
101
23
Mobile Apps
2012 2011
MAR FEB JAN DEC NOV OCT SEP AUG JUL JUN MAY APR MAR
(billions of minutes per month)
AN INFOSYS PUBLICATION
ART & SCIENCE
the DIGITAL ENTERPRISE
REVOLUTION
U R HERE
Click here to explore the current issue of Art & Science
AADITYA PRAKASH is a Senior Systems Engineer with the FNSP unit of Infosys. He can be
reached at [email protected].
ABHISHEK KUMAR SINHA is a Senior Associate Consultant with the FSI business unit of Infosys.
He can be reached at [email protected].
AJAY SADHU is a Software Engineer with the Big data practice under the Cloud Unit of Infosys.
He can be contacted at [email protected].
ANIL RADHAKRISHNAN is a Senior Associate Consultant with the FSI business unit of Infosys.
He can be reached at [email protected].
BILL PEER is a Principal Technology Architect with the Infosys Labs. He can be reached at
[email protected].
GAUTHAM VEMUGANTI is a Senior Technology Architect with the Corp PPS unit of Infosys.
He can be contacted at [email protected].
KIRAN KALMADI is a Lead Consultant with the FSI business unit of Infosys. He can be contacted
at [email protected].
MAHESH GUDIPATI is a Project Manager with the FSI business unit of Infosys. He can be reached
at [email protected].
NAJU D MOHAN is a Delivery Manager with the RCL business unit of Infosys. She can be contacted
at [email protected].
NARAYANAN CHATHANUR is a Senior Technology Architect with the Consulting
and Systems Integration wing of the FSI business unit of Infosys. He can be reached at
[email protected].
NAVEEN KUMAR GAJJA is a Technical Architect with the FSI business unit of Infosys. He can be
contacted at [email protected].
PERUMAL BABU is a Senior Technology Architect with RCL business unit of Infosys. He can be
reached at [email protected].
PRAKASH RAJBHOJ is a Principal Technology Architect with the Consulting and Systems
Integration wing of the Retail, CPG, Logistics and Life Sciences business unit of Infosys. He can be
contacted at [email protected].
PRASANNA RAJARAMAN is a Senior Project Manager with RCL business unit of Infosys. He can
be reached at [email protected].
SARAVANAN BALARAJ is a Senior Associate Consultant with Infosys’ Retail & Logistics Consulting
Group. He can be contacted at [email protected].
SHANTHI RAO is a Group Project Manager with the FSI business unit of Infosys. She can be
contacted at [email protected].
SUDHEESHCHANDRAN NARAYANAN is a Senior Technology Architect with the Big data practice
under the Cloud Unit of Infosys. He can be reached at [email protected].
ZHONG LI PhD. is a Principal Architect with the Consulting and System Integration Unit of
Infosys. He can be contacted at [email protected].
Big data was the watchword of year 2012. Even before one could understand
what it really meant, it began getting tossed about in huge doses in almost every
other analyst report. Today, the World Wide Web hosts upwards of 800 million
webpages, each page trying to either educate or build a perspective on the concept
of Big data. Technology enthusiasts believe that Big data is ‘the’ next big thing
after cloud. Big data is of late being adopted across industries with great fervor.
In this issue we explore what the Big data revolution is and how it will likely help
enterprises reinvent themselves.
As the citizens of this digital world we generate more than 200 exabytes of
information each year. This is equivalent to 20 million libraries of Congress.
According to Intel, each internet minute sees 100,000 tweets, 277,000 Facebook
logins, 204-million email exchanges, and more than 2 million search queries fred.
Looking at the scale at which data is getting churned it is beyond the scope of a
human’s capability to process data and hence there is need for machine processing
of information. There is no dearth of data for today’s enterprises. On the contrary,
they are mired with data and quite deeply at that. Today therefore the focus
is on discovery, integration, exploitation and analysis of this overwhelming
information. Big data may be construed as the technological intervention to
undertake this challenge.
Since Big data systems are expected to help analysis of structured and
unstructured data and hence are drawing huge investments. Analysts have
estimated enterprises will spend more than US$120 billion by 2015 on analysis
systems. The success of Big data technologies depends upon natural language
processing capabilities, statistical analytics, large storage and search technologies.
Big data analytics can help cope with large data volumes, data velocity and
data variety. Enterprises have started leveraging these Big data systems to mine
hidden insights from data. In the first issue of 2013, we bring to you papers
that discuss how Big data analytics can make a significant impact on several
industry verticals like medical, retail, IT and how enterprises can harness the
value of Big data.
Like always do let us know your feedback about the issue.
Happy Reading,
Yogesh Dandawate
Deputy Editor
[email protected]
Authors featured in this issue
Infosys Labs Briefings
Advisory Board
Anindya Sircar PhD
Associate Vice President &
Head - IP Cell
Gaurav Rastogi
Vice President,
Head - Learning Services
Kochikar V P PhD
Associate Vice President,
Education & Research Unit
Raj Joshi
Managing Director,
Infosys Consulting Inc.
Ranganath M
Vice President &
Chief Risk Officer
Simon Towers PhD
Associate Vice President and
Head - Center of Innovation for
Tommorow’s Enterprise,
Infosys Labs
Subu Goparaju
Senior Vice President &
Head - Infosys Labs
Big Data: Countering
Tomorrow’s Challenges
For information on obtaining additional copies, reprinting or translating articles, and all other correspondence,
please contact:
Email: [email protected]
© Infosys Limited, 2013
Infosys acknowledges the proprietary rights of the trademarks and product names of the other
companies mentioned in this issue of Infosys Labs Briefngs. The information provided in this
document is intended for the sole use of the recipient and for educational purposes only. Infosys
makes no express or implied warranties relating to the information contained in this document or to
any derived results obtained by the recipient from the use of the information in the document. Infosys
further does not guarantee the sequence, timeliness, accuracy or completeness of the information and
will not be liable in any way to the recipient for any delays, inaccuracies, errors in, or omissions of,
any of the information or in the transmission thereof, or for any damages arising there from. Opinions
and forecasts constitute our judgment at the time of release and are subject to change without notice.
This document does not contain information provided to us in confdence by our clients.
BIG DATA:
CHALLENGES AND
OPPORTUNITIES
£
¥
$

£
¥ $

Subu Goparaju
Senior Vice President
and Head of Infosys Labs
“At Infosys Labs, we constantly look for opportunities to leverage
technology while creating and implementing innovative business
solutions for our clients. As part of this quest, we develop engineering
methodologies that help Infosys implement these solutions right,
frst time and every time.”
B
I
G

D
A
T
A
:

C
H
A
L
L
E
N
G
E
S

A
N
D

O
P
P
O
R
T
U
N
I
T
I
E
S
V
O
L

1
1


N
O

1


2
0
1
3
VOL 11 NO 1
2013
Infosys Labs Briefings
I
n
f
o
s
y
s

L
a
b
s

B
r
i
e
f
i
n
g
s

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close