CONTENT BEYOND SYLLABUS – I
INTRODUCTION TO DATA MINING AND
What are Data Mining and Knowledge Discovery?
With the enormous amount of data stored in files, databases, and other repositories, it is
increasingly important, if not necessary, to develop powerful means for analysis and perhaps
interpretation of such data and for the extraction of interesting knowledge that could help in
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers
to the nontrivial extraction of implicit, previously unknown and potentially useful information
from data in databases. While data mining and knowledge discovery in databases (or KDD) are
frequently treated as synonyms, data mining is actually part of the knowledge discovery process.
The following figure (Figure 1.1) shows data mining as a step in an iterative knowledge
The Knowledge Discovery in Databases process comprises of a few steps leading from raw data
collections to some form of new knowledge. The iterative process consists of the following
Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
Data selection: at this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
Data mining:it is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results.
It is common to combine some of these steps together. For instance, data cleaning and data
integration can be performed together as a pre-processing phase to generate a data warehouse.
Data selection and data transformation can also be combined where the consolidation of the data
is the result of the selection, or, as for the case of data warehouses, the selection is done on
The KDD is an iterative process. Once the discovered knowledge is presented to the user, the
evaluation measures can be enhanced, the mining can be further refined, new data can be
selected or further transformed, or new data sources can be integrated, in order to get different,
more appropriate results.
Data mining derives its name from the similarities between searching for valuable
information in a large database and mining rocks for a vein of valuable ore. Both imply either
sifting through a large amount of material or ingeniously probing the material to exactly pinpoint
where the values reside. It is, however, a misnomer, since mining for gold in rocks is usually
called "gold mining" and not "rock mining", thus by analogy, data mining should have been
called "knowledge mining" instead. Nevertheless, data mining became the accepted customary
term, and very rapidly a trend that even overshadowed more general terms such as knowledge
discovery in databases (KDD) that describe a more complete process. Other similar terms
referring to data mining are: data dredging, knowledge extraction and pattern discovery.
A data warehouse as a storehouse, is a repository of data collected from multiple data
sources (often heterogeneous) and is intended to be used as a whole under the same unified
schema. A data warehouse gives the option to analyze data from different sources under the same
roof. Let us suppose that OurVideoStore becomes a franchise in North America. Many video
stores belonging to OurVideoStore company may have different databases and different
structures. If the executive of the company wants to access the data from all stores for strategic
decision-making, future direction, marketing, etc., it would be more appropriate to store all the
data in one site with a homogeneous structure that allows interactive analysis. In other words,
data from the different stores would be loaded, cleaned, transformed and integrated together. To
facilitate decision-making and multi-dimensional views, data warehouses are usually modeled by
a multi-dimensional data structure. Figure 1.3 shows an example of a three dimensional subset of
a data cube structure used for OurVideoStore data warehouse.
CONTENT BEYOND SYLLABUS – II
The Information Visualization Research Group at the Institute for Software Research at
University of California, Irvine cite on its Web pages:
Information visualization focuses on the development and empirical analysis of methods
for presenting abstract information in visual form. The visual display of information allows
people to become more easily aware of essential facts, to quickly see regularities and outliers in
data, and therefore to develop a deeper understanding of data. Interactive visualization
additionally takes advantage of people's ability to also identify interesting facts when the visual
display changes, and allows them to manipulate the visualization or the underlying data to
explore such changes
Issues to consider in Information Visualisation
Before using one or more IV techniques we have to consider several issues (Spence, 2001; Card
et al., 1999; Hearst, 2003; Reed and Heller, 1997):
1. The problem. This relates to what has to be presented, found, or demonstrated.
2. The nature of the data. Data types could be numerical (e.g. list of integers or reals),
ordinal (non-numerical data having a conventional ordering, such as days of the week), and
categorical (data with no order, such as names of persons or cities).
3. Number of data dimensions. Depending on the number of dimensions (also called attributes
or variables (Spence, 2001)), representations are said to be handling univariate
(one dimension), bivariate (two dimensions), trivariate (three dimensions), and multivariate
(four or more dimensions) data. We perceive our world in three spatial dimensions, so it
is easy to map and interpret up to three dimensions. However, handling more than three
dimensions is very frequent in real world situations and represents one of the most challenging
tasks in IV.
4. Structure of the data. This could be linear (data coded in plain data structures such as
arrays, tables, alphabetical lists, sets, etc.), temporal (data which changes during the time),
spatial or geographic (data which has a correspondence with something physical, e.g. maps,
_oor plans, 3D CAD; usually this is a subject of scienti_c visualisation and is not considered
to be IV in the strict sense), hierarchical (data that naturally arise in taxonomies, the
structures of organisations, disk space management, genealogies, etc. ), network (data describing
graph structures, i.e. nodes and links, nodes representing a data point, and a link
representing a relationship between two nodes).
5. Type of interaction. Whether the resulting graphical representation is static (e.g. a print
or a static image on a display screen), transformable (users can manipulate how the
representation is rendered, such as zooming or _ltering), or manipulable (users may control
parameters during the process of image generation, i.e. restricting the view to certain data
A problem humans are experiencing in their everyday life is to have too many things placed in
a limited space: books on shelves, addresses in agenda, windows on a computer screen, data to
display in a Personal Digital Assistant (PDA). The information explosion phenomena of last
years leads to the existence of more data than what can easily be displayed at once. "Too much
data, too little display area" is a common problem in Information Visualisation (Spence, 2001).
There are several techniques proposed to solve this problem, some of which are zooming,
panning, scrolling, focus+context and magic lenses (Spence, 2001).
_ Zooming is the increasing magni_cation of a decreasing (or increasing) fraction of a
_ Panning is the smooth movement of a viewing frame over a two-dimensional image of
_ Scrolling is the movement of data past a window able to contain only a part of it (such as
we are doing with the scrolling of a long document in a word processing program).
_ Focus+context's basic idea is to illustrate at the same time the overall picture (the context)
and to see details of immediate interests (the focus). This technique allows users to expand
and contract selected sections of a large image, thereby displaying simultaneously the
contents of individual sections of a document as well as its overall structure.
_ Magic lenses follow the metaphor of reading a text by the means of a lens that enlarges the
size of the text. In IV it can be used to place a lens upon the area of interest and receive more
detailed information on the data ampli_ed with the lens. For instance, magic lenses could
be applied to Figure 9: an application could show this map to tourists and a lens placed over
parts of the map could show details about the historical place selected.
CONTENT BEYOND SYLLABUS – III
APPLICATIONS OF VISUALIZATION
Applications of visualization
A scientific visualization of an extremely large simulation of a Raleigh-Taylor instability
caused by two mixing fluids.As a subject in computer science, scientific visualization is the use
of interactive, sensory representations, typically visual, of abstract data to reinforce cognition,
hypothesis building, and reasoning. Data visualization is a related subcategory of visualization
dealing with statistical graphics and geographic or spatial data (as in thematic cartography) that
is abstracted in schematic form.
Scientific visualization is the transformation, selection, or representation of data from
simulations or experiments, with an implicit or explicit geometric structure, to allow the
exploration, analysis, and understanding of the data. Scientific visualization focuses and
emphasizes the representation of higher order data using primarily graphics and animation
techniques. It is a very important part of visualization and maybe the first one, as the
visualization of experiments and phenomena is as old as science itself. Traditional areas of
scientific visualization are flow visualization, medical visualization, astrophysical visualization,
and chemical visualization. There are several different techniques to visualize scientific data,
with isosurface reconstruction and direct volume rendering being the more common.
Educational visualization is using a simulation normally created on a computer to create an
image of something so it can be taught about. This is very useful when teaching about a topic
that is difficult to otherwise see, for example, atomic structure, because atoms are far too small to
be studied easily without expensive and difficult to use scientific equipment. It can also be used
to view past events, such as looking at dinosaurs, or looking at things that are difficult or fragile
to look at in reality like the human skeleton.
Information visualization concentrates on the use of computer-supported tools to explore large
amount of abstract data. The term "information visualization" was originally coined by the User
Interface Research Group at Xerox PARC and included Dr. Jock Mackinlay. Practical
application of information visualization in computer programs involves selecting, transforming,
and representing abstract data in a form that facilitates human interaction for exploration and
understanding. Important aspects of information visualization are dynamics of visual
representation and the interactivity. Strong techniques enable the user to modify the visualization
in real-time, thus affording unparalleled perception of patterns and structural relations in the
abstract data in question.
The use of visual representations to transfer knowledge between at least two persons aims to
improve the transfer of knowledge by using computer and non-computer-based visualization
methods complementarily. visual formats are sketches, diagrams, images, objects, interactive
visualizations, information visualization applications, and imaginary visualizations as in stories.
While information visualization concentrates on the use of computer-supported tools to derive
new insights, knowledge visualization focuses on transferring insights and creating new
knowledge in groups. Beyond the mere transfer of facts, knowledge visualization aims to further
transfer insights, experiences, attitudes, values, expectations, perspectives, opinions, and
predictions by using various complementary visualizations. See also: picture dictionary, visual
Product visualization involves visualization software technology for the viewing and
manipulation of 3D models, technical drawing and other related documentation of manufactured
components and large assemblies of products. It is a key part of product lifecycle management.
Product visualization software typically provides high levels of photorealism so that a product
can be viewed before it is actually manufactured. This supports functions ranging from design
and styling to sales and marketing. Technical visualization is an important aspect of product
development. Originally technical drawings were made by hand, but with the rise of advanced
computer graphics the drawing board has been replaced by computer-aided design (CAD). CADdrawings and models have several advantages over hand-made drawings such as the possibility
of 3-D modeling, rapid prototyping, and simulation.
Systems visualization is a new field of visualization which integrates and subsumes existing
visualization methodologies and adds to it narrative story telling, visual metaphors (from the
field of advertising) and visual design. It also recognizes the importance of complex systems
theory, the interconnections of systems of systems and the need of knowledge representation
through ontologies. Systems visualization provides a viewer of systems visualization the ability
to quickly understand the complexity of a system. Unlike other visualization approaches - such
as data visualization, information visualization, flow visualization, scientific visualization and
network visualization, which focus mainly on data representation - systems visualization seeks to
provide new way to visualize complex systems of systems through an integrative approach.
Visual communication is the communication of ideas through the visual display of information.
Primarily associated with two dimensional images, it includes: alphanumerics, art, signs, and
electronic resources. Recent research in the field has focused on web design and graphicallyoriented usability.
Visual analytics focuses on human interaction with visualization systems as part of a larger
process of data analysis. Visual analytics has been defined as "the science of analytical reasoning
supported by the interactive visual interface". Its focus is on human information discourse
(interaction) within massive, dynamically changing information spaces. Visual analytics research
concentrates on support for perceptual and cognitive operations that enable users to detect the
expected and discover the unexpected in complex information spaces. Technologies resulting
from visual analytics find their application in almost all fields, but are being driven by critical
needs (and funding) in biology and national security.
The following are examples of some common visualization techniques:
• direct volume rendering
• streamlines, streaklines, and pathlines
• table, matrix
• charts (pie chart, bar chart, histogram, function graph, scatter plot, etc.)
• graphs (tree diagram, network diagram, flowchart, existential graph, etc.)
• parallel coordinates - a visualization technique for multidimensional data
• treemap - a visualization technique for hierarchical data
• Venn diagram
• Euler diagram
• Chernoff face
CONTENT BEYOND SYLLABUS – IV
Data visualization is the study of the visual representation of data, meaning "information
that has been abstracted in some schematic form, including attributes or variables for the units of
According to Friedman (2008) the "main goal of data visualization is to communicate
information clearly and effectively through graphical means. It doesn’t mean that data
visualization needs to look boring to be functional or extremely sophisticated to look beautiful.
To convey ideas effectively, both aesthetic form and functionality need to go hand in hand,
providing insights into a rather sparse and complex data set by communicating its key-aspects in
a more intuitive way. Yet designers often fail to achieve a balance between form and function,
creating gorgeous data visualizations which fail to serve their main purpose — to communicate
information". Indeed, Fernanda Viegas and Martin M. Wattenberg have suggested that an ideal
visualization should not only communicate clearly, but stimulate viewer engagement and
attention. Data visualization is closely related to information graphics, information visualization,
scientific visualization, and statistical graphics. In the new millennium, data visualization has
become an active area of research, teaching and development.
Data analysis is the process of studying and summarizing data with the intent to extract useful
information and develop conclusions. Data analysis is closely related to data mining, but data
mining tends to focus on larger data sets with less emphasis on making inference, and often uses
data that was originally collected for a different purpose.
In statistical applications, some people divide data analysis into descriptive statistics, exploratory
data analysis, and inferential statistics (or confirmatory data analysis), where the EDA focuses on
discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.
Types of data analysis are:
• Exploratory data analysis (EDA): an approach to analyzing data for the purpose of formulating
hypotheses worth testing, complementing the tools of conventional statistics for testing
hypotheses. It was so named by John Tukey.
• Qualitative data analysis (QDA) or qualitative research is the analysis of non-numerical data,
for example words, photographs, observations, etc.
Data governance encompasses the people, processes and technology required to create a
consistent, enterprise view of an organisation's data in order to:
• Increase consistency & confidence in decision making
• Decrease the risk of regulatory fines
• Improve data security
• Maximize the income generation potential of data
• Designate accountability for information quality
Data management comprises all the academic disciplines related to managing data as a valuable
resource. The official definition provided by DAMA is that "Data Resource Management is the
development and execution of architectures, policies, practices, and procedures that properly
manage the full data lifecycle needs of an enterprise." This definition is fairly broad and
encompasses a number of professions that may not have direct technical contact with lower-level
aspects of data management, such as relational database management.
Data mining is the process of sorting through large amounts of data and picking out relevant
information. It is usually used by business intelligence organizations, and financial analysts, but
is increasingly being used in the sciences to extract information from the enormous data sets
generated by modern experimental and observational methods. It has been described as "the
nontrivial extraction of implicit, previously unknown, and potentially useful information from
data" and "the science of extracting useful information from large data sets or databases." In
relation to enterprise resource planning, according to Monk (2006), data mining is "the statistical
and logical analysis of large sets of transaction data, looking for patterns that can aid decision
Data transforms is the process of Automation and Transformation, of both real-time and offline
data from one format to another. There are standards and protocols that provide the
specifications and rules, and it usually occurs in the process pipeline of aggregation or
consolidation or interoperability. The primary use cases are in integration systems organizations,
and compliance personnels.
CONTENT BEYOND SYLLABUS – V
HISTORY OF VISUALIZATION
Static visualisation (from 2500 BC to 1990 AD)
The computer-driven visualisations shown so far are comparatively recent, but visualisations of
various forms date back many millennia. The Mesopotamian clay ablet in Figure 8 is around
4500 years old, and contains a table of administrative information. We may think bureaucracy is
new, but in the vast majority of early clay tablet writing is of an administrative / financial nature,
often including simple tables of numbers.
Moving on 3500 years, Figure 9 shows an early line graph of solar, lunar and planetary
movements. The x-axis is days in the month and the lines track each heavenly body, where the yaxis is their height in the sky. Figure 10 skips forward to the 19th century and shows a
visualisation of the Paris–Lyon train timetable, with the x-axis hours of the day (from 6am to
6am the next day) and the y-axis showing the distance along the route (Paris at the top Lyon at
the bottom). Fast trains stand out clearly as the steeper lines.
Examples of interactive visualisation can be traced back to early scanning vector graphics
displays, or the seaside information boards where tiny lights were illuminated when you pressed
buttons for different kinds of features. However, it was in the early 1990s when growing
graphics power made it possible for the first time to create rich 3D graphics, complex
visualisations and real-time interaction. This lead to a blossoming of information visualisation
(and other graphics) research notably in the groups at a Xerox PARC and University of
Maryland. Not all the ideas were good, just like with gloriously multi-fonted documents during
the desktop publication revolution in the 1980s, there were many examples of gratuitous 3D
which deservedly forgotten. However, despite this, most of the core kinds of visualisations in use
today were introduced at that time (see selection in Figure 11), several of which will be
discussed in the next section
We have already seen examples of data journalism where rich, but simple to understand,
infographics have their way into mainstream media. Furthermore the web has increased the
public expectations of high quality, often interactive, visualisations. These web visualisations are
sometimes 'authored', that is created by the individual or institution responsible for the article or
blog. However, there are also a number of data sharing and analysis sites that make it easy to
upload and visualise your own data; figure 12 shows one example, IBM's "Many Eyes" .
Furthermore open data initiatives by governments and corporations across the world are making
data on many aspects of life available to all from the environment to employment, crime to
convert venues. Often this comes with an invitation to mashup and visualise the data in citizen's