Genetic Programming Approach to Record Deduplication

Published on February 2017 | Categories: Documents | Downloads: 19 | Comments: 0 | Views: 161
of 4
Download PDF   Embed   Report

Comments

Content


A Genetic Programming Approach to Record Deduplication
ABSTRACT
Several systems that rely on consistent data to offer high-quality services, such as digital
libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas,
or near-duplicate entries in their repositories. Because of that, there have been significant
investments from private and government organizations for developing methods for removing
replicas from its data repositories. This is due to the fact that clean and replica-free repositories
not only allow the retrieval of higher quality information but also lead to more concise data and
to potential savings in computational time and resources to process this data. In this paper, we
propose a genetic programming approach to record deduplication that combines several different
pieces of evidence extracted from the data content to find a deduplication function that is able to
identify whether two entries in a repository are replicas or not. As shown by our experiments, our
approach outperforms an existing state-of-the-art method found in the literature. Moreover, the
suggested functions are computationally less demanding since they use fewer evidence. In
addition, our genetic programming approach is capable of automatically adapting these functions
to a given fixed replica identification boundary, freeing the user from the burden of having to
choose and tune this parameter.
EXISTING SYSTEM
THE increasing volume of information available in digital media has become a
challenging problem for data administrators. Usually built on data gathered from different
sources, data repositories such as those used by digital libraries and e-commerce brokers may
present records with disparate structure. Also, problems regarding low-response time,
availability, security, and quality assurance become more difficult to handle as the amount of
data gets larger. Today, it is possible to say that the capacity of an organization to provide useful
services to its users is proportional to the quality of the data handled by its systems. In this
environment, the decision of keeping repositories with “dirty” data (i.e., with replicas, with no
standardized representation, etc.) goes far beyond technical questions such as the overall speed
or performance of data management systems. The solutions available for addressing this problem
require more than technical efforts, they need management and cultural changes as well.
To better understand the impact of this problem, it is important to list and analyze the
major consequences of allowing the existence of “dirty” data in the repositories. These include,
for example: 1) performance degradation—as additional useless data demand more processing,
more time is required to answer simple user queries; 2) quality loss—the presence of replicas and
other inconsistencies leads to distortions in reports and misleading conclusions based on the
existing data; 3) increasing operational costs—because of the additional volume of useless data,
investments are required on more storage media and extra computational processing power to
keep the response time levels acceptable.

PROPOSED SYSTEM
More specifically, record deduplication is the task of identifying, in a data repository,
records that refer to the same real world entity or object in spite of misspelling words, typos,
different writing styles or even different schema representations or data types. Thus, there have
been large investments from private and government organizations for developing methods for
removing replicas from data repositories. This is due to the fact that clean and replica-free
repositories not only allow the retrieval of higher quality information but also lead to a more
concise data representation and to potential savings in computational time and resources to
process this data.
In this paper, we present a genetic programming (GP) approach to record deduplication.
Our approach combines several different pieces of evidence extracted from the data content to
produce a deduplication function that is able to identify whether two or more entries in a
repository are replicas or not. Since record deduplication is a time consuming task even for small
repositories, our aim is to foster a method that finds a proper combination of the best pieces of
evidence, thus yielding a deduplication function that maximizes performance using a small
representative portion of the corresponding data for training purposes. Then, this function can be
used on the remaining data or even applied to other repositories with similar characteristics.
Moreover, new additional data can be treated similarly by the suggested function, as long as
there are no abrupt changes in the data patterns, something that is very improbable in large data
repositories. It is worth noticing that this (arithmetic) function, which can be thought as a
combination of several effective deduplication rules, is easy and fast to compute, allowing its
efficient application to the deduplication of large repositories.

ADVANTAGES OF PROPOSED SYSTEM
In sum, the main contribution of this paper is a GP-based approach to record deduplication that:

 Outperforms an existing state-of-the-art machine learning based method found in the
literature provides solutions less computationally intensive, since it suggests
deduplication functions that use the available evidence more efficiently;

 Frees the user from the burden of choosing how to combine similarity functions and
repository attributes. This distinguishes our approach from all existing methods, since
they require user-provided settings;

 Frees the user from the burden of choosing the replica identification boundary value,
since it is able to automatically select the deduplication functions that better fit this
deduplication parameter.


MODULES
 Genetic Operations
 Generational Evolutionary Algorithm
 Modeling The Record Deduplication Problem With GP
 Experiments with the Replica Identification Boundary

MODULE DESCRIPTION
Genetic Operations
Usually, GP evolves a population of length-free data structures, also called individuals,
each one representing a single solution to a given problem. In our modeling, the trees represent
arithmetic functions. When using this tree representation in a GP-based method, a set of
terminals and functions should be defined. Terminals are inputs, constants or zero argument3
nodes that terminate a branch of a tree. They are also called tree leaves. The function set is the
collection of operators, statements, and basic or user-defined functions that can be used by the
GP evolutionary process to manipulate the terminal values. These functions are placed in the
internal nodes of the tree.

Generational Evolutionary Algorithm
1. Initialize the population (with random or user provided individuals).
2. Evaluate all individuals in the present population, assigning a numeric rating or fitness value
to each one.
3. If the termination criterion is fulfilled, then execute the last step. Otherwise continue.
4. Reproduce the best n individuals into the next generation population.
5. Select m individuals that will compose the next generation with the best parents.
6. Apply the genetic operations to all individuals selected. Their offspring will compose the next
population. Replace the existing generation by the generated population and go back to Step 2.
7. Present the best individual(s) in the population as the output of the evolutionary process.

MODELING THE RECORD DEDUPLICATION PROBLEM WITH GP
When using GP (or even some other evolutionary technique) to solve a problem, there are
some basic requirements that must be fulfilled, which are based on the data structure used to
represent the solution [8]. In our case, we have chosen a tree-based GP representation for the
deduplication function, since it is a natural representation for this type of function. These
requirements [9] are the following:
1. All possible solutions to the problem must be represented by a tree, no matter its size.
2. The evolutionary operations applied over each individual tree must, at the end, result into a
valid tree.
3. Each individual tree must be automatically evaluated. For Requirement 1, it is necessary to
take into consideration the kind of solution we intend to find. In the record deduplication
problem, we look for a function that combines pieces of evidence. In our approach, each piece of
evidence (or simply “evidence”) E is a pair <attribute; similarity function> that represents the
use of a specific similarity function over the values of a specific attribute found in the data being
analyzed.

Experiments with the Replica Identification Boundary
A critical aspect regarding the effectiveness of several record deduplication approaches is
the setup of the boundary values that classify a pair of records as replicas or not with respect to
the results of the deduplication function. In this final set of experiments, our objective was to
study the ability of our GP-based approach to adapt the deduplication functions to changes in the
replica identification boundary, aiming at discovering whether it is possible to use a previously
fixed (or suggested) value for this parameter.




Deshmukh Dikshant
Kannapurkar Swapnil
[BE-CSE-I]

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close