Knjiga Data Quality and Master Data Management

Published on February 2017 | Categories: Documents | Downloads: 47 | Comments: 0 | Views: 463

of 226

Content

Data Quality and Master Data Management with Microsoft SQL Server 2008 R2
Dejan Sarka, Davide Mauri

Advanced Techniques & Tools for Storing, Maintaining, Cleansing, and Merging Master Data

Table of Contents

Table of Contents
Table of Contents ............................................................................................................................. 1 Foreword .......................................................................................................................................... 6 Acknowledgements .......................................................................................................................... 8 About the Authors .......................................................................................................................... 11 Chapter 1: Master Data Management ........................................................................................... 12 Types of Data .............................................................................................................................. 13 What Is Master Data?............................................................................................................. 14 Master Data Management ..................................................................................................... 16 MDM Challenges ................................................................................................................ 21 Data Models ............................................................................................................................... 23 Relational Model .................................................................................................................... 23 Dimensional Model ................................................................................................................ 31 Other Data Formats and Storages .......................................................................................... 33 Data Quality ................................................................................................................................ 36 Data Quality Dimensions ........................................................................................................ 36 Completeness ..................................................................................................................... 37 Accuracy ............................................................................................................................. 38 Information......................................................................................................................... 38 Consistency......................................................................................................................... 39 Data Quality Activities ............................................................................................................ 42 Master Data Services and Other SQL Server Tools .................................................................... 47

Page 1

Table of Contents Master Data Services .............................................................................................................. 47 Entities ................................................................................................................................ 49 Attributes............................................................................................................................ 49 Members ............................................................................................................................ 49 Hierarchies.......................................................................................................................... 49 Collections .......................................................................................................................... 50 Versions .............................................................................................................................. 50 Other SQL Server Tools for MDM........................................................................................... 51 Summary .................................................................................................................................... 54 References .................................................................................................................................. 55 Chapter 2: Master Data Services Concepts and Architecture ........................................................ 56 Master Data Services Setup ....................................................................................................... 57 Installation of MDS Components and Tools ........................................................................... 57 Setup of MDS Database.......................................................................................................... 58 Setup of MDS Web Application .............................................................................................. 59 Master Data Manager Web Application .................................................................................... 62 Explorer .................................................................................................................................. 63 Version Management ............................................................................................................. 63 Integration Management ....................................................................................................... 63 System Administration ........................................................................................................... 64 User and Groups Permissions ................................................................................................ 64 Models ........................................................................................................................................ 65 Models .................................................................................................................................... 65

Page 2

Table of Contents Entities and Attributes ........................................................................................................... 66 Hierarchies.............................................................................................................................. 77 Derived Hierarchy............................................................................................................... 78 Explicit Hierarchy ................................................................................................................ 79 Collections .............................................................................................................................. 84 Business Rules ........................................................................................................................ 85 Importing, Exporting and Managing Data .................................................................................. 89 Import Data ............................................................................................................................ 89 Managing Data ....................................................................................................................... 91 Export Data ............................................................................................................................. 93 Multiple Versions of Data ........................................................................................................... 95 MDS Database Schema .............................................................................................................. 98 Staging Tables ....................................................................................................................... 101 Summary ..................................................................................................................................103 References ................................................................................................................................ 104 Chapter 3: Data Quality and SQL Server 2008 R2 Tools ............................................................... 105 Measuring the Completeness ..................................................................................................106 Attribute Completeness .......................................................................................................107 XML Data Type Attribute Completeness .............................................................................. 109 Simple Associations among NULLs ....................................................................................... 112 Tuple and Relation Completeness........................................................................................ 118 Multivariate Associations among NULLs .............................................................................. 120 Profiling the Accuracy............................................................................................................... 126

Page 3

Table of Contents Numeric, Date and Discrete Attributes Profiling .................................................................128 Strings Profiling .................................................................................................................... 131 Other Simple Profiling ..........................................................................................................134 Multivariate Accuracy Analysis ............................................................................................ 136 Measuring Information ............................................................................................................142 Using Other SQL Server Tools for Data Profiling ......................................................................145 SSAS Cubes ........................................................................................................................... 145 PowerPivot for Excel 2010 ...................................................................................................149 SSIS Data Profiling Task ........................................................................................................153 Excel Data Mining Add-Ins....................................................................................................156 Clean-Up ............................................................................................................................... 160 Summary ..................................................................................................................................161 References ................................................................................................................................ 162 Chapter 4: Identity Mapping and De-Duplicating ........................................................................163 Identity Mapping ...................................................................................................................... 165 Problems............................................................................................................................... 166 T-SQL and MDS String Similarity Functions ..........................................................................166 Preparing the Data ...........................................................................................................169 Testing the String Similarity Functions .............................................................................174 Optimizing Mapping with Partitioning .............................................................................176 Optimizing Mapping with nGrams Filtering .....................................................................180 Comparing nGrams Filtering with Partitioning ................................................................ 188 Microsoft Fuzzy Components ...................................................................................................190

Page 4

Table of Contents Fuzzy Algorithm Description ................................................................................................ 190 Configuring SSIS Fuzzy Lookup ............................................................................................. 192 Testing SSIS Fuzzy Lookup ....................................................................................................200 Fuzzy Lookup Add-In for Excel.............................................................................................. 201 De-Duplicating .......................................................................................................................... 203 Preparing for Fuzzy Grouping ............................................................................................... 204 SSIS Fuzzy Grouping Transformation ................................................................................... 206 Testing SSIS Fuzzy Grouping .................................................................................................211 Clean-Up ............................................................................................................................... 213 Summary ..................................................................................................................................215 References ................................................................................................................................ 216 Index .............................................................................................................................................217

Page 5

Foreword

Foreword
Dejan Sarka
“If all men were just, there would be no need of valor,” said Agesilaus, Spartan King, 444 BC-360 BC. Just from this quote we can realize that Agesilaus was not too keen on fighting. Actually, Agesilaus never hurt his enemies without just cause, and he never took any unjust advantages. Nevertheless, the ancient world was just as imperfect as the contemporary world is, and Agesilaus had to fight his share of battles. If everyone would always insert correct data into a system, there would be no need for proactive constraints or for reactive data cleansing. We could store our data in text files, and maybe the only application we would need would be Notepad. Unfortunately, in real life, things go wrong. People are prone to make errors. Sometimes our customers do not provide us with accurate and timely data. Sometimes an application has a bug and makes errors in the data. Sometimes end users unintentionally make a transposition of letters or numbers. Sometimes we have more than one application in an enterprise, and in each application we have slightly different definitions of the data. (We could continue listing data problems forever.) A good and suitable data model, like the Relational Model, enforces data integrity through the schema and through constraints. Unfortunately, many developers still do not understand the importance of a good data model. Nevertheless, even with an ideal model, we cannot enforce data quality. Data integrity means that the data is in accordance with our business rules; it does not mean that our data is correct. Not all data is equally important. In an enterprise, we can always find the key data, such as customer data. This key data is the most important asset of a company. We call this kind of data master data.

Page 6

Foreword This book deals with master data. It explains how we can recognize our master data. It stresses the importance of a good data model for data integrity. It shows how we can find areas of bad or suspicious data. It shows how we can proactively enforce better data quality and make an authoritative master data source through a specialized Master Data Management application. It also shows how we can tackle the problems with duplicate master data and the problems with identity mapping from different databases in order to create a unique representation of the master data. For all the tasks mentioned in this book, we use the tools that are available in the Microsoft SQL Server 2008 R2 suite. In order to achieve our goal—good quality of our data—nearly any part of the suite turns to be useful. This is not a beginner’s book. We, the authors, suppose that you, the readers, have quite good knowledge of SQL Server Database Engine, .NET, and other tools from the SQL Server suite. Achieving good quality of your master data is not an easy task. We hope this book will help you with this task and serve you as a guide for practical work and as a reference manual whenever you have problems with master data.

Page 7

Acknowledgements

Acknowledgements
Dejan Sarka
This book would never have been finished without the help and support of several people. I need to thank them for their direct and indirect contributions, their advice, their encouragement, and other kinds of help. In the first place, I have to mention my coauthor and colleague from SolidQ, Davide Mauri. As an older guy, I have followed his career over several years. I am amazed by the amount of knowledge he gained in the past few years. He has become a top speaker and recognized author of several books and articles. Nevertheless, he retained all the vigor of youth and is still full of ideas. Davide, I am so proud of you. I always enjoy working with you, and I am looking forward to our further cooperation. With three other colleagues from SolidQ, Itzik Ben-Gan, Herbert Albert, and Gianluca Hotz, we are forming a gang of four inside the company, called Quartet. It is not just unofficial group; our official duty in the company is to certify and confirm places for company parties. Our endless discussions during conferences, hikes, or time spent in pubs are an invaluable source of insight and enlightenment. Besides general help through our socializing, all three of them have made a concrete contribution to this book. Herbert helped with technical review. Gianluca didn’t review the book officially; nevertheless, he read it. His old man’s grumbling was always a sign to me that I wrote something inaccurate or even wrong. Itzik was not directly involved in this book. However, this book would never have been written without him. This is the ninth book I am contributing to so far. I would never even have started writing if Itzik hadn’t pushed me and involved me in coauthoring his book 7 years ago. Itzik has invited me to contribute to four of his books so far, and we are already starting the fifth one together. To my friends from the Quartet, thank you for all of the precious time we are spending together!
Page 8

Acknowledgements SolidQ is not just a regular company. First of all, we are all friends. Even more, we have the best CEO in the world, Fernando Guerrero. Fernando, thank you for inviting me to become a part of this wonderful group of people, and thank you for all of your patience with me, and for all of your not only technical advice but also life advice! And thanks to all other members of this company; because we all together total more than 150 SQL Server and related technologies worldwide experts joining our efforts, I simply cannot list every single colleague. Besides concrete work with SQL Server, I am also involved in theoretical research in The Data Quality Institute. Dr. Uroš Godnov helped me with first steps in data quality, and he is continuing to educate me. Although he has some problems with his health, he is always available to me. Uroš, forget what Agesilaus said! We need courage not just because of our enemies; sometimes we need courage because of ourselves. Stay as brave as you are forever! I cannot express enough how much I appreciate being a member of the Microsoft Most Valuable Professional (MVP) program. Through this program, MVPs have direct contact with the SQL Server team. Help of the team in general and help from the Master Data Services part of the team is extraordinary. No matter how busy they are with developing a new version of SQL Server, they always take time to respond to our questions. Finally, I have to thank to my family and friends. Thank you for understanding the reduced time I could afford to spend with you! However, to be really honest, I did not miss too many parties and beers.

Page 9

Acknowledgements

Davide Mauri
Dejan already wrote a lot about our friendship, our company, and everything that allows us to continue to enjoy our work every day. But even at the cost of seeming repetitious, I’d also like to thank all my SolidQ colleagues, who are not only colleagues but friends above all. I learn a lot from you, and each discussion is mind-opening for me. I cannot thank you enough for this! I would like to say a big thank to Dejan, Itzik, Gianluca, and Fernando for being examples—truly mentors—not only through words but also with facts: You really give me inspiration and you show me each day, with your excellent work, your determination, your honesty, and your integrity, the path one has to follow to be someone who can make a difference, from a professional and ethical point of view. I couldn’t have found better colleagues, friends, and partners. Thanks! I’d like also to thank specifically “my” Italian team. Guys, we’re really creating something new, setting new standards, finding clever ways to solve business problems, making customers happy, and giving them the maximum quality the market can offer, while being happy ourselves at the same time. This is not easy, but we’re able to work together as a well-trained team, enjoying what we do each day. Thanks! Last but not least, of course, a big thanks also to my wife Olga and my newborn son Riccardo: You really are the power behind me. Olga, thanks for your patience and for your support! Riccardo, thanks for your smiles that allow me to see the world with different eyes! Thank you!

Page 10

About the Authors

About the Authors
Dejan Sarka
Dejan Sarka focuses on development of database and business intelligence applications. Besides projects, he spends about half of the time on training and mentoring. He is the founder of the Slovenian SQL Server and .NET Users Group. Dejan is the main author or coauthor of nine books about databases and SQL Server so far. Dejan also developed three courses for SolidQ: Data Modeling Essentials, Data Mining, and Master Data Management. As an MCT, Dejan speaks at many international events, including conferences such as PASS, TechEd, and DevWeek. He is also indispensable on regional Microsoft events. In addition, he is a coorganizer of a top-level technical conference named Bleeding Edge.

Davide Mauri
Davide Mauri is a SolidQ Mentor and a member of the Board of Directors of SolidQ Italia. A wellknown Microsoft SQL Server MVP, MCP, MCAD, MCDBA, and MCT, as well as an acclaimed speaker at international SQL Server conferences, Davide enjoys working with T-SQL and Relational Modeling and studying the theory behind these technologies. In addition, he is wellgrounded in Reporting Services, .NET, and object-oriented principles, and he has a deep knowledge of Integration Services and Analysis Services, giving him a well-rounded area of expertise around the Microsoft Data Platform, allowing him to have the correct vision and experience to handle development of complex business intelligence solutions. He is a course author for SolidQ, including seminars about Upgrading to SQL Server 2008, co-author of the book Smart Business Intelligence Solutions with SQL Server 2008, and author of the well-known DTExec replacement tool DTLoggedExec (http://dtloggedexec.davidemauri.it).

Page 11

Chapter 1: Master Data Management

Chapter 1: Master Data Management
Dejan Sarka
Master Data Management (MDM), the process of creating and maintaining master data, is one of the most challenging tasks in IT for an enterprise. In this chapter we will define what exactly the term master data means and what the most important challenges are, including data quality problems and ways to improve them. When talking about data quality, we cannot skip data integrity, and data models. The Relational Model is still the best model currently in use by line of business applications for enforcing data integrity. Nevertheless, it is not very suitable for analyzing data. Additional models for analytical systems have evolved; the most widely used is the Dimensional Model. We are going to briefly introduce both, Relational and Dimensional models, and show where the master data is in these models. In addition to data integrity, we are going to deal also with other data quality issues. After all introductions, we are going to put MDS in the picture. We are going to see that MDS is an integral part of MDM and that there are also other SQL Server tools that can help us in MDM. SQL Server Integration Services (SSIS) and SQL Server Analysis Services can play an important role in MDM activity. We are also going to define how MDS can help improving data in different kinds of applications, including On Line Transactional Processing (OLTP), On Line Analytical Processing (OLAP) and Customer Relationship Management (CRM) applications. In this chapter, we are going to introduce the following:     Master data and Master Data Management; Data models; Data quality; Master Data Services and other SQL Server tools.

Page 12

Chapter 1: Master Data Management

Types of Data
In an average company, many different types of data appear. These types include:  Metadata – this is data about data. Metadata includes database schemas for transactional and analytical applications, XML document schemas, report definitions, additional database table and column descriptions stored by using SQL Server provided extended properties or custom tables, application configuration data and similar.  Transactional data - maintained by line of business OLTP applications. In this context, we are using term transactional in means of business transactions, not database management system transactions. This data includes, for example, customer orders, invoices, insurance claims, data about manufacturing stages, deliveries, monetary transactions and similar. In short, this is OLTP data about events.  Hierarchical data typically appears in analytical applications. Relationships between data are represented in hierarchies. Some hierarchies represent intrinsic structure of data – they are natural for the data. An example is product taxonomy. Products have subcategories, which are in categories. We can have multiple levels of hierarchies. Such hierarchies are especially useful for drilling down from general to detailed level in order to find reasons, patterns and problems. This is a very common way of analyzing data in OLAP applications.  Semi-structured data is typically in XML form. XML data can appear in standalone files, or as part (a column in a table) of a database. Semi-structured data is useful where metadata, i.e. schema, changes frequently, or when you do not have need for detailed relational schema. In addition, XML is widely used for data exchange.  Unstructured data involves all kind of documents with instructions, company portals, magazine articles, e-mails and similar. This data can appear in a database, in a file, or even in printed material.

Page 13

Chapter 1: Master Data Management  Finally, we have master data.

What Is Master Data?
Intuitively, everybody has an idea of what kind of data master data would be. Let us give you some more precise definitions and examples. In the Relational Model, a table represents a predicate, and a row represents a proposition about a subject, object or event from the real world. We start building a relational database by describing the business problems in sentences. Nouns in the sentences describe real world entities that are of interest for the business, like customers, products, and employees and similar. Verbs in the sentences describe relationships between nouns, precisely roles the nouns play in the relationships. Nouns define our master data. Verbs lead to transactional data. From a description of a relational database, you can easily find nouns. These critical nouns typically define one of the following:     People, including customers, suppliers, employees, and sales representatives Things, including products, equipment, assets, and stores Concepts, including contracts, licenses, and bills of material Places, including company locations, and customer geographic divisions

Any of these entity sets can be further divided into specialized subsets. For example, a company can segment their customers based on previous sales into premier and other customers, or based on customer types, or persons and companies. For analytical applications, many times data is organized in a Dimensional Model. A popular name for the Dimensional Model is Star Schema (although, to be precise, a dimensionally modeled database can include multiple star and snowflake schemas). This is because it has a central, Fact table, and surrounding, dimensional tables or Dimensions. Fact tables hold data we are measuring, namely Measures. Dimension attributes are used for pivoting fact data; they give measures some meaning. Dimensions give context to measures. Fact tables are populated from

Page 14

Chapter 1: Master Data Management transactional data. Dimensions are populated from entities that represent nouns in a relational database description. Therefore, in a Dimensional Model, dimensions are the master data. As we already mentioned, OLAP analyses typically involve drilling down over hierarchies. Dimensions can include multiple hierarchies. Master data appears in probably every single application in an enterprise. ERP applications include products, bills of material, customers, suppliers, contracts and similar. Customer Relationship Management (CRM) applications deal, of course, with customers. Human Resources Management (HRM) applications are about employees. Analytical applications many times include all master data that appears in an enterprise. We can easily imagine that master data is a very important part of data. It is crucial that this master data is known and correct. Data quality issues are mostly about master data. It is easy to imagine that having the same master data in multiple sources can immediately lead to problems with same definitions, same identifications, and duplication. Master data typically changes with a much slower rate than transactional data. Customers, for example, do not change addresses frequently; however, they interact with your company through orders, services or even complaints probably on daily basis. Nevertheless, although less volatile than transactional data, the master data life cycle is still a classical CRUD cycle: Create, Read, Update and Destroy. The question that arises is whether any data about people, things, places or concepts is really master data for any company. If a company sells only five different products, then products data is not master data for this company. Although technically it is master data, the company does not need any specific management of this data. Every single attribute of these five products has a correct value in the system, and it is unlikely that an inaccurate value will appear. Therefore, for this company, products data is not considered master data. Cardinality has influence on definition whether we consider specific data as master data or not. If a company does not collect many attributes for product categories, if the categories entity is quite simple in terms of the number of attributes (for example, it could have only two

Page 15

Chapter 1: Master Data Management attributes, category id and category name), then again that company probably does not consider this as master data. Complexity of data is another criterion that helps us decide which data needs special treatment and which data is master data. Another factor to consider when defining master data is volatility. We already mentioned that master data tends to change less frequently that transactional data. Now imagine that some data does not change at all. For example, let us use geographic locations. Data about geographic locations changes quite infrequently; in some systems, we can even consider this data as static. As soon as we have cleansed this data, it does not need any further treatment, and therefore we do not consider this as master data again. For some data, we need to maintain history. In transactional applications, we need to know how the data came to the current state. Government or other authorities prescribes a formal auditing for some business areas. In analytical applications, we frequently analyze over time, for example, we compare sales of this year with sales of the previous year in a geographic region. In order to make proper analyses, we need to take into account possible movements of customers from region to region; therefore, we need to maintain history again. Data that needs versioning, auditing, or any other kind of maintaining of history, is typically master data. Finally, we give more attention to data that we reuse repeatedly. Re-usage increases value of data for us. The value of the data can increase because of other factors as well. For example, in pharmacy, an error in a bill of materials can lead to huge damage. Therefore, the more valuable the data is for us, the more likely we are going to define it as master data.

Master Data Management
We discussed that data that does not need special treatment is not master data. Clearly, we can conclude that master data needs some special treatment, which is called Master Data Management. In a more formal definition Master Data Management (MDM) is a set of coordinated processes, policies, tools and technologies used to create and maintain accurate master data.

Page 16

Chapter 1: Master Data Management Even the formal definition of MDM is still very broad. We do not have a single tool for MDM. We can consider anything we use to improve the quality of master data as a MDM tool. Any formal, scheduled or repeated, and ad-hoc activity that improves master data quality is a MDM process. Any technology, like Relational Database Management System (RDBMS) that enforces data integrity, is part of MDM technology. However, depending on the approach we use to maintain master data, we might consider using a specialized tool with clearly defines the process for managing master data. Some of the most important goals of MDM include:         Unifying or at least harmonizing master data between different transactional, or operational systems Maintaining multiple versions of master data for different needs in different operational systems Integrating master data for analytical and CRM systems Maintaining history for analytical systems Capturing information about hierarchies in master data, especially useful for analytical applications Supporting compliance with government prescriptions (e.g., Sarbanes-Oxley) through auditing and versioning Having a clear CRUD process through prescribed workflow Maximizing Return Of Investment (ROI) through re-usage of master data

Please note the last bullet. Master Data Management can be quite costly, and very intensive in terms of resources used, including man hours. For a small company with a single operational system, probably no specific MDM tool is needed. Such a company can maintain master data in the operational system. The more we reuse master data (in multiple operational, CRM and analytical applications), the bigger ROI we get.

Page 17

Chapter 1: Master Data Management A RDBMS and an application can and should enforce data integrity. Data integrity means that the data conforms to business rules. Business rules can be quite simple, as order numbers must be unique, there should be no order without known customer, and quantity ordered must be greater than zero. Business rules can also be more complicated. For example, a company can define that no customer should order in a single Web order a product in quantity that is more than half of the quantity of the product on the stock. If the database and the application do not enforce data integrity, we can expect dirty data. Nevertheless, even if the database and the application do enforce data integrity, we still should not take data accuracy for granted. How can we prevent typo errors? For example, an operator could write “42 Hudson Avenue” instead of “24 Hudson Avenue”; both addresses are valid from data integrity perspective. Another issue arises if we have multiple systems. Do all operators enter data in consistent way? Some operators could write the correct address, but in slightly different form, like “24 Hudson Ave.”. We could resolve data quality issues with occasional data cleansing. With cleansing, data quality rises; however, over time, the quality falls down again. This is reactive approach. A proactive approach, which prevents entering low quality data, is even better. What we need is explicit data governance. We must know who is responsible for the data, and we must have clearly defined the process of maintaining the data. Data governance sets the policies, the rules for master data. Data governance rules can prescribe data requirements, such as which information is required, how we derive values, data retention periods and similar. Data governance can also prescribe security policies, like which master data needs encryption, and which part has to be audited. It can prescribe versioning and workflows. It defines how to address data inconsistencies between different source systems; for example, it can define the authoritative source. Data governance policies should also define how to bring new systems online without breaking existing master data processes and quality. Finally, data governance can define the need for explicit roles in an enterprise – roles, responsible for maintaining master data and implementing data governance.

Page 18

Chapter 1: Master Data Management In MDM terminology, we have to define the Data stewards, the people responsible for their part of master data. Data stewards are the governors. They should work independently of any specific source or destination system for master data, in an objective way. Data stewards must have deep knowledge about the data they govern. Commonly, one data steward covers one business area of master data; we have one steward for customers, one for products etc. We should define data stewardship roles and designate data stewards early in the process of implementing a MDM solution. From what we have seen so far, we can conclude that there are different approaches to master data management. Here is a list of possible approaches:  No central master data management – we have systems that do not communicate at all. When we need any kind of cross-system interaction, like doing analysis over data from multiple systems, we do ad-hoc merging and cleansing. This approach is very cheap from the beginning; however, it turns out as the most expensive over time. From this book’s perspective, we really should not treat this approach as a real MDM approach.  Central metadata storage – with this approach, we have at least unified, centrally maintained definitions for master data. Different systems should follow and implement these central definitions. Ad-hoc merging and cleansing becomes somehow simpler. In this scenario, for the central metadata storage, we typically do not use a specialized solution. This central storage of metadata is probably in an unstructured form – in documents, worksheets, or even on paper only.  Central metadata storage with identity mapping – besides unified, centrally maintained definitions for master data we also store keys mapping tables in our MDM solution. Data integration applications can be developed much faster and easier. Although this solution seems quite appealing, it has many problems with maintaining master data over time. We have only keys from different systems in our MDM database; we do not have any other attributes. All attributes in source systems change over time, and we have no versioning or auditing in place to
Page 19

Chapter 1: Master Data Management follow the changes. This approach is viable for a limited time only. It is useful, for example, during upgrading, testing and the initial usage of a new ERP system to provide mapping back to the old ERP system.  Central metadata storage and central data that is continuously merged –we have metadata as well as master data stored in a dedicated MDM system. However, we do not insert or update master data here; we merge (and cleanse) master data from source systems continuously, on daily basis. There are multiple issues with this approach and continuous merge can become expensive. The only viable possibility for this approach is that we can find out what has changed in source systems from the last merge, enabling us to merge only the delta, only the new and updated data. This approach is frequently used for analytical systems. We prepare a Data Warehouse (DW) as a central storage for analytical data (which includes transactional and master data). We populate DW overnight, and during population, we merge data and resolve cleansing issues. Although we typically do not create the DW with MDM as the main goal, we can treat DW as an authoritative source of master data.  Central MDM, single copy – with this approach, we have a specialized MDM application, where we maintain master data, together with its metadata, in a central location. All existing applications are consumers of this master data. This approach seems preferable at first glimpse. However, it has its own drawbacks. We have to upgrade all existing applications to consume master data from central storage instead of maintain their own copies. This can be quite costly, and maybe even impossible with some legacy systems. In addition, our central master metadata should union all metadata from all source systems. Finally, the process of creating and updating master data could simply be too slow. It could happen, for example, that a new customer would have to wait for couple of days before submitting the first order, because the process of inserting customer data with all possible attributes involves contacting all source systems and takes simply too long.

Page 20

Chapter 1: Master Data Management  Central MDM, multiple copies – in this approach, we have a central storage of master data and its metadata. However, the metadata here includes only an intersection of common metadata from source systems. Each source system maintains its own copy of master data, with additional attributes that pertain to this system only. After we insert master data in the central MDM system, we replicate it (preferably automatically) to source systems, where we update the source-specific attributes. This approach gives us a good compromise between cost, data quality, and effectiveness of the CRUD process. Still, there is no free lunch. As different systems can also update the common data, we can have update conflicts. Therefore, this approach involves continuous merge as well. However, as at least part of the data is updated centrally, this approach means less work with continuous merging than in central metadata storage and central data that is continuously merged approach. For the last two approaches we need a special MDM application. A specialized MDM solution could be useful for central metadata storage with identity mapping and central metadata storage and central data that is continuously merged approaches as well. SQL Server 2008 R2 Master Data Services (MDS) is a specialized MDM application. We could also write our own application. Other SQL Server tools, like SSIS and SSAS, are helpful in the MDM process as well. However, for the last two approaches to MDM, MDS is the most efficient solution. MDM Challenges For a successful MDM project, we have to tackle all challenges we meet. These challenges include:  Different definitions of master metadata in source systems We can have different coding schemes, data types, collations and similar. We have to unify the metadata definitions.  Data quality issues

Page 21

Chapter 1: Master Data Management This is something we always have to expect. In short, if we do not have data quality issues, then we probably do not need a specialized MDM solution anyway. We have to improve the data quality; otherwise, the MDM project fails to accomplish its most important goal.  Authority Who is responsible for master data? Different departments want to be authoritative for their part of master data, and the authority for master data can overlap in an enterprise. We have to define policies for master data, with explicit data stewardship process prescribed. We also define data ownership as part of authority issue resolution.  Data conflicts When we prepare the central master data database, we have to merge data from our sources. We have to resolve data conflicts during the project, and, depending on the MDM approach we take, replicate the resolved data back to the source systems.  Domain knowledge We should include domain experts in a MDM project.  Documentation We have to take care that we properly document our master data and metadata. No matter which approach we take, MDM projects are always challenging. However, tools like MDS can efficiently help us resolve possible issues.

Page 22

Chapter 1: Master Data Management

Data Models
It is crucial that we have a basic understanding of data models used in an enterprise before we start a MDM project. Details of data modeling are out of the scope for this book; only the minimum needed is here. We are going to introduce the Relational Model, the Dimensional Model, and briefly other models and storages.

Relational Model
The relational model was conceived in the 1960s by Edgar F. Codd, who worked for IBM. It is a simple, yet rigorously defined conceptualization of how users perceive and work with data. The most important definition is the Information Principle, which states that all information in a relational database is expressed in one (and only one) way, as explicit values in columns within rows of a table. In the relational model, a table is called a relation, and a row is called a tuple, which consists of attributes. Each relation represents some real-world entity, such as a person, place, thing, or event. An entity is a thing that can be distinctly identified and is of business interest. Relationships are associations between entities. A row in a relation is a proposition, like an employee with identification equal to 17, full name Davide Mauri, lives in Milan. Relation header, or schema of the relation, is the predicate for its propositions. A predicate is a generalized form of proposition, like employee with identification EmployeeId(int), full name EmloyeeName(string), lives in City(CitiesCollection). Note the name / domain pair of placeholders for concrete values. The domain, or the data type, is the first point where a RDBMS can start enforcing data integrity. In the previous example, we cannot insert an EmployeeId that is not an integral number. We cannot insert a city that is not in

Page 23

Chapter 1: Master Data Management our collection of allowed cities. Please also note that without a good naming convention, it is hard to reconstruct predicates and propositions from a database schema. With different source systems, we typically do not have influence on the source schema and naming conventions; however, we should be aware that worse schema and naming mean more problems for our MDM project, and worse data quality. One of the most important tools for enforcing data integrity in a relational model is normalization. As mentioned, we are going to introduce only basics here; for more details on relational model, do please refer to data modeling books, for example An Introduction to Database Systems by C. J. Date. Tables are normalized when they represent propositions about entities of one type - in other words, when they represent a single set. This means that entities do not overlap in tables, and that tables are orthogonal or normal in mathematical terms. When a table meets a certain prescribed set of conditions, it is in a particularly normal form. A database is normalized when all tables are normalized. Normalization is a redesign process to unbundle the entities. The process involves decomposition. The normalization is achieved by applying a sequence of rules to create what are called normal forms. The goal is to eliminate redundancy and incompleteness. Many normal forms are defined. The most important ones are first, second, and third. If a database is in third normal form, it is usually already fully normalized. We are going to introduce the first three normal forms only. If a database is not normalized, you can experience data manipulation anomalies. Let us start with the following example of a model for customers’ orders, shown in figure 1.

Page 24

Chapter 1: Master Data Management

Orders PK OrderId OrderDate CustomerId CustomerName Address City Country OrderDetails{ProductId, ProductName, Quantity}

Figure 1: ORDERS MODEL BEFORE 1ST NF Note that the last column (OrderDetails) is actually a collection of items ordered. This model leads to many CRUD anomalies. How do you insert a potential customer, a customer without an order? How do you insert a product that is not ordered yet? If you delete the last order for a customer, you lose all information about that customer. If you update a name of a product, you have to take care to update it in all rows where this product is on order. Even reading this data is hard; how do you find the total quantity ordered for a single product, for example. The first normal form says that a table is in first normal form if all columns are atomic, or indivisible. No multivalued columns are allowed. Decomposition has to start with the OrderDetails column. We need a single row per item in an order, and every atomic piece of data of a single item (ProductId, ProductName, Quantity) must have its own column. However, after the decomposition, we have multiple rows for a single order. OrderId by itself cannot be the key anymore. The new key is composed of the OrderId and ItemId columns. Figure 2 shows our Orders table in first normal form.

Page 25

Chapter 1: Master Data Management

Orders PK PK OrderId ItemId OrderDate CustomerId CustomerName Address City Country ProductId ProductName Quantity

Figure 2: ORDERS MODEL IN 1ST NF Queries are now simplified; it is easy to find totals for products. However, update anomalies are still possible. You still cannot insert a customer without an order, a product that is not ordered yet, and you have to change a single piece of information, like order date, which now repeats for the same order in multiple rows, in multiple places. To achieve the second normal form, a table must be in first normal form, and every non-key column must be functionally dependent on the entire key. This means that no non-key column can depend on a part of the key only. In Orders model that is in first normal form, we need OrderId only to get customer data and order date; we don’t need ItemId, which is also part of the key. To achieve the second normal form, we need to decompose the table into two tables, shown in figure 3.

Page 26

Chapter 1: Master Data Management

Orders PK OrderId OrderDate CustomerId CustomerName Address City Country

OrderDetails PK,FK1 PK OrderId ItemId ProductId ProductName Quantity

Figure 3: ORDERS MODEL IN 2ND NF With the model in the second normal form, we resolved the problem of updating order date. For a single order, we maintain it in a single place now. However, there are still some update anomalies possible. There is still some redundancy in the model. Customer address, city and country repeat over orders; product name repeats over order details. To achieve the third normal form, a table must be in the second normal form, and every nonkey column must be non-transitively dependent on the key. In other words, non-key columns must be mutually independent. In our previous model in second normal form, we can find CustomerId from OrderId, and from CustomerId we can transitively find the customer name and address. From address, we can find city, and from city, we can find country. In addition, in the OrderDetails table, ProductId and ProductName columns are not mutually independent. To achieve the third normal form, we must create new tables for dependencies between non-key columns, as shown in figure 4.

Page 27

Chapter 1: Master Data Management

Customers PK CustomerId CustomerName Address CityId PK FK1

Orders OrderId CustomerId OrderDate

FK1

Cities PK CityId City CountryId

OrderDetails PK,FK1 PK FK2 OrderId ItemId ProductId Quantity

FK1

Countries PK CountryId Country PK

Products ProductId ProductName

Figure 4: ORDERS MODEL IN 3RD NF After a database is in third normal, there are usually no update anomalies. However, queries for retrieving data are more complicated, as they involve multiple joins. Properly implemented Relational Model helps maintaining data integrity and data quality in general. Besides data types and normalization, we can use many additional tools provided by modern RDBMS’ like SQL Server. We have declarative constraints, for example Check constraint, which further narrows down possible values of an attribute. We can define whether all values have to be known; we can forbid NULLs (NULL is a standard placeholder for unknown or not applicable value). We can implement integrity rules programmatically, with triggers and stored procedures. Of course, we can implement them programmatically in the middle tier or client tier of an application as well. The important thing for a MDM project is that we understand our source data. Realizing whether the design of operational databases, which are the sources, where the data is introduced in an enterprise for the first time, is proper or not, helps us a lot when evaluating the cost of possible approaches to a MDM solution.

Page 28

Chapter 1: Master Data Management Before explaining the Dimensional Model, we have to understand another relational design technique – specialization and generalization. We can have NULLs in our data because the value is unknown, or because an attribute is not applicable for a subset of rows. In our example, we could have the persons and companies in the Customers table. For persons, the birth date makes sense; for companies, the number of employees might be valuable information. In order to prevent NULLs, we can introduce subtype tables, or subtypes in short. Two entities are of distinct, or primitive, types if they have no attributes in common. Some relations can have both common and distinct attributes. If they have a common identifier, we can talk about a special supertype-subtype relationship. Supertypes and subtypes are helpful for representing different levels of generalization or specialization. In a business problem description, the verb is (or explicitly is a kind of) leads to a supertype-subtype relationship. Specialization leads to additional decomposition of tables, as we can see in the example in figure 5.

Page 29

Chapter 1: Master Data Management

Cities PK CityId City CountryId PK

Countries CountryId Country

FK1

Customers PK CustomerId CustomerName Address CityId PK FK1

Orders OrderId CustomerId OrderDate

FK1

OrderDetails Persons PK,FK1 CustomerId BirthDate PK,FK1 PK FK2 OrderId ItemId ProductId Quantity

Companies PK,FK1 CustomerId NumberOfEmployees PK

Products ProductId ProductName

Figure 5: ORDERS MODEL IN 3RD NF WITH SPECIALIZATION As you can see, even in this small example, our relational design became quite complicated. Queries for analyses involve multiple joins; they are hard to write, and many times do not perform good enough to analyze on-line, in real time. Analyzing on-line means that we can change the structure of the aggregations we are looking for in real time. For example, instead of aggregating companies’ sales over countries and years, we could decide to aggregate over products and age of individual people. Because of problems with analyses, Dimensional Model evolved. The Dimensional Model is the base building block of OLAP systems.

Page 30

Chapter 1: Master Data Management

Dimensional Model
Dimensional Model of a database has more deterministic schema than Relational Model. We use it for Data Warehouses (DW). In a DW, we store merged and cleansed data from different source systems, with historical data included, in one or more star schemas. A single star schema covers one business area, like sales, inventory, production, or human resources. As we already know, we have one central (Fact) table and multiple surrounding (Dimensions) tables in a star schema. Multiple star schemas are connected through shared dimensions. An explicit Time (or Date, depends on the level of granularity we need) dimension is always present, as we always include historical data in a DW. Star schema was introduced by Ralph Kimball in his famous book The Data Warehouse Toolkit. Star schema is deliberately denormalized. Lookup tables, like Cities and Countries in our example, are flattened back to the original table, and attributes from that lookup tables form natural hierarchies. In addition, we can also flatten specialization tables. Finally, we can add multiple derived attributes and custom-defined hierarchies. An example of a star schema created from our normalized and specialized model is illustrated in figure 6.

Page 31

Chapter 1: Master Data Management

DimCustomers PK CustomerId CustomerName Address City Country PersonCompanyFlag PersonAge CompanySize PK

DimProducts ProductId ProductName

FactOrders PK,FK1 PK,FK2 PK,FK3 CustomerId ProductId DateId DimDates Quantity PK DateId Date Month Quarter Year

Figure 6: ORDERS IN A STAR SCHEMA MODEL With more dimension tables, the schema would represent a star even more. We can see derived columns (Age and CompanySize) in the DimCustomers table. This table is also flattened from the original Customers table, Cities and Countries lookup tables, and Persons and Companies subtype tables. In the DimCustomers, we can see natural a hierarchy, Country ⇨ City ⇨ Customer; in the DimDates table, we can see a natural hierarchy, Year ⇨ Quarter ⇨ Month ⇨ Date. For analytical purposes, we could also define ad-hoc hierarchies, for example Country ⇨ CompanySize. While relational databases are typically sources of master data, are dimensional databases usually destinations of master data. As with the Relational Model, there is much more to say for the Dimensional Model. However, this brief introduction should be enough for understanding the two mostly used models, and how the impact on our master data management solution.

Page 32

Chapter 1: Master Data Management

Other Data Formats and Storages
Besides relational and dimensional databases, data can appear in an enterprise in many additional various formats ant storages. Some companies implement Operational Data Store (ODS). ODS is by definition of Bill Inmon, who introduced this concept: Subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in support of an organization's need for up-to-the-second, operational, integrated, collective information.
Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999)

The data in ODS has limited history, and is updated more frequently than the data in a DW. However, the data is already merged from multiple sources. ODS is many times part of a CRM application; typically, it is data about customers. A MDM solution can actually replace, or integrate, existing ODS. Some data is semi-structured. Either the structure is not prescribed in so many details, or the structure itself is volatile. Nowadays semi-structured data usually appears in XML format. We can have XML data in files in a file system or in a database. Modern relational systems support XML data types. XML instances can have a schema. For defining XML schemas, we use XML Schema Definition (XSD) documents. XSD is an XML instance with defined namespaces, elements and attributes, that expresses a set of rules to which an XML instance must conform. If XML instance conforms to XSD, then we say it is schema validated. Here is an example of XML schema:
<xsd:schema targetNamespace="ResumeSchema" xmlns:schema="ResumeSchema" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sqltypes="http://schemas.microsoft.com/sqlserver/2004/sqltypes" elementFormDefault="qualified"> <xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" schemaLocation="http://schemas.microsoft.com/sqlserver/2004/sqltypes/sqltypes

Page 33

Chapter 1: Master Data Management
.xsd" /> <xsd:element name="Resume"> <xsd:complexType> <xsd:sequence> <xsd:element name="skills" minOccurs="0" maxOccurs="1"> <xsd:simpleType> <xsd:restriction base="sqltypes:nvarchar" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52"> <xsd:maxLength value="1000" /> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="previousemployment" minOccurs="0"> <xsd:simpleType> <xsd:restriction base="sqltypes:nvarchar" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52"> <xsd:maxLength value="100" /> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:schema>

Reading XML is not very funny. Nevertheless, we can extract some useful information. From the highlighted parts of the code, we can conclude that this is a schema for resumes, probably for job candidates. It allows two elements, according to their names for describing skills and previous employment. Skills can appear only once, while previous employment multiple times. In order to maintain data quality for XML data, we should force validation of XML instances against XML schemas. SQL Server supports XML schema validation for columns of XML data type. Every company has also to deal with unstructured data. This data is in documents, spreadsheets, other computer formats, or even on paper only. If this data is important, if it is

Page 34

Chapter 1: Master Data Management part of master data, we should include it in our MDM solution. Of course, the problem is how to do it. One possibility is simply to store all kind of files in a database. SQL Server supports a couple of large objects (LOB) data types. In addition, SQL Server supports FileStream data for binary large objects, data type varbinary(max). FileStream integrates the SQL Server Database Engine with an NTFS file system by storing varbinary(max) object data as files on the file system. This way, we get unlimited storage for unstructured data inside our relational database. Usually we can find interesting properties of the unstructured data. If we store unstructured data in our database, it makes a lot of sense to store these interesting properties in additional, regular non-LOB attributes. We can classify documents before storing them in a database. We can do the classification manually, or with help of a tool. Text mining tools, for example, can extract terms from text. We can use the terms extracted to classify texts.

Page 35

Chapter 1: Master Data Management

Data Quality
Data quality is indivisibly interleaved with master data management. The most important goal of an MDM solution is to raise the quality of master data. We should tackle data quality issues in any MDM project. Nevertheless, data quality activities, such as data profiling, finding root cause for poor data, and improvements of quality, can be independent of an MDM project as well. An enterprise can define data quality policies and processes through existing applications only. However, a specialized MDM solution can mitigate implementation of those policies a great deal. Before we describe data quality activities, we have to decide for which aspects we are going to measure and improve the quality of our data. Data quality dimensions capture a specific aspect included in general data quality term. Measuring data quality, also known as data profiling, should be an integral part of the implementation of an MDM solution. We should always get a thorough comprehension of source data before we start merging it. We should also include measuring improvements of data quality over time to understand and explain the impact of the MDM solution. Let us start with data quality dimensions, to show what and how we can measure data quality.

Data Quality Dimensions
Data quality dimensions can refer to data values or to their schema. We introduced the most popular schemas earlier in this chapter. We are going to start with pure data quality dimensions, and return to schema quality dimensions later in this section. There is no exact definition which data quality dimensions should we inspect. Different tools and different books list different sets of dimensions. Nevertheless, some dimensions are analyzed more frequently than others are. We are going to focus on the most important, i.e. the most frequent ones.

Page 36

Chapter 1: Master Data Management We can measure some data quality dimensions with tools, like Transact-SQL queries. The measurable dimensions are also called hard dimensions. For some dimensions, we depend on the perception of the users of data. These are soft dimensions. We cannot measure soft dimensions directly; we can measure them indirectly through interviews with users of data, or through any other kind of communication with users. Note that this communication can unfortunately include unpleasant events, like customer complaints, we want to prevent. Let us start with hard dimensions. Completeness Completeness is a dimension that can be measured in the easiest way. We can start measuring it on population level. In a closed world assumption, we can state that no other values except the values actually present in a relational table represent facts in the real world. If the relation does not have unknown values (NULLs), the relation is complete from the population perspective. In the open world assumption, we cannot state the population completeness, even if our relation does not contain null values. In order to evaluate the completeness of a relation in the open world assumption, we need to get a reference relation that contains the entire population. With a reference relation, we can define the completeness as the proportion of the number of tuples presented in the relation and the number of tuples in the reference relation. Because of privacy and law constraints, it is commonly not possible to acquire the reference relation. However, usually we can get at least the number of tuples in a reference relation. For example, we can easily get the number of citizens of a country. From a technical point of view, it is very easy to measure the completeness of our relation once you have the number of tuples in a reference relation. In a closed world assumption, in a relational database, the presence of the NULLs is what defines the completeness. We could measure attribute completeness, i.e. the number of NULLs in a specific attribute, tuple completeness, i.e. the number of unknown values of the attributes in a tuple, and relation completeness, i.e. the number of tuples with unknown attribute values in the relation. Finally, we could also measure value completeness, which makes sense for

Page 37

Chapter 1: Master Data Management complex, semi-structured columns, namely for XML data type columns. In an XML instance, a complete element or an attribute can miss. In addition, XML standards also define a special xsi:nil attribute as a placeholder for missing values; this is similar to relational NULLs. Accuracy Accuracy is a complicated dimension. Firstly, you have to determine what is inaccurate. Accuracy is stricter than just conforming to business rules; the latter should be enforced with data integrity. For data that should be unique, duplicate values are inaccurate. Finding duplicate values might be quite easy, with simple queries, or very hard, if we have to find duplicates across different systems. Finding other inaccurate data might involve some manual work. With different algorithms, you can extract data that is potentially inaccurate only. Here is some advice on how to isolate inaccurate data. For discrete data values, we can use frequency distribution of values. A value with very low frequency is probably incorrect. For strings, we can search for string length distribution. A string with very untypical length is potentially incorrect. For strings, we can also try to find patterns, and then create pattern distribution. Patterns with low frequency are probably denoting wrong values. For continuous attributes, we can use descriptive statistics. Just by looking at minimal and maximal values, we can easily spot potentially problematic data. No matter how we find inaccurate data, we can flag it, and then measure the level of accuracy. We can measure this for columns and tables. Information Another measurable dimension is information. Information Theory, an applied mathematics branch, defines entropy as the quantification of information in a system. We can measure entropy on column and table level. The more disperse values we have, the more frequencies distribution of a discrete column is equally spread among the values, the more information we have in the column. Information is not a direct data quality dimension; however, it can tell us whether our data is suitable for analyses or not.

Page 38

Chapter 1: Master Data Management Consistency Consistency measures the equivalence of information stored in various databases. We can find a lot of inconsistent data by comparing values with a predefined set of possible values; we can find some inconsistencies by comparing data among systems. We can find some inconsistencies manually. No matter how we find inconsistencies, we can flag them, and then again measure the level of inconsistency on column or table level. We can measure soft dimension indirectly, through interaction with users. Questionnaires, quick polls, user complaints, or any other communication with data users, is our tool for measuring quality of soft data dimensions. The following list includes some typical soft dimensions:  Timeliness tells us the degree to which data is current and available when needed. There is always some delay between change in the real world and the moment when this change is entered into a system. Although stale data can appear in any system, this dimension is especially important for Web applications and sites. A common problem on the Web is that owners do not update sites in a timely manner; we can find a lot of obsolete information on the Web.  Ease of use is a very typical dimension that relies on user perception. This dimension depends on application, on user interface. In addition, users of data can perceive usage as complex also because they are undereducated.  Intention – is the data the right data for intended usage? Sometimes we do not have the exact data we need; however, we can substitute the data needed with data with similar information. For example, we can use phone area codes instead of ZIP codes in order to locate customers approximately. Although phone numbers were not intended for analyses, they can give us reasonable results. Another, worse example of unintended usage is usage of a column in a table for storing unrelated information, like using product name to store product

Page 39

Chapter 1: Master Data Management taxonomy in it. This is unintended usage of schema, which leads to many problems with data cleansing and integration.  Trust – we have to ask users whether they trust the data. This is a very important dimension. If users do not trust data in operational systems, they will create their own little, probably unstructured, databases. Integration of master data from unstructured sources is very challenging. If users do not trust data from analytical applications, they will simply stop using them.  Presentation quality is another dimension that depends on user perception. When presenting data, format and appearance should support appropriate use of information. In operational systems, this dimension is closely related to ease of use dimension, and depends a lot on user interface. For example, an application can force users to enter dates manually, or guide them through calendar control. In analytical systems, presentation is probably even more important. Do we show data in graphs or in tables? How much interactivity should we put in our reports? Questions like this and answers to these questions can have a big influence on success or failure of analytical systems. Finally, we can describe some schema quality dimensions. A common perception is that schema quality cannot be measured automatically. Well, this is true for some dimensions; we cannot measure them without digging into a business problem. Nevertheless, it is possible to find algorithms and create procedures that help us in measuring some part of schema quality. The following list shows the most important schema quality dimensions with a brief description of how to measure them when applicable.  Completeness tells us to which extent schema covers the business problem. We cannot measure this dimension without in-depth analysis of business problem and its needs.  Correctness of the model concerns the correct representation of real-world objects in the schema and the correct representation of requirements. For example, the first name could be modeled as an entity with one-to-one

Page 40

Chapter 1: Master Data Management relationship to Customers entity, or as an attribute of the Customers entity. Of course, the latter is correct. The first name cannot be uniquely identified, and is therefore not an entity. This is the problem of correctly representing real world objects. An example of incorrectness considering requirements is a model of departments and managers. If a requirement says that a manager can manage a single department, and that each department has a single manager, then the relationship between managers and departments should be one-to-one. Modeling this relationship as many-to-many means an incorrect schema. We can measure this dimension manually, by investigating business requirements and object definitions and their representation in the schema.  Documentation tells us whether the schema is properly documented. Schema diagrams, like Entity-Relationship diagrams, should always be part of documentation. In addition, all business rules that cannot be represented in a diagram should be documented in textual format. In short, we should have complete documentation of conceptual schema. We can check the quality of this dimension manually, with schema documentation overview.  Compliance with theoretical models –is the database schema for transactional applications in a properly normalized and specialized relational schema, and does the schema for data warehouse consist of star schemas. This dimension has to be partially measured manually, with an overview of the data model. However, we can find some problematical areas in the schema procedurally. For example, we can check correlations between non-key attributes; attributes with very high correlation in a table lead to a conclusion that the schema is not normalized, at least it is not in third normal form. In addition, many NULLs lead to conclusion that there is not enough specialization, i.e. subtypes in the schema. This is especially true if some values of an attribute always leads to NULLs in another attribute. we can measure this with database queries.  Minimalization – abstraction is a very important modeling technique. It means that we should have only objects in a model that are important for business

Page 41

Chapter 1: Master Data Management problems that we are solving. Schema should be minimal, without objects that are not pertinent to the problem it is solving. Again, we can measure this dimension with manual overview and comparison to business requirements. There are many more data and schema quality dimensions defined in literature that deal specifically with data quality problems. Nevertheless, we mentioned the most important ones; now it is time to describe the activities needed to understand and improve data quality.

Data Quality Activities
Data quality projects are very intensive in terms of resources needed. In order to execute a successful project, we have to show the possible benefits to key stakeholders. Firstly, we need to understand business needs. We can use interviews, overviews of organizational charts, analyses of existing practices in an enterprise etc. We have to prioritize the business issues, and make a clear project plan. For a successful project, it is important to start with a business area that is either very painful for the stakeholders, or is quite simple to solve. We should always implement data quality projects step by step. Before we start any MDM project, we have to understand sources and destinations of master data. Therefore, data quality activities must include overviews. We have to make an extensive overview of all schemas of databases that pertain to master data. We should interview domain experts and users of data. This is especially important for getting comprehension of quality of schema dimensions. In addition, after this step, we should also have a clear understanding of technology that the enterprise is using. If needed, we have to plan to include appropriate technology experts in the project. During the overview of the data, we should also focus on data life cycle, to understand retention periods and similar. The next important step is the data quality assessment. We can assess hard dimensions with procedural analysis of the data, i.e. data profiling. There are many different tools for data profiling. We should exploit all knowledge we have and all tools available for this task. We should measure soft data quality dimensions in this step as well. We can get a lot of insight on

Page 42

Chapter 1: Master Data Management the state of the company by comparing hard and soft dimensions. If we evaluate hard dimensions as bad, and get good evaluations for soft dimensions, then the company does not realize they have problems with data. In such a case, additional assessment of potential damage caused with bad data can help key stakeholders understand data quality problems. If both, soft and hard dimensions are evaluated low; the company is ready for data quality and / or MDM project. If hard dimensions get good evaluations, while soft get bad, then this means that for some reason domain experts and users do not trust the data. Usually the reason is in previous systems, or in previous versions of a system, or in under education. If both, hard and soft dimensions get good evaluations, then the company does not need a special data quality project; however, the company could still decide for a MDM project, in order to minimize expenses with master data maintenance. After finishing with data assessment, we can re-assess the business impact of low quality data. We should meet again with key stakeholders in order to review the priorities and elaborate the improvements part of the project plan in detail. Before we start improving data quality, we have to find root causes for bad data. Finding root causes can narrow down the size of work with data quality improvements substantially. For finding root causes, we can use the five whys method introduced by Sakichi Toyoda and used first in Toyota Motor Company. With this method, we simply ask why five times. Imagine the problem is with duplicate records for customers.  We can ask why there are duplicate records. An answer might be because operators frequently insert new customer record instead of using existing ones.  We should ask the second why – why are operators creating new records for existing customers? The answer might be because operators do not search for existing records of a customer.  We then ask the third why – why don’t they search. The answer might be because the search would take too long.

Page 43

Chapter 1: Master Data Management   The next why is, of course, why it takes so long? The answer might be because it is very clumsy to search for existing customers. Now we ask the final, the fifth why – why is searching so clumsy? The answer might be because one can search only for exact values, not for approximate strings, and an operator does not have exact name, address, or phone number of a customer in memory. We found the root cause for duplication – for this example, this is application, specifically user interface. Now we know where to put effort in order to lower the number of duplicates. Of course, five whys is not the only technique for finding root causes. We can also just track a piece of information through its life cycle. By tracking it, we can easily spot the moment when it becomes inaccurate. We can find some root causes for some problems procedurally as well. For example, we can find that NULLs are in the system because lack of subtypes with quite simple queries. No matter how we find root causes, we have to use this information to prepare a detailed improvements plan. An improvement plan should include two parts: correcting existing data, and even more important, preventing future errors. If we focus on correcting only, we will have to repeat the correcting part of the data quality activities regularly. Of course, we have to spend some time in correcting existing data; however, we should not forget the preventing part. When we have both parts of improvements plan, we start implementing it. Implementation of corrective measures involves automatic and manual cleansing methods. Automatic cleansing methods can include our own procedures and queries. If we have a known logic how to correct the data, we should use it. We can solve consistency problems by defining a single way of representing the data in all systems, and then replace inconsistent representations with the newly defined ones. For example, if gender is represented in some system with numbers 1 and 2, while we define that is should be represented with letters F and M, we can replace numbers with letters in a single update statement. For de-duplication and merging from different sources, we can use string-matching algorithms. For correcting addresses, we can use

Page 44

Chapter 1: Master Data Management validating and cleansing tools that already exist in the market and use some registries of valid addresses. However, we should always prepare on the fact that part of data cleansing has to be done manually. Preventing new inserts of inaccurate data involves different techniques as well. The most important one is to implement a MDM solution. For a MDM solution, we need an MDM application. SQL Server Master Data Services is an example of the tool we could use. Besides enabling central storage for master data and its metadata, the MDM application has to support explicit data stewardship, versioning, auditing and data governance workflows. In addition to MDM solution, we should also focus on source systems. It is very unlikely that out MDM solution will ever cover all possible master entities with all of their attributes. As we already mentioned, an approach with a central MDM and a single copy of data is most of the times impractical. Therefore, part of master data is still going to be maintained in source, operational applications. We have to enforce proper data models, constraints, and good user interfaces wherever possible. After we have implemented data quality corrective and preventive solutions, we should measure how they perform. Even better, we should prepare improvements infrastructure in advance, before starting implementing solutions. By measuring improvements, we can easily show the value of the solutions to key stakeholders. In addition, we can control our work as well since our improvements can fail and lead even to worse data. We can measure soft dimensions with ongoing interviews with end users and domain experts. We can store results of the interviews in a special data quality data warehouse, to track the data quality over time. For hard dimensions, we can measure them automatically on a predefined schedule, and again store results in the data quality data warehouse. Figure 7 shows potential schema for data quality data warehouse, for measuring completeness and accuracy.

Page 45

Chapter 1: Master Data Management

FactTables PK,FK3 PK,FK1 PK,FK2 TableId DateId EmployeeId NumRows NumUnknownRows NumErroneousRows

DimStewards PK,FK1 EmployeeId EmployeeName ManagerId Department Title Age

FactColumns PK,FK3 PK,FK1 PK,FK2 ColumnId DateId EmployeeId NumValues NumUnknownValues NumErroneousValues

DimDates PK DateId Date Month Quarter Year

DimTables PK TableId TableName SchemaName DatabaseName ServerName ApplicationName PK

DimColumns ColumnId ColumnName TableId

FK1

Figure 7: DATA QUALITY DATA WAREHOUSE Although not mentioned as an explicit data quality activity step, it is very important to communicate actions and results throughout the project. The more communication we have, the better. Domain experts can always help us with their knowledge. IT professionals and end users put much more effort in success if they are involved, if they feel the project is their project. Key stakeholders should always know how the project progresses and have to be involved in all decision milestones.

Page 46

Chapter 1: Master Data Management

Master Data Services and Other SQL Server Tools
With SQL Server 2008 R2, Microsoft released a MDM solution called Master Data Services as part of the suite. Besides this specialized solution, other SQL Server parts are extremely useful in all parts of data quality and master data management activity. In this part, we are going to briefly introduce the Master Data Services and its architecture, and other useful SQL Server tools as well. Of course, a detailed presentation of Master Data Services and instructions on how to work with it are in the rest of this book.

Master Data Services
SQL Server Master Data Services is a platform for master data management. It provides a location for master data and metadata, enables processes for data stewardship and workflow, and has connection points we can use to exchange data with sources and destinations directly or through application interfaces, and even to extend the functionality of MDS. It consists of a database, Web application, configuration application, Web service as an interface to applications, and .NET class libraries that can be used to develop and maintain MDS applications programmatically. The MDS database, called MDS Hub, is central to any MDS deployment. The MDS Hub stores:       Schema and database objects for master data and metadata Versioning settings and business rules Information for starting workflow and e-mail notifications Database objects, settings and data for MDS application Staging tables for importing and processing data from source systems Subscription views for systems that can retrieve master data directly from the Hub

Page 47

Chapter 1: Master Data Management Figure 8 represents the MDS Hub, processes in the Hub, and connections to outer systems, to data sources and destinations, graphically.

ERP

CRM

SharePoint

DW

Other Web Service Synchroniz ation

Metadata

Business Rules

Workflow

Data Stewardship

Versioning

Entities

Hierarchies

Figure 8: MDS HUB In an MDS database models are the topmost level of master data organization. Models consist of   Entities- which map to real world entities, or DW dimensions Attributes- part of entities, containers for entity properties

Page 48

Chapter 1: Master Data Management    Members- part of entities, which we can consider as rows in entity table Hierarchies, also defined as entities, that group members in a tree structure Collections- which give possibility to users to create ad-hoc groups of hierarchies and members. Entities Entities are central objects in the model structure. Typical entities include customers, products, employees and similar. An entity is a container for its members, and members are defined by attributes. We can think of entities as tables. In a model, we can have multiple entities. Attributes Attributes are objects within entities; they are containers for values. Values describe properties of entity members. Attributes can be combined in attribute groups. We can have domain-based attribute values; this means that we get the pool of possible values of attributes in a lookup table, related to the corresponding entity of the attribute. Members Members are the master data. We can think of members as rows of master data entities. A member is a product, a customer, an employee and similar. Hierarchies Another key concept in MDS is Hierarchy. Hierarchies are tree structures that either group similar members for organizational purposes or consolidate and summarize members for analyses. Hierarchies are extremely useful for data warehouse dimensions, because typical analytical processing involves drilling down through hierarchies. Derived hierarchies based on domain-based attributes, i.e. on relationships that exist in the model. In addition, we can create explicit hierarchies, which we can use for consolidating members any way we need.

Page 49

Chapter 1: Master Data Management Collections We can also consolidate members in collections. Collections are not as fixed as hierarchies; we can create them ad-hoc for grouping members and hierarchies for reporting or other purposes. Versions MDS supports versioning. We can have multiple versions of master data within a model. Versions contain members, attributes and attribute values, hierarchies and hierarchy relationships, and collections of a model. Master Data Manager is an ASP.NET application which is the data stewardship portal. In this application, we can create models, manage users and permissions, create MDS packages for deployment on different MDS servers, manage versions, and integrate master data with source and destination systems. Figure 9 shows the Master Data Manager home page.

Page 50

Chapter 1: Master Data Management

Figure 9: MASTER DATA MANAGER HOME PAGE

Other SQL Server Tools for MDM
SQL Server Database Engine, the RDBMS itself, is the system that hosts MDS database, the MDS Hub. Of course, as a RDBMS, it can host many transactional and analytical databases and we can use it for many other master data management and data quality bound activities. Transact-SQL is a very powerful language for manipulating and querying data that can be used to discover and correct incorrect data. From version 2005 and later, we can enhance Transact-SQL capabilities with CLR functions and procedures. For example, Transact-SQL does not support validating strings against regular expressions out of the box. However, we can write Visual Basic or Visual C# functions for this task, and use it inside SQL Server.

Page 51

Chapter 1: Master Data Management SQL Server Integration Services (SSIS) is a tool for developing Extract – Transform – Load (ETL) applications. An ETL process is always part of an analytical solution. We have to populate data warehouses somehow. Of course, we can use the ETL process and ETL tools for preparing master data for our MDS application as well. SSIS has included many procedures called tasks and transformations that are useful for analyzing, cleansing and merging data. For example, the Data Profiling task quickly gives us a good overview of the quality of our data. The Data Profiling task is shown in figure 10.

Figure 10: THE DATA PROFILING TASK Besides Data profiling, there are quite a few additional tasks and transformations that help us with the ETL process. For example, Fuzzy Lookup transformation can join data from different sources based on string similarities. Fuzzy Grouping transformation helps find duplicates, again based on string similarities.

Page 52

Chapter 1: Master Data Management SQL Server Analysis Services (SSAS) is an analytical tool. It enables OLAP and Data Mining analyses. As an analytical tool, it does not seem useful for master data management at first glimpse. However, we can use OLAP cubes efficiently for data overview. Data Mining comprises of a set of advanced algorithms that try to find patterns and rules in a data set. Some of these algorithms are very useful for finding bad data. For example, the Clustering algorithm tries to group rows in groups, or clusters, based on similarity of the attribute values. After we find clusters, we can analyze how a row fits to some cluster. A row that does not fit well in any cluster is a row with suspicious attribute values. Finally, we have SQL Server Reporting Services (SSRS) in SQL Server suite. SSRS is useful for master data management indirectly. We can present our activities and data quality improvements through reports. We can publish reports on default SSRS portal site, or on any other intranet portal site, including SharePoint sites, we have in our enterprise.

Page 53

Chapter 1: Master Data Management

Summary
For a beginning of the book about SQL Server Master Data Services, we started with some theoretical introductions. We defined what master data is. Then we described master data management in general. We mentioned explicit actions about data governance, and operators who manage the master data. The name of those operators is data stewards. We also introduced different approaches to master data management. Then we switched to master data sources and destinations. For sources, it is crucial that they take care for data integrity. We discussed briefly the Relational Model, the most important model for transactional databases. We also mentioned normalization as the process of unbundling relations, a formal process that leads to desired state of a database, when tables represent exactly one entity. In this second part of the first chapter, we also introduced the Dimensional Model, a model used for analytical systems. Master data management always has to deal with data quality. Investments into a MDM solution make no sense, if the quality of the data in our company does not improve. In order to measure the data quality, we need to understand which data quality dimensions we have. In addition, we introduced the most important activities dedicated to data quality improvements. In the last part of the chapter, we introduced SQL Server Master Data Services. We mentioned MDS key concepts. We also explained how we can use other elements of SQL Server suite for master data management and data quality activities. It is time now to start working with MDS.

Page 54

Chapter 1: Master Data Management

References
     Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999) C. J. Date: An Introduction to Database Systems, Eighth Edition (Pearson Education, Inc., 2004) Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, Second Edition (John Wiley & Sons, 2002) Danette McGilvray: Executing Data Quality Projects (Morgan Kaufmann, 2008) 5 Whys on Wikipedia

Page 55

Chapter 2: Master Data Services Concepts and Architecture

Chapter 2: Master Data Services Concepts and Architecture
Davide Mauri
We previously introduced master data, master data management and Master Data Services in the first chapter of the book, now it is time to show how we can work with this new product in SQL Server suite. This chapter shows how we can create a MDS model with entities, attributes, hierarchies and collections. In addition, it shows how we can import and export data to and from MDS Hub. We will also learn how to implement business rules for our attributes, and the concept of versioning master data. The chapter starts also cover a quick introduction to Master Data Services’ setup so that it can serve you if you want to put your hands on MDS and want to use this chapter as a walk-through guide, and you do not have MDS installed yet. Topics covered in this chapter include:       Installation of Master Data Services (MDS); Using MDS Web application; Defining models, entities, attributes and hierarchies; Constraining inputs with business rules; Importing, exporting and managing data; Managing versions.

Page 56

Chapter 2: Master Data Services Concepts and Architecture

Master Data Services Setup
This section provides an overview to installing Master Data Services. The aim is to give you just a short description of the steps necessary to have a machine with MDS up and running without going too much into detail. The complete coverage of an MDS installation is provided in SQL Server 2002 R2 Books Online. The installation and setup of Master Data Services is divided into three main steps:    Installation of MDS components and tools Setup of MDS database Setup of MDS Web Application

Installation of MDS Components and Tools
Installing MDS components and tools, is a straightforward operation. It’s done using a typical Windows installer package and you can simply accept all default options. MDS Setup is included only in the 64-bit versions of SQL Server 2008 R2 Developer, Enterprise or Datacenter editions. Operating system support is determined mainly by the edition of SQL Server you choose. For example, the Developer edition can be installed starting with Windows Vista (Business, Enterprise and Ultimate) and Windows 7 (Professional, Enterprise and Ultimate). MDS components and tools require .NET Framework 3.5 Service Pack 1 to be installed. If it’s not already installed, MDS Setup will install it before proceeding with the main installation. Furthermore, the Windows user account of the logged user that is executing MDS Setup needs to be a member of local Administrators Windows group.

Page 57

Chapter 2: Master Data Services Concepts and Architecture Although not strictly necessary to setup the components and tools, the PowerShell feature needs to be installed before proceeding with the remaining configuration steps. The other two installation steps require using a specialized configuration program called Master Data Services Configuration Manager that is installed along with the other components and tools in the first step.

Figure 11: MASTER DATA SERVICES CONFIGURATION MANAGER

Setup of MDS Database
Setting up a database for an MDS solution requires an already installed SQL Server instance and the provisioning of two Windows users that will be granted the appropriate permissions to login to the instance and access the database.

Page 58

Chapter 2: Master Data Services Concepts and Architecture The first Windows user will be used to provide the identity of the web application pool and the second one will be used as the MDS System Administrator. The MDS System Administrator can access and update all models and all data in all functional areas for all web applications that use an MDS database. In this book, we assume that the MDS database created is named MDSBook. MDS Configuration Manager can then be used to connect to the instance and create the database as shown in the Figure 2.

Figure 2: DATABASE SETUP WITH MDS CONFIGURATION MANAGER

Setup of MDS Web Application
Setting up an MDS web application requires the installation of the Web Server (ISS) role along with some specific role services. As a quick guideline, the following roles services are required:  Common HTTP Features

Page 59

Chapter 2: Master Data Services Concepts and Architecture o Static Content o Default Document o Directory Browsing o HTTP Errors  Application Development o ASP.NET o .NET Extensibility o ISAPI Extensions o ISAPI Filters  Health and Diagnostics o HTTP Logging o Request Monitor  Security o Windows Authentication o Request Filtering   Performance o Static Content Compression Management Tools o IIS Management Console Furthermore, the following features are required:  .NET Framework Features o WCF Activation     HTTP Activation Non-HTTP Activation

Windows PowerShell Windows Process Activation Service o Process Model o .NET Environment

Page 60

Chapter 2: Master Data Services Concepts and Architecture o Configuration APIs MDS Configuration Manager can then be used to create and configure the web application as shown in Figure 3.

Figure 3: CONFIGURING WEB APPLICATION WITH MDS CONFIGURATION MANAGER

Page 61

Chapter 2: Master Data Services Concepts and Architecture

Master Data Manager Web Application
The entry point of any kind of human interaction with Master Data Services is the Master Data Manager Web Application. The web application is installed during the configuration of Master Data Services as part of the initial system deployment, and allows managing everything related to a Master Data Management solution. From a data model definition to system and security management, passing through data verification and manipulation, this is the central point from where Master Data can be managed. Figure 4 shows the five main functional areas, which are the entry point to access all available features.

Page 62

Chapter 2: Master Data Services Concepts and Architecture Figure 4: MASTER DATA MANAGER WEB APPLICATION

Explorer
Explorer is where all data stored inside Master Data Services can be managed. Models, entities, hierarchies and collections can be navigated and their data can be validated with the help of defined business rules. Direct updates or additions to the Members, the actual data stored in entities, can be done here. It’s also possible to annotate such data to explain why a change has been made, so that everything can be tracked and kept safe for future reference. It’s also possible to reverse transactions if some changes made to data have to be undone.

Version Management
Once the process of validating Master Data is finished, external applications can start to use the verified data stored in Master Data Services. Since applications will relay on that reference data, it’s vital that no modification at all can be done on it, but still data will need to change in order to satisfy business requirements. By creating different versions of data, it’s possible to manage all these different situations, keeping track of changes and creating stable and immutable versions of Master Data.

Integration Management
As the name implies, Integration Management allows managing the integration of the Master Data with the already existing data ecosystem. With Integration Management, it’s possible to batch import data that has been put into staging tables and monitor the import results. It is also possible to define how data can be exposed to external applications via defined Subscription Views.

Page 63

Chapter 2: Master Data Services Concepts and Architecture

System Administration
System Administration is the central place where the work of defining a Master Data Model takes place. With System Administration it is possible to create models, entities, attributes, hierarchies, business rules and everything offered by Master Data Services in terms of data modeling.

User and Groups Permissions
Security plays a key role when a system manages all the data that represents a core asset of any business. With User and Groups Permissions, all topics regarding security can be administered and managed, configuring all the security clearances needs to access and modify Master Data.

Page 64

Chapter 2: Master Data Services Concepts and Architecture

Models
As explained in Chapter 1, Master Data Services organizes its data with the aid of different concepts to deal with different aspects of data. All these concepts allow the definition of a Data Model. To start to get confident with Master Data Services concepts and technology, in the next paragraph we’ll create a model to hold customers data. By following a walk-through of all the steps needed we will get an example of a functional Master Data Services model.

Models
The highest level of data organization in Master Data Services is the Model. A model is the container of all other Master Data Services objects and can be thought of as a database in respect with the Relational Model. As a result, the first step to create any solution based on Master Data Services is to create a Model. From the System Administration section Master Data Manager portal, select the element Model from the Manage menu.

Figure 5: OPENING THE MODEL MAINTENANCE PAGE

Page 65

Chapter 2: Master Data Services Concepts and Architecture From the Model Maintenance page, it’s possible to create a new Model simply clicking on the Add Model Button (The green “plus” icon). This will open the Add Model page, where it is possible to specify the name and other options that will automatically create some default objects for us. These will be useful after having a solid background on how Master Data Services works, on order to save some work, so for now we’ll uncheck them.

Figure 6: ADDING A MODEL Since a Model is just a simple container, all that’s needed to create a Model is its name. For our sample we’re going to create customers Master Data, so we call our Model Customer.

Entities and Attributes
After having created the Customer Model, Entities of that model need to be defined. Entities are the objects of which a customer is made of, and they are in concept very similar to the tables of a standard Relational Database. In our sample Customer Model, each customer, obviously, will have a specific address. An address exists in a City, which has a specific Province (or State), within a Country.

Page 66

Chapter 2: Master Data Services Concepts and Architecture Cities, Provinces and Countries will then be entities on their own since they represent reference data too. In this way, Master Data will contain one and only one definition of their member values, so that it will be possible to have a single version of the truth, avoiding human errors such as misspelled or incorrect data that hinders the reach of high data quality standards. The process of defining which object has to be put into a separate Entity and which does not is similar to the Normalization process for a relational database. The aim is to avoid redundancy and data duplication, which drives to data anomalies and inconsistencies. Back to our sample, to hold customers reference data, we’ll create the following entities:     Customer City StateProvince CountryRegion

This sample is based on the usage of the AdventureWorksLT2008R2 sample database, which can be downloaded from CodePlex. Entities are created through the Model -> Entities menu item from the System Administration Page. After having defined in which model an Entity belongs to, only the name as to be specified in order to create it.

Page 67

Chapter 2: Master Data Services Concepts and Architecture

Figure 7: ADDING AN ENTITY We haven’t defined what are hierarchies and collections, so we don’t want to have them in our model for now. It will be possible to change this option later on, when we’ll have a deeper understating of these objects. After creating all the aforementioned Entities, you’ll see a page like the following:

Page 68

Chapter 2: Master Data Services Concepts and Architecture

Figure 8: ENTITY MAINTENANCE PAGE AFTER ADDING ALL ENTITIES From here it’s possible to enrich and complete entities definition with the aid of Attributes: entities are made of Attributes, which are the analogous of columns in a relational database. Any Entity is made of at least two attributes, which are mandatory, and cannot be removed nor renamed:   Name Code

Attributes can be of three different types. The types define which kind of values an Attribute can handle:    Free-Form: Allow the input of free-form text, numbers, dates and links. File: Allow you to store any generic file. For example: documents, images or any kind of binary object. Domain-Based: The values of an attribute that as a Domain-Based attribute type, are the values stored into another Entity. It allows the creation of relationships

Page 69

Chapter 2: Master Data Services Concepts and Architecture between Entities, by imposing the rule that each value used here must be a value that exists in another specific Entity. As a result, this force the user not to enter free-form data but to choose from a list of defined values, which helps a lot in avoiding errors and redundancy. The entity that stores the domain of possible values for the attribute is similar to a lookup table in classical Relational Model. To manage attributes for an entity, we first need to select that entity from the Entity Maintenance Page, so that a set of icons will become visible:

Figure 9: SELECTING AN ENTITY IN THE ENTITY MAINTENANCE PAGE Clicking on the pencil icon will open the Attribute Management page for the selected Entity.

Page 70

Chapter 2: Master Data Services Concepts and Architecture

Figure 10: OPENING THE ATTRIBUTE MAINTENANCE PAGE In the Leaf Attributes section it’s possible to add as many attribute as we need. Attribute are called “Leaf” to distinguish them from Consolidated Attributes used in Explicit Hierarchies. In brief, a Leaf Attribute represents the lowest level of details used to define the Master Data model. Moving forward in our Customer sample, we need to create an attribute for the StateProvice entity: a State or a Province in contained within a Country and so we need to model this situation in our solution. This kind of relationship between the StateProvince and Country entities can be represented using a Domain-Based Attribute, like shown in Figure 11.

Page 71

Chapter 2: Master Data Services Concepts and Architecture

Figure 11: CREATING A DOMAIN-BASED ATTRIBUTE Now it will probably be obvious that a Domain-Based Attribute has to be defined also for City Entity, so that it will be related to the state or province in which the city is located. With the Customer Entity, we will need to define several attributes. Beside the mandatory Code and Name we’ll create the following:       Firstname Lastname EmailAddress Phone Address City

Page 72

Chapter 2: Master Data Services Concepts and Architecture For simplicity we’ll create them all – with an exception for City attribute – as Free-Text attributes with a maximum length of 100 characters. City has to be a Domain-Based attribute whose values will be taken from City entity. When a Free-Text attribute is created, it’s possible to define the length of the data and the length – in pixels – of the textbox that will allow a user to see and modify attribute’s data.

Figure 12: CREATING A FREE-FORM ATTRIBUTE As the previous image shows, it’s also possible to flag the option Enable Change Tracking. All kinds of attributes support this option. As the name suggest, if Change Tracking is enabled, Master Data Services will monitor attribute’s data for changes. If a change happens, a Business Rule can be configured to be fired so that the necessary validation of data can take place.
Page 73

Chapter 2: Master Data Services Concepts and Architecture Attribute’s data type determines which constraints can be placed on the attribute itself to assure data consistency. For Number Data Type, it’s possible to define how many decimal places have to be used and the input mask.

Figure 13: NUMBER DATA TYPE OPTIONS DateTime data type also has needs to have an input mask defined. A File Attribute can constrain the type of handled files by specifying a file extension. Just be aware that this is not meant to be a security mechanism. For example, nothing will check that the effective file content is what its extension says it should be. After having created all entities and attributes we have defined the model in which we’ll store customer data. It should be clear at this point that creating Entities and Attributes is the key point of modeling Master Data. As it happens, for relational database modeling, it’s vital for the success of a project to have a clear idea of all the entities that are needed along with their attributes so that a correct relationship can be created. Since changing them once data has been imported, though still possible, it’s a time and resource consuming task. Some similarities with a relational database emerge here. This is not an unexpected thing. Master Data Services uses the power of SQL Server Relational Database Engine to enforce constraints that assures data quality. Each time an Entity is created, a corresponding table gets created too, and each time a Domain-Based attribute put two

Page 74

Chapter 2: Master Data Services Concepts and Architecture Entities in a relationship, a corresponding Foreign Key constraint gets created between the two underneath tables. All entities and attributes are stored into mdm.tblEntity and mdm.tblAttribute tables of the configure Master Data Services database. Members of an entity are stored in a specific table created ad-hoc. Two tables are created per entity. These tables following a specific naming convention: mdm_tbl_<model_id>_<entity_id>_<table_type>. EN table type holds entity members, while MS table type holds security information. The mapping between these tables and the Entity that uses them can be found in the mdm.tblEntity table. As can be easily supposed, in real world solution, an entity can have dozens of attributes, each one or each group describing a specific aspect of the entity. In this case it may make sense to group logically connected attributes together so that it will be easier for the final user to focus only on a specific group of data. The possibility to group attributes is supplied by the Attribute Group feature. In our example it makes sense to group the Address and City attribute together since they hold address information. The management page of Attribute Groups can be opened from Model -> Attribute Group menu. For each entity, Attributes Groups can be created and populated with the desired attributes. The Attribute Groups Maintenance page is shown in Figure 14.

Page 75

Chapter 2: Master Data Services Concepts and Architecture

Figure 14: ATTRIBUTE GROUPS MAINTENANCE PAGE Name and Code attribute are not visible since they will always be available in any attribute group, and so they don’t need to be inserted manually.. Another Attribute Group we may want to create here is the one the groups contact information together (FirstName, LastName, Phone, EmailAddress) to isolate them from address data. Besides helping users to manage Master Data more easily, Attribute Groups are also a powerful security feature. Do please refer to Books Online for more details how effective permissions are determined through users, groups and MDS Model elements assigned permissions.

Page 76

Chapter 2: Master Data Services Concepts and Architecture

Hierarchies
Hierarchies are tree based structures that allow you to organize master data in a coherent way, which also facilitates the navigation and aggregation of the data itself.

Figure 15: HIERARCHY EXAMPLE A hierarchy tells the user or the consumer system that receives Master Data from Master Data Services, that make sense to aggregate data at all level except the leaf level, which represent the actual un-aggregated data. For example, Customers may be the leaf level. Since customers live and work in cities, it makes sense to aggregate all the customers on a per city basis, so that we can analyze how strong the presence of our company in different cities is. Cities are located in countries so aggregating data by country is interesting analyze worldwide business. Having a common place where hierarchies are defined, like in Master Data Services models, is a key point to have a standard way to aggregate data across all the applications used in a company.

Page 77

Chapter 2: Master Data Services Concepts and Architecture For all those who already have developed a Business Intelligence solution, the concept of hierarchy in a SQL Server Analysis Services Dimension is very close to the one used by Master Data Services. In fact, it is possible to create and populate an Analysis Services Dimension just using the data stored in a Master Data Services hierarchy. Derived Hierarchy A Derived Hierarchy is a hierarchy created using the relationships that exist between entities that use Domain-Based attributes. In our Customer sample, we have a relationship between City, StateProvince and CountryRegion as follows:

Figure 16: DERIVED HIERARCHY EXAMPLE A Derived Hierarchy can be created using the Derived Hierarchy Maintenance page, reachable via Manage menu. After having specified the name, Geography in our case, composing the hierarchy is all a matter of dragging the available entities from the left to the Current Level section in the middle of the page. After dropping the entity, Master Data Services will automatically detect at which level of the hierarchy the dropped entity has to be placed, by analyzing the relationship between that and already placed entities.

Page 78

Chapter 2: Master Data Services Concepts and Architecture

Figure 17: DERIVED HIERARCHY MAINTENANCE PAGE The Preview pane will show how Members are organized based on the created hierarchy. This feature will only be useful when data has already been imported in Master Data Services to check that the hierarchy organizes values in the expected way. In a Derived Hierarchy, all members of all used entities take part in the hierarchy. When a member of an entity is added, deleted or updated, Master Data Services automatically updates the related hierarchies so that data is always consistent. An entity member can be used one time only in a hierarchy. An entity cannot be placed in more than one level. Of course, an entity can be used in more than one hierarchy. Special cases for Derived Hierarchies are Recursive Hierarchies. This is the case when you have an Entity that has a Domain-Based attribute that uses values coming from the entity itself. A typical scenario here is the organizational chart of a company where an employee can also be the manager of other employees. There is no particular difference between this kind of hierarchy and the standard Derived Hierarchies, but there are a few limitations:    Only one Recursive Hierarchy can exist per Entity Member Permission cannot be assigned Circular References are not allowed

Explicit Hierarchy It is also possible to define a hierarchy without using the relationship between entities: maybe we need to create a hierarchy using some additional data not available in any system, by adding new data manually directly into the Master Data Services database.

Page 79

Chapter 2: Master Data Services Concepts and Architecture These special hierarchies are called Explicit Hierarchies. They can organize data coming from one entity only (in opposition to what happen with Derived Hierarchies) and allow the creation of Consolidated Members, which defines the level of the hierarchy itself. In our Customer example, we have CountryRegion data, which contains information about the nation where a customer resides. For business purposes we may decide that it makes sense to have the data grouped into Continents and Geographic Zones.

Figure 18: EXPLICIT HIERARCHY EXAMPLE Explicit Hierarchies can be ragged, meaning that not all branches have to be equally deep. For example, we may decide to have a higher level of details under North America, adding the levels Country and State, while omitting them for other elements. To create an Explicit Hierarchy, we firs have to enable this feature at Entity level. This can be done from the Entity Maintenance page, together with the definition of the hierarchy name.

Page 80

Chapter 2: Master Data Services Concepts and Architecture

Figure 19: Enabling EXPLICIT HIERARCHIES If the “Include all leaf members in a mandatory hierarchy” flag is checked, the newly created Explicit Hierarchy will contain all the entity’s members. By default they will be places under the “Root” node so that they can be moved into other levels later on. If the flag’s not checked, not all entity’s members must be used in the hierarchy. The unused members will be placed in a special node named “Unused”. After having enabled the Explicit Hierarchy, the Entity Maintenance page will look like the one in Figure 20.

Page 81

Chapter 2: Master Data Services Concepts and Architecture Figure 20: ENTITY MAINTENANCE PAGE AFTER ENABLING EXPLICIT HIERARCHIES Consolidated Attributes are the attributes used by Consolidated Members. As visible, it’s possible to have more than one Explicit Hierarchy per Entity. Collection Attributes are attributes used by Collection which will be discussed later in this chapter. After having defined an Explicit Hierarchy, to add Consolidated Members to that hierarchy, in order to build its levels, the Explorer functional area of Master Data Services has to be used. From here, through the Hierarchies menu item it’s possible to access the Maintenance Page to choose the Explicit Hierarchy. The page is divided in two main sections. On the left, the hierarchy is visible and it’s possible to navigate its levels or move items from one level to another (along with their sublevel) by dragging and dropping them or using the cut-n-paste feature.

Figure 21: MANAGING EXPLICIT HIERARCHIES – LEFT SIDE On the right section of the page, there is the area where it’s possible to create or delete items from levels. In the following sample, under the Europe level we need to add the Western,

Page 82

Chapter 2: Master Data Services Concepts and Architecture Central and Eastern zones. After having selected the Europe item on the left side of the page, it is possible to add Consolidated members to that hierarchy level.

Figure 22: MANAGING EXPLICIT HIERARCHIES – RIGHT SIDE It is also possible to add Leaf Members from this page. Remember that since a Leaf Member is considered the last item of a hierarchy, a leaf member cannot have any child element. After having clicked on the Add button (again the green “plus” icon) a new page will pop-up. Here, all attributes for the chosen Item will be inserted and it’s also possible to select under which level the new member will be placed by clicking on the “Find Parent Node” button (indicated by a magnifying glass icon).

Page 83

Chapter 2: Master Data Services Concepts and Architecture

Figure 23: ADDING A CONSOLIDATED MEMBER The manual addition of Consolidated Members is just one option. As we’ll see next, it’s possible to batch import such members as it happens for Leaf Members.

Collections
Collections are nothing more than a group of Explicit Hierarchies and Collection Members. They are useful to create groups of members that not necessarily grouped from a business perspective; maybe they are useful to be grouped to make life easier for the user. For example, if someone is in charge of managing Customers from Spain, Portugal and South America, (s)he can do a more efficient job if (s)he can find all their customers in a single group. Here’s where Collections come into play. Collections can be created on a per-Entity basis and, just like Explicit Hierarchies; they can also have their own Consolidated Members. Again, Members management is done in the Explorer functional area. After having created a Consolidated Member to represent the collection, it’s possible to add members by clicking on the “Edit Members” menu item visible after clicking on the drop-down arrow near the collection we want to manage.

Page 84

Chapter 2: Master Data Services Concepts and Architecture

Figure 24: EDITING A COLLECTION A new page will be displayed where using the same drag-n-drop or cut-n-paste technique used to manage Hierarchies, members of the collection can be added, removed or changed.

Business Rules
Business Rules are one of the most powerful features of Master Data Services. They allow the definition of rules that assure the quality of data. A business rule is basically an If...Then sentence that allows specifying what has to be done if certain conditions are met. For example, let’s say that an email address, to be considered valid, need to contain at least the “at” (@) character. Business Rules are created in the System Administration functional area, by accessing the Business Rule Maintenance Page through the “Manage” menu item. New rules can be created by clicking on the usual “Add” button, and rule name can be set by double-clicking on the existing name that has to be changed. After having selected for which entity the Business Rule has to be created, the rule can be defined by clicking on the third icon from the left:

Page 85

Chapter 2: Master Data Services Concepts and Architecture

Figure 25: BUSINESS RULE MAINTENANCE PAGE The Edit Business Rule page will be displayed. This page has four main areas. In the top left section there are the Components of the business rule. Each component represents pieces of a sentence that can be put together to form a business rule. They are of three types: conditions used for the If part of the rule sentence, actions used for the Then part, and logical operators that connect two or more conditions. As said before, in our case we just want to be sure that the EmailAddress attribute values have the “@” sign. Under the Validation Actions there is the “must contain the pattern” action. We don’t need any conditions, since we always want to verify that an email address is formally correct. To put desired elements under the proper section, we just have to drag-n-drop the element we want to use in the right place: Conditions or Actions.

Page 86

Chapter 2: Master Data Services Concepts and Architecture

Figure 25: EDITING A BUSINESS RULE As soon as we drop the item, the Edit section of the page, on the bottom right will require us to configure the used Condition or Action. Basing on the selected item we may be asked to specify which attribute we want to use and some other configuration options. In our example, we’ll need to specify that the rule will be applied on the EmailAddress attribute and that pattern we’re looking for is %@%.

Figure 26: SPECIFYING DETAILS OF THE “MUST CONTAIN THE PATTERN” ACTION

Page 87

Chapter 2: Master Data Services Concepts and Architecture Again, to select the Attribute we want to use, we just have to drag-n-drop it under the “Edit Action” item. Once the rule is completed, it needs to be published before users can start to use it to validate data. From the Business Rules Maintenance page, select the rule that needs to be published and click on the “Publish Business Rules” button (second from left).

Figure 27: PUBLISHING A BUSINESS RULE After the Business Rule has been published, it can be used to validate Master Data.

Page 88

Chapter 2: Master Data Services Concepts and Architecture

Importing, Exporting and Managing Data
Master Data Services needs to collect data from other systems in order to consolidate and validate it and make it Master Data. Though data can be managed directly using the Master Data Management portal, it will be a common scenario to import and export data from and to an external system that produces and consumes reference data.

Import Data
Importing data into Master Data Services is a batch process based on three main tables:    mdm.tblStagingMember mdm.tblStgMemberAttribute mdm.tblStgRelationship

The first table – mdm.tblStagingMember – is used to import members and their Code & Name system attributes. All other user defined attributes have to be imported using the mdm.tblStgMemberAttribute table. If we need to move members into an Explicit Hierarchy or add members to a Collection, we’ll use the mdm.tblStgRelationship table too. Without going into deep technical details, to import member data, we firstly need to create members, and then populate all the user-defined attributes, in the following sequence. For the first attempt, we’ll load data for the CountryRegion Entity, taking the values from AdventureWorksLT2008R2 sample database. To extract and load data into the mdm.tblStagingMember we’ll use the following T-SQL code:
USE AdventureWorksLT2008R2; GO SELECT DISTINCT CountryRegion FROM SalesLT.[Address]

Page 89

Chapter 2: Master Data Services Concepts and Architecture
) INSERT INTO MDSBook.mdm.tblStgMember (ModelName, HierarchyName, EntityName, MemberType_ID, MemberName, MemberCode) SELECT ModelName = 'Customer', HierachyName = NULL, EntityName = 'CountryRegion', MemberType_ID = 1, MemberName = CountryRegion, MemberCode = ROW_NUMBER() OVER (order by CountryRegion) FROM cte_source; GO

After executing the T-SQL batch from SQL Server Management Studio, we’ll have three rows imported in the member staging table of our MDSBook sample database. To notify Master Data Services that we want to start the import process, we have to use the Import feature available in the “Integration Management” functional area.

Figure 28: IMPORTING CUSTOMER DATA By clicking on the Process Button (the only one above the Model label) we’ll start the batch process which will queue the rows to be imported. Batches are processed every 60 seconds by default.

Page 90

Chapter 2: Master Data Services Concepts and Architecture After that time the process will be completed and we’ll be able to see the result, using the same page:

Figure 29: AFTER CUSTOMER DATA IS IMPORTED If errors occurred, we could have detailed information on the errors shown in a separate page, simply by clicking on the “Batch Details” icon, the first on the left.

Managing Data
Now that we have our data inside Master Data Services, we can use the Explorer functional area to explore and manage the data. Here we can select the Entity we want to manage simply choosing the desired one from the “Entities” menu. Since we’ve just imported CountryRegion data we may want to check it. The CountryRegion Entity page will be as shown in Figure 30.

Page 91

Chapter 2: Master Data Services Concepts and Architecture

Figure 30: THE COUNTRYREGION ENTITY PAGE Here it is possible to add or remove members or change existing ones simply by double clicking on the attribute that we want to change. The yellow question mark visible on the left by the members name indicates that data has not yet been validated, which means that we have to execute business rules in order to validate date. To do this, we simply have to select the members we want to validate, just by checking the relative checkbox and then clicking on the “Apply Business Rules” button, the third from the right, just below the “CoutryRegion” title. We haven’t defined any business rule for CountyRegion, so data will surely pass validation, changing the yellow question mark to a green tick mark. Since CountryRegion has an Explicit Hierarchy defined, we may also want to check Consolidated Members. As the image shows, we can do this by selecting the “Consolidated” option, or we can use the Hierarchies menu to access the Explicit Hierarchy and put the imported Leaf Members under the correct levels:

Page 92

Chapter 2: Master Data Services Concepts and Architecture

Figure 31: THE COUNTRYREGION EXPLICIT HIERARCHY PAGE By using drag-n-drop or by using the cut-n-paste features we can move “Canada” and “United States” leaf members under the “North America” Consolidated Member, and “United Kingdom” leaf member under the “Western Europe” consolidated member.

Export Data
Now that we have imported, validated and consolidated our data, we may need to make it available to external applications. This can be done by creating a Subscription View from the “Integration Management” functional area.

Figure 32: CREATING A SUBSCRIPTION VIEW A Subscription View needs to be defined with a name, a version of the data that has to be exported (we’ll introduce versioning in the following section of this chapter) and of course the

Page 93

Chapter 2: Master Data Services Concepts and Architecture Entity or the Derived Hierarchy we want to export. The format allows you to define which kind of data we want to export: Leaf Members, Consolidated Members, levels of an Explicit Hierarchy and so on. We want to export the CountryRegion Master Data, along with the information of its Explicit Hierarchy: that’s way we have to choose “Explicit Levels” in the Format combo-box and specify how many levels we want to export. Three levels are enough, since our Explicit Hierarchy is made up of the Continent – Zone – Country. After having saved the Subscription View, Master Data Services creates a SQL Server View for us. In the MDSBook database, we created the Customer_CountryRegion view which is ready to be used by external applications to get that Master Data. Here is an example of Transact-SQL code used to query the view:
SELECT * FROM [MDSBook].[mdm].[Customer_CountryRegion];

The view, along with other columns, has “flattened” the hierarchy using three levels, as requested:

Figure 33: DATA FROM A SUBSCRIPTION VIEW

Page 94

Chapter 2: Master Data Services Concepts and Architecture

Multiple Versions of Data
Now that Master Data has been made available to other system via Subscription Views, we must guarantee those systems that Master Data won’t change suddenly and without any advice. To “freeze” Master Data so that its values won’t change, we have to introduce the concept of versions. Versions provide a way to create logical snapshots of the Master Data; this way, we can still make changes to Master Data while keep tracking the changes over time, and providing to the subscribing system the version they expect to get. Versioning is completely integrated into the system and it is always active. At the very beginning, only one version of data exists and the status is set to Open. This first version is automatically created by Master Data Services as soon as the Model is created. “Open” means that everyone (who has the required permission) can make changes to the model and to its data.

Figure 34: THE FIRST VERSION OF DATA IS OPEN Once the model is completed and has been filled with data, the administrator needs to start the validation process. Of course, to validate data it has to be stable. To be sure that no-one can change data while someone is validating it, the version must be set to Locked. This operation is done in the Version Management functional area, shown in Figure 35.

Page 95

Chapter 2: Master Data Services Concepts and Architecture

Figure 35: LOCKING A VERSION Once the version has been locked, validation can take place. To validate data all at once, the Validation Version page, reachable via the menu with the same name, runs all the defined Business Rules on all data:

Figure 36: VALIDATING A VERSION After having the data validated, it’s finally possible to set the version to the Committed status, using the initially greyed button (the second in the top left side of the image above).

Page 96

Chapter 2: Master Data Services Concepts and Architecture Once the version has reached this status, its data cannot be changed anymore. If new need to change to Master Data, we’ll also need to create a new version of it. New versions are created from the Manage Version page, by selecting which committed version of data we want to start to work from.

Figure 37: THE MANAGE VERSIONS PAGE Now that we have a new version, we can start to work on it freely, being sure that all reference data stored in the previous version will remain untouched, allowing the system that rely upon it to work correctly and without problems.

Page 97

Chapter 2: Master Data Services Concepts and Architecture

MDS Database Schema
In the last part of this chapter, we are going to take a look at the MDS database schema. This should help us understanding how it’s made and how it works, since an intimate knowledge of the final store of our Master Data can help us to create better solution to handle it. Each time a new MDS solution is created, using the MDS Configuration Manager tool, a new database is also created too. This database will contain all the models we’ll define in this specific MDS solution, beside all the system objects needed by MDS for its own functioning. All these objects are located in the mdm schema. Among all the tables here contained several are particularly interesting and deserve more attention. The first of these tables is mdm.tblModel. Each time a new model is created within the selected MDS Application, a new row is also inserted here. This action also generates an ID value that uniquely identify the model and that is used throughout the database. From now on this ID will be referred to as model_id. Model’s Entities also reside in tables; the main table where a list of all Entities is stored is the mdm.tblEntity table. As obvious, also in this case each entity has its own ID value, which will be called entity_id. The table contains all the Entity defined in the MDS Application and each Entity is related to the Model in which it has been defined using the model_id value. Each time a new Entity is added to the model, at least two dedicated tables are also created. The name of these tables is generated using this rule: mdm.tbl_<model_id>_<entity_id>_<member_type>

Page 98

Chapter 2: Master Data Services Concepts and Architecture The first two placeholders, model_id and entity_id has been already defined, the member_type can have six different values:       EN: table that holds entity data MS: table that stores security data HR: table that stores Explicit Hierarchies Data HP: stores relationships between entities in the Explicit Hierarchy CN: table that holds Collections defined on the Entity CM: table that keep tracks of which Entity Member is in which Collection

Allowed member types are stored in the mdm.tblEntityMemberType table. As soon as an entity is created, only the “EN” and “MS” gets created. All the other tables will be created only if the functionality they support is used. The following query shows an example how to find all entities in a MDS database by querying the mdm.tblEntity table, with the results shown in the figure right after the query.
SELECT ID, Model_ID, Name, EntityTable, SecurityTable FROM mdm.tblEntity;

Figure 38: ENTITIES IN THE MDM.TBLENTITY TABLE As we know, each Entity can have its own set of Attributes. These attributes are stored as columns in “EN” entity table. Each time a new attribute is added to an entity, a new column is also added to the related table. The name of the created column follows a naming convention similar to the one already seen in table names:

Page 99

Chapter 2: Master Data Services Concepts and Architecture uda_<model_id>_<attribute_id> The only exception to this rule is for the Code and Name attributes. Since they are mandatory for any entity, they are mapped directly to columns with the same name. In the following figure, we can see sample results of a query over an example of the “EN” entity table.

Figure 39: ATTRIBUTES OF AN “EN” ENTITY TABLE The attribute_id, like the others ids mentioned until now, is generated automatically by the system each time a new row is inserted. The table that keep tracks of all attributes in the database is the mdm.tblAttribute table. This table stores all the metadata needed by attributes, like the DisplayName and the mapping to the column name used in entity table:

Figure 40: PARTIAL CONTENT OF THE MDM.TBLATTRIBUTE TABLE When an entity has a Domain-Based attribute, a Foreign Key constraint is automatically created between the referenced and the referencing entities in order to support the relationship.

Page 100

Chapter 2: Master Data Services Concepts and Architecture

Staging Tables
As we have learned in previous section, each time a new entity is created, an associated database table gets created as well. In such dynamic and always changing environment, creating a standard solution to import data into the Master Data Services database so that it will fit into the defined model can be a nightmare. Luckily, Master Data Services comes to help, giving us three standard tables, all existing in the mdm schema, that have to be used to import data into entities, to populate theirs attributes and to define relationships of hierarchies. These three tables are called staging tables and are the following:    mdm.tblStagingMember mdm.tblStgMemberAttribute mdm.tblStgRelationship

By using them it’s possible to import data into any model we have in our Master Data Services database. A fourth staging table exists, but it isn’t used actively in the import process, since it just reports that status and the result of the batch process that is moving data from the aforementioned staging tables into entities, attributes and hierarchies. This table is the mdm.StgBatch table. You can populate the staging tables with regular T-SQL inserts and bulk inserts. After the staging tables are populated, you can invoke the batch staging process from the Master Data Manager Web application. After the batch process is finished, you should check for the possible errors. Please refer to Books Online for more details about populating your MDS database through the staging tables. Following this process and with the aid of the three staging tables, is possible not only to populate entities with new fresh data (which means creating new members), but also:

Page 101

Chapter 2: Master Data Services Concepts and Architecture        Update members values Delete and reactivate members Create collections Add members to collections Delete and reactivate collections Update attribute values Designate relationships in explicit hierarchies

Page 102

Chapter 2: Master Data Services Concepts and Architecture

Summary
In this chapter, the quick walk-through Master Data Manager Web application gave us an overview of MDS capabilities. We also learned how to work with Master Data Manager. Of course, everything is not that simple as we have shown in this chapter. Nevertheless, this chapter helps us match elements of a master data management solution with practical implementation by using Master Data Services. The first part of this chapter is just a quick guide to installing MDS. This should help readers to start testing it and use this chapter as a quick walk-through MDS. Then we defined MDS Models and elements of a Model. We have also shown how to import and export data. Finally, we discussed versioning of our master data. We have to mention that for a real-life solution, we should expect much more work with importing data. Before importing data into a MDS model, we have to profile it, to check it for quality, and cleanse it. Typical actions before importing data include also merging the data from multiple sources and de-duplicating it. In the next chapter, we are going to explain how we can check data quality in our existing systems by exploiting tools and programming languages included in and supported by SQL Server suite. The last chapter of this book is dedicated to merging and de-duplicating data. We are not going to spend more time and place on the Master Data Services application in this book. This is a general MDM book, and MDS is just a part of a general MDM solution. In addition, MDS that comes with SQL Server 2008 R2 is just a first version and is, according to our opinion, not suitable for production usage in an enterprise yet. We suggest waiting for the next version of MDS for real-life scenarios. Nevertheless, the understanding how MDS works in current version we got in this chapter should definitely help us with successful deployment and usage of the next version.

Page 103

Chapter 2: Master Data Services Concepts and Architecture

References
     Installing and Configuring Master Data Services on MSDN MSFT database samples on CodePlex How Permissions Are Determined (Master Data Services) on MSDN Master Data Services Team Blog Master Data Services Database (Master Data Services) on MSDN

Page 104

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
Dejan Sarka
Before we start implementing any centralized MDM solution, we should understand the quality of our existing data. If we transfer and merge bad data into our MDM solution, we will have bad data from the start, and over time the situation can get worse. Therefore, intensive data profiling should always start a MDM project. Besides getting an impression of our data quality, we should also try to find the root cause for bad data. In chapter one of this book, we have already seen that the five whys is a known technique used to find the root cause. However, the five whys technique works well for soft data quality dimensions. In this chapter, we are going to focus on hard data quality dimensions, i.e. on measurable dimensions. Namely, we are going to deal with completeness, accuracy and information. We are going to show how we can use SQL Server 2008 R2 tools for measuring data quality. In addition, we are going to show how we can use these tools to find the root cause for bad data. We will also see that through data profiling we can make some conclusions about schema quality. In this chapter, we are going to introduce the following:     Measuring the completeness; Profiling the accuracy; Measuring information; Using other SQL Server tools for data quality.

Page 105

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Measuring the Completeness
In order to get some value out of the data, our data has to be complete. This means that the data should be deep enough to cover all of the business needs, and scoped for the business tasks. Completeness starts with schema completeness. Of course, if our schema does not have a place for a specific piece of data provided, we cannot insert that piece of data. Normalization of a relational schema is a formal mathematical process that guarantees the completeness of the schema. In addition, normalization eliminates redundancy. Nevertheless, we are not going to talk about data modeling here. We are going to show how we can measure the completeness data quality dimension, and find the root cause for incomplete data. Population completeness is the first completeness that can be easily measured. We can use two different assumptions here:   Closed world assumption: all tuples that satisfy relation predicates are in the relation; Open world assumption: population completeness defined on a reference relation. If we have the reference relation, we can measure population completeness by just comparing the number of rows in our and in reference relation. Of course, just having the number of rows in the reference relation is sufficient information to measure our population completeness. In a relational database, the presence of the NULLs is what defines the completeness. NULLs are standard placeholders for unknown. We can measure attribute completeness, i.e. the number of null values in a specific attribute, tuple completeness, i.e. the number of unknown values of the attributes in a tuple, and relation completeness, i.e. the number of tuples with unknown attribute values in the relation.

Page 106

Chapter 3: Data Quality and SQL Server 2008 R2 Tools In a relational model, we have only a single placeholder for unknown attribute values, the NULL. However, we can have NULLs because attributes are not applicable for subset of rows, or because the values are really unknown. Not applicable values can also be a signal that the schema is not of the best possible quality; namely, they can tell us that some subtypes should be introduced in the schema. Inadequate schema quality is one possible root cause for NULLs in our database. In XML data, we do not have NULLs. XML has a special xsi:nil placeholder for unknown values. In addition, a complete element or an attribute can miss in an XML instance, and this is missing data as well. We have to use XQuery language inside Transact-SQL (T-SQL), inside XML data type methods, in order to find incomplete data inside XML. When we start data profiling, we meet a lot of suspicious or even bad data at the start. In order to find the root cause, we have to narrow down our data profiling activities and searches. The techniques shown in this chapter help us not only with data profiling, but with narrowing down the problem in order to find the root cause as well.

Attribute Completeness
Let us start with a simple example – finding the number of NULLs in an attribute. We are going to analyze attributes from the Production.Product table from the AdventureWorks2008R2 demo database. The samples are based on the usage of the SQL 2008 R2 sample databases, which can be downloaded from CodePlex. You can execute queries in SQL Server Management Studio (SSMS), the tool shipped with SQL Server. If you are not familiar with this tool yet, do please refer to Books Online. First, let’s find which columns are nullable, i.e. allow null values, in the Production.Product table with the following query that uses the INFORMATION_SCHEMA.COLUMNS view:

Page 107

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
SELECT COLUMN_NAME, IS_NULLABLE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = N'Production' AND TABLE_NAME = N'Product';

The nullable columns are Color, Size, SizeUnitMeasureCode, WeightUnitMeasureCode, Weight, ProductLine, Class, Style, ProductSubcategoryID, ProductModelID, SellEndDate and DiscontinuedDate. The next query shows an overview of first 100 rows, with only nullable columns, and columns that help you identifying rows, namely ProductId and Name:
SELECT TOP 100 ProductId ,Name ,Color ,Size ,SizeUnitMeasureCode ,WeightUnitMeasureCode ,Weight ,ProductLine ,Class ,Style ,ProductSubcategoryID ,ProductModelID ,SellEndDate ,DiscontinuedDate FROM Production.Product;

Partial results, showing only a couple of columns and rows, are here:
ProductId Name Color Size SizeUnitMea sureCode 1 2 3 4 316 … Adjustable Race Bearing Ball BB Ball Bearing Headset Ball Bearings Blade … NULL NULL NULL NULL NULL … NULL NULL NULL NULL NULL … NULL NULL NULL NULL NULL … … … … … … … …

You can easily see that there are many unknown values in this table. With a simple GROUP BY query, we can find the number of nulls, for example in the Size column. By dividing the number of NULLs by the total number of rows in the Production.Product table, we can also get the proportion of NULLs.

Page 108

Chapter 3: Data Quality and SQL Server 2008 R2 Tools The following queries show the absolute number of NULLs in the Size column, and the proportion of NULLs:
SELECT Size ,COUNT(*) AS cnt FROM Production.Product WHERE Size IS NULL GROUP BY Size; SELECT 100.0 * (SELECT COUNT(*) FROM Production.Product WHERE Size IS NULL GROUP BY Size) / (SELECT COUNT(*) FROM Production.Product) AS PctNullsOfSize;

By running these two queries, we are able to find out that there are 293 NULLs in the Size column, which is 58% of all values. This is a huge proportion; just from this percentage, we can conclude that size is not applicable for all products. For products, size is a common attribute, so we could expect mostly known values. We could continue checking other nullable attributes with similar queries; however, we will see later that for finding NULLs in a column, the SQL Server Integration Services (SSIS) Data Profiling task is suitable for this.

XML Data Type Attribute Completeness
So far we have shown how to check whether the complete value is missing. However, if we have a column of XML data type, there could be just a single element inside the XML value missing. Of course, we can then define that the xml value is not complete. The question is how you can find rows with a specific element missing. In SQL Server, XML data type supports some useful methods. These methods are:    query() method, which returns part of the xml data in xml format; value() method, which returns a scalar value of an element or an attribute of an element; exist() method, which returns 1, if an element or an attribute exists in an xml instance, 0 if an element or an attribute does not exist, and null if the xml data type instance contains null;
Page 109

Chapter 3: Data Quality and SQL Server 2008 R2 Tools   modify() method, which allows you to insert an element or an attribute, delete an element or an attribute, or update value of an element or an attribute; nodes() method, which allows you to shred an xml data type instance to relational data. All of the XML data type methods accept XQuery as an argument. XQuery expressions allow us to traverse through nodes of XML instance to find a specific element or attribute. The value method accepts additional parameter, the target SQL Server scalar data type. For the modify() method, XQuery is extended to allow modifications, and is called XML Data Modification Language or XML DML. You can learn more about XQuery expressions and XML DML in Books Online. For checking whether an element is present in the xml instance, exist() method is the right one. First, let us create XML instance in a query from the Production.Product table by using the FOR XML clause:
SELECT p1.ProductID ,p1.Name ,p1.Color ,(SELECT p2.Color FROM Production.Product AS p2 WHERE p2.ProductID = p1.ProductID FOR XML AUTO, ELEMENTS, TYPE) AS ColorXml FROM Production.Product AS p1 WHERE p1.ProductId < 319 ORDER BY p1.ProductID;

The subquery in the SELECT clause generates XML data type column from the Color attribute. For the sake of brevity, the query is limited to return seven rows only. The first five have NULLs in the Color attribute, and the XML column returned does not include the Color element. The last two rows include the Color element. We are going to use the previous query inside a CTE in order to simulate a table with XML column, where the Color element is missing for some rows. The outer query is using the .exist() method to check for the presence of the Color attribute.

Page 110

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
WITH TempProducts AS (SELECT p1.ProductID ,p1.Name ,p1.Color ,(SELECT p2.Color FROM Production.Product AS p2 WHERE p2.ProductID = p1.ProductID FOR XML AUTO, ELEMENTS, TYPE) AS ColorXml FROM Production.Product AS p1 WHERE p1.ProductId < 319) SELECT ProductID ,Name ,Color ,ColorXml.value('(//Color)[1]','nvarchar(15)') AS ColorXmlValue FROM TempProducts WHERE ColorXml.exist('//Color') = 0;

The outer query correctly finds the first five rows from the CTE. XML specification also allows that an element is present but has no value. In such a case, a special attribute xsi:nil should appear inside the nillable (XML term for nullable) element. Therefore, in order to find all incomplete xml instances, we have to check also for the xsi:nil attribute. In order to check for the xsi:nil attribute, we have to create it in some rows first. We are going to slightly change the last query. In the CTE, we are going to include XSINIL keyword in the FOR XML clause of the subquery. This will generate the Color element for every row; however, when the color is missing, this element will have an additional xsi:nil attribute. Then, with the outer query, we have to check whether this attribute appears in the Color element:
WITH TempProducts AS (SELECT p1.ProductID ,p1.Name ,p1.Color ,(SELECT p2.Color FROM Production.Product AS p2 WHERE p2.ProductID = p1.ProductID FOR XML AUTO, ELEMENTS XSINIL, TYPE) AS ColorXml FROM Production.Product AS p1 WHERE p1.ProductId < 319) SELECT ProductID ,Name ,Color ,ColorXml.value('(//Color)[1]','nvarchar(15)') AS ColorXmlValue FROM TempProducts WHERE ColorXml.exist('//Color[@xsi:nil]') = 1;

Of course, the CTE query returns the same small sample (seven) rows of products with ProductID lower than 319. The outer query correctly finds the first five rows from the CTE.

Page 111

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Simple Associations among NULLs
It would be too easy if we could finish our research by just calculating the proportion of unknown values. In order to improve the quality of our data, we have to discover the cause of these Nulls. The first thing we can check is whether Nulls in one column lead to Nulls in another one. The relation between columns could indicate two things for the schema: the schema could not be in third normal form, i.e. we have functional dependency between two non-key columns, or the values for a column are not applicable for all rows, and thus we should probably introduce subtypes. Let’s first check whether NULLs in one column lead to NULLs in another. We can need to decide which columns need checking. The answer depends on the information we can get from the names of the columns and from the business experts. If the naming convention in the database is narrative, we can easily find the potential candidate columns for checking. If the naming convention is bad, we have to rely on business experts to narrow down our checking. If we do not have a business expert at hand and we cannot get a clue from the column names, then we have a problem; we have to check all pairs of columns. Fortunately, in the AdventureWorks2008R2 demo database, the naming convention is quite good. If we take a look at the Size and SizeUnitMeasureCode columns, their names tell us they are somehow related. Therefore, let’s check whether NULLs are related in these two columns with the following query:
SELECT 1 AS ord ,Size ,SizeUnitMeasureCode ,COUNT(*) AS cnt FROM Production.Product GROUP BY Size, SizeUnitMeasureCode UNION ALL SELECT 2 AS ord ,Size ,SizeUnitMeasureCode ,COUNT(*) AS cnt FROM Production.Product GROUP BY SizeUnitMeasureCode, Size ORDER BY ord, Size, SizeUnitMeasureCode;

Before showing the results of this query, let us add a comment on the query. The result set unions two result sets; first SELECT aggregates rows on Size and then on SizeUnitMeasureCode,

Page 112

Chapter 3: Data Quality and SQL Server 2008 R2 Tools the second SELECT does the opposite and aggregates rows on SizeUnitMeasureCode first and then on Size. The first column of the result is just a constant used to get guaranteed order in the combined result set. In SQL 2008 and 2008 R2, we can rewrite this query in a shorter way using the new GROUPING SETS subclause of the GROUP BY clause. It is not just the query that is shorter; SQL Server is usually able to find a more effective execution plan as well, and thus execute the query faster. However, in order to guarantee the order of the result, the query with GROUPING SETS becomes more complicated, and also less efficient. As this is not a T-SQL programming book, we are going to use the query above, written without GROUPING SETS. Although the query is not as efficient as it could be, it is more readable, and thus more suitable for explaining the concepts of narrowing down the search for the reasons for NULLs. The abbreviated results are as follows:
ord 1 1 … 1 1 1 1 1 1 2 2 ... Size NULL 38 … 62 70 L M S XL NULL 38 … SizeUnitMeasureCode NULL CM … CM NULL NULL NULL NULL NULL NULL CM … cnt 293 12 … 11 1 11 11 9 3 293 12 …

Page 113

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
2 2 2 2 2 2 62 70 L M S XL CM NULL NULL NULL NULL NULL 11 1 11 11 9 3

Note the bolded rows. Whenever the Size is null, SizeUnitMeasureCode is also null; however, the opposite is not true. In addition, SizeUnitMeasureCode is null for all sizes expressed in character codes, like L, M and not null for numeric sizes, except for size 70 (the bolded italics row). We can conclude that there is a strong relation between nulls in these two columns. Of course, size unit measure code tells us in which unit the size measured is if the size is numeric; for example in the second row of the result set, we can see that size 38 is measured in centimeters. When the size is expressed in character codes, the measure unit makes no sense. However, we can see that something is wrong with size 70; this one should have measure unit. The measure unit is missing or size 70 should not be in the relation, as it is potentially erroneous. By researching unknown values, we can find potential errors. In addition, for the root cause of the unknown values, we can omit the SizeUnitMeasureCode column; we already know where nulls in this column come from. Therefore, we can limit our research to the Size column only from this pair. If we do the same analysis for the Weight and WeightUnitMeasureCode columns, we will find that we can omit the WeightUnitMeasureCode column from further researches as well. Finally, we can do the same thing for the ProductSubcategoryID and ProductModelID columns, and will find out that whenever ProductSubcategoryID is null, ProductModelID is null as well. Therefore, we can also omit the ProductModelID from further completeness checking. How can we prevent missing size measure units? The answer lies in the schema. We can introduce a check constraint on the SizeUnitMeasureCode column that would not accept null values for numeric sizes, or we can create a trigger on the Production.Product table that can

Page 114

Chapter 3: Data Quality and SQL Server 2008 R2 Tools check whether the measure unit is known for numeric sizes. Finally, we can further normalize the schema. It is obvious that the SizeUnitMeasureCode and Size columns are functionally dependent. Figure 1shows the normalized version of the design for size and size unit measure.

Products PK ProductId Name SizeId OtherCols PK

Sizes SizeId Size SizeTypeId PK

SizeTypes SizeTypeId SizeTypeName MeasureUnit

FK1

FK1

Figure 12: SIZE AFTER NORMALIZATION In a design like the one in the image above, once we insert the correct measure of units for the appropriate size types, we cannot have erroneous or missing size measure units anymore. Of course, if the values are incorrect in the lookup tables, Sizes and SizeTypes, measure units would be wrong for all products of a specific size. However, it is easier to maintain small lookup tables and have only correct values in them. The question here is whether it is worth changing the database design because of the problem with a missing size measure unit. Changing the database design could lead to upgrading one or more applications, which could be very expensive. We could prevent new incorrect values with a check constraint or a trigger instead. Even with a check constraint or a trigger we could end up with problems with our applications. If our applications do not implement exception handling, we might finish with a crashed application when an erroneous size measure unit would be inserted. Preventing errors should be our ultimate goal; however, in many cases, especially when we have to deal with legacy applications, it might be the cheapest way to simply find the errors and correct them on a regular schedule. Here, we just checked whether NULL in one column leads to NULL in another column. What if a known value in a column leads to NULL in another column? This would mean that the second attribute is not applicable for all values of the first attribute. Subtype tables would probably be needed. In the Production.Product table, from the name of the ProductSubcategoryId column

Page 115

Chapter 3: Data Quality and SQL Server 2008 R2 Tools we can suspect that this column might be a potential candidate for introducing subtypes for values of this column. Let’s check if the Class attribute is applicable for all product subcategories using the following query:
WITH Classes AS (SELECT ProductSubcategoryID ,Class ,COUNT(*) AS NRowsInClass FROM Production.Product GROUP BY ProductSubcategoryId, Class) SELECT c.ProductSubcategoryID ,c.Class ,c.NRowsInClass ,CAST(ROUND(100.0 * c.NRowsInClass / SUM(c.NRowsInClass) OVER(PARTITION BY c.ProductSubcategoryId),0) AS decimal(3,0)) AS PctOfSubCategory FROM Classes c ORDER BY PctOfSubCategory DESC;

This query uses Common Table Expressions (CTE) to calculate the number of rows in subcategories and classes. The outer query uses the CTE to calculate the percentage of rows of a class in a specific subcategory from the total number of rows of the same subcategory by using the OVER clause. The result set is ordered by a descending percentage. If the percentage is close to 100, it means that one class is prevalent in one subcategory. If the value of the class is NULL, it means that the class is probably not applicable for the whole subcategory. Here are partial results of this query.
ProductSubcategoryID 6 7 9 18 … Class NULL NULL NULL NULL … NRowsInClass 2 1 2 3 … PctOfSubCategory 100 100 100 100 …

We can easily see that the Class attribute is not applicable for all subcategories. In further research of reasons for NULLs of the Class attribute, we can exclude the rows where the values are not applicable. We can also check which other attributes are not applicable for some

Page 116

Chapter 3: Data Quality and SQL Server 2008 R2 Tools subcategories. Note that there are not necessarily the same subcategories leading to NULLs in Class as the subcategories leading to NULLs in other attributes. For example, subcategory 4 leads to NULLs in Color, but not to NULLs in Class attribute. Therefore, in order to continue the research for the reason for NULLs with narrowed number of rows by excluding the rows where the values are not applicable, we should make separate researches for each nullable attribute. We will focus on the Class attribute and continue researching with rows where Class is applicable. The following query limits rows, to rows where the Class attribute is applicable, and limits the column list to interesting columns only.
SELECT ProductId ,Name ,Color ,Size ,Weight ,ProductLine ,Class ,Style FROM Production.Product WHERE (Color IS NULL OR Size IS NULL OR Weight IS NULL OR ProductLine IS NULL OR Class IS NULL OR Style IS NULL) AND (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0));

Excluded columns are ProductModelId, because it is NULL whenever ProductSubcategoryId is NULL; WeightUnitMeasureCode and SizeUnitMeasureCode are not needed, as they are NULL whenever Weight or Size are NULL. In addition, SellEndDate and DiscontinuedDate are not interesting for completeness quality checking, as from the names we can expect that most of the rows should have NULLs in these two columns, and we can easily imagine there is business reason behind those NULLs. Of

Page 117

Chapter 3: Data Quality and SQL Server 2008 R2 Tools course, these two columns could be interesting from the accuracy perspective. Six interesting columns are left from the completeness perspective. Rows are limited by the WHERE clause to rows which have an unknown value in any of the six interesting columns without rows where the Class column is not applicable. The final subquery in the WHERE clause is used to find all subcategories for which the Class is not applicable. There is not a single distinct class in a subcategory. Note that the COUNT(DISTINCT Class) aggregate function, differently from COUNT(*), eliminates NULLs from the aggregation. Therefore, if the count of distinct classes in a subcategory is zero, this means that there are only classes with NULLs in a subcategory.

Tuple and Relation Completeness
For the tuple completeness, we can count how many columns have unknown values. For relation completeness, we can use the proportion of rows with NULLs and total number of rows. Before starting measuring, let’s create a scalar user-defined function that accepts a parameter of sql_variant data type, and returns one if the parameter is null and zero if it is not null.
CREATE FUNCTION dbo.ValueIsNULL (@checkval sql_variant) RETURNS tinyint AS BEGIN DECLARE @retval int IF (@checkval IS NULL) SET @retval = 1 ELSE SET @retval = 0; RETURN(@retval) END;

With this function, it is easy to write a query that calculates the number of nulls for each interesting row. Note that the following query refers to interesting columns only, and limits the result set to rows with applicable Class attribute only.
SELECT ProductId ,Name ,dbo.ValueIsNULL(Color) + dbo.ValueIsNULL(Size) + dbo.ValueIsNULL(Weight) + dbo.ValueIsNULL(ProductLine) + dbo.ValueIsNULL(Class) + dbo.ValueIsNULL(Style)

Page 118

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
AS NumberOfNULLsInRow FROM Production.Product WHERE (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0)) ORDER BY NumberOfNULLsInRow DESC;

The abbreviated result from the query is as follows:
ProductId 802 803 804 … Name LL Fork ML Fork HL Fork … NumberOfNULLsInRow 5 5 5 …

We can say that tuples with more NULLs are less complete than tuples with fewer NULLs. We can even export this data to a staging table, repeat the tuple completeness measure on a schedule, and compare the measures to notice the tuple improvement. Of course, this makes sense only if we can join measures on some common identification; we need something that uniquely identifies each row. In the Production.Product table, there is a primary key on the ProductId column. In a relational database, every table should have a primary key, and the key should not change if you want to make comparisons over time. For relation completeness, we can use two measures: the total number of NULLs in the relation and the number of rows with NULL in any of columns. The following query does both calculations for the Production.Product table, limited on rows with applicable Class attribute only.
SELECT 'Production' AS SchemaName ,'Product' AS TableName ,COUNT(*) AS NumberOfRowsMeasured ,SUM( dbo.ValueIsNULL(Color) + dbo.ValueIsNULL(Size) +

Page 119

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
dbo.ValueIsNULL(Weight) + dbo.ValueIsNULL(ProductLine) + dbo.ValueIsNULL(Class) + dbo.ValueIsNULL(Style) ) AS TotalNumberOfNULLs ,SUM( CASE WHEN ( dbo.ValueIsNULL(Color) + dbo.ValueIsNULL(Size) + dbo.ValueIsNULL(Weight) + dbo.ValueIsNULL(ProductLine) + dbo.ValueIsNULL(Class) + dbo.ValueIsNULL(Style) ) > 0 THEN 1 ELSE 0 END ) AS NumberOfRowsWithNULLs FROM Production.Product WHERE (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0));

The result of the relation completeness query is as follows:
SchemaNa me Production TableNa me Product NumberOfRows Measured 237 TotalNumber OfNULLs 222 NumberOfRows WithNULLs 61

We can continue with such measurements for each table in the database. We can also store the results in a data quality data warehouse, as proposed in chapter 1 of this book. This way, we can measure improvements over time. After all, one of the most important goals when implementing a MDM solution is data quality improvement.

Multivariate Associations among NULLs
We presented a way how you can find whether there is some association between missing values in two columns. A little bit more advanced question is whether there is an association among missing values in multiple columns; for example, if a value is missing in all or any of the Color, Class and Style columns of the Production.Product table, does this fact lead to a missing value in the Weight column?

Page 120

Chapter 3: Data Quality and SQL Server 2008 R2 Tools There are many multivariate analyses known from statistics and data mining. SQL Server Analysis Services (SSAS) bring very strong data mining support. All of the most popular algorithms are implemented. For the problem mentioned, Naïve Bayes is an appropriate algorithm. It checks associations of pairs of variables, and then calculates joined associations, i.e. associations between multiple input variables and a target variable. The target variable is called the predictable variable, as we can use the model to predict the values of the target variable in new cases when we have values for the input variables. In SSAS, we can declare all variables as input and predictable, thus getting one Naïve Bayes analysis per variable. Microsoft also ships data mining viewers, the controls that show us the knowledge learned by the algorithm graphically. The Naïve Bayes Dependency Network viewer shows us all links between all variables in a single screen, it shows all models simultaneously. This way we can easily find the most important links and thus understand which target variable has the strongest associations with which input variables. Let’s start working. First, we need a view in the AdventureWorks2008R2 database that selects only the interesting columns for applicable values of the Class column of the Production.Product table, as in the following code:
CREATE VIEW dbo.ProductsMining AS SELECT ProductId ,Name ,dbo.ValueIsNULL(Color) AS Color ,dbo.ValueIsNULL(Size) AS Size ,dbo.ValueIsNULL(Weight) AS Weight ,dbo.ValueIsNULL(ProductLine) AS ProductLine ,dbo.ValueIsNULL(Class) AS Class ,dbo.ValueIsNULL(Style) AS Style FROM Production.Product WHERE (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0));

Next, we need to create an Analysis Services project in Business Intelligence Development Studio (BIDS). We can do this by completing the following steps:

Page 121

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 1. Open the BIDS. 2. In BIDS, create a new SSAS project. Name the solution MDSBook_Ch03, the project MDSBook_Ch03_SSAS, and save it the solution to any folder you want – we suggest the C:\MDSBook\Chapter03 folder. 3. In Solution Explorer, right-click on the Data Sources folder. Create a new Data Source for the AdventureWorks2008R2 database. Connect to the instance where you have deployed the demo databases. Select the AdventureWorks2008R2 database. Use the service account impersonation. Use the default name for the data source (Adventure Works2008R2). 4. In Solution Explorer, right-click on the Data Source Views folder. Create a new Data Source View based on the data source from the previous step. From the available objects, select only the ProductsMining (dbo) view you just created. Use the default name for the data source view (Adventure Works2008R2). 5. In Solution Explorer, right-click on the Mining Structures folder and select New Mining Structure. 6. Use the existing relational database or data warehouse. 7. Select the Microsoft Naïve Bayes technique. 8. Use Adventure Works2008R2 data source view. 9. Specify ProductsMining as a case table. 10. Use ProductId as a key column (selected by default), and Class, Color, ProductLine, Size, Style and Weight as input and predictable columns, like in figure 2.

Page 122

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 2: TRAINING DATA COLUMNS 11. In the Specify Columns’ Content and Data Type screen, make sure the content type is discrete for all input and target variables. If needed, change the content to Discrete (in SQL 2008 R2 RTM, the wizard usually detects the content of these columns inappropriately as Discretized). 12. Use the default percentage (30%) of data for the test set (this is not important for the usage you are performing, as you have a single model only; this is important if you want to compare the accuracy of predictions of multiple models). 13. Name the structure Products Mining Completeness, and the model Products Mining Completeness Naïve Bayes. 14. Click Finish. 15. Save, deploy and process the project by right-clicking on the project in the Solution Explorer window and selecting Deploy.

Page 123

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 16. When the processing is finished, click on the Mining Model Viewer tab. The default viewer is the Dependency Network viewer. 17. Lower the slider on the left to get more important links. For example, put the slider on the left side of the viewer on the seventh position from the bottom, and then click on the Size attribute. You should see the something like figure 3 shows.

Figure 3: DEPENDENCY NETWORK VIEWER 18. You can easily see that ProductLine, Style and Weight predict Size, while Size also predicts Class. Color has no strong associations with other attributes. 19. Check the other viewer, for example the Attribute Discrimination viewer. Select the Size attribute, and compare value 0 with value 1, like in figure 4.

Page 124

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 4: ATTRIBUTE DISCRIMINATION VIEWER 20. You can see that the Style attribute is strongly linked with Size; i.e., if Style is NULL, it is highly probable that Size is NULL as well (value 1 in the picture, as the dbo.ValueIsNULL function returns 1 for NULLs). As we can see from this example, we can easily find associations between missing values and data mining. Data mining can help you a lot in searching for the root cause. Before finishing with this example, we have to add a quick note. We can use the Data Source View (DSV) for a quick overview of your data as well by completing the following steps: 21. In the DSV, right-click on the ProductsMining view, and select Explore Data. 22. Use the Chart tab and select Class, Color, ProductLine, Size, Style and Weight columns. 23. Check the other views as well. When finished, close the Explore Data view. Do not close BIDS yet, if you want to follow further examples in this chapter.

Page 125

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Profiling the Accuracy
Similarly like completeness, accuracy can also be defined on the attribute, tuple or relation level. However, measuring the accuracy is more complicated than measuring the completeness. Sometimes we can easily define which data is inaccurate. The data can be duplicated, out of predefined range, or be inaccurate in some other way that can be easily spotted and discovered with simple queries. For example, it is clear that something is wrong if the birth date is higher than the employment data for an employee. However, there is something wrong if a twenty year old employee has a Ph.D.in medicine? This may be an error. Nevertheless, we cannot say this without checking this further. With code, we cannot always find inaccurate data. Many times we can find suspicious data only, and then check this data manually. Therefore, it makes a lot of sense to find suspicious data programmatically. If we can narrow down manual checking from ten million rows to ten thousand suspicious rows only, we have made fantastic progress. Finding inaccurate date differs if we are dealing with numbers, strings, or other data types. For the completeness, we showed examples dealing with products. Another aspect of master data is customers. For profiling accuracy we are going to deal with customers, where we typically have many different types of data gathered. We are going to base our examples on the dbo.vTargetMail view from the AdventureWorksDW2008R2 demo database. This database, also part of the product samples, represents an analytical data warehouse. It is already denormalized, using star schema. The dbo.vTargetMail view joins customers’ demographic data from multiple tables from the AdventureWorks2008R2 database, and is thus more suitable for demonstrating techniques for finding inaccurate data than source tables, where we would have to deal with multiple joins. Because the sample data is accurate, or at least intended to be accurate, we are going to create a view based on mentioned vTargetMail view. Our view will select subset of rows from the

Page 126

Chapter 3: Data Quality and SQL Server 2008 R2 Tools original view, and add some erroneous rows. We have worked with the AdventureWorks2008R2 database so far; we can continue working in this context, in order to have less work with cleanup code after we finish with this chapter. The following code creates the view we are going to use in the AdventureWorks2008R2 database.
CREATE VIEW dbo.vTargetMailDirty AS SELECT TOP 500 CustomerKey, FirstName, MiddleName, LastName, BirthDate, MaritalStatus, Gender, EmailAddress, YearlyIncome, TotalChildren, NumberChildrenAtHome, EnglishEducation AS Education, EnglishOccupation AS Occupation, HouseOwnerFlag, NumberCarsOwned, AddressLine1 AS Address, Phone, DateFirstPurchase, CommuteDistance, Region, Age, BikeBuyer FROM AdventureWorksDW2008.dbo.vTargetMail WHERE CustomerKey < 12000 UNION -- misspelled MiddleName, repeated CustomerKey 11000 SELECT 99000, N'Jon', N'VeryLongMiddleName', N'Yang', '19660408', N'M', N'M', N'[email protected]', 90000, 2, 0, N'Bachelors', N'Professional', 1, 0, N'3761 N. 14th St', N'1 (11) 500 555-0162', '20010722', N'1-2 Miles', N'Pacific', 42, 1 UNION -- duplicate PK, repeated CustomerKey 11000 SELECT 99000, N'Jon', N'V', N'Yang', '19660408', N'M', N'M', N'[email protected]', 90000, 2, 0, N'Bachelors', N'Professional', 1, 0, N'3761 N. 14th St', N'1 (11) 500 555-0162', '20010722', N'1-2 Miles', N'Pacific', 42, 1 UNION -- wrong EmailAddress, repeated CustomerKey 11001 SELECT 99001, N'Eugene', N'L', N'Huang', '19650514', N'S', N'M', N'eugene10#adventure-works.com', 60000, 3, 3, N'Bachelors', N'Professional', 0, 1, N'2243 W St.', N'1 (11) 500 555-0110', '20010718', N'0-1 Miles', N'Pacific', 42, 1 UNION -- BirthDate out of range, repeated CustomerKey 11001 SELECT 99002, N'Eugene', N'L', N'Huang', '18650514', N'S', N'M', N'[email protected]', 60000, 3, 3, N'Bachelors', N'Professional', 0, 1, N'2243 W St.', N'1 (11) 500 555-0110', '20010718', N'0-1 Miles', N'Pacific', DATEDIFF(YY, '18650514', GETDATE()), 1 UNION -- misspelled Occupation, repeated CustomerKey 11002 SELECT 99003, N'Ruben', NULL, N'Torres', '19650812', N'M', N'M', N'[email protected]', 60000, 3, 3, N'Bachelors', N'Profesional', 1, 1, N'5844 Linden Land', N'1 (11) 500 555-0184', '20010710', N'2-5 Miles', N'Pacific', 42, 1 UNION -- Phone written as 'Phone: ' + number

Page 127

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
SELECT CustomerKey, FirstName, MiddleName, LastName, BirthDate, MaritalStatus, Gender, EmailAddress, YearlyIncome, TotalChildren, NumberChildrenAtHome, EnglishEducation AS Education, EnglishOccupation AS Occupation, HouseOwnerFlag, NumberCarsOwned, AddressLine1 AS Address, N'Phone: ' + Phone, DateFirstPurchase, CommuteDistance, Region, Age, BikeBuyer FROM AdventureWorksDW2008.dbo.vTargetMail WHERE CustomerKey > 12000 AND CustomerKey % 500 = 3;

Numeric, Date and Discrete Attributes Profiling
We are going to start with finding inaccuracies in a single attribute. As we already mentioned, the techniques differ for different data types. We are going to start with dates. For finding potentially inaccurate dates, the MIN and MAX T-SQL aggregate functions are very useful. The following query finds the oldest and the youngest person among our customers.
SELECT CustomerKey ,FirstName ,LastName ,BirthDate FROM dbo.vTargetMailDirty WHERE BirthDate = (SELECT MIN(BirthDate) FROM dbo.vTargetMailDirty) OR BirthDate = (SELECT MAX(BirthDate) FROM dbo.vTargetMailDirty);

The results have found suspicious data. The oldest person is born in the year 1865.
CustomerKey 99002 11132 FirstName Eugene Melissa LastName Huang Richardson BirthDate 1865-05-14 1984-10-26

Finding suspicious data is mostly translated to finding outliers, i.e. rare and far out of bound values. We can use a similar technique for continuous numeric values. With a couple of standard T-SQL aggregate functions, we can easily get an idea of distribution of values, and then compare minimal and maximal values with the average. In addition, the standard deviation tells us how spread the distribution is in general. The less spread it is, the

Page 128

Chapter 3: Data Quality and SQL Server 2008 R2 Tools more likely the outliers are inaccurate. The following query calculates this basic descriptive statistics for the Age column, which is calculated from the BirthDate column.
SELECT MIN(Age) AS AgeMin ,MAX(Age) AS AgeMax ,AVG(CAST(Age AS float)) AS AgeAvg ,STDEV(Age) AS AgeStDev FROM dbo.vtargetMailDirty;

The result shows us something we already knew: the oldest person in the data is probably the wrong Age, and since Age is calculated, the BirthDate has to be wrong also.
AgeMin 26 AgeMax 146 AgeAvg 50.1018518518519 AgeStDev 13.1247332792511

Before we move to discrete attributes, let us mention how to interpret this basic descriptive statistics. We should expect a normal, Gaussian distribution of ages around the average age. In a normal distribution, around 68% of the data should lie within one standard deviation of either side of the mean, about 95% of the data should lie within two standard deviations of either side of the mean, and about 99% of the data should lie within three standard deviations of either side of the mean. The minimal age is less than two standard deviations from the mean (50 – 2 * 13 = 24 years), while the maximal age is more than seven standard deviations of the mean. There is a very, very low probability for data to lay more than seven standard deviations from the average value. For example, we already have less than 0.5% of probability to have data outside interval that finishes three standard deviations from the mean. Thus, we can conclude there is something wrong with the maximal value in the Age column. How do we find outliers in discrete columns? No matter whether they are numeric, dates or strings, they can take a value from discrete pools of possible values only. Let’s say we do not know the pool in advance. We can still try to find suspicious values by measuring frequency distribution of all values of an attribute. A value with very low frequency is potentially an

Page 129

Chapter 3: Data Quality and SQL Server 2008 R2 Tools erroneous value. The following query finds the frequency distribution for the Occupation column.
WITH freqCTE AS ( SELECT v.Occupation ,AbsFreq=COUNT(v.Occupation) ,AbsPerc=CAST(ROUND(100.* (COUNT(v.Occupation)) / (SELECT COUNT(*) FROM vTargetMailDirty) ,0) AS int) FROM dbo.vTargetMailDirty v GROUP BY v.Occupation ) SELECT c1.Occupation ,c1.AbsFreq ,CumFreq=(SELECT SUM(c2.AbsFreq) FROM freqCTE c2 WHERE c2.Occupation <= c1.Occupation) ,c1.AbsPerc ,CumPerc=(SELECT SUM(c2.AbsPerc) FROM freqCTE c2 WHERE c2.Occupation <= c1.Occupation) ,Histogram=CAST(REPLICATE('*',c1.AbsPerc) AS varchar(50)) FROM freqCTE c1 ORDER BY c1.Occupation;

The query uses a CTE to calculate absolute frequency and absolute percentage, and calculates cumulative values in outer query with correlated subqueries. Note again that this is not the most efficient query. However, we want to show the techniques we can use, and the performance is not our main goal here. In addition, we typically do not execute these queries very frequently. In the result of the previous query, we can see the suspicious occupation. The “Profesional” value is present in a single row only. As it is very similar to the “Professional” value, we can conclude this is an error.
Occupation Clerical Management Manual AbsFreq 68 134 64 CumFreq 68 202 266 AbsPerc 13 25 12 CumPerc 13 38 50 Histogram ************* ************************* ************

Page 130

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
Profesional Professional Skilled Manual 1 157 116 267 424 540 0 29 21 50 79 100 ***************************** *********************

Strings Profiling
Catching errors in unconstrained strings, like names and addresses, is one of the most challenging data profiling tasks. Because there are no constraints, it is not possible to say in advance what is correct and what is incorrect. Still, the situation is not hopeless. We are going to show couple of queries we can use to find string inconsistencies. In the character column, strings have different lengths in different rows. However, lengths are distributed with either nearly uniform or normal distribution. In both cases, strings that are extremely long or short might be errors. Therefore, we are going to start our profiling by calculating the distribution of string lengths. The following example checks lengths of middle names.
SELECT LEN(MiddleName) AS MNLength ,COUNT(*) AS Number FROM dbo.vTargetMailDirty GROUP BY LEN(MiddleName) ORDER BY Number;

The vast majority of middle names are either unknown or one character long.
MNLength 18 2 NULL 1 Number 1 1 235 303

Of course, it is easy to find middle names that are more than one character long with the next query.

Page 131

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
SELECT CustomerKey ,FirstName ,LastName ,MiddleName FROM dbo.vTargetMailDirty WHERE LEN(MiddleName) > 1;

We see that one middle name is definitely wrong (of course, this is the one we added intentionally). In addition, the middle name that is two characters long might be written inconsistently as well. It looks like middle names should be written with a single letter, without a dot after it.
CustomerKey 11377 99000 FirstName David Jon LastName Robinett Yang MiddleName R. VeryLongMiddleName

Sometimes we know what strings should look like. This means we have patterns for strings. We can check for basic patterns with LIKE T-SQL operator. For example, we would not expect any letters in phone numbers. Let’s check them with the following query.
SELECT CustomerKey ,FirstName ,LastName ,Phone FROM dbo.vTargetMailDirty WHERE Phone LIKE '%[A-Z]%';

From the abbreviated results, we can see that there are some phone numbers that include characters. It seems like some operator constantly uses prefix “Phone” when entering phone numbers.
CustomerKey 12003 12503 13003 13503 14003 FirstName Audrey Casey Jill Theodore Angel LastName Munoz Shen Hernandez Gomez Ramirez Phone Phone: 1 (11) 500 555-0124 Phone: 1 (11) 500 555-0148 Phone: 1 (11) 500 555-0197 Phone: 1 (11) 500 555-0167 Phone: 488-555-0166

Page 132

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
… … … …

More advanced pattern matching can be done with regular expressions. Regular expressions can be treated as LIKE operator on steroids. T-SQL does not support regular expressions out of the box. However, we are not powerless. From version 2005, SQL Server supports CLR objects, including functions, stored procedures, triggers, user-defined types and user-defined aggregates. A simple .NET, either Visual C# or Visual Basic function could do the work. Going into details of using CLR code inside SQL Server is out of the scope of this book. Nevertheless, the CLR project with this function is added to the accompanying code, and we can use it. Here is Visual C# code for the function.
using using using using using using System; System.Data; System.Data.SqlClient; System.Data.SqlTypes; Microsoft.SqlServer.Server; System.Text.RegularExpressions;

public partial class CLRUtilities { [SqlFunction(DataAccess = DataAccessKind.None)] public static SqlBoolean IsRegExMatch( SqlString inpStr, SqlString regExStr) { return (SqlBoolean)Regex.IsMatch(inpStr.Value, regExStr.Value, RegexOptions.CultureInvariant); } };

The Boolean function accepts two parameters: string to check and regular expression. It returns true if string matches the pattern and false otherwise. Before we can use the function, we have to import the assembly into SQL Server database, and create and register the function. The CREATE ASSEMBLY command imports an assembly. The CREATE FUNCTION registers CLR function. After the function is registered, we can use it like any built-in T-SQL function. The TSQL code for importing the assembly and registering function is as follows:
CREATE ASSEMBLY MDSBook_Ch03_CLR FROM 'C:\MDSBook\Chapter03\MDSBook_Ch03\MDSBook_Ch03_CLR\bin\Debug\MDSBook_Ch03_CLR .dll' WITH PERMISSION_SET = SAFE; GO

Page 133

Chapter 3: Data Quality and SQL Server 2008 R2 Tools
CREATE FUNCTION dbo.IsRegExMatch (@inpstr AS nvarchar(max), @regexstr AS nvarchar(max)) RETURNS BIT WITH RETURNS NULL ON NULL INPUT EXTERNAL NAME MDSBook_Ch03_CLR.CLRUtilities.IsRegExMatch;

Note that this code expects that the assembly is in the “C:\MDSBook\Chapter03\MDSBook_Ch03\ MDSBook_Ch03_CLR\bin\Debug\” folder. If you have it in some other folder, do please change the path accordingly. After the function is registered in SQL Server, we can use it like any other T-SQL functions. We are going to use it for checking e-mails, as in the following code.
SELECT CustomerKey ,FirstName ,LastName ,EmailAddress ,dbo.IsRegExMatch(EmailAddress, N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})') AS IsEmailValid FROM dbo.vTargetMailDirty WHERE dbo.IsRegExMatch(EmailAddress, N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})') = CAST(0 AS bit);

In the results, we can see that we found an incorrect e-mail address.
CustomerKey 99001 FirstName Eugene LastName Huang EmailAddress eugene10#adventureworks.com IsEmailValid 0

You can learn more about regular expressions on the MSDN. In addition, there are many sites on the Web where developers freely exchange regular expressions.

Other Simple Profiling
It is very simple to check for the uniqueness of rows in a table as well. If we do not use Primary Key and Unique constraints in our database, we can get duplicate rows. We have to check which columns can potentially form a key, i.e. we have to check candidate keys. If they are unique, they are useful as keys. Do please note that keys should not only be unique, but also known; therefore, checking for completeness should be a part of candidate keys checks as well.

Page 134

Chapter 3: Data Quality and SQL Server 2008 R2 Tools In the following example we are checking the uniqueness of rows over a single column – the CustomerKey column - only. Nevertheless, the query would not be much more complicated if we were to check the uniqueness of a composite candidate key. We would only have to use all columns that form the candidate key in every place we use CustomerKey.
SELECT CustomerKey ,COUNT(*) AS Number FROM dbo.vTargetMailDirty GROUP BY CustomerKey HAVING COUNT(*) > 1 ORDER BY Number DESC;

Of course, we get the duplicated CustomerKey in the result:
CustomerKey 99000 Number 2

Let us only mention how we can profile XML data. In the completeness part of this chapter, we have mentioned XML data type methods, and have shown how we can use them in T-SQL queries. For the accuracy, we used the .value() method of the XML data type to extract element and attribute values and represent them as scalar values of the built-in SQL Server data types. After that, we used the same methods as we used for finding inaccuracies in attributes of builtin scalar data types. Therefore, dealing with XML data type attributes does not differ much from what we have seen so far in the accuracy part of this chapter. Finally, we have to mention that SQL Server supports validating the complete XML instance against an XML schema collection. In an XML schema collection, we can have multiple XML schemas in XSD format. An XML instance has to validate against one of the schemas for the collection, otherwise SQL Server rejects the instance. Similarly like we mentioned for check constraints, we should also try to use XML schema collection validation in our databases, in order to enforce the data integrity rules.

Page 135

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Multivariate Accuracy Analysis
As with completeness, we also want to find the root cause for inaccurate data. We should ask ourselves whether specific values in some attributes lead to errors in other attributes. If we can find the attributes that influence on errors in other attributes, we can focus just on them, and make huge improvements in data quality with minimal efforts. Again, we are going to use data mining for this task. We can use Naïve Bayes algorithm for finding the root cause of inaccurate data, like we used it for finding causes of incompleteness. However, this time we are going to introduce another useful algorithm, the Decision Trees algorithm. Decision Trees work on discrete data. Note that SQL Server Decision Trees accept continuous attributes; however, the algorithm changes to Regression Trees. In the following example, we are going to use discrete algorithms only, in order to make sure we get Decision Trees. Before we start exploring the data, let’s explain briefly how Decision Trees work. Decision Trees is a directed technique. A target variable is the one that holds information about a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially liked / disliked; etc.). We are trying to explain this decision using other gleaned information saved in other variables (demographic data, purchasing habits, etc.). The process of building the tree is a recursive partitioning. The data is split into partitions using a certain value of one of the explaining variables; the partitions are then split again and again. Initially the data is in one big box. The algorithm tries all possible breaks of all input (explaining) variables for the initial split. The goal is to get purer partitions in terms of the target variable. The purity is related to the distribution of the target variable. We want to get uniform distributions of target variable in branches of the tree as possible. The tree continues to grow using the new partitions as separate starting points and splitting them more.

Page 136

Chapter 3: Data Quality and SQL Server 2008 R2 Tools We then have to stop the process somewhere; otherwise we could get a completely fitting tree having only one case in each class. The class would be, of course, absolutely pure. This would not make any sense; the results could not be used for any meaningful prediction. This phenomenon is called “over-fitting”. We are not going into details here; let’s just mention that we can control the growth of the tree with algorithm parameters. We are now going to analyze what leads to a incorrect phone number. The following view filters out rows with duplicate primary key (data mining models require key in source data) and adds a flag showing whether the phone number is valid or not.
CREATE VIEW dbo.TargetMailMining AS SELECT CustomerKey ,FirstName ,LastName ,MaritalStatus ,Gender ,YearlyIncome ,TotalChildren ,NumberChildrenAtHome ,Education ,Occupation ,HouseOwnerFlag ,NumberCarsOwned ,CommuteDistance ,Region ,Age ,BikeBuyer ,CASE WHEN Phone LIKE '%[A-Z]%' THEN N'False' ELSE N'True' END AS PhoneValid FROM dbo.vTargetMailDirty WHERE CustomerKey <> 99000;

To create the Decision Trees model, complete the following steps: 1. If you have already closed the project with Naïve Bayes model, reopen it in BIDS. 2. In Solution Explorer, double-click on the Adventure Works200R2 data source view in the Data Source Views folder to open the Data Source View designer. 3. Right-click in the empty space in designer window and select Add/Remove Tables option. Find the TargetMailMining view and add it to the Data Source View.

Page 137

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 4. In Solution Explorer, right-click on the Mining Structures folder, and select New Mining Structure. 5. Use existing relational database or data warehouse. 6. Select the Microsoft Decision Trees technique. 7. Use Adventure Works2008R2 data source view. 8. Specify TargetMailMining as a case table. 9. Use CustomerKey as a key column and PhoneValid as predictable column. Use Age, BikeBuyer, CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren and YearlyIncome as input columns. Check whether your selection matches figure 5.

Page 138

Chapter 3: Data Quality and SQL Server 2008 R2 Tools Figure 5: COLUMNS USAGE FOR THE DECISION TREES ANALYSIS 10. In the Specify Columns’ Content and Data Type screen, click on the Detect button. Be sure that the content of the Bike Buyer attribute is discrete. 11. In the Specify Columns’ Content and Data Type screen, click on the Detect button. Be sure that the content of all attributes is discrete, except Age and YearlyIncome, which should be continuous, and CustomerKey, which is the key. 12. In the Create Testing Set window, accept the default split option to use 30% of data for testing. 13. Name the structure Target Mail Mining Accuracy, and model Target Mail Mining Accuracy Decision Trees. Click Finish. 14. In a real-life project, you should discretize continuous columns in advance, and define discretization buckets from business perspective. Nevertheless, you can discretize in SSAS as well. Discretize Age and Yearly Income mining structure columns in five groups using EqualAreas method. Use the Properties window of a column in Mining Structure tab of the Data Mining Designer, change Content property to Discretized, DiscretizationBucketCount property to 5 and DiscretizationMethod property to Automatic. 15. In the demo data, there is no strong rule for invalid phones based on other attributes. Although no pattern is a piece of information as well, for the following example, let’s force deeper tree through algorithm parameters. In the Mining model Designer window, click on the Mining Models tab. 16. Right-click on the Decision Trees model and select Set Algorithm Parameters. Change the COMPLEXITY_PENALTY parameter to 0.1, and the MINIMUM_SUPPORT parameter to 5, as in figure 6.

Page 139

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 6: SETTING THE ALGORITHM PARAMETERS FOR THE DECISION TREES ANALYSIS 17. Save, deploy and process the model (or the whole project). 18. Click on the Mining Model Viewer tab. Change the Background drop-down list to False. Now you can easily spot what leads to invalid phone numbers. Darker blue nodes mean nodes with more invalid phone numbers. In the example in figure 7 it seems that having House Owner Flag set to 0, Yearly Income between 75,650 and 98,550, and Commute Distance not between zero and one mile, is a good way to have an invalid phone number.

Page 140

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 7: DECISION TREE MINING MODEL VIEWER 19. Check the Dependency Network viewer as well. 20. Save the project. Do not close BIDS, if you want to try the next exercise. Although for this example we needed to tweak the Decision Trees parameters in order to force growth of the tree to get at least some results, the example is valid to show the process of finding root causes for inaccurate data.

Page 141

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Measuring Information
The next hard data quality dimension is information. The amount of information in data is not important from a correctness point of view; however, it makes sense to know it if we want to use our data for business intelligence, for analyses. From mathematics, from Information Theory, the measure for the amount of information is Entropy. Before quoting the formula, let’s explain the idea behind the Entropy. In real life, information is the same thing as surprise. If a friend tells you something, and you are not surprised, this means you already knew that. Now, let’s say we have a discrete attribute. How many times can we get surprised with its value, i.e. its state? For a start, let’s say we have two possible states, with equal distribution. We would expect one state; in 50% of cases we would be surprised, because we would get another state. Now imagine we have skewed distribution, and one state has 90% probability. We would expect this state, and be surprised in 10% of cases only. With totally uniform distribution, we would not have any chance to be surprised ever. Now let’s say the attribute has three possible states, with equal distribution. We still expect one specific state. However, in this case, we would be surprised in 67% examples. With four possible states and equal distribution, we would be surprised in 75% examples. We can conclude the following: the more possible states a discrete variable has, the higher the maximal possible information in this attribute. The more equal across possible values (the less uniform) the distribution is, the more actual information is in the attribute. The Entropy formula reflects this:

Page 142

Chapter 3: Data Quality and SQL Server 2008 R2 Tools We do not have support for Entropy calculation out of the box in T-SQL. However, we can calculate probabilities, i.e. frequencies of states, and then the Entropy. The following query calculates the Entropy for Education, Region and Gender columns:
WITH DistinctValues AS ( SELECT COUNT(DISTINCT Education) AS DistinctEducation ,COUNT(DISTINCT Region) AS DistinctRegion ,COUNT(DISTINCT Gender) AS DistinctGender FROM dbo.vTargetMailDirty ) SELECT N'H(Education)' AS Measure ,(SELECT DistinctEducation FROM DistinctValues) AS DistinctVals ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual ,(1.0*LOG((SELECT DistinctEducation FROM DistinctValues))/LOG(2)) AS Maximal ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) / (1.0*LOG((SELECT DistinctEducation FROM DistinctValues))/LOG(2)) AS Normalized FROM (SELECT Education ,1.0*COUNT(Education) / (SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px FROM dbo.vTargetMailDirty GROUP BY Education) AS Prob UNION SELECT N'H(Region)' AS Measure ,(SELECT DistinctRegion FROM DistinctValues) AS DistinctVals ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual ,(1.0*LOG((SELECT DistinctRegion FROM DistinctValues))/LOG(2)) AS Maximal ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) / (1.0*LOG((SELECT DistinctRegion FROM DistinctValues))/LOG(2)) AS Normalized FROM (SELECT Region ,1.0*COUNT(Region) / (SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px FROM dbo.vTargetMailDirty GROUP BY Region) AS Prob UNION SELECT N'H(Gender)' AS Measure ,(SELECT DistinctGender FROM DistinctValues) AS DistinctVals ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual ,(1.0*LOG((SELECT DistinctGender FROM DistinctValues))/LOG(2)) AS Maximal ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) / (1.0*LOG((SELECT DistinctGender FROM DistinctValues))/LOG(2)) AS Normalized FROM (SELECT Gender ,1.0*COUNT(Gender) / (SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px FROM dbo.vTargetMailDirty GROUP BY Gender) AS Prob ORDER BY Normalized;

The query calculates the number of distinct states in a CTE. This information is used to calculate the maximal possible Entropy. Then the query calculates in derived tables, in the FROM part of

Page 143

Chapter 3: Data Quality and SQL Server 2008 R2 Tools outer query, the actual distribution for each of the analyzed columns separately. In the SELECT part of the outer queries for each analyzed columns, the actual Entropy is calculated. In addition, normalized Entropy is calculated as a division of actual and maximal Entropy, to show which variable has relatively more information, or better said, which variable has more equal distribution. Finally, all outer queries are unioned in a single query with UNION operators. The result is as follows:
Measure H(Education) H(Region) H(Gender) DistinctVals 5 3 2 Actual 2.14581759656851 1.54525421415102 0.999841673765157 Maximal 2.32192809488736 1.58496250072116 1 Normalized 0.924153336743447 0.974946860539558 0.999841673765157

From the results, we can see that we have the most information in the Education column. However, Gender column has the most equal distribution. For continuous attributes, we can calculate the Entropy in the same way if we discretize them. We have to discretize them in classes (or bins) of equal width, in order to preserve the shape of the original distribution function as much as possible. We can try with different number of bins.

Page 144

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Using Other SQL Server Tools for Data Profiling
We used T-SQL queries, enhanced with CLR function, and data mining models in our data profiling activities so far. There are more useful tools in the SQL Server suite for this task. SSAS OLAP cubes are very useful for an overview of discrete columns. SSIS has a specialized Data Profiling task. Finally, we can also use free Office Data Mining Add-Ins. We are now going to show you how we can use all of the tools mentioned for data profiling.

SSAS Cubes
SSAS cubes are typically built on Star Schema, using multiple dimensions and fact tables, with business measures like quantities and amounts. Nevertheless, we can create a SSAS cube based on a single table, and define par of columns for a dimension, and part for a fact table. In fact, this way we are creating star schema inside Analysis Services cube, although we have flattened (single table) model in our data warehouse. For data profiling, the only measure we need is the count of rows. Then we can use all other attributes for a dimension, and use them for analyzing counts, i.e. frequencies, of different states of discrete attributes. We are going to create a cube based on TargetMailMining view we used for Decision Trees analysis. In order to do this, you will need to complete the following steps: 1. If you closed the BI project from this chapter, reopen it in BIDS. 2. In Solution Explorer, right-click on the Cubes folder, and select New Cube. 3. Use the existing tables in the Select Creation Method window. 4. Use Adventure Works data source view. Select TargetMailMining for the measure group table.

Page 145

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 5. In the Select Measures window, clear all measures selected except Target Mail Mining Count measure. The wizard uses all numeric columns for measures, and automatically creates additional measure with COUNT aggregate function. This is exactly the measure we need. 6. Use the dimension selected by default, i.e. Target Mail Mining (it is, of course, the only dimension you can use). 7. Specify CustomerKey as the dimension key. 8. Name the cube Target Mail Cube and click Finish. The Cube Designer window should open. 9. In Solution Explorer, expand the Dimensions folder. Double-click on the Target Mail Mining dimension to open the Dimension Designer. The Cube wizard did not add any attributes to the dimension. 10. In Dimension Designer, drag and drop columns from the rightmost of the three panes (the Data Source View pane) to the leftmost (the Attributes) pane. Drag all columns we used as input columns for the Decision Trees model: Age, BikeBuyer, CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren and YearlyIncome. Your dimension should be like the one in figure 8.

Page 146

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 8: TARGET MAIL MINING DIMENSION 11. You have to discretize the Age and Yearly Income attributes again. Use the Properties window to change the DiscretizationBucketCount property to 5 and DiscreatizationMethod property to Automatic for both attributes. 12. Save the project and deploy it. 13. You can browse the cube directly from BIDS using the Browser tab in Cube Designer. However, it is much nicer to use Microsoft Office Excel 2007 or 2010 as OLAP client. 14. If you have access to Excel 2007 or 2010, open it. 15. Click on Data tab to open the Data ribbon. 16. In the From Other Sources drop-down list, select From Analysis Services. 17. In Data Connection Wizard, write your Analysis Services in the Name textbox. Use Windows authentication. Select the MDSBook_Ch03_SSAS database, Target Mail Cube cube. Finish the Data Connection Wizard. Click OK to import the data in PivotTable report.

Page 147

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 18. In the PivotTable Field List window, select the Target Mail Mining Count measure, and the Occupation attribute. You should already see the distribution for the Occupation in the pivot table. Click somewhere in the pivot table to make sure you have accessible the PivotTable Tools ribbon. 19. Click on the Options tab of the PivotTable Tools ribbon. On the right side of the ribbon, click on PivotChart. Click OK to select the default chart options. 20. Your Excel worksheet should look similar to the one in figure 9.

Figure 9: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM SSAS CUBE 21. Play with other distributions. Try also to group over columns. 22. When finished, close Excel. Do not close BIDS yet. As you can see, SSAS together with Excel is a powerful data profiling tool. This was an example of using a tool for other than its intended purpose. In the next section, we are going to introduce a tool that is intended for data profiling – the SSIS Data Profiling task.

Page 148

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

PowerPivot for Excel 2010
Microsoft SQL Server PowerPivot for Excel is a free add-in that became available at the same time as SQL Server 2008 R2. Besides Excel version, SharePoint version of PowerPivot exists as well. This tool is providing so-called self-service BI. From Excel, advanced end users can create their own analytical solution without intervention of database specialists. You can download PowerPivot for Excel 2010 from the PowerPivot site. PowerPivot is a completely new technology for data warehousing. In this version, as an add-in for Excel 2010, this is an in-memory relational database with column-oriented storage. Because it is relational, we work with tables that represent entity sets. Because it is in-memory, it is fast. And because the data, although we work with rows, is stored in sorted, compressed column by column storage, it is even faster. Because the interface for creating the database, importing the data, and creating PivotTable and PivotChart reports is available in Excel, it is simple to use. We can use PowerPivot instead of SSAS cubes for our data profiling tasks. Assuming you already downloaded and installed PowerPivot for Excel 2010, complete the following steps to prepare similar analysis like with SSAS on the TargetMailMining view: 1. Open Excel 2010. Click on the PowerPivot tab (this tab appears after the PowerPivot is installed). 2. From the Launch group on the left side of the ribbon, click on the PowerPivot Window button. You just launched your database development and administration interface for PowerPivot databases. 3. Using the Home tab, from the Get External data group, click on the From Database drop-down list, and select From SQL Server. 4. In the Table Import Wizard, Connect to a Microsoft SQL Server Database page, connect to your SQL Server instance with the AdventureWorks2008R2 sample database where the TargetMailMining view is created. Click Next.

Page 149

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 5. In the Choose How to Import the Data page, use the default option – Select from a list of tables and views to choose the data to import. Click Next. 6. In the Select Tables and Views page, choose the TargetMailMining view. Do not click Finish yet – click on the Preview & Filter button first. 7. The Preview Selected Table window opens. Now you can select only the columns we need for analysis and the key: CustomerKey, Age, BikeBuyer, CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren and YearlyIncome. Actually, you have to de-select other columns (FirstName, LastName and PhoneValid), because by default all columns are selected. Click OK. After you are returned to the Select tables and Views page, click Finish.

Figure 10: THE PREVIEW SELECTED TABLE WINDOW 8. After the data is imported, on the Importing page, click Close. Your data is now in the PowerPivot database, in memory. 9. Discretizing continuous columns is not that simple in PowerPivot as it is in Analysis Services. You can create derived columns using expressions in the Data

Page 150

Chapter 3: Data Quality and SQL Server 2008 R2 Tools Analysis Expressions (DAX) language, the language that PowerPivot supports. Please refer to Books Online or PowerPivot book mentioned in the References section of this chapter for details about PowerPivot and DAX. For this brief overview, we are going to simply skip the Age and YearlyIncome continuous columns. 10. Let’s start analyzing the data. In the Reports group of the PowerPivot Home window, click on the PivotTable drop-down list, and select PivotTable. 11. In the Create PivotTable pop-up window, select the Existing Worksheet option, and click OK. 12. In the PowerPivot Field List, select the CustomerKey and Occupation columns. CustomerKey should appear in the Values area, and Occupation in the Axis Fields (Categories) area. Note that the default aggregation function is Sum. Therefore, your chart shows sum of CustomerKey over Occupation. Of course, this makes no sense. 13. In the Values area, click on the down arrow near the Sum of CustomerKey field. Select Edit Measure option. 14. In the Measure Settings window, select the Count aggregate function. Click OK.

Page 151

Chapter 3: Data Quality and SQL Server 2008 R2 Tools Figure 11: SETTING THE COUNT AGGREGATE FUNCTION 15. We can add a chart to the worksheet, like we did when we analyzed SSAS data. However, this chart is not automatically connected to the PivotTable we already have in the sheet. We will have to select fields for analysis again. 16. In the Reports group, click on the PivotTable drop-down list, and select PivotChart. In the Create PivotChart pop-up window, select Existing Worksheet, and change the Location to 'Sheet1'!$A$10. 17. In the PowerPivot Field List, select the CustomerKey and Occupation columns. 18. In the Values area, click on the down arrow near the Sum of CustomerKey field. Select Edit Measure option. 19. In the Measure Settings window, select the Count aggregate function. Click OK. 20. Now you should have the same analysis of distribution of the Occupation column like you had when analyzing SSAS cube.

Figure 12: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM POWERPIVOT 21. Play with other distributions. Try also to group over columns.

Page 152

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 22. When finished, close Excel.

SSIS Data Profiling Task
With SSIS Data Profiling task available, a question why we did develop all of the queries, mining models and cubes in this chapter so far. Well, everything we learned so far is still valid and useful. The Data Profiling task can do limited profiling only. We already developed many more advanced profiling techniques. In addition, the Data Profiling task saves the result in XML form, which is not useful directly. We would have to write custom code in the SSIS Script task in order to consume the results in the same package. For overview of the result, an application called Data Profile Viewer is included within SQL Server suite. Let’s explore what the Data Profiling task can do by completing the following steps. 1. If you closed BIDS, reopen the solution we created so far. 2. Use File – Add – New Project menu option to add a project to the solution. Use the Integration Services project template. Name the project MDSBook_Ch03_SSIS. 3. In Solution Explorer, in SSIS Packages folder, right-click on the default package called Package.dtsx and rename it to SSISProfiling.dtsx. Confirm renaming of the package object. 4. From the Toolbox, drag and drop the Data Profiling Task to the control from area. Right-click on it and select Edit. 5. In the General tab, use the Destination property drop-down list, and select New File Connection. 6. In the File Connection Manager Editor window, change usage type to Create file. In the File text box, type file name you want to create (you can select folder graphically with Browse button). In order to follow placing and naming we had so far, type C:\MDSBook\Chapter03\SSISProfiling.xml. 7. When you are back in the Data Profiling Task Editor, General tab, change the OverwriteDestination property to True, in order to make possible to re-execute the

Page 153

Chapter 3: Data Quality and SQL Server 2008 R2 Tools package multiple times (otherwise, you would get an error saying that the destination file already exists with next execution of the package). 8. In the bottom-right corner of the Data Profiling Task Editor, General tab, click on the Quick Profile button. 9. In the Simple Table Quick Profiling Form, click on the New button to create a new ADO.NET connection. The Data Profiling Task accepts only ADO.NET connections. 10. Connect to your SQL Server and select the AdventureWorks2008R2 database. Click OK to return to the Simple Table Quick Profiling Form. 11. Select the dbo.TargetMailMining view in the Table or View drop-down list. Leave selected all check boxes that are selected by default. Select also the Column Pattern Profile and Functional Dependency Profile check boxes. Click OK. 12. In the Data Profiling Task Editor window, in the Profile Type list on the right, select the Functional Dependency Profile Request. In the Request Properties window, select BikeBuyer as dependent column, and CommuteDistance, Education and NumberCarsOwned as determinant columns. Deselect star (all columns) as determinant columns. Your Data Profiling Task Editor should look like the one in figure 13. When finished, click OK to close the Data Profiling Task Editor window.

Page 154

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 13: THE DATA PROFILING TASK CONFIGURED 12. Save the project. Right-click on the Data Profiling task and select Execute Task option. Wait until the execution finishes. The task color should turn to green after a successful execution. 13. From the Debug menu, select the Stop Debugging option. Check whether the XML file appeared in the folder. 14. From All Programs, SQL Server 2008 R2, Integration Services menu, start the Data Profile Viewer application. 15. Click on the Open button. Navigate to the SSISProfiling.xml file and open it. Now you can start harvesting the results. 16. In the left, the Profiles pane, select, for example, the Column Value Distribution Profiles. In the top-right pane, select the Occupation column. In the middle-right window, you should see the distribution for Occupation attribute. Click on the value that has very low frequency (the Proffesional value). At the top-right corner of the middle-right window, you will find the drill-down button. Click on it. In the

Page 155

Chapter 3: Data Quality and SQL Server 2008 R2 Tools bottom-right window, you get row with this value for the Occupation column, like in figure 14.

Figure 14: THE OCCUPATION COLUMN VALUE DISTRIBUTION PROFILE 17. Check other profiles generated. SSIS Data Profiling Task can be a nice, automated basic tool for quick profiling of our data. Besides some simple profiling, it has some more sophisticated options. For example, the Functional Dependency profile uses Naïve Bayes algorithm internally to find dependency between single target variable and multiple determinant variables. Data mining algorithms are really useful for data profiling.

Excel Data Mining Add-Ins
Microsoft Office 2007 Data Mining Add-Ins is free to download from Microsoft SQL Server 2008 R2 Feature Pack site: (http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=ceb4346f-657f4d28-83f5-aae0c5c83d52). The same add-ins work for Office 2010 as well. However, do please

Page 156

Chapter 3: Data Quality and SQL Server 2008 R2 Tools note that this add-ins currently exist in 32-bit version only. After you install them, you should get a new Data Mining tab in Excel. With data mining add-ins, Excel does not become a data mining engine. It can utilize, or leverage some existing Analysis Services instance. Excel sends data from its worksheets or tables to SSAS instance, and then gets the result and displays it. If data is formatted as Excel tables, then data mining is simpler, through more business-oriented options that appear in Table Tools ribbon. If data is organized in worksheets only, then we can still utilize the Data Mining ribbon, and mine it from more technical perspective. The Clustering data mining algorithm is especially useful for finding complex outliers. By complex outliers we mean outliers not because of a value of a single column; we are searching for outliers based on combination of values from multiple columns. Therefore, with Clustering, we can make multivariate analysis for outliers. The Clustering algorithm tries to group rows into clusters. In the Expectation-Maximization variant of the algorithm, which is default in SSAS, each case belongs to each cluster with some probability. A case can belong to one cluster with 0.3, and to another with 0.05 probabilities. However, the highest probability could be very low as well. For example, a case can belong to one cluster with probability 0.001, to another with 0.00003, and to other clusters with even lower probabilities. This means that the case does not fit well to any cluster, or, said differently, that this case is an exception. This is how we can find outliers with Clustering, find cases for which the highest probability they belong to a cluster is still very low. With data mining add-ins, Excel becomes a very powerful data mining tool. We can use it for nearly any data mining activity Analysis Services can perform. Excel can serve as data mining client, because with data mining add-ins, we also get data mining viewers that are built in BIDS and SSMS in Excel. In addition, the add-ins also brings a couple of new templates for Visio. Nevertheless, this introduction to Office Data Mining Add-Ins should be enough for a Master

Page 157

Chapter 3: Data Quality and SQL Server 2008 R2 Tools Data Services book. Rather than continuing with theory and descriptions let’s use the add-ins! You get started; you will need to complete the following steps: 1. In order to use Excel as a data mining tool, your SSAS instance has to allow session mining models. In SSMS, in Object Explorer, connect to your SSAS instance. Rightclick on it and select Properties. Change the Data Mining \ Allow Session Mining Models property to True, as shown in figure 12. Click OK.

Figure 15: ALLOW SSAS SESSION MINING MODELS 2. Open Excel. In the Data ribbon, click on From Other Sources drop-down list. Select the From SQL Server option. 3. Connect to your SQL Server and select the AdventureWorks2008R2 database. In the Select Database and Table window of the Data Connection Wizard, make sure that the Connect to a specific table check box is checked. Select the dbo.TargetMailMining view and finish the wizard.

Page 158

Chapter 3: Data Quality and SQL Server 2008 R2 Tools 4. In the Import Data window, click OK. You should get all of the data in tabular format in your worksheet. 5. If part of your worksheet is formatted as table n Excel, you get additional Table Tools – Analyze ribbon with data mining add-ins. 6. As Excel needs Analysis Services to perform data mining tasks, you need to click on the Connection button (by default it should show No connection). Connect to your Analysis Services, to the MDSBook_Ch02_SSAS database. Your screen should look like the one in figure 16.

Figure 16: TABLE ANALYSIS TOOLS AFTER CONNECTION TO SSAS 7. Click on the Highlight Exceptions button. In the pop-up window, click Run. 8. After the wizard finishes, you should get additional sheet with summary information about exceptions. If you select your original worksheet, the worksheet with data, you will notice that some rows are highlighted with dark yellow color, like shown in figure 17. In addition, some columns of highlighted rows are highlighted with lighter yellow color.

Page 159

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Figure 17: EXCEPTIONS HIGHLIGHTED IN THE ORIGINAL SHEET 9. These highlights show you suspicious rows. In addition, you can see which column makes a row an exception. Now it is easy to find errors. 10. After you finish, close Excel and BIDS.

Clean-Up
As this was the last profiling option introduced in this chapter, we can clean up the AdventureWorks2008R2 database with the following code:
USE AdventureWorks2008R2; IF OBJECT_ID(N'dbo.ValueIsNULL', N'FN') IS NOT NULL DROP FUNCTION dbo.ValueIsNULL; IF OBJECT_ID(N'dbo.ProductsMining',N'V') IS NOT NULL DROP VIEW dbo.ProductsMining; IF OBJECT_ID(N'dbo.vTargetMailDirty',N'V') IS NOT NULL DROP VIEW dbo.vTargetMailDirty; IF OBJECT_ID(N'dbo.IsRegExMatch', N'FS') IS NOT NULL DROP FUNCTION dbo.IsRegExMatch; DROP ASSEMBLY MDSBook_Ch03_CLR; IF OBJECT_ID(N'dbo.TargetMailMining',N'V') IS NOT NULL DROP VIEW dbo.TargetMailMining; GO

Page 160

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Summary
The old rule “garbage in – garbage out” is absolutely valid for MDM projects as well. Before starting implementing a centralized MDM solution, like SQL Server Master Data Services solution, we should have in-depth comprehension of the quality of our data. In addition, we should find root causes for bad data. We have shown in this chapter how we can use tools from SQL Server 2008 R2 suite for data profiling and for finding the root cause. We have used Transact-SQL queries. We have used XQuery and CLR code for controlling data quality of XML data and strings. We used SQL Server Analysis Services intensively. The Unified Dimensional Model, or, if we prefer this expression, OLAP cube, is a nice way for a quick, graphical overview of the data. In addition, PowerPivot for Excel 2010 gives us opportunity to achieve the same graphical overview even without Analysis Services. Data Mining helps us find interesting patterns, and thus this is a useful tool for finding root causes for bad data. With Office Data Mining Add-Ins, Excel 2007 and 2010 became a powerful data mining tool as well. SQL Server Integration Services Data Profiling task is another quick tool for finding bad or suspicious data. One of the most challenging tasks in preparing and maintaining master data is merging it from multiple sources when we do not have the same identifier in all sources. This means we have to do the merging based on similarities of column values, typically on similarity of string columns like names and addresses. Even if we have a single source of master data, we can have duplicate rows for the same entity, like duplicate rows for the same customer. In the next chapter, we are going to tackle these two problems, merging and de-duplicating.

Page 161

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

References
 Erik Veerman, Teo Lachev, Dejan Sarka: MCTS Self-Paced Training Kit (Exam 70-448): Microsoft SQL Server 2008-Business Intelligence Development and Maintenance (Microsoft Press, 2009)   Thomas C. Redman: Data Quality - The Field Guide (Digital Press, 2001) Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning (John Wiley & Sons, 2003)     Arkady Maydanchik: Data Quality Assessment (Technics Publications, 2007) Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass: Inside Microsoft SQL Server 2008: TSQL Querying (Microsoft Press, 2009) Itzik Ben-Gan, Dejan Sarka, Roger Wolter, Greg Low, Ed Katibah, Isaac Kunen: Inside Microsoft SQL Server 2008: T-SQL Programming (Microsoft Press, 2010) Marco Russo, Alberto Ferrari: PowerPivot for Excel 2010 (Microsoft Press, 2011)

Page 162

Chapter 4: Identity Mapping and De-Duplicating

Chapter 4: Identity Mapping and DeDuplicating
Dejan Sarka
Two of the most challenging problems with maintaining master data are identity mapping and de-duplication. In an enterprise, we frequently have more than one source of master data. Sometimes we have to import master data from outer sources. Different sources can include legacy applications; relational databases used by OLTP applications; analytical databases; semi-structured data, such as XML data from some Web service; and even non-structured data in Microsoft Excel worksheets and Microsoft Word documents. Typically, we do not have unique identification of entities in all these different sources. However, we would like to get a complete picture of an entity (e.g., of a customer) in order to properly support our applications, such as a Customer Relationship Management (CRM) application. Therefore, we have to identify which objects (e.g., rows in a table, XML instances) represent the same entity. We need to match the identities based on similarities of entity attributes. Even if we have a single source of data, we are not without problems. Master data is often duplicated. For example, multiple operators could insert the same customer multiple times, each time with slightly differently written name or address. Again, we would like to identify such duplicates and get rid of them. In this chapter, we introduce details of both problems and show possible solutions. We focus on identity mapping, and we show that de-duplicating can be actually translated to the same problem. We discuss the following topics:  Problems with identity mapping

Page 163

Chapter 4: Identity Mapping and De-Duplicating       String similarity functions Reducing the search space Performing identity mapping with T-SQL Using SSIS Fuzzy Lookup transformation for identity mapping De-duplicating Using SSIS Fuzzy Grouping transformation for de-duplicating

Page 164

Chapter 4: Identity Mapping and De-Duplicating

Identity Mapping
The problem with identity mapping arises when data is merged from multiple sources that can update data independently. Each source has its own way of entity identification, or its own keys. There is no common key to make simple joins. Data merging has to be done based on similarities of strings, using names, addresses, e-mail addresses, and similar attributes. Figure 1 shows the problem: we have three rows in the first table in the top left corner, and two rows in the second table in the top right corner. Keys of the rows from the left table (Id1 column) are different from keys of the rows in the right table (Id2 column). The big table in the bottom shows the result of approximate string matching. Note that each row from the left table is matched to each row from the right table; similarities are different for different pairs of rows.

Figure 13: IDENTITY MAPPING PROBLEM

Page 165

Chapter 4: Identity Mapping and De-Duplicating

Problems
Many problems arise with identity mapping. First, there is no way to get a 100 percent accurate match programmatically; if you need a 100 percent accurate match, you must match entities manually. But even with manual matching, you cannot guarantee 100 percent accurate matches at all times, such as when you are matching people. For example, in a database table, you might have two rows for people named John Smith, living at the same address; we cannot know whether this is a single person or two people, maybe a father and son. Nevertheless, when you perform the merging programmatically, you would like to get the best matching possible. You must learn which method to use and how to use it in order to get the best possible results for your data. In addition, you might even decide to use manual matching on the remaining unmatched rows after programmatic matching is done. Later in this chapter, we compare a couple of public algorithms that are shipped in SQL Server 2008 R2, in the Master Data Services (MDS) database; we also add SSIS Fuzzy Lookup transformation to the analysis. The next problem is performance. For approximate merging, any row from one side, from one source table, can be matched to any row from the other side. This creates a cross join of two tables. Even small data sets can produce huge performance problems, because cross join is an algorithm with quadratic complexity. For example, cross join of 18,484 rows with 18,484 rows of the AdventureWorksDW200R2 vTargetMail view to itself (we will use this view for testing), means dealing with 341,658,256 rows after the cross join! We discuss techniques for optimizing this matching (i.e., search space reduction techniques) later in this chapter. Another problem related to identity mapping is de-duplicating. We deal briefly with this problem at the end of the chapter.

T-SQL and MDS String Similarity Functions
The T-SQL SOUNDEX() and DIFFERENCE() functions allow searches on similar sounding character strings. SOUNDEX() takes a word, such as a person’s name, as input and produces a character string that identifies a set of words that are (roughly) phonetically alike. Phonetically alike in this

Page 166

Chapter 4: Identity Mapping and De-Duplicating case means US English phonetically alike. DIFFERENCE() returns an integer indicating the difference between the SOUNDEX() values. DIFFERENCE() returns a value between 0 and 4. The integer returned is the number of characters in the SOUNDEX() values that are the same. Therefore, value 0 indicates weak or no similarity, and value 4 indicates strong similarity or the same values. SOUNDEX() is a very basic function, not suitable for solving real-life problems. There are many better algorithms that compare strings phonetically. In addition, all phonetic algorithms are language dependent. We focus on language-independent algorithms in this chapter, because we cannot say in which language we will have to do the matching in advance. If we install MDS in SQL Server 2008 R2, we get the following string matching algorithms:     Levenshtein distance (also called edit distance) Jaccard index Jaro-Winkler distance Simil (longest common substring; also called Ratcliff/Obershelp)

All of these algorithms are well-known and publically documented (e.g., on Wikipedia). They are implemented through a CLR function. Note that MDS comes only with SQL Server 2008 R2 Enterprise and Datacenter 64-bit editions. If you are not running either of these editions, you can use any edition of SQL Server 2005 or later and implement these algorithms in the CLR functions. Anastasios Yalanopoulos’s “Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server” (http://anastasiosyal.com/archive/2009/01/11/18.aspx#soundex) provides a link to a publically available library of CLR string matching functions for SQL Server. Levenshtein (edit) distance measures the minimum number of edits needed to transform one string into the other. It is the total number of character insertions, deletions, or substitutions that it takes to convert one string to another. For example, the distance between kitten and sitting is 3: • • kitten → sitten (substitution of 's' for 'k') sitten → sittin (substitution of 'i' for 'e')

Page 167

Chapter 4: Identity Mapping and De-Duplicating • sittin → sitting (insert 'g' at the end)

Similarity is then normalized between 0 and 1. The Jaccard index (similarity coefficient) measures similarity between sample sets. When used for strings, it defines strings as sets of characters. The size of the intersection is divided by the size of the union of the sample sets, as the following formula shows. Note that the coefficient is normalized between 0 and 1 by its definition.

The Jaro-Winkler distance is a variant of Jaro string similarity metrics. Jaro distance combines matches and transpositions for two strings s1 and s2:

| |

| |

The symbol m means the number of matching characters and the symbol t is the number of transpositions, while |s| denotes the length of a string. In order to define characters as matching, their position in the strings must be close together, which is defined by character position distance CPD; they should be no farther than the following formula calculates: | || |

[

]

Jaro-Winkler distance uses a prefix scale p, which gives more favorable ratings to strings that match from the beginning.

In this formula, the meaning of the signs is:

Page 168

Chapter 4: Identity Mapping and De-Duplicating • • dj is the Jaro distance is the common-prefix length at the start of the string, up to a maximum of 4 characters • p is a scaling factor for common prefixes; p should not exceed 0.25, otherwise the distance can become larger than 1 (usually p is equal to 0.1) Finally, the Simil algorithm looks for the longest common substring in two strings. Then it removes this substring from the original strings. After that, it searches for the next longest common substring in remainders of the two original strings from the left and the right. It continues this process recursively until no common substrings of a defined length (e.g., two characters) are found. Finally, the algorithm calculates a coefficient between 0 and 1 by dividing the sum of the lengths of the substrings by the lengths of the strings themselves. Preparing the Data In order to compare functions’ efficiency, we need some data. The code in Listing 1 prepares sample data. First, it prepares two tables. The CustomersMaster table will be the master table, which is a table with keys that we want to keep and transfer to the target table. The CustomersTarget table is the target of identity mapping; it will receive the keys from the CustomersMaster table. We fill both tables from the AdventureWorksDW2008R2.dbo.vTargetMail view. In the target table, we keep the original keys, multiplied by -1, in order to control the efficiency of merging. The target table also has an empty column to store the key from the master table (MasterCustomerId) and one additional column (Updated) that will be used only in the code that produces errors in the data. Besides common columns (Fullname, StreetAddress, and CityRegion), the two tables also have some different columns, to show that identical structures are not necessary. All we need is at least one character column in common, a column that is used to compare values. Listing 1: Code to Prepare Sample Data
-- Assuming that MDSBook database exists USE MDSBook; GO

Page 169

Chapter 4: Identity Mapping and De-Duplicating

-- Creating sample data -- Customers primary (master) table IF OBJECT_ID(N'dbo.CustomersMaster', N'U') IS NOT NULL DROP TABLE dbo.CustomersMaster; GO CREATE TABLE dbo.CustomersMaster ( CustomerId int NOT NULL PRIMARY KEY, FullName nvarchar(200) NULL, StreetAddress nvarchar(200) NULL, CityRegion nvarchar(200) NULL, NumberCarsOwned tinyint NULL ); GO INSERT INTO dbo.CustomersMaster SELECT c.CustomerKey AS CustomerId ,c.FirstName + ' ' + c.LastName AS FullName ,c.AddressLine1 AS StreetAddress ,g.City + ', ' + g.StateProvinceName + ', ' + g.EnglishCountryRegionName AS CityRegion ,c.NumberCarsOwned FROM AdventureWorksDW2008R2.dbo.DimCustomer AS c INNER JOIN AdventureWorksDW2008R2.dbo.DimGeography AS g ON c.GeographyKey = g.GeographyKey; GO -- Customers secondary (merge target) table IF OBJECT_ID(N'dbo.CustomersTarget', N'U') IS NOT NULL DROP TABLE dbo.CustomersTarget; GO CREATE TABLE dbo.CustomersTarget ( CustomerId int NOT NULL PRIMARY KEY, FullName nvarchar(200) NULL, StreetAddress nvarchar(200) NULL, CityRegion nvarchar(200) NULL, MaritalStatus nchar(1) NULL, Updated int NULL, MasterCustomerId int NULL ); GO INSERT INTO dbo.CustomersTarget SELECT c.CustomerKey * (-1) AS CustomerId ,c.FirstName + ' ' + c.LastName AS FullName ,c.AddressLine1 AS StreetAddress ,g.City + ', ' + g.StateProvinceName + ', ' + g.EnglishCountryRegionName AS CityRegion ,c.MaritalStatus ,0 AS Updated ,NULL AS MasterCustomerId FROM AdventureWorksDW2008R2.dbo.DimCustomer AS c INNER JOIN AdventureWorksDW2008R2.dbo.DimGeography AS g ON c.GeographyKey = g.GeographyKey; GO

We can perform matching based on FullName, StreetAddress, and CityRegion and get 100 percent accurate results. Of course, we have the same data in both tables. In order to test the functions, we have to perform some updates in the target table and produce some errors in the

Page 170

Chapter 4: Identity Mapping and De-Duplicating common columns. The code in Listing 2 performs these changes. This code makes changes with a slightly controlled randomness, in a loop that repeats three times. In every loop, the code makes three updates on 17 percent, 17 percent, and 16 percent (combined 50 percent) rows. Rows to be updated are selected with Bernoulli’s formula for sampling, which provides statistically good randomness. In the first two passes, the 50 percent of rows are selected from all rows; in the third pass, we select 50 percent of rows from the rows that were updated in the previous two passes only. This way, we get more errors on fewer rows; if we selected rows randomly from all rows in each pass, we would get fewer errors on more rows. From experience, we know that errors tend to pile in the same rows and are not spread completely randomly in a table. For example, we nearly always get an error with the last name Sarka in Slovenian (it is a Hungarian last name, uncommon in Slovenia), written as Sraka (which is a Slovenian word), whereas we nearly never get an error with the last name Novak (the most common last name in Slovenian). Listing 2: Code to Update the Target Table
DECLARE @i AS int = 0, @j AS int = 0; WHILE (@i < 3) -- loop more times for more changes BEGIN SET @i += 1; SET @j = @i - 2; -- control here in which step you want to update -- only already updated rows WITH RandomNumbersCTE AS ( SELECT CustomerId ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3 ,FullName ,StreetAddress ,CityRegion ,MaritalStatus ,Updated FROM dbo.CustomersTarget ) UPDATE RandomNumbersCTE SET FullName = STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int), 1, CHAR(CEILING(RandomNumber2 * 26) + 96)) ,StreetAddress = STUFF(StreetAddress, CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int), 2, '') ,CityRegion =

Page 171

Chapter 4: Identity Mapping and De-Duplicating
STUFF(CityRegion, CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int), 0, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,Updated = Updated + 1 WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.17 AND Updated > @j; WITH RandomNumbersCTE AS ( SELECT CustomerId ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3 ,FullName ,StreetAddress ,CityRegion ,MaritalStatus ,Updated FROM dbo.CustomersTarget ) UPDATE RandomNumbersCTE SET FullName = STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int), 0, CHAR(CEILING(RandomNumber2 * 26) + 96)) ,StreetAddress = STUFF(StreetAddress, CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int), 2, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,CityRegion = STUFF(CityRegion, CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int), 2, '') ,Updated = Updated + 1 WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.17 AND Updated > @j; WITH RandomNumbersCTE AS ( SELECT CustomerId ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3 ,FullName ,StreetAddress ,CityRegion ,MaritalStatus ,Updated FROM dbo.CustomersTarget ) UPDATE RandomNumbersCTE SET FullName = STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int), 1, '') ,StreetAddress = STUFF(StreetAddress, CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int), 0, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,CityRegion =

Page 172

Chapter 4: Identity Mapping and De-Duplicating
STUFF(CityRegion, CAST(CEILING(RandomNumber1 * 2, CHAR(CEILING(RandomNumber2 * CHAR(CEILING(RandomNumber3 * ,Updated = Updated + 1 WHERE RAND(CHECKSUM(NEWID()) % 1000000000 AND Updated > @j; END;

LEN(CityRegion)) AS int), 26) + 96) + 26) + 96)) - CustomerId) < 0.16

After the update, we can check how many rows are different in the target table for the original rows, which are still available in the master table. The maximum number of updates per row is 9; the more times a row was updated, the more common attribute values differ from the original (and correct) ones. The probability that a row is updated many times drops quite quickly with higher numbers of updates. The query in Listing 3 compares full names and addresses after the update, selecting only rows with some changes and sorting them by the number of updates in descending order, so we get the rows with maximal number of updates on the top. Listing 3: Query to Compare Full Names and Addresses after Update
SELECT m.FullName ,t.FullName ,m.StreetAddress ,t.StreetAddress ,m.CityRegion ,t.CityRegion ,t.Updated FROM dbo.CustomersMaster AS m INNER JOIN dbo.CustomersTarget AS t ON m.CustomerId = t.CustomerId * (-1) WHERE m.FullName <> t.FullName OR m.StreetAddress <> t.StreetAddress OR m.CityRegion <> t.CityRegion ORDER BY t.Updated DESC;

The partial result in Figure 2 shows that three rows were updated 7 times (note that in order to fit the number of updates of a row into the figure, CityRegion columns before and after the update are omitted from the figure). Altogether, 7,790 rows were updated in this test. You should get different results every time you run this test, because the updates are done randomly (or better, with a controlled randomness, no matter how paradoxical this sounds). You can also see that the values in the rows that were updated many times differ quite a lot from the original values; therefore, our string matching algorithms are going to have a hard time finding similarities.

Page 173

Chapter 4: Identity Mapping and De-Duplicating

Figure 2: RESULTS OF NAME AND ADDRESS COMPARISON Testing the String Similarity Functions Although we mentioned that we are not going to use T-SQL SOUNDEX() and DIFFERENCE() functions because they are language dependent, it makes sense to start with a quick test of these functions. This test shows another reason for not using the functions: They are very inefficient. Let’s look at how they perform on full names. Because we retained the original key (although multiplied by -1) in the target table, we can make an exact join and compare the original (master) and changed (target) names, as Listing 4 shows. Listing 4: SOUNDEX() and DIFFERENCE() Test on Full Names
SELECT m.CustomerId ,m.FullName ,t.FullName ,DIFFERENCE(m.FullName, t.Fullname) AS SoundexDifference ,SOUNDEX(m.FullName) AS SoundexMaster ,SOUNDEX(t.FullName) AS SoundexTarget FROM dbo.CustomersMaster AS m INNER JOIN dbo.CustomersTarget AS t ON m.CustomerId = t.CustomerId * (-1) ORDER BY SoundexDifference;

The results in Figure 3 show that the DIFFERENCE() function based on the SOUNDEX() code did not find any similarity for quite a few full names. However, as you can see from the highlighted row, the name Zoe Rogers was not changed much; there should be some similarity in the strings “Zoe Rogers” and “rZoeRogers.” This proves that the two functions included in T-SQL are not efficient enough for a successful identity mapping.

Page 174

Chapter 4: Identity Mapping and De-Duplicating

Figure 3: RESULTS OF TESTING THE SOUNDEX AND DIFFERENCE FUNCTIONS ON FULL NAMES You could continue with checking the two T-SQL functions using the street address and city and region strings. All of the four algorithms implemented in the mdq.Similarity function in the MDS database return similarity as a number between zero and one. A higher number means better similarity. The query in Listing 5 checks how algorithms perform on full names. Again, because we retained the original key (although multiplied by -1) in the target table, we can make an exact join, compare the original (master) and changed (target) names, and visually evaluate which algorithm gives the highest score. Listing 5: Code to Check Algorithm Performance on Full Names
SELECT m.CustomerId ,m.FullName ,t.FullName ,mdq.Similarity(m.FullName, t.Fullname, ,mdq.Similarity(m.FullName, t.Fullname, ,mdq.Similarity(m.FullName, t.Fullname, ,mdq.Similarity(m.FullName, t.Fullname, FROM dbo.CustomersMaster AS m INNER JOIN dbo.CustomersTarget AS t ON m.CustomerId = t.CustomerId * (-1) ORDER BY Levenshtein;

0, 1, 2, 3,

0.85, 0.85, 0.85, 0.85,

0.00) 0.00) 0.00) 0.00)

AS AS AS AS

Levenshtein Jaccard JaroWinkler Simil

For more information about the mdq.Similarity parameters, see SQL Server Books Online.

Page 175

Chapter 4: Identity Mapping and De-Duplicating The results (in Figure 4) are sorted by Levenshtein coefficient, in ascending order (we could choose any of the four), in order to get the rows with the maximal dissimilarity on the top.

Figure 4: RESULTS OF CHECKING ALGORITHM PERFORMANCE ON FULL NAMES FOR THE MDQ.SIMILARITY FUNCTION From the results, we can see that the Jaro-Winkler algorithm gives the highest scores in our example. You should check the algorithms on street address and city and region strings as well. Although we cannot say that the Jaro-Winkler algorithm would always perform the best, and although you should always check how algorithms perform on your data, we can say that this is not a surprise. Jaro-Winkler is one of the most advanced public algorithms for string matching. In addition, it forces higher scores for strings with the same characters in the beginning of the string. From experience, we noticed that errors in data, produced by humans, typically do not appear in the first few characters. Therefore, it seems that the Jaro-Winkler algorithm is the winner in this case, and we will use it to do the real matching. However, before we do the match, we need to discuss optimizing the matching by trying to avoid a full cross join. Optimizing Mapping with Partitioning As mentioned, the identity matching is a quadratic problem. The search space dimension is equal to the cardinality of A x B of the Cartesian product of the sets included in the match. There are multiple search space reduction techniques. A partitioning or blocking technique partitions at least one set involved in matching in blocks. For example, take the target rows in batches and match each batch with a full master table.

Page 176

Chapter 4: Identity Mapping and De-Duplicating With 10,000 rows in each table, we could have 10 iterations of joins with 1,000 x 10,000 = 10,000,000 rows instead of one big row with 10,000 x 10,000 = 100,000,000 rows. The important thing here is that we do not bloat the memory and go over our hardware resources. In order to prevent this, we can change batch sizes appropriately. A sorting neighborhood technique sorts both sets and then moves a window of a fixed size on both sets. We do the matching on that window only, one window after another. The problem with such a technique lies in the fact that we can have objects from one set in a window that is not compared to the window from another set where the objects that should actually be matched reside. The same problem could arise with blocking techniques implemented on both sets. We do not want to lose the matches we might find by optimization of the mapping methods. Let’s test the partitioning method. The first partition is, of course, the partition with exact matches. The query in Listing 6 updates the MasterCustomerId column in the target table with CustomerId from the master table based on exact matches on all three common string columns. Listing 6: Query to Update MasterCustomerId Column
WITH Matches AS ( SELECT t.MasterCustomerId ,m.CustomerId AS MCid ,t.CustomerId AS TCid FROM dbo.CustomersTarget AS t INNER JOIN dbo.CustomersMaster AS m ON t.FullName = m.FullName AND t.StreetAddress = m.StreetAddress AND t.CityRegion = m.CityRegion ) UPDATE Matches SET MasterCustomerId = MCid;

In the next step, we select (nearly) randomly 1,000 rows from the target table and perform a match with all rows from the master table. We measure the efficiency of the partitioning technique and of the Jaro-Winkler string similarity algorithm. We select rows from the target (nearly) randomly in order to prevent potential bias that we might get by selecting, for example, rows based on CustomerId. We use the NEWID() function to simulate randomness. The values

Page 177

Chapter 4: Identity Mapping and De-Duplicating generated by NEWID() actually do not satisfy statistical randomness, which is why we used term “(nearly) randomly.” In a real-life example, we would iterate through all rows block by block anyway, so it is not that important to use real statistical randomness. NEWID() should give us enough randomness to simulate one iteration (i.e., one block from the target table matched with the complete master table). We measure the efficiency of the matching on the fly, while we are doing the match. The first part of the code simply declares two variables to store the number of rows updated in this pass and the start time of the UPDATE query. The actual UPDATE query starts with a CTE that selects 1,000 rows from the target table, as we said, nearly randomly, using the TOP operator on NEWID() values, as Listing 7 shows. Listing 7: UPDATE Query
-- Variables to store the number of rows updated in this pass -- and start time DECLARE @RowsUpdated AS int, @starttime AS datetime; SET @starttime = GETDATE(); -- Select (nearly) randomly 1000 rows from the target table WITH CustomersTarget1000 AS ( SELECT TOP 1000 t.CustomerId ,t.MasterCustomerId ,t.FullName ,t.StreetAddress ,t.CityRegion FROM CustomersTarget AS t WHERE t.MasterCustomerId IS NULL ORDER BY NEWID() ),

The next CTE, which Listing 8 shows, performs a cross join between the 1,000-row block from the target table with the full master table. It also compares the strings and adds the JaroWinkler similarity coefficient to the output. In addition, it adds the row number sorted by the Jaro-Winkler similarity coefficient, in descending order and partitioned by the target table key, to the output. A row number equal to 1 will mean the highest Jaro-Winkler similarity coefficient for a specific target table key. This way, we marked the master table row that, according to JaroWinkler algorithms, is the most similar to the target row. The last part of the UPDATE

Page 178

Chapter 4: Identity Mapping and De-Duplicating statement, also shown in Listing 8, is the actual update. It updates target table rows with the key from the master table row with the highest similarity. Listing 8: Code to Perform Cross Join and Actual Update
-- Full cross join -- Adding Jaro-Winkler coefficient and row number MasterTargetCross AS ( SELECT t.CustomerId AS TCid ,m.CustomerId AS MCid ,t.MasterCustomerId ,mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) AS JaroWinkler ,ROW_NUMBER() OVER (PARTITION BY t.CustomerId ORDER BY mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) DESC) AS RowNo FROM CustomersMaster AS m CROSS JOIN CustomersTarget1000 AS t ) -- Actual update UPDATE MasterTargetCross SET MasterCustomerId = MCid WHERE RowNo = 1;

Finally, in the last statements of the batch, we use the fact that we know what key from the master table we should receive in the target table. We are just counting how many rows got a wrong key and comparing this number to the number of rows we updated in this pass. In addition, we are also measuring the time needed to execute this update, as Listing 9 shows. Listing 9: Code to Measure Efficiency
-- Measuring the efficiency SET @RowsUpdated = @@ROWCOUNT; SELECT @RowsUpdated AS RowsUpdated ,100.0 * @RowsUpdated / 7790 AS PctUpdated ,COUNT(*) AS NumErrors ,100.0 * COUNT(*) / @RowsUpdated AS PctErrors ,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed FROM dbo.CustomersTarget WHERE MasterCustomerId <> CustomerId * (-1);

After we execute the update, we can harvest the results of the efficiency measurement, as Figure 5 shows.

Page 179

Chapter 4: Identity Mapping and De-Duplicating

Figure 5: RESULTS OF EFFICIENCY MEASUREMENT The results are not too promising in this case. We updated only 1,000 out of 7,790 rows we have to update, which is approximately 12.84 percent. With more than 6 percent of errors on that small subset, it seems that pure a Jaro-Winkler algorithm does not satisfy our needs, and doesn’t make sense to apply it to the full set. In addition, 133 seconds for only a small subset of rows is not too encouraging. This is the price of the cross join. Apparently, we need to improve both the accuracy of the matching and the performance. Before developing a better procedure, let’s reset the MasterCustomerId column in the target table:
UPDATE dbo.CustomersTarget SET MasterCustomerId = NULL;

Optimizing Mapping with nGrams Filtering In this section, we focus on a pruning (or filtering) method. The question is whether we can intelligently pre-select a subset of rows from the master table for a match with a single preselected batch from the target table. With intelligent selection, we would like to improve the accuracy and the performance of the matching procedure. The problem here is how to pre-select these rows. We need to be sure that we do not exclude a row that should be matched with one or more target rows. Besides that, we need reasonably sized subsets; if we pre-select a subset of 9,000 rows from a set of 10,000 rows, we didn’t gain nearly anything, because we would get nearly full cross join again. In addition, we would like to build some intelligence in selecting batches from the target table; let’s select the rows that have more chances to match well with master rows first, and then rows with a little fewer chances, and then rows with even fewer chances, and so on. What we need is, of course, a procedure that accepts parameters. Finally, we could leave some rows for manual matching. Of course, we would start with exact matching.

Page 180

Chapter 4: Identity Mapping and De-Duplicating Without any further hesitation, let’s introduce the pre-selecting algorithm we will use. We will tokenize strings from the master table to substrings of length n. These tokens are called nGrams. In the MDS database, there is already a function mdq.NGrams that does this for us. If we do not use MDS, we can write our own CLR or T-SQL function. A T-SQL solution is already written and can be found in SQL Server MVP Deep Dives. After we tokenize the strings from the master table and store these tokens together with keys in a new table, we calculate the overall (absolute) frequency of tokens and store this in a token statistics table. With this statistics, we can select for comparison rows with strings that have at least m common nGrams with less than p frequency in each pass. The m, n and p are the parameters we use to select rows to be matched in each pass. We can start by comparing the rows that have at least 3 (m) common 4(n)Grams with less than 20 (p) absolute frequency. The p parameter helps us to start comparing strings with rare tokens first, with tokens that have low absolute frequency. Then we can use lower values for m and n and higher values for p in the next pass, and so on. Of course, this solution comes at a price. We have to create nGrams for both the source and target tables. In our example, we will store nGrams for the master (source) table and create nGrams for a batch of target rows on the fly. If we need to do the identity mapping continuously (e.g., from day to day), we could store nGrams for both tables permanently and update them with the help of triggers on master and target tables. We need to keep updated also the token statistics table. This could be done either with another trigger on the master table or with help of an indexed view on the tokens table (i.e., the token statistics table could be just an indexed view). Triggers and indexes would slow down updates on the master and target tables. Nevertheless, because both tables hold master data, we do not expect too frequent changes there. In addition, the tokens and token statistics tables occupy some space. However, we do not consider this as a huge problem. We need only a couple of tables like this, because we usually merge only a couple of entity sets using string matching (typically customers).

Page 181

Chapter 4: Identity Mapping and De-Duplicating In summary, pre-selecting comes with a price. The price is quite low if we do the merging continuously. For a single merge, it might be simpler to use a blocking technique to reduce the search space. But we don’t need to guess, because we have developed a whole infrastructure for measuring the efficiency. As mentioned, in our example we will show a solution that stores nGrams for the master table only and calculates nGrams for the target table on the fly. In addition, we are will store the nGrams statistics in a table as well. The code in Listing 10 creates a table to store nGrams from the master table and then populates the new table. Note that the n parameter in this case is 4—we are using 4Grams. In addition, before actually extracting nGrams, we are standardizing all strings to upper case. Listing 10: Code to Create Table to Store nGrams from the Master Table
CREATE TABLE dbo.CustomersMasterNGrams ( CustomerId int NOT NULL, Token char(4) NOT NULL PRIMARY KEY (Token, CustomerId) ); GO INSERT INTO dbo.CustomersMasterNGrams SELECT DISTINCT -- Store one NGram for one customer only once m.CustomerId ,g.Token FROM dbo.CustomersMaster AS m CROSS APPLY (SELECT Token, Sequence FROM mdq.NGrams(UPPER(m.FullName + m.StreetAddress + m.CityRegion), 4, 0)) AS g WHERE CHARINDEX('', g.Token) = 0 AND CHARINDEX(' ', g.Token) = 0;

We do not show the code and results of checking the content of this table here. Let’s directly create the nGrams frequency table, and store the frequency there, with the code in Listing 11. Listing 11: Code to Create the nGrams Frequency Table
CREATE TABLE dbo.CustomersMasterNGramsFrequency ( Token char(4) NOT NULL, Cnt int NOT NULL,

Page 182

Chapter 4: Identity Mapping and De-Duplicating
RowNo int NOT NULL PRIMARY KEY (Token) ); GO INSERT INTO dbo.CustomersMasterNGramsFrequency SELECT Token ,COUNT(*) AS Cnt ,ROW_NUMBER() OVER(ORDER BY COUNT(*)) AS RowNo FROM dbo.CustomersMasterNGrams GROUP BY Token;

The following query’s results display which nGrams values are the most frequent, as Figure 6 shows.
SELECT * FROM dbo.CustomersMasterNGramsFrequency ORDER BY RowNo DESC;

Figure 6: NGRAMS FREQUENCY In order to completely comprehend the impact of storing the tokens and their frequency, let’s check the space used by the original and tokens tables. The following code checks the space used; Figure 7 shows the results.
EXEC EXEC EXEC EXEC sp_spaceused sp_spaceused sp_spaceused sp_spaceused 'dbo.CustomersMaster'; 'dbo.CustomersTarget'; 'dbo.CustomersMasterNGrams'; 'dbo.CustomersMasterNGramsFrequency';

Page 183

Chapter 4: Identity Mapping and De-Duplicating

Figure 7: SPACE USED From the results, we can see that the nGrams table actually occupies much more space, approximately ten times more space, than the original table. Of course, our original CustomersMaster table does not have many columns. In a real-world situation, the tokens table would probably use approximately the same or even less space than the source table. Now we have the entire infrastructure we need to start doing the actual matching. Again, we are starting with exact matches. The UPDATE code in Listing 12 performs the exact matching on all three common columns; this is the same query we used in the partitioning procedure already. Listing 12: Code to Perform Matching on Common Columns
WITH Matches AS ( SELECT t.MasterCustomerId ,m.CustomerId AS MCid ,t.CustomerId AS TCid FROM dbo.CustomersTarget AS t INNER JOIN dbo.CustomersMaster AS m ON t.FullName = m.FullName AND t.StreetAddress = m.StreetAddress AND t.CityRegion = m.CityRegion ) UPDATE Matches

Page 184

Chapter 4: Identity Mapping and De-Duplicating
SET MasterCustomerId = MCid;

Now we can filter out exact matches in all of the following steps. If we have a large number of rows, we can also use only a batch of still unmatched rows of the target table combined with nGrams pre-selecting. This means we could actually combine the nGrams filtering with the partitioning technique. However, because the number of rows to match is already small enough in this proof of concept project, we do not split the unmatched rows in the target table in batches here. We will measure the efficiency of the matching on the fly, while we are doing the match. The first part of the code, as in the partitioning technique, declares two variables to store the number of rows updated in this pass and the start time. The actual merging query uses four CTEs. In the first one, we are tokenizing the target table rows into 4Grams on the fly. Because we are using 4Grams, the n parameter is equal to 4 in this pass. This part of code is shown in Listing 13. Listing 13: Variable Declaration and the First CTE
-- Variables to store the number of rows updated in this pass -- and start time DECLARE @RowsUpdated AS int, @starttime AS datetime; SET @starttime = GETDATE(); -- Tokenize target table rows WITH CustomersTargetNGrams AS ( SELECT t.CustomerId AS TCid ,g.Token AS TToken ,t.MasterCustomerId FROM dbo.CustomersTarget AS t CROSS APPLY (SELECT Token, Sequence FROM mdq.NGrams(UPPER(t.FullName + t.StreetAddress + t.CityRegion), 4, 0)) AS g WHERE t.MasterCustomerId IS NULL AND CHARINDEX(' , g.Token) = 0 AND CHARINDEX('0, g.Token) = 0 ),

The next CTE selects only target rows with 4Grams, with absolute frequency less than or equal to 20. Thus, the p parameter is 20 in this case. The code is shown in Listing 14. Listing 14: The Second CTE

Page 185

Chapter 4: Identity Mapping and De-Duplicating
-- Target rows with 4Grams, with absolute frequency less than or equal to 20 NGramsMatch1 AS ( SELECT tg.TCid ,tg.MasterCustomerId ,mg.CustomerId AS MCid ,tg.TToken ,f.Cnt FROM CustomersTargetNGrams AS tg INNER JOIN dbo.CustomersMasterNGramsFrequency AS f ON tg.TToken = f.Token AND f.Cnt <= 20 INNER JOIN dbo.CustomersMasterNGrams AS mg ON f.Token = mg.Token ),

In the third CTE, we are selecting only matches that have in common at least three less frequent 4Grams. Thus, the m parameter is 3 in this example, as shown in Listing 15. Listing 15: The Third CTE
-- Matches that have in common at least three less frequent 4Grams NGramsMatch2 AS ( SELECT TCid ,MCid ,COUNT(*) AS NMatches FROM NGramsMatch1 GROUP BY TCid, MCid HAVING COUNT(*) >= 3 ),

The last CTE then compares the strings and adds the Jaro-Winkler similarity coefficient to the output. In addition, it adds the row number sorted by Jaro-Winkler similarity coefficient, in descending order and partitioned by the target table key, to the output. Row number equal to 1 means the row with the highest Jaro-Winkler similarity coefficient for a specific target table key. This way, we marked the master table row that, according to Jaro-Winkler algorithms, is the most similar to the target row. The code for the fourth CTE that calculates the Jaro-Winkler coefficient and the row number is shown in Listing 16. Fortunately, this is the last CTE in this long query; after this CTE, the only thing left to do is the actual UPDATE. It updates the target table rows with the key from the master table row with the highest similarity. The UPDATE statement is also shown in Listing 16. Listing 16: The Fourth CTE and the UPDATE Statement

Page 186

Chapter 4: Identity Mapping and De-Duplicating
-- Adding Jaro-Winkler coefficient and row number NGramsMatch3 AS ( SELECT t.CustomerId AS TCid ,m.CustomerId AS MCid ,t.MasterCustomerId ,mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) AS JaroWinkler ,ROW_NUMBER() OVER (PARTITION BY t.CustomerId ORDER BY mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) DESC) AS RowNo ,ngm2.NMatches FROM NGramsMatch2 AS ngm2 INNER JOIN dbo.CustomersTarget AS t ON ngm2.TCid = t.CustomerId INNER JOIN dbo.CustomersMaster AS m ON ngm2.MCid = m.CustomerId ) -- Actual update UPDATE NGramsMatch3 SET MasterCustomerId = MCid WHERE RowNo = 1;

We use the code in Listing 17 to measure the efficiency, and show the results in Figure 8. Listing 17: Measuring the Efficiency
-- Measuring the efficiency SET @RowsUpdated = @@ROWCOUNT; SELECT @RowsUpdated AS RowsUpdated ,100.0 * @RowsUpdated / 7790 AS PctUpdated ,COUNT(*) AS NumErrors ,100.0 * COUNT(*) / @RowsUpdated AS PctErrors ,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed FROM dbo.CustomersTarget WHERE MasterCustomerId <> CustomerId * (-1);

Figure 8: RESULTS OF MEASURING THE EFFICIENCY The results are much better in this case. We updated only 6,847 out of 7,790 rows we have to update, which is approximately 87.89 percent. With only 0.48 percent of errors, we can see now that the complication with intelligent pre-selecting makes sense. In addition, the performance is magnitudes better than we achieved with the partitioning procedure. We needed only 7

Page 187

Chapter 4: Identity Mapping and De-Duplicating seconds to match 6,847 rows, whereas we needed 133 seconds to match only 1,000 rows with the partitioning procedure. But is the n Grams filtering method always that efficient? Comparing nGrams Filtering with Partitioning In order to make a more frank comparison, we need to play with different parameters of the nGrams filtering method. We admit that in the section where we described the method, we used parameters that give us, according to many tests, very good results. Some of the most important results of multiple tests are aggregated in the Table 1. Table 1: Results of nGrams Filtering

Method

n

p

m

Rows to match

Rows matched 1000 2000 7391

Percent matched 12.83 25.67 94.88

Number of errors 63 93 232

Percent of errors 6.30 4.65 3.14

Elapsed time (s) 133 262 223

Partitions Partitions

NA NA 3

NA NA 20

NA NA 3

7790 7790 7790

n Grams
filtering

nGrams
filtering

3

20

2

7790

7717

99.06

322

4.17

264

nGrams
filtering

4

50

3

7790

7705

96.39

48

0.64

11

nGrams
filtering

4

50

2

7790

7716

99.05

106

1.37

14

nGrams
filtering

4

20

3

7790

6847

87.89

33

0.48

7

Page 188

Chapter 4: Identity Mapping and De-Duplicating

nGrams
filtering

4

20

2

7790

7442

95.53

145

1.95

7

nGrams
filtering

4

10

2

7790

6575

84.40

188

2.86

4

nGrams
filtering

4

10

3

7790

5284

67.83

28

0.53

4

As you can see from the table, it is important to choose the correct values for the parameters of the nGrams filtering method. For example, when using two common 3Grams with absolute frequency less than or equal to 20, the performance (elapsed time 264s) and the accuracy (4.17 percent of errors) were not really shining. Nevertheless, the table proves that the nGrams filtering method works. In all cases in the table, it is more efficient than the partitioning method. With proper values for the parameters, it gives quite astonishing results. In addition, note that in a real example, not as many errors would be present in the target table rows, and the results could be even better. We should also try matching with other algorithms, to check which one is the most suitable for our data. In addition, we could start with more strict values for the three parameters and have even fewer errors in the first pass. So, did we find the best possible method for identity mapping? Before making such a strong conclusion, let’s try the last option for matching we have in SQL Server out of the box, the SSIS Fuzzy Lookup data flow transformation.

Page 189

Chapter 4: Identity Mapping and De-Duplicating

Microsoft Fuzzy Components
The Fuzzy Lookup and Fuzzy Grouping transformations, which are included in Enterprise, Datacenter and Developer editions of SQL Server only, use a proprietary Microsoft matching algorithm. It is partially documented in SQL Server Books Online; more information can be found in MSDN and TechNet articles. In addition, Microsoft recently published a Fuzzy Lookup add-in for Excel 2010. This add-in’s documentation includes a good explanation of the Fuzzy algorithm. Even more details can be found in the documentation for the Microsoft Research Data Cleaning project.

Fuzzy Algorithm Description
Fuzzy transformations use a quite advanced algorithm for approximate string matching. It actually comprises some other algorithms that we already know. It starts by using the Jaccard similarity coefficient. However, the Fuzzy transformations version is much more advanced: It is actually weighted Jaccard similarity for tokens. For example, the sets {a, b, c} and {a, c, d} have a Jaccard similarity of 2/4 = 0.5 because the intersection is {a, c} and the union is {a, b, c, d}. You can assign weights to each item in a set and define the weighted Jaccard similarity as the total weight of the intersection divided by the total weight of the union. For example, we added arbitrary weights to the elements of the sets from the previous example to get the weighted sets {(a, 2), (b, 5), (c, 3)}, {(a, 2), (c, 3), (d, 7)}. For these two sets, the weighted Jaccard similarity is (2 + 3) / (2 + 3 + 5 +7) = 5/17 = 0.294. Tokens are substrings of original strings. Fuzzy transformations convert strings to sets before they calculate the weighted Jaccard similarity. The transformation used for converting an internal component is called a tokenizer. For example, the row {“Ruben Torres”, “5844 Linden Land”} might be tokenized into the set {“Ruben”, “Torres”, “5844”, “Linden”, “Land”}. The default tokenizer is for English text. You can change the LocaleId property in component properties. Note that this is an advanced property, and you need to open the Advanced Editor for the

Page 190

Chapter 4: Identity Mapping and De-Duplicating transformation in order to get to this property (right-click on either Fuzzy Lookup or Fuzzy Grouping transformation in SSIS Designer and select the Show Advanced Editor option). Fuzzy transformations assign weights to tokens. Tokens get higher weights if they occur infrequently and lower weights if they occur frequently. In database texts, for example, frequent words such as “database” might be given a lower weight, whereas less frequent words such as “broker” might be given a higher weight. In the Excel version of Fuzzy Lookup, you can even override the default token weights by supplying your own table of token weights. Fuzzy components are additionally enhanced with token transformations. Tokens are converted from one string to another. There are many classes of such transformations that Fuzzy components handle automatically, such as spelling mistakes, string prefixes, and string merge/split operations. In the Excel version, you can also define a custom transformation table to specify conversions from a token to a token. For example, you can specify that “Inc” token has to be converted to “Incorporated” token. The Jaccard coefficient is further enhanced to Jaccard similarity under transformations. This is the maximum Jaccard similarity between any two transformations of each set. With a given set of transformation rules (either from your table in Excel, or only built-in rules in SSIS), all possible transformations of the set are considered. For example, for the sets {a, b, c} and {a, c, d} and the transformation rules {b=>d, d=>e}, the Jaccard similarity is computed:         Variations of {a, b, c}: {a, b, c}, {a, d, c} Variations of {a, c, d}: {a, c, d}, {a, c, e} Maximum Jaccard similarity between all pairs: J({a, b, c}, {a, c, d}) = 2/4 = 0.5 J({a, b, c}, {a, c, e}) = 2/4 = 0.5 J({a, d, c}, {a, c, d}) = 3/3 = 1.0 J({a, d, c}, {a, c, e}) = 2/4 = 0.5 The maximum is 1.0.

Page 191

Chapter 4: Identity Mapping and De-Duplicating Fuzzy components use also Edit (Levenshtein) distance. The algorithm for Edit distance was described earlier in this chapter. Fuzzy components include an additional internal transformation provider called EditTransformationProvider, which generates specific transformations for each string and creates a transformation from the token to all words in its dictionary that are within a given edit distance. The normalized edit distance is the edit distance divided by the length of the input string. As you can see, Fuzzy transformations use quite an advanced algorithm, which combines many public algorithms and some internal components. But can we expect better results than we obtained with public algorithms provided in MDS database? Let’s run some tests, starting with SSIS Fuzzy Lookup transformation.

Configuring SSIS Fuzzy Lookup
To test the Fuzzy lookup transformation, we need to create a new empty target table, as shown in Listing 18. Listing 18: A New Table for Fuzzy Lookup Matching
CREATE TABLE dbo.FuzzyLookupMatches ( CustomerId int, FullName nvarchar(200), StreetAddress nvarchar(200), CityRegion nvarchar(200), MaritalStatus nvarchar(1), Updated int, MasterCustomerId int, MCid int );

Of course, even Fuzzy Lookup cannot perform magic. If we try to match every row from the left table with every row from the right table, we still get a full cross join. Therefore, it is wise to do some optimization here as well. In the SSIS package, we start with exact matches again, using the Lookup transformation. The target table is the source, and the master table is the lookup table for the Lookup transformation. The rows that did not match (i.e., the “Lookup no Match Output”) are then
Page 192

Chapter 4: Identity Mapping and De-Duplicating driven to the Fuzzy Lookup transformation. After exact and approximate matches are done, we make a union of all rows and send them to the output. If you want to test Fuzzy Lookup by yourself, follow these steps. 1. Open Business Intelligence Development Studio (BIDS). 2. In BIDS, create a new SSIS project. Name the solution MDSBook_Ch04, the project MDSBook_Ch04_SSIS, and save the solution to any folder you want (we suggest C:\MDSBook\Chapter04). 3. In Solution Explorer, right-click on the default package with the name Package.dtsx. Rename it to FuzzyLookup.dtsx. Click OK in the message box window to rename the package object. 4. From the Toolbox, drag the Data Flow Task to the Control Flow working area. In SSIS Designer, click on the Data Flow tab. 5. From the Toolbox, drag the OLE DB Source to the Data Flow working area. Rightclick on it, then select the Rename option, and rename it to CustomersTarget. 6. Double-click on the source to open the OLE DB Source Editor. In the Connection Manager tab, create a new connection to your SQL Server system with the MDSBook database. Select the dbo.CustomersTarget in the Name of the table or the view drop-down list. Your data source should be similar to the one shown in Figure 9.

Page 193

Chapter 4: Identity Mapping and De-Duplicating

Figure 9: CONFIGURING THE DATA SOURCE 7. Click on the Columns tab to get columns mapping. Click OK to close the data Source Editor. 8. Drag the Lookup transformation to the canvas. Rename it to Exact Matches. Connect it with the source using the green arrow from the CustomersTarget data source. Double-click on it to open the editor. 9. In the Lookup Transformation Editor window, in the General tab, use the Specify how to handle rows with no matching entries drop-down list to select the option Redirect rows to no match output. 10. Use the Connection tab to specify the lookup table. Use the same connection manager as you created for the data source. Using the Use a table or view option, select the dbo.CustomersMaster table.

Page 194

Chapter 4: Identity Mapping and De-Duplicating 11. In the Columns tab, drag and drop the FullName column from the left table to the FullName column in the right table. Repeat this for StreetAddress and CityRegion columns. These three columns will be used for matching rows. Click on the check box near the CustomerId column in the right table. This is the column we are adding to our CustomersTarget table. In the Output Alias cell, rename the CustomerId column to MCid. Your Columns tab of the Lookup Transformation Editor should be similar to the one in Figure 10. Click OK to close the editor.

Figure 10: CONFIGURING THE LOOKUP TRANSFORMATION 12. Drag and drop the Fuzzy Lookup transformation to the canvas. Rename it to Approximate Matches. Connect it with the Lookup transformation using Lookup no Match Output. Double-click on it to open the Fuzzy Lookup Transformation Editor.

Page 195

Chapter 4: Identity Mapping and De-Duplicating 13. In the Reference Table tab, use the same connection manager as for the Lookup transformation reference table, and again select the dbo.CustomersMaster table for the reference table name. 14. Check the Store new index check box. By storing the index, you can optimize subsequent executions (see Books Online and the sidebar “Storing an Index”). Your Reference Table should resemble the one in Figure 11.

Figure 11: CONFIGURING THE REFERENCE TABLE TAB Storing an Index When the package first runs the transformation, the transformation copies the reference table, adds a key with an integer data type to the new table, and builds an index on the key column. Next, the transformation builds an index, called a match index, on the copy of the reference table. The match index stores the results of tokenizing the values in the

Page 196

Chapter 4: Identity Mapping and De-Duplicating transformation input columns, and the transformation then uses the tokens in the lookup operation. The match index is a table in a SQL Server 2000 or later database. When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleaning. 15. Click on the Columns tab. Delete the connection between left and right CustomerId columns. Check the box near the right CustomerId column. Rename the outp column name to MCid. If connections between FullName, StreetAddress, and CityRegion columns from left and right do not exist, create them. These settings are the same as columns settings for the Lookup transformation. 16. Click on the Advanced tab. Set the similarity threshold to 0.50 (see the sidebar “Similarity Threshold”). Check that Maximum number of matches to output per lookup is set to 1. We are selecting the best match according to the Fuzzy Lookup only, like we did when we matched rows using the Jaro-Winkler algorithm. Your Advanced tab should be similar to the one in Figure 12. Click OK.

Page 197

Chapter 4: Identity Mapping and De-Duplicating

Figure 12: THE ADVANCED TAB OF THE FUZZY LOOKUP TRANSFORMATION EDITOR Similarity Threshold The closer the value of the similarity threshold is to 1, the closer the resemblance of the lookup value to the source value must be to qualify as a match. Increasing the threshold can improve the speed of matching because fewer candidate records need to be considered. You can optimize the matching by having multiple Fuzzy Lookups. First you match rows with higher similarity, then you lower the similarity threshold, then lower again, and so on. Because every time you need to match fewer rows, you can control the performance. 17. Drag and drop the Union All transformation to the working area. Connect it with the Lookup Match Output from the Lookup transformation and with the default output from the Fuzzy Lookup transformation.

Page 198

Chapter 4: Identity Mapping and De-Duplicating 18. Drag and drop OLE DB Destination to the working area. Rename it to FuzzyLookupMatches. Connect it with the green arrow from the Union All transformation. Double-click on it to open the editor. 19. In the Connection Manager tab, use the same connection manager to the MDSBook database as in all sources and transformations in this exercise. Select the dbo.FuzzyLookupMatches table in the Name of the table or the view dropdown list. 20. Click on the Mappings tab to check whether the column mappings are correct. Because we used the same column names throughout the package and we used the same names in the destination table, the automatic mapping should be correct. Click OK to close the OLE DB Destination Editor. Your data flow of the package should be similar to the one in Figure 13 shows.

Page 199

Chapter 4: Identity Mapping and De-Duplicating Figure 13: THE COMPLETE DATA FLOW FOR FUZZY LOOKUP
21.

Save the project. Do not exit BIDS.

Testing SSIS Fuzzy Lookup
Now it is time to test how the Fuzzy Lookup transformation works. At this point, we need to execute the package. After execution, we need to measure the results. Let’s start by calculating the number of matched rows, or, simpler, the number of not matched rows:
SELECT COUNT(*) FROM dbo.FuzzyLookupMatches WHERE MCid IS NULL;

From this number, it is easy to calculate the number of matched rows. In my case, the number of unmatched row was 608. Because I had 7,790 rows to match, this means that Fuzzy Lookup actually matched 92.20 percent of rows to match. With the following query we can measure the number of errors:
SELECT * FROM dbo.FuzzyLookupMatches WHERE CustomerId <> MCid * (-1);

In my case, the result was fantastic: no errors at all. At first glimpse, it looked too good to be true to me. Therefore, I ran many additional tests, and only here and there got an error. Out of more than 20 tests, I got two errors only once! What about the execution time? After you execute the package, you can read execution time using the Execution Results tab of the SSIS Designer. In my case, execution (or elapsed) time was 15 seconds. However, this time included the time needed to read the data, to perform exact matches with the Lookup transformation, to run the Fuzzy Lookup transformation, union the data, and write the data to the destination.

Page 200

Chapter 4: Identity Mapping and De-Duplicating Let’s compare the Fuzzy Lookup results with the best results we got using public algorithms. With public algorithms, we matched 87.89 percent of rows to match with 0.48 percent of errors in 7 seconds (approximate match only). With Fuzzy Lookup, we matched 92.20 percent of rows to match with 0.00 percent of errors in 15 seconds (exact and approximate match). As we could expect, Microsoft’s Fuzzy algorithm really works. However, you should still test other algorithms on your data. It is impossible to predict that Fuzzy Lookup will outperform all other algorithms on any kind of data. That’s why we developed different methods using different algorithms for identity mapping.

Fuzzy Lookup Add-In for Excel
After you download and install the Fuzzy Lookup add-in for Excel, you get this power of identity mapping for Excel tables as well. There is nothing much to add from the algorithm perspective— of course, this is the same Fuzzy Lookup algorithm as we described earlier. Exhaustive instructions on how to use the add-in are provided in a PDF document and in a demo Excel file. Figure 14 shows a quick overview of the Excel demo file with Fuzzy Lookup prepared for configuration.

Page 201

Chapter 4: Identity Mapping and De-Duplicating

Figure 14: EXCEL FUZZY LOOKUP

Page 202

Chapter 4: Identity Mapping and De-Duplicating

De-Duplicating
De-duplicating is actually a very similar, or more accurately, it is actually the same problem as identity mapping. SSIS has a separate Fuzzy Grouping transformation for this task. It groups rows based on string similarities. Let’s see how we can use either of the two Fuzzy tasks for both problems. Let’s start with using Fuzzy Grouping for identity mapping. This is a very simple task. We can just make a not distinct union (in T-SQL, use UNION ALL rather than the UNION operator) of all rows from both the master and target tables. Then we can perform the grouping on this union. The result is the same as we would get with identity mapping. To turn the problem around: How could we perform de-duplication with Fuzzy Lookup or another string similarity merging? We should use the same table twice, once as the master and once as the target table. We would immediately exclude exact matches for all the character columns and with the same keys (i.e., matches of a row with itself). In the next step, we would perform the identity mapping of exact matches of character columns of rows with different keys. Then we would perform approximate matching on the rest of the rows. Finally, we could delete all the rows that got a match (i.e., the same identification) except one. The Excel Fuzzy Lookup add-in can actually be used for de-duplicating as well. For the left and the right table for matching, you can define the same Excel table. Nevertheless, you might find it easier to perform the de-duplicating by using the SSIS Fuzzy Grouping transformation. Therefore, let’s test it. Before testing, let’s prepare some demo data. We will add all the rows from the CustomersTarget table to the CustomersMaster table and then try to de-duplicate them. We can add the rows with the INSERT query, like shown in Listing 19. Listing 19: Adding Duplicate Rows to the CustomersMaster Table

Page 203

Chapter 4: Identity Mapping and De-Duplicating
INSERT INTO dbo.CustomersMaster (CustomerId ,FullName ,StreetAddress ,CityRegion ,NumberCarsOwned) SELECT t.CustomerId ,t.FullName ,t.StreetAddress ,t.CityRegion ,NULL FROM dbo.CustomersTarget AS t

Preparing for Fuzzy Grouping
Before we start using the Fuzzy Lookup transformation, we have to think about the algorithm. We need to optimize it similarly to how we optimized the identity mapping process. We have to perform exact matches first. We will use the Aggregate transformation for this task. We will group by FullName, StreetAddress, and CityRegion columns. However, we need to decide which CustomerId to keep. For this example, let’s say we keep the higher Id. This is a quite plausible scenario, because a higher Id could mean that the row came to our system later than the one with the lower Id and might be cleaner than the one with the lower Id. Because we have negative CustomerId values in the CustomersTarget table, this means we will retain the Id from the CustomersMaster table. After we have the Id, we need to find the correct value for the NumberCarsOwned column. We will use the Lookup transformation and perform a lookup on the same table we are deduplicating (i.e., the CustomersMaster table). Because we will have higher Ids only in the aggregated table, we will get the correct match and read the value of the NumberCarsOwned attribute from the correct row. First, we have to prepare the destination table. Besides original columns, we are adding columns that the Fuzzy Grouping transformation provides. We are using the default Fuzzy Grouping names, as shown in Listing 20. Listing 20: Table for Fuzzy Grouping Matches
CREATE TABLE dbo.FuzzyGroupingMatches

Page 204

Chapter 4: Identity Mapping and De-Duplicating
( _key_in int NULL, _key_out int NULL, _score real NULL, CustomerId int NULL, FullName nvarchar(200) NULL, StreetAddress nvarchar(200) NULL, CityRegion nvarchar(200) NULL, NumberCarsOwned tinyint NULL, FullName_clean nvarchar(200) NULL, StreetAddress_clean nvarchar(200) NULL, CityRegion_clean nvarchar(200) NULL, _Similarity_FullName real NULL, _Similarity_StreetAddress real NULL, _Similarity_CityRegion real NULL );

There will be some additional work after Fuzzy Grouping finishes. The transformation adds the following columns to the output:    _key_in, a column that uniquely identifies each row for the transformation; _key_out, a column that identifies a group of duplicate rows; The _key_out column has the value of the _key_in column in the canonical data row. The canonical row is the row that the Fuzzy Grouping identified as the most plausible correct row and was used for comparison (i.e., the row used for standardizing data). Rows with the same value in _key_out are part of the same group. The _key_out value for a group corresponds to the value of _key_in in the canonical data row. We could suppose that we could keep the canonical row from the group only;  _score, a value between 0 and 1 that indicates the similarity of the input row to the canonical row. For the canonical row, the _score has a value of 1. In addition, Fuzzy Grouping adds columns used for approximate string comparison with clean values. Clean values are the values from the canonical row. In our example, these columns are the FullName_clean, StreetAddress_clean, and CityRegion_clean columns. Finally, the transformation adds columns with similarity scores for each character column used for approximate string comparison. In our example, these are the _Similarity_FullName, _Similarity_StreetAddress, and _Similarity_CityRegion columns.

Page 205

Chapter 4: Identity Mapping and De-Duplicating

SSIS Fuzzy Grouping Transformation
We will create a new SSIS package for the de-duplicating problem. Use the following steps: 1. If you closed it, open BIDS and re-open the MDSBook_Ch04_SSIS project. 2. In Solution Explorer, right-click on the Packages folder and add a new package. Rename it to FuzzyGrouping.dtsx. Click OK in the message box window to rename the package object as well. 3. From the Toolbox, drag the Data Flow Task to the Control Flow working area. In SSIS Designer, click on the Data Flow tab. 4. From the Toolbox, drag the OLE DB Source to the Data Flow working area. Rightclick on it, then select the Rename option and rename it to CustomersMaster. 5. Double-click on the source to open the OLE DB Source Editor. In the Connection Manager tab, create a new connection to your SQL Server system with the MDSBook database. Select dbo.CustomersMaster in the Name of the table or the view drop-down list. 6. Click on the Columns tab to get the columns mapping. Click OK. 7. Drag the Aggregate transformation to the working area. Rename it to Exact Matches. Connect it with the source using the green arrow from the CustomersMaster data source. Double-click on it to open the editor. 8. Configure the FullName, StreetAddress, and CityRegion for the Group by operation. Use Maximum Operation for the CustomerId column. Click OK. Figure 15 shows the Aggregate transformation configuration.

Page 206

Chapter 4: Identity Mapping and De-Duplicating

Figure 15: CONFIGURING THE AGGREGATE TRANSFORMATION 9. Drag the Lookup transformation to the canvas. Rename it to Adding NumberCarsOwned. Connect it with the Aggregate transformation using the green arrow from the CustomersTarget data source. Double-click on it to open the editor. 10. In the Lookup Transformation Editor window, in the General tab, use the default settings. 11. Use the Connection tab to specify the lookup table. Use the same connection manager as you created for the data source. Using the Use a table or view option, select the dbo.CustomersMaster table. 12. Click on the Columns tab. Make sure that the only connection is between the CustomerId columns from the left and right. Use NumberCarsOwned as the

Page 207

Chapter 4: Identity Mapping and De-Duplicating lookup column, and make sure you use the same name for the output alias. Figure 16 shows the correct configuration.

Figure 16: COLUMN USAGE FOR THE LOOKUP TRANSFORMATION 13. Drag the Fuzzy Grouping transformation to the working area. Rename it to Approximate Matches. Connect it with the Lookup Match Output to the Lookup transformation. Double-click on it to open the Fuzzy Grouping Transformation editor. 14. In the Connection Manager tab, use the same connection manager as in previous steps. 15. In the Columns tab, use FullName, StreetAddress, and CityRegion as the grouping columns, and CustomerId and NumberCarsOwned as pass-through columns, as Figure 17 shows.

Page 208

Chapter 4: Identity Mapping and De-Duplicating

Figure 17: COLUMN USAGE FOR THE FUZZY GROUPING TRANSFORMATION 16. Click on the Advanced tab and check the advanced settings, especially the Input, Output, and Similarity Score column names. Make sure you are using the same names as in the dbo.FuzzyGroupingMatches table. Set the similarity threshold to 0.50 (as we did for the Fuzzy Lookup transformation when we did identity mapping). Click OK to close the Fuzzy Grouping Editor. 17. Drag and drop OLE DB Destination to the working area. Rename it to FuzzyGroupingMatches. Connect it with the green arrow from the Fuzzy Grouping transformation. Double-click on it to open the editor. 18. In the Connection Manager tab, use the same connection manager to the MDSBook database as in all sources and transformations in this exercise. Select

Page 209

Chapter 4: Identity Mapping and De-Duplicating the dbo.FuzzyGroupingMatches table in the Name of the table or the view dropdown list. 19. Click on the Mappings tab to check whether the column mappings are correct. Because we used the same column names throughout the package and the same names in the destination table, the automatic mapping should be correct. Click OK to close the OLE DB Destination Editor. Your data flow of the package should be similar to that in Figure 18.

Figure 18: THE COMPLETE DATA FLOW FOR FUZZY GROUPING
20.

Save the project. Do not exit BIDS.

Page 210

Chapter 4: Identity Mapping and De-Duplicating

Testing SSIS Fuzzy Grouping
Now it’s time to execute the package! After the execution, we have to measure the results. Let’s start with execution time. In my test, it was about 57 seconds. Therefore, it seems like it was less efficient than Fuzzy Lookup. In addition, we did not finish yet. We have to check the content of the destination table. A very simple query gives us an overview of the results, which Figure 19 shows.
SELECT * FROM dbo.FuzzyGroupingMatches

Figure 19: RESULTS OF THE FUZZYGROUPINGMATCHES QUERY At a first glimpse, the results are quite satisfying. Rows with _key_out (the second column) equal to 1 (the first two rows) were correctly identified as duplicates, and the row with the positive CustomerId (the first row in the output, with CustomerId 14348) was identified as the canonical row (_key_in equal to 1). However, let’s run some additional tests. In our next query, let’s count the number of duplicate rows in each group (i.e., the number of rows with the same _key_out value).
SELECT _key_out ,COUNT(_key_out) AS NumberOfDuplicates FROM dbo.FuzzyGroupingMatches GROUP BY _key_out ORDER BY NumberOfDuplicates DESC;

Page 211

Chapter 4: Identity Mapping and De-Duplicating From our data, we would expect that the maximal number of rows in a group would be 2. However, the results in Figure 20 show a different picture.

Figure 20: RESULTS OF QUERYING THE NUMBER OF _KEY_OUT DUPLICATES In this test, we got maximally 19 rows identified as duplicates! Of course, we could set a higher similarity threshold and get more accurate matches for duplicates. However, we set it to 0.50 to have a direct comparison of efficiency of the Fuzzy Grouping with the Fuzzy Lookup transformation. Let’s check the rows with the higher number of duplicates identified visually, with the query shown in Listing 21. Listing 21: Checking Rows with High Number of Duplicates
WITH NumberOfDuplicatesCTE AS ( SELECT _key_out ,COUNT(_key_out) AS NumberOfDuplicates FROM dbo.FuzzyGroupingMatches GROUP BY _key_out ) SELECT * FROM dbo.FuzzyGroupingMatches WHERE _key_out = (SELECT TOP 1 _key_out FROM NumberOfDuplicatesCTE ORDER BY NumberOfDuplicates DESC);

Page 212

Chapter 4: Identity Mapping and De-Duplicating Figure 21 shows the results.

Figure 21: RESULTS OF CHECKING THE ROWS WITH HIGH NUMBER OF DUPLICATES From the results, we can see that the canonical row was not identified properly for all the rows. For example, the canonical row for the sixth row (i.e., the row with CustomerId -25321) should be the second row (i.e., the row with CustomerId 25321). Many correct rows with positive CustomerId values were identified incorrectly as duplicates of the first row (the row with CustomerId equal to 25320), which was identified as the canonical row for this group. Apparently, we would have to perform more manual work in order to finish de-duplicating by using the Fuzzy Grouping than by using the Fuzzy Lookup transformation. Of course, we should play more with Fuzzy Grouping with different similarity threshold settings. We could perform a consecutive procedure using de-duplicating with a high similarity threshold first, then lower it a bit, and then lower it more, and so on. Nevertheless, it seems that the Fuzzy Lookup transformation could be more suitable for de-duplicating than Fuzzy Grouping. Not only did it give us better results, but it also easily managed to outperform Fuzzy Grouping.

Clean-Up
To clean up the MDSBook database, use the code from Listing 22. Listing 22: Clean-Up Code
USE MDSBook;

Page 213

Chapter 4: Identity Mapping and De-Duplicating
IF OBJECT_ID(N'dbo.CustomersMaster', N'U') IS NOT NULL DROP TABLE dbo.CustomersMaster; IF OBJECT_ID(N'dbo.CustomersTarget', N'U') IS NOT NULL DROP TABLE dbo.CustomersTarget; IF OBJECT_ID(N'dbo.CustomersMasterNGrams', N'U') IS NOT NULL DROP TABLE dbo.CustomersMasterNGrams; IF OBJECT_ID(N'dbo.CustomersMasterNGramsFrequency', N'U') IS NOT NULL DROP TABLE dbo.CustomersMasterNGramsFrequency; IF OBJECT_ID(N'dbo.FuzzyLookupMatches', N'U') IS NOT NULL DROP TABLE dbo.FuzzyLookupMatches; IF OBJECT_ID(N'dbo.FuzzyGroupingMatches', N'U') IS NOT NULL DROP TABLE dbo.FuzzyGroupingMatches; GO

Page 214

Chapter 4: Identity Mapping and De-Duplicating

Summary
As you can see in this chapter, identity mapping and de-duplicating are not simple. We developed a custom algorithm for identity mapping using the functions from the MDS database. We tested it on data from the AdventureWorksDW2008R2 demo database. We made errors in the data in a controllable way, so we could measure the results of the tests throughout the chapter. Through the tests, we realized that the quality of the results and the performance of our de-duplicating algorithm are highly dependent on proper selection of parameters. After testing the manual procedure, we used SSIS Fuzzy Lookup transformation for deduplicating. We also introduced the Fuzzy Lookup add-in for Excel, which brings the power of this transformation to advanced users on their desktops. We showed that de-duplicating is actually the same problem as identity mapping. Nevertheless, we also tested the SSIS Fuzzy Grouping transformation. According to the tests in this chapter, the Fuzzy Lookup transformation is a clear winner. It gave us better results than any other option, including the Fuzzy Grouping performance, with quite astonishing performance. Nevertheless, this does not mean you should always use Fuzzy Lookup for identity mapping and de-duplicating. You should test other possibilities on your data as well. The tests presented in this chapter are quite exhaustive, so you should have the heavy work with identity mapping and de-duplicating mitigated. This is the last chapter in this version of the book. However, stay tuned; there are many exciting new features coming with the next release of SQL Server, code-name Denali. Microsoft already announced rewritten Master Data Services, improved Integration Services, and a complete new application called Data Quality Services. We will update this book to incorporate these new technologies when they become available.

Page 215

Chapter 4: Identity Mapping and De-Duplicating

References
          53 Microsoft SQL Server MVPs: SQL Server MVP Deep Dives (Manning, 2010) Carlo Batini, Monica Scannapieco: Data Quality – Concepts, Methodologies and Techniques (Springer-Verlag, 2006) Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server Levenshtein distance on Wikipedia Jaccard index on Wikipedia Jaro-Winkler distance on Wikipedia Ratcliff/Obershelp pattern recognition on National Institute of Standards and Technology Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 article on MSDN describing more details about Fuzzy transformations than Books Online Fuzzy Lookup Add-In for Excel Microsoft Research Data Cleaning Project

Page 216

Index

Index
.NET, 7, 11, 47, 57, 60, 133 _key_in, 205, 211 _key_out, 205, 211, 212 a composite candidate key, 135 accuracy, 18, 38, 45, 105, 118, 123, 126, 135, 136, 139, 180, 189 accurate, 6, 16, 126, 166, 170, 212 administrator, 95 Adventure Works2008R2, 122, 138 AdventureWorks2008R2, 107, 112, 121, 122, 126, 127, 149, 154, 158, 160 aggregate functions, 128 Aggregate transformation, 204, 206, 207 aggregation, 30 Alberto Ferrari, 162 algorithm, 38, 40, 44, 53, 121, 136, 156, 166, 167, 173, 175, 176, 178, 186, 189, 190, 192, 201 Analysis Services, 11, 12, 53, 78, 121, 145, 147, 150, 157, 159, 161 analytical application, 13, 14, 16, 17, 40 Anastasios Yalanopoulos, 167 application, 6, 7, 13, 15, 18, 20, 21, 28, 33, 39, 40, 44, 45, 47, 50, 52, 56, 59, 61, 62, 63, 101, 103, 115, 153, 155, 163, 215 Arkady Maydanchik, 162 assessment, 42, 43 association, 23, 120, 121, 124, 125 Attribute Discrimination, 124, 125 Attribute Groups, 75, 76 Attribute Groups Maintenance, 75, 76 Attribute Management, 70 attributes, 48, 49, 66, 69, 71, 74, 75, 82, 99, 100, 128, 146 auditing, 16, 17, 19, 45 authoritative source, 18, 20 authority, 22 Axis Fields (Categories), 151 bias, 177 BIDS, 121, 122, 125, 137, 141, 145, 147, 148, 153, 157, 160, 193, 200, 206, 210 Bill Inmon, 33, 55 bill of material, 14, 15 bill of materials, 16 blocking, 176, 177, 182 Boolean, 133 business intelligence, 11, 142 business interest, 23 business needs, 42, 106 business problem, 10, 14, 29, 40, 42 business requirements, 41, 42, 63 Business Rule Maintenance, 85, 86 business rules, 6, 18, 38, 41, 47, 56, 63, 64, 92 Business Rules, 85, 88, 92, 96 C. J. Date, 24, 55 canonical row, 205, 211, 213 cardinality, 176 Carlo Batini, 216 Cartesian product, 176 central storage, 19, 20, 21, 45 Check constraint, 28 Circular References, 79 class libraries, 47 classify, 35

Page 217

Index
cleansing, 6, 18, 19, 20, 40, 44, 52 client tier, 28 closed world assumption, 37 Closed world assumption, 106 CLR, 51, 133, 134, 145, 160, 161, 167, 181 Clustering, 53, 157 Code attribute, 76 collation, 21 Collection Attributes, 82 collections, 50, 56, 63, 68, 102 Collections, 49, 50, 84, 99 committed version, 97 common columns, 169, 171, 184 Common Table Expressions, 116 common-prefix length, 169 communicate, 19, 46 completeness, 37, 40, 45, 105, 106, 107, 109, 114, 117, 118, 119, 120, 123, 126, 134, 135, 136 complexity, 16 COMPLEXITY_PENALTY, 139 compliance, 17, 41 components, 57, 58, 191, 192 compromise, 21 concept, 14 Concepts, 56, 216 Configuration Manager, 58, 59, 61, 98 Connection Manager, 153, 193, 199, 206, 208, 209 consistency, 39 consistent, 18, 79 Consolidated Attributes, 82 constant, 113 constraint, 6, 28, 37, 45, 74, 131, 134, 135 container, 49, 65, 66 continuous, 20, 21, 38, 128, 136, 139, 144, 150 continuous merge, 20, 21 contract, 14, 15 coordinated, 16 correcting, 44 corrective measures, 44 correctness, 40 CRM, 12, 15, 17, 33, 163 cross-system, 19 CTE, 110, 111, 116, 130, 143, 178, 185, 186 cumulative values, 130 current, 16, 33, 39, 103 customers, 6, 10, 14, 15, 16, 19, 24, 33, 39, 43, 44, 49, 65, 66, 67, 77, 84, 126, 128, 181 Danette McGilvray, 55 Data conflict, 22 Data Connection Wizard, 147, 158 Data Flow Task, 193, 206 data governance, 18, 45, 54 data integrity, 6, 7, 12, 17, 18, 23, 24, 28, 38, 54, 135 Data Mining, 11, 53, 139, 145, 156, 157, 158, 161, 162 data model, 6, 7, 12, 41, 62 data profiling, 36, 42, 105, 107, 131, 145, 148, 149, 156, 161 Data Profiling task, 52, 109, 145, 148, 153, 155, 161 Data Profiling Task Editor, 153, 154 data quality, 6, 7, 9, 12, 15, 17, 18, 21, 22, 24, 28, 34, 36, 37, 38, 42, 43, 44, 45, 46, 47, 51, 53, 54, 67, 74, 103, 105, 106, 120, 136, 142, 161 Data Source Views, 122, 137 data steward, 19 data stewardship, 19, 22, 45, 47, 50 data type, 21, 23, 33, 34, 35, 38, 74, 107, 109, 110, 118, 126, 128, 135, 196 data warehouse, 41, 45, 46, 49, 120, 122, 126, 138, 145

Page 218

Index
database, 11, 13, 14, 18, 19, 22, 24, 28, 31, 33, 35, 41, 47, 48, 51, 54, 57, 58, 59, 65, 67, 75, 79, 89, 90, 94, 98, 99, 100, 101, 104, 107, 112, 115, 120, 121, 122, 126, 127, 133, 134, 147, 149, 150, 154, 158, 159, 160, 166, 169, 175, 181, 191, 192, 193, 197, 199, 206, 209, 213, 215 DateTime, 74 Decision Trees, 136, 137, 138, 139, 140, 141, 145, 146 decomposition, 24, 25, 29 dedicated, 20, 54, 98, 103 de-duplicating, 164, 203 de-duplication, 163 definition, 15, 16, 17, 23, 33, 36, 62, 65, 67, 69, 80, 85, 168 Dejan Sarka, 6, 8, 11, 12, 56, 105, 162, 163 denormalized, 31, 126 Dependency Network, 121, 124, 141 deployment, 47, 50, 62, 103 derived columns, 32, 150 derived hierarchies, 49 Derived Hierarchy, 78, 79, 94 Derived Hierarchy Maintenance, 78 descriptive statistics, 38, 129 destinations, 32, 42, 47, 48, 54 DIFFERENCE(), 166, 174 Dimension, 14, 15, 31, 36, 146 Dimensional Model, 12, 14, 15, 23, 29, 30, 31, 32, 54, 161 discretize, 139, 144, 147 distinct states, 143 documentation, 22, 41 domain, 22, 23, 42, 43, 45, 49, 70 domain knowledge, 22 Domain-Based, 69, 71, 72, 73, 74, 78, 79, 100 duplicate primary key, 137 duplicate records, 43 duplication, 15, 44, 67, 163, 203 DW, 20, 31, 33, 48 ease of use, 39 Ed Katibah, 162 education, 43 efficiency, 169, 177, 178, 179, 180, 182, 185, 187, 212 employees, 14, 15, 29, 49, 79 Enable Change Tracking, 73 enforce, 6, 7, 18, 45, 74, 135 enforcing, 12, 23, 24 enterprise, 6, 12, 15, 18, 22, 23, 28, 33, 36, 42, 53, 103, 163 entities, 14, 15, 23, 24, 29, 45, 48, 49, 56, 63, 64, 67, 69, 71, 74, 75, 78, 79, 99, 100, 101, 163, 166 entity, 14, 15, 23, 40, 48, 49, 54, 70, 71, 73, 75, 78, 79, 80, 81, 85, 98, 99, 100, 101, 149, 161, 163, 165, 181 Entity Maintenance Page, 70 entropy, 38 Entropy formula, 142 entry point, 62 Erik Veerman, 162 erroneous, 114, 115, 127, 130 ETL, 52 evaluating the cost, 28 event, 14, 23 exact matching, 180 Excel, 147, 148, 149, 152, 153, 156, 157, 158, 159, 160, 161, 162, 163, 190, 191, 201, 202, 203, 215, 216 execution time, 200, 211 exist(), 109, 110

Page 219

Index
explicit hierarchies, 49, 102 Explicit Hierarchies, 80 Explicit Hierarchy, 79, 80, 81, 82, 89, 92, 93, 94, 99 Explorer, 63, 82, 84, 91, 122, 123, 137, 138, 145, 146, 153, 158, 193, 206 exporting, 56 Fact table, 14 Feature Pack, 156 File, 69, 74, 153 filtering, 180, 185, 188, 189 first pass, 189 five whys, 43, 44, 105 flag, 38, 39, 73, 81, 137 Free-Form, 69 Free-Text, 73 functional areas, 59, 62 functional dependency, 112 functionally dependent, 26, 115 Fuzzy Grouping, 52, 164, 190, 191, 203, 204, 205, 206, 208, 209, 210, 211, 212, 213, 215, 216 Fuzzy Lookup, 52, 164, 166, 189, 190, 191, 192, 193, 195, 197, 198, 200, 201, 202, 203, 204, 209, 211, 212, 213, 215, 216 Fuzzy transformations, 190, 191, 192, 216 generalization, 29 government, 17 governor, 19 Greg Low, 162 GROUPING SETS, 113 hard dimensions, 37, 42, 45 harmonizing, 17 header, 23 hierarchical data, 13 Hierarchies, 49, 71, 77, 79, 80, 81, 82, 83, 84, 85, 92, 99 historical data, 31 history, 16, 17, 33 HRM, 15 human resources, 31 identification, 15, 23, 119, 163, 165, 203 identity mapping, 7, 19, 21, 163, 164, 165, 166, 169, 174, 181, 189, 201, 203, 204, 209, 215 importing, 47, 103, 133, 149 Importing, 56, 89, 90, 150 improvement, 36, 43, 44, 45, 53, 54, 120, 136 improvement plan, 44 inaccuracies, 128, 135 inaccurate, 8, 15, 38, 44, 45, 126, 128, 129, 136, 141 incompleteness, 24, 136 inconsistent, 39, 44 independently, 19, 165 influence, 15, 24, 40, 136 information, 17, 18, 23, 25, 26, 29, 33, 34, 38, 39, 40, 44, 75, 76, 80, 91, 94, 105, 106, 112, 136, 139, 142, 143, 144, 159, 175, 190 Information Principle, 23 Information Theory, 38, 142 input, 69, 74, 121, 122, 123, 136, 138, 146, 166, 192, 197, 205 installation, 57, 58, 59 installer package, 57 installing, 57, 103 instance, 33, 38, 58, 59, 107, 109, 110, 122, 135, 149, 157, 158 integrating, 17 integration, 19, 40, 63 Integration Management, 63, 90, 93 Integration Services, 11, 12, 52, 109, 153, 155, 161, 215, 216 integrity rules, 28

Page 220

Index
interaction, 19, 39, 62 interactivity, 40 intersection, 21, 168, 190 interview, 37, 42, 45 Isaac Kunen, 162 Itzik Ben-Gan, 8, 162 Jaccard, 167, 168, 175, 190, 191, 216 Jaro-Winkler, 167, 168, 176, 177, 178, 179, 180, 186, 187, 197, 216 join, 28, 30, 126, 165, 177 key data, 6 keys mapping tables, 19 Leaf Attributes, 71 leaf level, 77 legacy applications, 115, 163 Levenshtein, 167, 175, 176, 192, 216 license, 14 life cycle, 15, 42, 44 LIKE, 132, 133, 137 list of defined values, 70 location, 14, 16 longest common substring, 167, 169 lookup table, 49, 70, 192, 194, 207 Lookup transformation, 192, 194, 195, 196, 197, 198, 200, 204, 207, 208, 213 low frequency, 38, 129, 155 Lubor Kollar, 162 magnifying glass, 83 Management Studio, 90, 107 managing data, 56 Mappings, 199, 210 Marco Russo, 162 master data, 6, 7, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 32, 35, 36, 40, 42, 43, 45, 47, 48, 49, 50, 51, 52, 53, 54, 56, 77, 103, 126, 161, 163, 181 Master Data Management, 7, 11, 12, 16, 17, 62, 89 Master Data Manager, 50, 62, 63, 65, 101, 103 Master Data Services, 9, 12, 21, 45, 47, 54, 56, 57, 58, 62, 63, 64, 65, 66, 73, 74, 77, 78, 79, 82, 85, 89, 90, 91, 94, 95, 101, 103, 104, 158, 161, 166, 215 matching, 133, 165, 166, 167, 168, 170, 173, 176, 177, 178, 180, 181, 184, 185, 189, 190, 194, 195, 198, 203 matching characters, 168 MDM solution, 19, 21, 22, 28, 33, 35, 36, 45, 47, 54, 103, 105, 120, 161 mdm.StgBatch, 101 mdm.tblAttribute, 75, 100 mdm.tblEntity, 75, 98, 99 mdm.tblEntityMemberType, 99 mdm.tblModel, 98 mdm.tblStagingMember, 89, 101 mdm.tblStgMemberAttribute, 89, 101 mdm.tblStgRelationship, 89, 101 mdq.NGrams, 181, 182, 185 mdq.Similarity, 175, 176, 179, 187 MDS, 12, 21, 22, 47, 48, 49, 50, 51, 52, 54, 56, 57, 58, 59, 61, 76, 98, 99, 101, 103, 166, 167, 175, 181, 192, 215 MDS Hub, 47, 51, 56 member, 49, 63, 75, 79, 80, 82, 83, 84, 92, 94 member values, 67 merging, 19, 21, 36, 44, 52, 103, 161, 165, 166, 169, 182, 185, 203 metadata, 13, 19, 20, 21, 22, 45, 47, 100 Microsoft, 7, 9, 11, 47, 121, 122, 133, 138, 147, 149, 156, 162, 163, 190, 201, 215, 216 Microsoft Research, 190, 216 middle tier, 28 minimalization, 41

Page 221

Index
MINIMUM_SUPPORT, 139 Mining Structures, 122, 138 missing, 38, 107, 109, 110, 111, 114, 115, 120, 125 Model, 14, 30, 60, 64, 65, 66, 67, 75, 76, 90, 95, 98, 99, 103, 124, 140, 141 modify(), 110 Monica Scannapieco, 216 mutually independent, 27 MVP, 9, 11, 181, 216 Naïve Bayes, 121, 122, 123, 136, 137, 156 Name attribute, 100 naming convention, 24, 75, 99, 112 natural hierarchies, 31 navigation, 77 NEWID(), 171, 172, 173, 177, 178 nGrams, 180, 181, 182, 183, 184, 185, 188, 189 nodes(), 110 non-transitively dependent, 27 normal distribution, 129, 131 normal form, 24, 25, 26, 27, 41, 112 normalization, 24, 28, 54, 106, 115 Normalization, 24, 67, 106 notification, 47 noun, 14, 15 NULL, 28, 29, 37, 38, 41, 44, 90, 106, 107, 108, 109, 110, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 125, 127, 131, 134, 160, 170, 178, 180, 182, 183, 185, 200, 204, 205, 214 nullable, 107, 108, 109, 111, 117 object, 11, 14, 35, 41, 67, 69, 153, 193, 206 ODS, 33 OLAP, 12, 13, 15, 30, 53, 145, 147, 161 OLAP cubes, 53 OLE DB Destination, 199, 209, 210 OLE DB Source, 193, 206 OLTP, 12, 13, 163 open world assumption, 37 Open world assumption, 106 operating system, 57 operational system, 17 optimization, 177, 192 order, 13, 15, 24, 27 OVER, 90, 116, 179, 183, 187 over-fitting, 137 package, 153, 154, 192, 193, 196, 197, 199, 200, 206, 210, 211 partitioning, 176, 177, 184, 185, 187, 189 pattern, 38 people, 14 People, 6 PivotChart, 148, 149, 152 PivotTable, 147, 148, 149, 151, 152 place, 14 policy, 16, 18, 22, 36 population, 20, 37, 106 portal, 50, 53, 65, 89 PowerPivot, 149, 150, 151, 152, 161, 162 PowerShell, 58, 60 predicate, 14, 23 predictable, 121, 122, 138 pre-select, 180 presentation quality, 40 preventing, 44 Preview, 79, 150 priority, 43 proactive, 6, 18 probability, 129, 142, 157, 173 Process Activation Service, 60 processing, 47, 49, 124 product categories, 15

Page 222

Index
products, 14, 15, 19, 26, 30, 49, 109, 111, 115, 126 proposition, 14, 23 pruning, 180 quality, 7, 10, 17, 18, 22, 36, 39, 40, 41, 42, 43, 45, 52, 54, 85, 103, 105, 107, 112, 117, 120, 161, 215 query(), 109 Ralph Kimball, 31, 55 randomness, 171, 173, 177 Ratcliff/Obershelp, 167, 216 RDBMS, 17, 18, 23, 28, 51 reactive, 6, 18 real world, 14, 37, 39, 41, 48, 75 Recursive Hierarchies, 79 recursive partitioning, 136 redundancy, 24, 27, 67, 70, 106 reference data, 63, 67, 89, 97 reference relation, 37, 106 Reference Table, 196 regular expressions, 51, 133, 134 relation, 23, 37, 106, 112, 114, 118, 119, 120, 126 relational database, 14, 15, 23, 35, 37, 67, 69, 74, 106, 119, 122, 138, 149 relational design, 29, 30 Relational Model, 6, 11, 12, 14, 23, 28, 31, 32, 54, 65, 70 relationship, 13, 14, 23, 29, 41, 49, 50, 69, 71, 74, 75, 78, 79, 99, 100, 101, 102 retention, 18, 42 re-usage, 17 reuse, 16, 17 Roger Wolter, 162 root cause, 36, 43, 44, 105, 106, 107, 114, 125, 136, 141, 161 Sakichi Toyoda, 43 sales representative, 14 sample data, 126, 169, 170 schema, 6, 13, 23, 24, 31, 32, 33, 34, 36, 40, 41, 42, 45, 98, 101, 105, 106, 107, 112, 114, 126, 135, 145 schema quality, 36, 40, 107 search space, 164, 166, 176, 182 search space reduction, 176 security, 18, 60, 62, 64, 74, 75, 76, 99 segment, 14 SELECT, 89, 90, 94, 99, 108, 109, 110, 111, 112, 116, 117, 118, 119, 120, 121, 127, 128, 129, 130, 131, 132, 134, 135, 137, 143, 144, 170, 171, 172, 173, 174, 175, 177, 178, 179, 182, 183, 184, 185, 186, 187, 200, 204, 211, 212 semi-structured data, 13 service, 15, 59 set, 14, 18, 36, 112, 149, 166, 168, 176, 177, 181, 190, 191 shared dimensions, 31 Simil, 167, 169, 175 Similarity Threshold, 197, 198 single copy of data, 45 soft dimensions, 37, 39, 43, 45 SolidQ, 8, 9, 10, 11 Solution Explorer, 122 sorting neighborhood, 177 SOUNDEX(), 166, 174 sources, 15, 22, 28, 32, 33, 40, 42, 44, 47, 48, 52, 54, 103, 161, 163, 165, 199, 209 specialization, 29, 30, 31, 41 spreadsheet, 34 SQL 2008 R2 sample databases, 107 SQL Server, 7, 9, 11, 12, 13, 21, 28, 34, 35, 45, 47, 51, 52, 53, 54, 56, 57, 58, 74, 78, 90, 94, 103, 105, 107, 109, 110, 113, 121, 133, 134, 135, 136, 145,

Page 223

Index
149, 153, 154, 155, 156, 158, 161, 162, 166, 167, 175, 181, 189, 190, 193, 197, 206, 215, 216 SSAS, 21, 53, 121, 122, 139, 145, 147, 148, 149, 152, 157, 158, 159 SSIS, 12, 21, 52, 109, 145, 148, 153, 156, 164, 166, 189, 191, 192, 193, 200, 203, 206, 211, 215 SSRS, 53 staging tables, 47 stakeholder, 42, 43, 45, 46 standard deviation, 128, 129 Star Schema, 14, 145 static, 16, 133, 197 Steve Kass, 162 string lengths, 131 string similarity, 164 string-matching, 44 strings profiling, 131 subject, 14 Subscription View, 93, 94 subscription views, 47 Subscription Views, 63, 95 subtype, 29, 41, 44, 107, 112, 116 System Administration Page, 67 Tamraparni Dasu, 162 taxonomy, 13, 40 Teo Lachev, 162 Testing Set, 139 text mining, 35 Theodore Johnson, 162 thing, 14 third normal form, 27 Thomas C. Redman, 162 timeliness, 39 timely, 6, 39 token, 181, 183, 184, 190, 191, 197 tokenizer, 190 Toolbox, 153, 193, 206 transactional, 13, 14, 15, 16, 17, 20, 41, 51, 54 transactional data, 13, 15 Transact-SQL, 37, 51, 94, 107, 161 transposition, 6 transpositions, 168 treatment, 16 trust, 40 T-SQL, 11, 89, 90, 101, 107, 113, 128, 132, 133, 134, 135, 143, 145, 162, 164, 166, 174, 175, 181, 203 tuple, 23, 37, 106, 118, 119, 126 unified, 19 uniform distribution, 142 UNION, 112, 127, 143, 144, 203 UNION ALL, 112, 203 Union All transformation, 198, 199 unique, 7, 18, 38, 134, 163 uniqueness, 134, 135 unstructured data, 13 UPDATE, 171, 172, 177, 178, 179, 180, 184, 186, 187 update anomalies, 26, 27, 28 User and Groups Permissions, 64 user-defined attributes, 89 validate, 33, 63, 92, 93, 96 validate data, 88, 95, 96 Validation Actions, 86 value(), 109, 135 verb, 14 version, 9, 50, 51, 67, 93, 95, 96, 97, 103, 115, 133, 149, 157, 190, 191, 215 versioning, 16, 17, 18, 19, 45, 50, 56, 93, 103 volatility, 16 vTargetMail, 126, 127, 128, 166, 169

Page 224

Index
Web, 18, 39, 47, 56, 57, 59, 61, 62, 63, 101, 103, 134, 163 WHERE, 108, 109, 110, 111, 117, 118, 119, 120, 121, 127, 128, 130, 132, 134, 137, 172, 173, 178, 179, 182, 185, 187, 200, 212 Wikipedia, 55, 167, 216 Windows, 57, 58, 59, 60, 147 workflow, 17, 18, 45, 47 worksheet, 148, 152, 159 XML, 13, 33, 34, 38, 107, 109, 110, 111, 135, 153, 155, 161, 163 XQuery, 107, 110, 161 XSD, 33, 135 xsi:nil, 107, 111

Page 225

Knjiga Data Quality and Master Data Management

Comments

Content

Sponsor Documents

Recommended