ETL

Published on February 2017 | Categories: Documents | Downloads: 69 | Comments: 0 | Views: 426

of 9

Content

It is estimated that as high as 75% of the effort spent on building a data warehouse can be attributed to back-end issues, such as readying the data and transporting it into the data warehouse (Atre, 1998). Data quality tools are used in data warehousing to ready the data and ensure that clean data populates the warehouse, thus enhancing usability of the warehouse. objective of the effort is to develop a tool to support the identification of data quality issues and the selection of tools for addressing those issues. A secondary objective is to provide information on specific tools regarding price, platform, and unique features of the tool. Example- A new airport in Hong Kong suffered catastrophic problems in baggage handling, flight information, and cargo transfer. The ramifications of the dirty data were felt throughout the airport. Flights took off without luggage, airport officials tracked flights with plastic pieces on magnetic boards, and airlines called confused ground staff on cellular phones to let them know where even more confused passengers could find their planes (Arnold, 1998). The new airport had been depending on the central database to be accurate. When it wasn’t, the airport paid the price in terms of customer satisfaction and trust. For example, most payroll systems require a social security number when setting up an employee file. If no number is available when the file is set up, an incorrect number may be used, such as 111-11-1111, in order to facilitate payroll processing. Hundreds of tools are available to automate portions of the tasks associated with auditing, cleansing, extracting, and loading data into data warehouses. Most of these tools fall into the data extracting and loading classification while only a small number would be considered auditing or cleansing tools.
You can use your ETL (extract, transform and load) tool for detecting data problems, cleansing the data and even maintaining data quality. a surprising number of organizations do not realize the humble ETL's potential as data quality tool.

The Data Quality Pro piece warns it isn't a replacement for a high-end data quality solution, but if you want better data, but can't afford a full-blown investment, an ETL can be an excellent starting point for three reason: 1. You already have one – or more – on hand. So, you won't have to buy anything new. 2. An ETL can address nearly 70 percent of data quality requirements, according to Data Quality Pro, which used Arkady Maydanchik's "Data Quality Assessment" framework as a gauge. That remaining 30 percent? Only 7 percent of it is completely non-complaint. It's not perfect, but it's better than 100 percent unsure, right? 3. You can use the ETL data quality approach to tackle low-hanging fruit problems, and then turn any cost savings into a down payment for a full-blown tool – or, at least, that's

what the Data Quality Pro piece recommends. I say it's your money, do what you want with it.

4. Scenario details
5. The data warehouse is fed daily with an orders extract which comes from a source OLTP system. Unfortunately, the data quality in that extract is poor as the source system does not perform much consistency checks and there are no data dictionaries. The data quality problems that need to be addressed are identified using two types of Data Quality Tests: syntax and reference tests. The syntax tests will report dirty data based on character patterns, invalid characters, incorrect lower or upper case order, etc. The reference tests will check the integrity of the data according to the data model. So, for example a customer ID which does not exist in a data warehouse customers dictionary table will be reported. Also, both types of tests report using two severity levels: errors and warnings. When an error is encountered, the record is logged and not passed through to the output. Warnings are logged, however still loaded into the data warehouse. The data quality problems in this example are as follows:  an order clearance date is earlier than the date when the order was placed  invalid order number - containing invalid characters or incorrect number of characters  incorrect address (postal code and street name) which is not consistent with the dictionary  a phone number not matching a pattern - when for example it contains blanks or dashes and the format defined in the DW is supposed to be numbers only

6. 7. 8. 9.

10.

Solution

11. The idea is to capture records containing invalid data in a rejects file for further data quality inspection and analysis. In practice, the ETL process design in any of the comercial ETL tools will be pretty straightforward and will include an input, a validation step and two outputs: validated records and dirty data. The two validation transformations (Character and reference DQ Tests) depicted below will do the following: 12.  Compare order clearance date with order entry date - check whether the order entry date is equal or earlier than the clearance date. Reject records which do not meet this criteria. Severity: error 13.  Validate order ID - check if it contains invalid characters. The order ID should be numeric. Severity: error 14.  Address validation - the postal code is looked up from a datawarehouse dictionary based on a city name. Discrepancies and not matching records are reported. Severity: warning 15.  Phone number correction - we store phone numbers in the following format: +prefix number. Lets consider a number: +34 112223333.

The following variations will be reported as warnings: (0034) 112223333, 11 2223333, +34 11-22-3333 16. ETL process for Data Quality and data cleansing (Pentaho Kettle Spoon):

ETL process

The three-stage ETL process and the ETL tools implementing the concept might be a response for the needs described above. The ‘ETL’ shortcut comes from 'Extract, transform, and load' – the words that describe the idea of the system. The ETL tools were created to improve and facilitate data warehousing. The Etl process consists of the following steps: 1. Initiation 2. Build reference data 3. Extract from sources 4. Validate 5. Transform 6. Load into stages tables 7. Audit reports 8. Publish 9. Archive 10. Clean up Sometimes those steps are supervised and performed indirectly but its very time-consuming and may be not so accurate. The purpose of using ETL Tools is to save the time and make the whole process more reliable.

ETL TOOLS
The ETL (Extract, transform, load) tools were created to simplify the data management with simultaneous reduction of absorbed effort. Depending on the needs of customers there are several types of tools. One of them perform and supervise only selected stages of the ETL process like data migration tools(EtL

Tools , “small t”tools) , data transformation tools(eTl Tools , “capital T”tools).Another are complete (ETL Tools ) and have many functions that are intended for processing large amounts of data or more complicated ETL projects. Some of them like server engine tools execute many ETL steps at the same time from more than one developer , while other like client engine tools are simpler and execute ETL routines on the same machine as they are developed. There are two more types. First called code base tools is a family of programing tools which allow you to work with many operating systems and programing languages.The second one called GUI base tools remove the coding layer and allow you to work without any knowledge (in theory) about coding languages. How do the ETL tools work?

The first task is data extraction from internal or external sources. After sending queries to the source system data may go indirectly to the database. However usually there is a need to monitor or gather more information and then go to Staging Area . Some tools extract only new or changed information automatically so we dont have to update it by our own. The second task is transformation which is a broad category: -transforming data into a stucture wich is required to continue the operation (extracted data has usually a sructure typicall to the source) -sorting data -connecting or separating -cleansing -checking quality The third task is loading into a data warehouse. As you can see the ETL Tools have many other capabilities (next to the main three: extraction , transformation and loading) like for instance sorting , filtering , data profiling , quality control, cleansing , monitoring , synchronization and consolidation.

ETL Tools providers
Here is a list of the most popular comercial and freeware(open-sources) ETL Tools. Comercial ETL Tools:
      

IBM Infosphere DataStage Informatica PowerCenter Oracle Warehouse Builder (OWB) Oracle Data Integrator (ODI) SAS ETL Studio Business Objects Data Integrator(BODI) Microsoft SQL Server Integration Services(SSIS)



Ab Initio Freeware, open source ETL tools:

   

Pentaho Data Integration (Kettle) Talend Integrator Suite CloverETL Jasper ETL

As you can see there are many types of ETL Tools and all you have to do right now is to choose appropriate one for you. Some of them are relatively quite expensive, some may be too complex, if you dont want to transform a lot of information or use many sources or use sophisticated features. It is always necessary to start with defining the business requirements, then consider the technical aspects and then choose the right ETL tool.

Free ETL tools

Pentaho Data Integration (PDI, Kettle)
According to Pentaho itself, it is a BI provider that offers ETL tools as a capability of data integration. These ETL capabilities are based on the Kettle project. Pentaho is known by selling subscriptions such as support services or management tools. Focusing primarily on connectivity and transformation, Pentaho's Kettle project is able to incorporate significant number of contributions from its community. Community-driven enhancements include: a Web services lookup, a SAP connector and the development of an Oracle bulk loader. The SAP connector, although it is integrated with Kettle, is not a free product - it is a commercially offered plug-in, however it is around 10 times cheaper than an SAP connectivity for Infosphere Datastage.

Talend
It is a startup of French origin that has positioned itself as a pure play of open source data integration and now offers its product - Open Studio. For vendors wishing to embed Open Studio capabilities in their products, Talend has an OEM license agreement. That is what JasperSoft has done, thus creating an open source BI stack to compete with Pentaho's Kettle. Talend is a commercial open source vendor which generates profit from support, training and consulting services. What Open Studio offers is a user-friendly graphical modeling environment as it provides traditional approach for performance management as well as a pushdown optimization (architectural approach). The latter allows users to bypass the actual cost of dedicated hardware to support an ETL engine and enables users to leverage spare capacity of the server within both the source and target environments to power the transformations.

clover.ETL

This project is directed by OpenSys, a based in Czech Republic company. It is Java-based, duallicensed open source that in its commercially licensed version offers warranty and support. In its offer there is a small footprint that makes it easy to embed by system integrators and ISVs. It aims at creating a basic library of functions, including mapping and transformations. Its enterprise server edition is a commercial offering.

KETL
This project is sponsored by Kinetic Networks - a professional services company. It started as a tool for customer engagements as commercial tools were too expensive. The Kinetic employees are currently developing the code but there are outside contributions that are expected in the future. Additional modules like data quality and profilifng component, were also developed by Kinetic and they are not placed under the licence for the open source. Initially KETL was designed as a utility to replace custom PLSQL code would move large data volumes. It is Javabased and XML-driven development environment which is of great use for skilled Java developers. KETL is currently limited ofr those users who do not have a visual development GUI.

challenges
ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis can identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous environment. Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day microbatch to integration with message queues or real-time change-data capture for continuous transformation and update. A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.

ETL applications implement three main types of parallelism:
 



Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

All three types of parallelism usually operate combined in a single job. An additional difficulty comes with making sure that the data being uploaded is relatively consistent. Because multiple source databases may have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points becomes necessary.

Best practices
Four-layered approach for ETL architecture design
   

Functional layer: Core functional ETL processing (extract, transform, and load). Operational management layer: Job-stream definition and management, parameters, scheduling, monitoring, communication and alerting. Audit, balance and control (ABC) layer: Job-execution statistics, balancing and controls, rejects- and error-handling, codes management. Utility layer: Common components supporting all other layers.

Use file-based ETL processing where possible
 

  

Storage costs relatively little Intermediate files serve multiple purposes: o Used for testing and debugging o Used for restart and recover processing o Used to calculate control statistics Helps to reduce dependencies - enables modular programming. Allows flexibility for job-execution and -scheduling Better performance if coded properly, and can take advantage of parallel processing capabilities when the need arises.

Use data-driven methods and minimize custom ETL coding


Parameter-driven jobs, functions, and job-control

 

Code definitions and mapping in database Consideration for data-driven tables to support more complex code-mappings and business-rule application.

Qualities of a good ETL architecture design
     

Performance Scalable Migratable Recoverable (run_id, ...) Operable (completion-codes for phases, re-running from checkpoints, etc.) Auditable (in two dimensions: business requirements and technical troubleshooting)

Handling of non-desirable values (NULL values, erroneous values, etc.) See: Dealing With Nulls In The Dimensional Model (Kimball University)[2]
   

NULL DIMENSIONAL values NULL FACT values NULL PRIMARY and/or FOREIGN KEY values Erroneous or undesirable values

ETL

Comments

Content

Sponsor Documents

Recommended