of 38

Best Informatica Interview Questions

Published on July 2016 | Categories: Documents | Downloads: 19 | Comments: 0
378 views

informatica faqs

Comments

Content

Best Informatica Interview Questions & Answers

Deleting duplicate row using Informatica
Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in the Target System eliminating the duplicate rows. What will be the approach? Ans. Let us assume that the source system is a Relational Database . The source table is having duplicate rows. Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source table and load the target accordingly.

Informatica Join Vs Database Join

Which is the fastest? Informatica or Oracle?
In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will look into the JOIN operation, not only because JOIN is the single most important data set operation but also because performance of JOIN can give crucial data to a developer in order to develop proper push down optimization manually. Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. This article will help them to take the informed decision.

Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question is – which system performs this faster?

Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4

million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data volumes. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. Informatica JOINER has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in informatica level. We have executed these mappings with different data points and logged the result. Further to the above test we will execute m_db_side_join mapping once again, this time with proper database side indexes and statistics and log the results.

Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The average time is plotted along vertical axis and data points are plotted along horizontal axis.

Data Points 1 2 3 4

Master Table Record Count 0.1 M 0.2 M 0.4 M 0.6 M

Detail Table Record Count 1M 2M 4M 6M

Verdict In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments

Note
1. This data can only be used for performance comparison but cannot be used for performance benchmarking. 2. This data is only indicative and may vary in different testing conditions.

In this "DWBI Concepts' Original article", we put Oracle database and Informatica PowerCentre to lock horns to prove which one of them handles data SORTing operation faster. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning.

Comparing Performance of SORT operation (Order By) in Informatica and Oracle

Which is the fastest? Informatica or Oracle?
Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. Think about a typical ETL operation often used in enterprise level data integration. A lot of data processing can be either redirected to the database or to the ETL tool. In general, both the database and the ETL tool are reasonably capable of doing such operations with almost same efficiency and capability. But in order to achieve the optimized performance, a developer must carefully consider and decide which system s/he should be trusting with for each individual processing task. In this article, we will take a basic database operation – Sorting, and we will put these two systems to test in order to determine which does it faster than the other, if at all.

Which sorts data faster? Oracle or Informatica?
As an application developer, you have the choice of either using ORDER BY in database level to sort your data or using SORTER TRANSFORMATION in Informatica to achieve the same outcome. The question is – which system performs this faster?

Test Preparation
We will perform the same test with different data points (data volumes) and log the results. We will start with 1 million records and we will be doubling the volume for each next data points. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. The source table has 10 columns and first 8 columns will be used for sorting 9. Informatica sorter has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data in informatica level. We have executed these mappings with different data points and logged the result.

Result

The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The time is plotted along vertical axis and data volume is plotted along horizontal axis.

Verdict The above experiment demonstrates that Oracle database is faster in SORT operation than Informatica by an average factor of 14%.
Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments

Note
This data can only be used for performance comparison but cannot be used for performance benchmarking. To know the Informatica and Oracle performance comparison for JOIN operation, please click here

In this yet another "DWBI Concepts' Original article", we test the performance of Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join. This article gives a crucial insight to application developer in order to take informed decision regarding performance tuning.

Which is the fastest? Informatica or Oracle?

In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will look into the JOIN operation, not only because JOIN is the single most important data set operation but also because performance of JOIN can give crucial data to a developer in order to develop proper push down optimization manually. Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from 1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the technologies that they support. But when it comes to the application development, developers often face challenge to strike the right balance of operational load sharing between these systems. This article will help them to take the informed decision.

Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question is – which system performs this faster?

Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4 million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data volumes. Here are the details of the setup we will use, 1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool 3. Database and Informatica setup on different physical servers using HP UNIX 4. Source database table has no constraint, no index, no database statistics and no partition 5. Source database table is not available in Oracle shared pool before the same is read 6. There is no session level partition in Informatica PowerCentre 7. There is no parallel hint provided in extraction SQL query 8. Informatica JOINER has enough cache size We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in informatica level. We have executed these mappings with different data points and logged the result. Further to the above test we will execute m_db_side_join mapping once again, this time with proper database side indexes and statistics and log the results.

Result
The following graph shows the performance of Informatica and Database in terms of time taken by each system to sort data. The average time is plotted along vertical axis and data points are plotted along horizontal axis.

Data Points 1 2 3 4

Master Table Record Count 0.1 M 0.2 M 0.4 M 0.6 M

Detail Table Record Count 1M 2M 4M 6M

Verdict In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index Assumption
1. Average server load remains same during all the experiments 2. Average network speed remains same during all the experiments

Note
1. This data can only be used for performance comparison but cannot be used for performance benchmarking. 2. This data is only indicative and may vary in different testing conditions.

Informatica Reject File - How to Identify rejection reason

When we run a session, the integration service may create a reject file for each target instance in the mapping to store the target reject record. With the help of the Session Log and Reject File we can identify the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads in the subsequent session runs. If theInformatica Writer or the Target Database rejects data due to any valid reason the integration service logs the rejected records into the reject file. Every time we run the session the integration service appends the rejected records to the reject file.

Working with Informatica Bad Files or Reject Files
By default the Integration service creates the reject files or bad files in the $PMBadFileDir process variable directory. It writes the entire reject record row in the bad file although the problem may be in any one of the Columns. The reject files have a default naming convention like [target_instance_name].bad . If we open the reject file in an editor we will see comma separated values having some tags/ indicator and some data values. We will see two types of Indicators in the reject file. One is the Row Indicator and the other is the Column Indicator . For reading the bad file the best method is to copy the contents of the bad file and saving the same as a CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel. The firstmost column in the reject file is the Row Indicator , that determines whether the row was destined for insert, update, delete or reject. It is basically a flag that determines the Update Strategy for the data row. When the Commit Type of the session is configured as User-defined the row indicator indicates whether the transaction was rolled back due to a non-fatal error, or if the committed transaction was in a failed target connection group.

List of Values of Row Indicators:

Row Indicator 0 1 2

Indicator Significance Insert Update Delete

Rejected By Writer or target Writer or target Writer or target

3 4 5 6 7 8 9

Reject Rolled-back insert Rolled-back update Rolled-back delete Committed insert Committed update Committed delete

Writer Writer Writer Writer Writer Writer Writer

Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.

List of Values of Column Indicators:
> Column Indicator Type of data Writer Treats As

D

Valid data or Good Data.

Writer passes it to the target database. The target accepts it unless a database error occurs, such as finding a duplicate key while inserting. Numeric data exceeded the specified precision or scale for the column. Bad data, if you configured the mapping target to reject overflow or truncated data. The column contains a null value. Good data. Writer passes it to the target, which rejects it if the target database does not accept null values. String data exceeded a specified precision for the column, so the Integration Service truncated it. Bad data, if you configured the mapping target to reject overflow or

O

Overflowed Numeric Data.

N

Null Value.

T

Truncated String Data.

truncated data.

Also to be noted that the second column contains column indicator flag value 'D' which signifies that the Row Indicator is valid. Now let us see how Data in a Bad File looks like:

Implementing Informatica Incremental Aggregation
Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate calculations in a session. If the source changes incrementally and we can capture the changes, then we can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to delete previous loads data, process the entire source data and recalculate the same data each time you run the session.

Using Informatica Normalizer Transformation
Normalizer, a native transformation in Informatica, can ease many complex data transformation requirement. Learn how to effectively use normalizer here.

Using Noramalizer Transformation
A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in columns to rows.

Normalizer effectively does the opposite of what Aggregator does!

Example of Data Transpose using Normalizer
Think of a relational table that stores four quarters of sales by store and we need to create a row for each sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter like below.. The following source rows contain four quarters of sales by store: Source Table

Store Store1 Store2

Quarter1 100 250

Quarter2 300 450

Quarter3 500 650

Quarter4 700 850

The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that identifies the quarter number: Target Table

Store Store 1 Store 1 Store 1 Store 1 Store 2 Store 2 Store 2 Store 2

Sales 100 300 500 700 250 450 650 850

Quarter 1 2 3 4 1 2 3 4

How Informatica Normalizer Works
Suppose we have the following data in source:

Name Sam

Month Jan

Transportation 200

House Rent 1500

Food 500

John Tom Sam John Tom

Jan Jan Feb Feb Feb

300 300 300 350 350

1200 1350 1550 1200 1400

300 350 450 290 350

and we need to transform the source data and populate this as below in the target table:

Name Sam Sam Sam John John John Tom Tom Tom

Month Jan Jan Jan Jan Jan Jan Jan Jan Jan

Expense Type Transport House rent Food Transport House rent Food Transport House rent Food

Expense 200 1500 500 300 1200 300 300 1350 350

.. like this. Now below is the screen-shot of a complete mapping which shows how to achieve this result using Informatica PowerCenter Designer. Image: Normalization Mapping Example 1

I will explain the mapping further below.

Setting Up Normalizer Transformation Property
First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of the Normalizer transformation, since we have Food,Houserent and Transportation. Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and Month

In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab. Interestingly we will observe two new columns namely,

• •

GK_EXPENSEHEAD GCID_EXPENSEHEAD GK field generates sequence number starting from the value as defined in Sequence field while GCID holds the value of the occurence field i.e. the column no of the input Expense head.

Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

Now the GCID will give which expense corresponds to which field while converting columns to rows. Below is the screen-shot of the expression to handle this GCID efficiently: Image: Expression to handle GCID This is how we will accomplish our task!

Informatica Dynamic Lookup Cache
A LookUp cache does not change once built. But what if the underlying lookup table changes the data after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the underlying table changes?

DynamicLookupCache
Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping you have a Lookup and in the Lookup, you are actually looking up the same target table you are loading. You may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right. There is no "big deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so

whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup cache. The lookup will still hold the previously cached data, even if the underlying target table is changing. But what if you want your Lookup cache to get updated as and when the target table is changing? What if you want your lookup cache to always show the exact snapshot of the data in your target table at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic cache to handle this.

But why anyone will need a dynamic cache?
To understand this, let's first understand a static cache scenario.

Informatica Dynamic Lookup Cache - What is Static Cache

STATIC CACHE SCENARIO Let's suppose you run a retail business and maintain all your customer information in a customer master table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a Customer Dimension table in your data warehouse. Your source customer table is a transaction system table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address, the old address is updated with the new address. But your data warehouse table stores the history (may be in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table. Typically you do a Lookup on target (static cache) and check with your every incoming customer record to determine if the customer is already existing in target or not. If the customer is not already existing in target, you conclude the customer is new and INSERT the record whereas if the customer is already existing, you may want to update the target record with this new record (if the record is updated). This is illustrated below, You don't need dynamic Lookup cache for this

Image: A static Lookup Cache to determine if a source record is new or updatable

Informatica Dynamic Lookup Cache - What is Dynamic Cache
DYNAMIC LOOKUP CACHE SCENARIO Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that your source table does not have any duplicate record. But, What if you had a flat file as source with many duplicate records? Would the scenario be same? No, see the below illustration.

Image: A Scenario illustrating the use of dynamic lookup cache

Here are some more examples when you may consider using dynamic lookup,

• •

Updating a master customer table with both new and updated customer information coming together as shown above Loading data into a slowly changing dimension table and a fact table at the same time. Remember, you typically lookup the dimension while loading to fact. So you load dimension table before loading fact table. But using dynamic lookup, you can load both simultaneously.

• •

Loading data from a file with many duplicate records and to eliminate duplicate records in target by updating a duplicate row i.e. keeping the most recent row or the initial row Loading the same data from multiple sources using a single mapping. Just consider the previous Retail business example. If you have more than one shops and Linda has visited two of your shops for the first time, customer record Linda will come twice during the same load.

Informatica Dynamic Lookup Cache - How does dynamic cache work
So, How does dynamic lookup work?

When the Integration Service reads a row from the source, it updates the lookup cache by performing one of the following actions:



Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags the row as insert.

• •

Updates the row in the cache: If the row exists in the cache, the Integration Service updates the row in the cache based on the input ports. The Integration Service flags the row as update. Makes no change to the cache: This happens when the row exists in the cache and the lookup is configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to update existing rows only or, the row is in the cache, but based on the lookup condition, nothing changes. The Integration Service flags the row as unchanged.

Notice that Integration Service actually flags the rows based on the above three conditions. And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different logic. This flag port is called



NewLookupRow Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to use a Router or Filter transformation followed by an Update Strategy. Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:

• • •

0 = Integration Service does not update or insert the row in the cache. 1 = Integration Service inserts the row into the cache. 2 = Integration Service updates the row in the cache. When the Integration Service reads a row, it changes the lookup cache depending on the results of the lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.

Informatica Dynamic Lookup Cache - Dynamic Lookup Mapping Example
Example of Dynamic Lookup Implementation
Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of the mapping. Since the screenshot is slightly bigger, so I link it below. Just click to expand the image.

And here I provide you the screenshot of the lookup below. Lookup ports screen shot first, Image: Dynamic Lookup Ports Tab And here is Dynamic Lookup Properties Tab

If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE group. The router screenshot is also given below. New records are routed to the INSERT group and existing records are routed to the UPDATE group.

Router Transformation Groups Tab

Informatica Dynamic Lookup Cache - Dynamic Lookup Sequence ID
While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the lookup condition. When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence ID for each row it inserts into the lookup cache. When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache associated with any port using a sequence ID and it generates a key for the port by incrementing the greatest sequence ID existing value by one, when the inserting a new row of data into the cache.

When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.

Informatica Dynamic Lookup Cache - Dynamic Lookup Ports
About the Dynamic Lookup Output Port
The lookup/output port output value depends on whether we choose to output old or new values when the Integration Service updates a row:

• •

Output old values on update: The Integration Service outputs the value that existed in the cache before it updated the row. Output new values on update: The Integration Service outputs the updated value that it writes in the cache. The lookup/output port value matches the input/output port value. Note: We can configure to output old or new values using the Output Old Value On Update transformation property.

Informatica Dynamic Lookup Cache - NULL handling in LookUp
Handling NULL in dynamic LookUp
If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input port, the input value does not equal the lookup value or the value out of the input/output port. When you select theIgnore Null property, the lookup cache and the target table might become unsynchronized if you pass null values to the target. You must verify that you do not pass null values to the target. When you update a dynamic lookup cache and target table, the source data might contain some null values. The Integration Service can handle the null values in the following ways:

• •

Insert null values: The Integration Service uses null values from the source and updates the lookup cache and target table using all values from the source. Ignore Null inputs for Update property : The Integration Service ignores the null values in the source and updates the lookup cache and target table using only the not null values from the source. If we know the source data contains null values, and we do not want the Integration Service to update the lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding lookup/output port. When we choose to ignore NULLs, we must verify that we output the same values to the target that the Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that lookup cache and the target table might not become unsynchronized



New values. Connect only lookup/output ports from the Lookup transformation to the target.



Old values. Add an Expression transformation after the Lookup transformation and before the Filter or Router transformation. Add output ports in the Expression transformation for each port in the target table and create expressions to ensure that we do not output null input values to the target.

Informatica Dynamic Lookup Cache - Other Details
When we run a session that uses a dynamic lookup cache, the Integration Service compares the values in all lookup ports with the values in their associated input ports by default. It compares the values to determine whether or not to update the row in the lookup cache. When a value in an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.

But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to ignore when it compares ports. The Designer only enables this property for lookup/output ports when the port is not used in the lookup condition. We can improve performance by ignoring some ports during comparison.

We might want to do this when the source data includes a column that indicates whether or not the row contains data we need to update. Select the Ignore in Comparison property for all lookup ports except the port that indicates whether or not to update the row in the cache and target table. Note: We must configure the Lookup transformation to compare at least one port else the Integration Service fails the session when we ignore all ports.

Links

Pushdown Optimization In Informatica
Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance data transformation load among servers. This article describes pushdown techniques.

What is Pushdown Optimization?
Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance. Veteran ETL developers often come across issues when they need to determine the appropriate place to perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica Filter transformation. Sometimes, we can even "push" some transformation logic to the target database instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is crucial for overall ETL performance.

How does Push-Down Optimization work?
One can push transformation logic to the source or target database using pushdown optimization. The Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the source or the target database which executes the SQL queries to process the transformations. The amount of transformation logic one can push to the database depends on the database, transformation logic, and mapping and session configuration. The Integration Service analyzes the transformation logic it can push to the database and executes the SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database.

Pushdown Optimization In Informatica - Using Pushdown Optimization
Using Pushdown Optimization
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to pushdown optimization. Let us take an example: Image: Pushdown Optimization Example 1 Filter Condition used in this mapping is: DEPTNO>40 Suppose a mapping contains a Filter transformation that filters out all employees except those with a DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It generates the following SQL statement to process the transformation logic: INSERT INTO EMP_TGT(EMPNO, ENAME, SAL, COMM, DEPTNO) SELECT EMP_SRC.EMPNO, EMP_SRC.ENAME, EMP_SRC.SAL, EMP_SRC.COMM, EMP_SRC.DEPTNO FROM EMP_SRC WHERE (EMP_SRC.DEPTNO >40) The Integration Service generates an INSERT SELECT statement and it filters the data using a WHERE clause. The Integration Service does not extract data from the database at this time. We can configure pushdown optimization in the following ways:

Using source-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the source database. The Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the source database and executes the corresponding SELECT statement.

Using target-side pushdown optimization:
The Integration Service pushes as much transformation logic as possible to the target database. The Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database and executes the DML.

Using full pushdown optimization:
The Integration Service pushes as much transformation logic as possible to both source and target databases. If you configure a session for full pushdown optimization, and the Integration Service cannot push all the transformation logic to the database, it performs source-side or target-side pushdown optimization instead. Also the source and target must be on the same database. The Integration Service analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it analyzes the target. When it can push all transformation logic to the database, it generates an INSERT SELECT statement to run on the database. The statement incorporates transformation logic from all the transformations in the mapping. If the Integration Service can push only part of the transformation logic to the database, it does not fail the session, it pushes as much transformation logic to the source and target database as possible and then processes the remaining transformation logic. For example, a mapping contains the following transformations: SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn SUM(SAL), SUM(COMM) Group by DEPTNO RANK PORT on SAL TOTAL = SAL+COMM Image: Pushdown Optimization Example 2 The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source, processes the Rank transformation, and pushes the Expression transformation and target to the target database. When we use pushdown optimization, the Integration Service converts the expression in the transformation or in the workflow link by determining equivalent operators, variables, and functions in the database. If there is no equivalent operator, variable, or function, the Integration Service itself processes the transformation logic. The Integration Service logs a message in the workflow log and the Pushdown Optimization Viewer when it cannot push an expression to the database. Use the message to determine the reason why it could not push the expression to the database.

Pushdown Optimization In Informatica - Pushdown Optimization in Integration Service
Page 3 of 6

How does Integration Service handle Push Down Optimization?
To push transformation logic to a database, the Integration Service might create temporary objects in the database. The Integration Service creates a temporary sequence object in the database to push Sequence Generator transformation logic to the database. The Integration Service creates temporary views in the database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to the database, an unconnected relational lookup, filtered lookup. 1. To push Sequence Generator transformation logic to a database, we must configure the session for pushdown optimization with Sequence . 2. To enable the Integration Service to create the view objects in the database we must configure the session forpushdown optimization with View. 2. After the database transaction completes, the Integration Service drops sequence and view objects created for pushdown optimization.

Pushdown Optimization In Informatica - Configuring Pushdown Optimization
Configuring Parameters for Pushdown Optimization
Depending on the database workload, we might want to use source-side, target-side, or full pushdown optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file. The possible values may be, 1. none i.e the integration service itself processes all the transformations, 2. Source [Seq View], 3. Target [Seq View], 4. Full [Seq View]

Pushdown Optimization In Informatica - Using Pushdown Optimization Viewer
Pushdown Optimization Viewer

Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the corresponding SQL statement that is generated for the specified selections. When we select a pushdown option or pushdown group, we do not change the pushdown configuration. To change the configuration, we must update the pushdown option in the session properties.

Database that supports Informatica Pushdown Optimization
We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2, Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers. When we use native drivers, the Integration Service generates SQL statements using native database SQL. When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The Integration Service can generate more functions when it generates SQL statements using native language instead of ANSI SQL.

Pushdown Optimization In Informatica - Pushdown Optimization Error Handling
Handling Error when Pushdown Optimization is enabled
When the Integration Service pushes transformation logic to the database, it cannot track errors that occur in the database. When the Integration Service runs a session configured for full pushdown optimization and an error occurs, the database handles the errors. When the database handles errors, the Integration Service does not write reject rows to the reject file. If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot perform incremental recovery because the database processes the transformations. Instead, the database rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the Integration Service fails, the database server rolls back the transaction.

Links

Informatica Tuning - Step by Step Approach
This is the first of the number of articles on the series of Data Warehouse Application performance tuning scheduled to come every week. This one is on Informatica performance tuning. Please note that this article is intended to be a quick guide. A more detail Informatica performance tuning guide can be found here: Informatica Performance Tuning Complete Guide

Source Query/ General Query Tuning

1.1 Calculate original query cost 1.2 Can the query be re-written to reduce cost? - Can IN clause be changed with EXISTS? - Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query? - Is there a redundant table join that can be avoided? - Can we include additional WHERE clause to further limit data volume? - Is there a redundant column used in GROUP BY that can be removed? - Is there a redundant column selected in the query but not used anywhere in mapping? 1.3 Check if all the major joining columns are indexed 1.4 Check if all the major filter conditions (WHERE clause) are indexed - Can a function-based index improve performance further? 1.5 Check if any exclusive query hint reduce query cost - Check if parallel hint improves performance and reduce cost 1.6 Recalculate query cost - If query cost is reduced, use the changed query

Tuning Informatica LookUp
1.1 Redundant Lookup transformation - Is there a lookup which is no longer used in the mapping? - If there are consecutive lookups, can those be replaced inside a single lookup override? 1.2 LookUp conditions - Are all the lookup conditions indexed in database? (Uncached lookup only) - An unequal condition should always be mentioned after an equal condition 1.3 LookUp override query - Should follow all guidelines from 1. Source Query part above 1.4 There is no unnecessary column selected in lookup (to reduce cache size) 1.5 Cached/Uncached - Carefully consider whether the lookup should be cached or uncached - General Guidelines - Generally don't use cached lookup if lookup table size is > 300MB - Generally don't use cached lookup if lookup table row count > 20,000,00 - Generally don't use cached lookup if driving table (source table) row count < 1000 1.6 Persistent Cache - If found out that a same lookup is cached and used in different mappings, Consider persistent cache 1.7 Lookup cache building - Consider "Additional Concurrent Pipeline" in session property to build cache concurrently "Prebuild Lookup Cache" should be enabled, only if the lookup is surely called in the mapping

Tuning Informatica Joiner
3.1 Unless unavoidable, join database tables in database only (homogeneous join) and don't use joiner 3.2 If Informatica joiner is used, always use Sorter Rows and try to sort it in SQ Query itself using Order By (If Sorter Transformation is used then make sure Sorter has enough cache to perform 1-pass sort) 3.3 Smaller of two joining tables should be master

Tuning Informatica Aggregator
4.1 When possible, sort the input for aggregator from database end (Order By Clause) 4.2 If Input is not already sorted, use SORTER. If possible use SQ query to Sort the records.

Tuning Informatica Filter

5.1 Unless unavoidable, use filteration at source query in source qualifier 5.2 Use filter as much near to source as possible

Tuning Informatica Sequence Generator
6.1 Cache the sequence generator

Setting Correct Informatica Session Level Properties
7.1 Disable "High Precision" if not required (High Precision allows decimal upto 28 decimal points) 7.2 Use "Terse" mode for tracing level 7.3 Enable pipeline partitioning (Thumb Rule: Maximum No. of partitions = No. of CPU/1.2) (Also remember increasing partitions will multiply the cache memory requirement accordingly)

Tuning Informatica Expression
8.1 Use Variable to reduce the redundant calculation 8.2 Remove Default value " ERROR('transformation error')" for Output Column. 8.3 Try to reduce the Code complexity like Nested If etc. 8.4 Try to reduce the Unneccessary Type Conversion in Calculation

Implementing Informatica Partitions
Why use Informatica Pipeline Partition? Identification and elimination of performance bottlenecks will obviously optimize session performance. After tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.

PowerCenter Informatica Pipeline Partition
Different Types of Informatica Partitions We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained
Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel. A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration

Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage. Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions). We can set the following attributes to partition a pipeline: Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads. We cannot create partition points at Source instances or at Sequence Generator transformations. Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently. Partition types: The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points. We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database partitioning: The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database. Pass-through: The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose passthrough partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions. Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing. Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations. Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key. Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and

ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range. We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning. Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a session in Workflow Manager. The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Implementing Informatica Persistent Cache
You must have noticed that the "time" Informatica takes to build the lookup cache can be too much sometimes depending on the lookup table size/volume. Using Persistent Cache, you may save lot of your time. This article describes how to do it.

What is Persistent Cache?
Lookups are cached by default in Informatica. This means that Informatica by default brings in the entire data of the lookup table from database server to Informatica Server as a part of lookup cache building activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider this scenario - what if you are looking up to the same table different times using different lookups in different mappings? Do you want to spend the time of building the lookup cache again and again for each lookup? Off course not! Just use persistent cache option! Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes lookup cache files after a successful session run based on whether the Lookup cache is checked as persistent or not.

Where and when we shall use persistent cache:
Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to populate the surrogate key in the fact tables based on their source system keys. Now if we cache the same Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading timeline. There can be some functional reasons also for selecting to use persistent cache. Please read the article Advantage and Disadvantage of Persistent Cache Lookup to know how persistent cache can be used to ensure data integrity in long running ETL sessions where underlying tables are also changing .

So the solution is to use Named Persistent Cache.
In the first mapping we will create the Named Persistent Cache file by setting three properties in the Properties tab of Lookup transformation.

Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx or .dat Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current data of the lookup table. Next in all the mappings where we want to use the same already built Named Persistent Cache we need to set two properties in the Properties tab of Lookup transformation.

Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in Cache Directory and if the cache file is not there the session will not fail it will just create the cache file instead. Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that was defined in the mapping where the persistent cache file was created.

Note:
If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even also an extra blank space will fail the session that is using the already built persistent cache file. So if the incoming source data volume is high, the lookup table’s data volume that need to be cached is also high, and the same lookup table is used in many mappings then the best way to handle the situation is to use one-time build, already created persistent named cache.

Aggregation with out Informatica Aggregator
Since Informatica process data row by row, it is generally possible to handle data aggregation operation even without an Aggregator Transformation. On certain cases, you may get huge performance gain using this technique!

General Idea of Aggregation without Aggregator Transformation
Let us take an example: Suppose we want to find the SUM of SALARY for Each Department of the Employee Table. The SQL query for this would be: SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO; If we need to implement this in Informatica, it would be very easy as we would obviously go for an Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as SUM(SALARY the problem can be solved easily. Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use the very funda of the expression transformation of holding the value of an attribute of the previous tuple over here.

But wait... why would we do this? Aren't we complicating the thing here?
Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is already sorted or when you know input data will not violate the order, like you are loading daily data and want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation operation. This needs time and cache space and this also voids the normal row by row processing in Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease out row by row processing. The mapping below will show how to do this Image: Aggregation with Expression and Sorter 1 Sorter (SRT_SAL) Ports Tab

Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties Sorter (SRT_SAL1) Ports Tab

Expression (EXP_SAL2) Ports Tab

Filter (FIL_SAL) Properties Tab

This is how we can implement aggregation without using Informatica aggregator transformation. Hope you liked it!

What are the differences between Connected and Unconnected Lookup?
Connected Lookup Unconnected Lookup Unconnected lookup receives input values from the result of a LKP: expression in another transformation

Connected lookup participates in dataflow and receives input directly from the pipeline

Connected lookup can use both dynamic and static cache Connected lookup can return more than one column value ( output port ) Connected lookup caches all lookup columns

Unconnected Lookup cache can NOT be dynamic

Unconnected Lookup can return only one column value i.e. output port Unconnected lookup caches only the lookup output ports in the lookup conditions and the

return port Supports user-defined default values (i.e. value to return when lookup conditions are not satisfied)

Does not support user defined default values

What is the difference between Router and Filter?
Router Router transformation divides the incoming records into multiple groups based on some condition. Such groups can be mutually inclusive (Different groups may contain same record) Router transformation itself does not block any record. If a certain record does not match any of the routing conditions, the record is routed to default group Router acts like CASE.. WHEN statement in SQL (Or Switch().. Case statement in C) Filter

Filter transformation restricts or blocks the incoming record set based on one given condition.

Filter transformation does not have a default group. If one record does not match filter condition, the record is blocked

Filter acts like WHERE condition is SQL.

What can we do to improve the performance of Informatica Aggregator Transformation?
Aggregator performance improves dramatically if records are sorted before passing to the aggregator and "sorted input" option under aggregator properties is checked. The record set should be sorted on those columns that are used in Group By operation. It is often a good idea to sort the record set in database level (why?) e.g. inside a source qualifier transformation, unless there is a chance that already sorted records from source qualifier can again become unsorted before reaching aggregator

What are the different lookup cache?
Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static cache is one which does not modify the cache once it is built and it remains same during the session run. On the other hand, Adynamic cache is refreshed during the session run by inserting or updating the records in cache based on the incoming source data. A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains the cache even after session run is complete or not respectively

How can we update a record in target table without using Update strategy?
A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the target table in Informatica level and then we need to connect the key and the field we want to update in the mapping Target. In the session level, we should set the target property as "Update as Update" and check the "Update" check-box. Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and "Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID and Customer Address fields in the mapping. If the session properties are set correctly as described above, then the mapping will only update the customer address field for all matching customer IDs.

Deleting duplicate row using Informatica
Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in the Target System eliminating the duplicate rows. What will be the approach? Ans. Let us assume that the source system is a Relational Database . The source table is having duplicate rows. Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source table and load the target accordingly.

Source Qualifier Transformation DISTINCT clause

But what if the source is a flat file? How can we remove the duplicates from flat file source?

To know the answer of this question and similar high frequency Informatica questions, please continue to,

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close