Two-two

Published on May 2016 | Categories: Types, School Work | Downloads: 50 | Comments: 0 | Views: 734
of 18
Download PDF   Embed   Report

Comments

Content

http://blog.iadvise.eu/tag/etl/

Talend: Schema compatibility check
Posted on October 8, 2014 by Jessica Smets

Most of the time when talking about Talend jobs, people think of standard
ETL (Extract, Transform, Load). But in some cases there’s the need to check
the incoming data before loading them into the target rather than just
transforming it. We refer to this process as E-DQ-L
(Extract, Data Quality, Load).
One of the things that you might want to check before loading is schema
compatibility. For example: you expect to get a String that’s 5 long. If you, for
any reason, receive a String that is larger than 5, it will generate an error. Or
perhaps you expect a percent (in format BigDecimal like 0.19), but you
receive it as a string (“19%”). This example will result into a failing job with
an error saying “Type mismatch: cannot convert
from dataType to otherDataType”.
Before I continue this blog I would like to emphasize that all the solutions
below are possible with the Data Integration version of Talend, except for the
last one. The last option requires a Talend Data Quality license.
Let’s create an example case: We want to extract data on a regular basis
from a third-party source which we cannot fully trust in terms of schemasettings. We know how many columns we can expect and we have a rough
idea of what it contains, but we do not fully trust the source to not give
incompatible data. We want to load the records that are valid and we want to
separately store the ‘corrupt’ data for logging purposes. I’ve gathered
several solutions for this problem:
1. Use rejected flow on an input-component
One thing you can do is reject the records as soon as you import them.
Disable “die on error” on the basic settings tab of you input-component and

then right-click it and select “Reject”. The rows will be rejected based on the
schema of the file. In the example below we put phone number as an integer
and as you can see 1 records is begin rejected. This is because the phone
number contains characters and therefore cannot be read as an integer. If
you did not disable the “die on error”-option then this component would
make the job fail.

2. In case of the target being a database: use rejected links
You can also choose to directly input the data into your database, but to
reject any rows that would create an error. You can then create a separate
flow to determine what to do with these rejected records.
In your database output component (for example tOracleOutput) change the
following:


Basic settings: Uncheck “Die on error”



Advanced settings: Uncheck “Use batch size”

Now, right-click on your component and select “Row-Reject” and connect it
to an output-component. The output you’ll receive will be the rejected rows
and what error would have been generated if you tried inserting them, as
you can see in the picture below.

3. Use a tFilter-component
You can make the data go through a filter-component before inserting it into
your target. You can (manually) decide what’s allowed to go through. This
can be useful when your destination is not a database, in which case option
1 is most likely not available.

A tFilterRow-component also has the possibility to output the rejected rows,
including the reason why they got rejected. You can enable this by right-

clicking on your filter and selecting “Row-Reject”. An example of rejected
rows by the filter:

Note – You can also use self-defined routines in the tFilterRow-component by
checking “Use advanced mode”. This can be useful when you want to check
whether or not converting is possible. For example: you could define a
routine called “isInterger” that returns true if the conversion is valid and
false if it’s impossible.
4. Use a tSchemaComplianceCheck-component
Another way of making sure that your schema is compatible is by using the
tSchemaComplianceCheck-component. Unfortunately, this component is only
integrated in the Data Quality version of Talend.
It’s a very easy component to use. The only thing you have to do is connect
the incoming data to the tSchemaComplianceCheck-component and then
continue its flow to the destination source. You can get the rejected rows the
same way as previously (by right clicking on it and then selecting “Row>Reject”).

The rejected rows and their error message look like this:

That’s it for now. There’s probably a lot of other ways of checking schema
compatibility. Feel free to comment if you know any. Thank you for reading!
Posted in Talend | Tagged ETL, Talend | Leave a comment

Talend: tips and tricks part 2
Posted on August 26, 2014 by Jessica Smets

In the first part of these entries we discussed how to test your expressions,
the importance of optimizing the appearance of a tLogRow component and
how to handle windows and views within Talend. This time around, we will be
talking about the different ways to get components into your job, how to
trace your dataflow and how to easily sync columns. As last time, this post
will be useful for both starting and experienced users.
4. Getting components into your job
There are many ways to get components into your job. Most people search
the palette (by either the search-function or by manually exploring the
folders) and drag/drop the components into their job. You can achieve the
same thing by simply clicking on a random place in your job and then type
the name of the component. Obviously this is only recommended once
you’re familiar with the different components and their names.

When working with metadata, you can use certain shortcuts to save a bit of
time. Usually people just click on the metadata and then drop it onto their
job. This will pop up a window allowing you to choose which type of
component you want to use. Holding the Control-key while dragging the
component will directly create an Output-component. Holding Control+Shift
will result into an Input-component.
5. Syncing columns
Occasionally, you may have to change the schema of a certain component in
the middle of development. This might affect other components in your job.
In some cases, Talend asks if you want to propagate the changes you’ve
made (to the other components).

You may accidently close this window, click “No” or not get this message at
all, resulting in the following error: “The schema from the input
link “youroutputlink”is different from the schema defined in the component”.

When this happens, you can go to the basic settings of the component that
has the error and click on “Sync columns”. The error should now be gone.

6. Tracing your dataflow (Debug Run)
Lastly, I would like to say a few words about the debug run. In some cases
we want to closely watch our dataflow in order to get a better understanding
of what’s exactly happening. You can achieve this by running your job in
debug mode. This can be done by clicking on the Run-window, then click on
the “Debug Run” tab on the left side of the window and start it by clicking on
“Traces Debug”.

The moment you open the “Debug run” tab, you’ll immediately see extra
icons in your job. These magnifying glass icons indicate that details will be
shown when you debug-run your job. The result should look something like
this:

You can Pause and Resume the run at any time. You can also add breakpoints
if you like. Do this by right-clicking on a dataflow and then selecting “Show
Breakpoint Setup”.

This brings you to the “Breakpoint” tab of the data flow you clicked on. You
can also go there by clicking on the specific flow and manually selecting
“Breakpoint”. Let’s add a breakpoint to pause our run whenever we come

across a record with “Bloom” as last name. Firstly, make sure to check the
“Activate conditional breakpoint” option. After that, click on the plus-icon
underneath the conditions. Then select the InputColumn we want to put our
condition on, in our case this is “Last_name”, and add a value (“Bloom” in
this example). The default Operation is “Equals”, which is the one we want.
You can also specify an Operation if you need to, but this is unnecessary for
this case.

You can add multiple breakpoints if you like. Whenever you debug run your
job now, it will stop at a record where the Last_name is “Bloom” (if any
exist).
That’s it for now. Thank you for reading!
Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment

Talend: tips and tricks part 1
Posted on August 4, 2014 by Jessica Smets

This blog contains some convenient tips and tricks that will make working
with the open source tool Talend for data integration a lot more efficient. This
blogpost will be especially useful for people who are just discovering this
amazing tool, yet I am sure that people who have been using it for a while
will also find it very helpful. These series of tips will be spread over multiple
blog entries so make sure to check back often for future tips!
1. Testing expressions in the tMap component
Using the tMap component, you have the possibility to test your expressions.
This way you can easily see whether or not the result is what you expected it

to be. You can also use this to determine whether or not your expression will
error. Let’s create an example.
We’ve got details of employees as input for our tMap. We would like the first
name to be shown in uppercase. First of all, go into the expression builder by
clicking the ellipsis next to your expression.

To convert the first name to uppercase, we have to use the StringHandling
function “UPCASE”. This will result in the following
expression:StringHandling.UPCASE(employee.First_name)
After you’re done filling in test values, click on the “Test!” button and wait for
the result. If everything goes as expected, you should see your first name in
uppercase on the right side of the window.

2. Optimizing the appearance of the tLogRow component output

tLogRow is one of the most frequently used components. It is recommended
that you learn how to optimize its use. Firstly, make sure that you always
have the right appearance selected for your output. You can find this
property in the basic settings of your tLogRow-component.

There are three types of Modes that you can choose between:


Basic

Basic will generate a new line for each record, separated by the “Field
Separator” you’ve chosen (see image above). When using basic mode, I
highly recommend to check the “Print header” option when working with
multiple column records or multiple outputs, purely for visibility reasons.



Table (print values in cells of a table)

The table mode shows the records and their headers in a table-format,
including the name of the component that generated this output (in our
case: “tLogRow_1”). This emphasizes the importance of properly naming
everything, especially when you have multiple components that generate
output. In this case, it would have been better to rename our component to
“EMPLOYEES”. Personally, I prefer this mode.



Vertical (each row is a key value/list)

Vertical mode will show a table for each one of your records.

The output mode you decide to use depends on what you’re trying to
visualize. For example, when your goal is to show a single string, I would
recommend using the basic mode. But when you have multiple table outputs
(for example: departments, customers and employees in a single output), I’m
certain the table mode would be the best option.
Sometimes your data is spread over multiple lines, resulting in an unclear
output, like shown in the image below.

To force the output to put all the data on one single line, you can uncheck the
“Wrap” option. This option is located underneath your output and will enable
a horizontal scrollbar.

Do you also want to be able to get data regarding tweets using Talend, as
shown in the image above? Read my previous blogpost and find out how!
3. Resetting windows and maximizing/minimizing them
Sometimes you accidently close a window and have a hard time finding a
way to get it back. You can very easily reset your environment by clicking on
“Window” – “Reset Perspective”.

You can see all of the views by clicking on “Windows” – “Show View” –
“Talend”. Some of the views are not shown by default, such as “Modules”.
Modules can be used to import .jar-files without having to restart your studio,
which will most likely save you some time.
Lastly, because Talend is Eclipse-based, you have the possibility to maximize
and minimize windows. I personally use this function when examining the
output of a tLogRow-component including a lot of data. You can achieve this
by either double-clicking on the window or by right-clicking on it and
selecting “Minimize”/”Maximize”.
That’s it for now. I hope you enjoyed reading this blog and make sure to
return soon for future blogs!
Posted in data integration, ETL, Talend, Tips and Tricks | Tagged ETL, Talend | Leave a comment

Use of contexts within Talend
Posted on May 27, 2014 by Dieter Van Ransbeek

When developing jobs in Talend, it’s sometimes necessary to run them on
different environments. For other business cases, you need to pass values

between multiple sub-jobs in a project. To solve this kind of issues, Talend
introduced the notion of “contexts”.
In this blogpost we elaborate on the usage of contexts for easily switching
between a development and a production environment by storing the
connection data in context variables. This allows you to determine on which
environment the job should run, at runtime, without having to recompile or
modify your project.
To start using contexts in Talend you have two possible scenario’s:
1) you can create a new context group and its corresponding context
variables manually, or
2) you can export an existing connection as a context.
In this example we’ll go over exporting an existing Oracle connection as a
context.
Double click an existing database connection to edit it and click Next.
ClickExport as context

NOTE There are some connections that don’t allow you to export them as a
context. In that case you’ll have to create the context group and its variables
manually, add the group/variables to your job, and use the variables in the
properties of the components of your job.

After you’ve clicked the Export as context button you’ll see the Create/Edit
context group screen. Enter a name, purpose and description and click Next.

Now you’ll see all the context variables that belong to this context group.
Notice that Talend has already created all the context variables that are
needed for the HR connection. If you want to change their names you can
simply click them and they become editable.
Click the Values as table tab.

Sponsor Documents

Recommended

No recommend documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close