Concurrent run KNIME workflow like ETL Batch Jobs

garyhow01 · July 18, 2018, 3:54pm

Hi all
Need help. Can any one point me to any links on how to create batch jobs (.exe or .bat) that picks up files from SFTP server, transform it, then load into a MS SQL Database?

I need 4 batch jobs to execute at the same time to pick up files from 4 different SFTP server, with different transformation and load into different MS SQL database.

ScottF · July 18, 2018, 5:50pm

Which part(s) of this are you looking to execute in KNIME? That will determine the structure of your batch file.

I think you could do ALL of it in KNIME - download data, transform, and load into DB - for each of the four servers, within a single workflow. Then you could execute that workflow from the command line using a batch file, which would just be a few lines of text.

Here’s the link to our FAQ that talks about headless execution of KNIME via a batch file: https://www.knime.com/faq#q12

garyhow01 · July 19, 2018, 1:48pm

Thank you Scott, 4 is my current use case and i understand your suggestion.

How about the scenario of 150 ETL workflows all with different transformation, source to target mapping?

Is KNIME a suitable replacement for ETL tools like Pentaho, Talend, etc where they have a Batch Job scheduling and management feature to neatly handle ETL workflows.

Can I throw away my enterprise ETL tool and use KNIME + some sort of Task Scheduler, in short. what’s your take?

ScottF · July 19, 2018, 2:29pm

Hi @garyhow01 -

I think you could still do it in KNIME AP, but you would definitely have a large workflow (or set of workflows) to manage that many transformations and data transfers.

I can’t really speak much to the capabilities of other software packages, but our commercial product, KNIME Server, has scheduling features to deal specifically with workflow automation. This is done without batch files, as it’s all built into the product. KNIME Server also has some great features centered around collaboration, versioning, and deployment. If you’re interested, we’ll have a webinar in September that showcases KNIME Server features. You can register here:

https://www.knime.com/about/events/webinar-sharing-deploying-data-science-with-knime-server-september-2018

We also have a webinar from last week that was just posted to Youtube that talks about ETL and Server (as well as a bunch of other topics). You can watch that here:

Hope that helps!

garyhow01 · July 19, 2018, 2:50pm

Yes it helps alot! Good to know there is a scheduler feature in KNIME Server. If you can help drop me further links or detailed documentation e.g. installation, configuration, operator guide on KNIME server, i will be able to do the comparison with my data engineering team.

cpadilla · July 19, 2018, 3:02pm

Hello @garyhow01,

I can send you the documentation you are requesting and talk to you about the options to evaluate the KNIME Server. Please send me your contact information to cynthia.padilla@knime.com.

Thanks,
Cynthia

ipazin · July 23, 2018, 10:11am

Hi!

Currently I’m working with Knime doing ETLs so can give you my experience.

So, for the time being I managed to do everything I wanted with KNIME. And I didn’t need to use a lot of nodes. I even managed to put some functionalities in a loop so instead of creating 40 workflows I have created only 2. I use a lot of flow variables and they are serving me fine. The database integration nodes are missing some features but new framework just came with 3.6 version of Knime so I guess there should be some enchantments. The update on databases is a bit slow but optimization from my side can and should be done here. Actually, optimization for the whole configuration is needed and has to be done in a proper way.

Generally what I like most about Knime is its simplicity and possibility to do same thing on couple of ways. That gives you flexibility

The project I’m working on still has some work to do before it goes into production so we’ll see how it ends up.

Hope this helps

Br,
Ivan

garyhow01 · July 23, 2018, 12:45pm

Thank you Cynthia, I am in email contact with a KNIME staff on the same. Am studying the guide with the team this week.

Hi Ivan, that’s good insight! Would you be able to share a little more:

u mentioned “database integration nodes are missing”. What are some examples?
“update on databases is a bit slow but optimization from my side”. What are some form of optimisation and the ways to do it proper?
how do you handle scheduling, detection of failed etl in a particular day and restarting of the etl program on that day?

Look forward to hearing more sharing from the community!

ipazin · July 24, 2018, 7:49am

Hi @garyhow01,

Hope this answers some of your questions.

I said they are missing some features. Like you can not update a row based on its value (you can not use same column in set and where), merge node is missing (actually he is in a new above mentioned framework), fetching metadata from a database within Database Reader node doesn’t fetch schema so you need to write it…
from outside of Knime there is a Knime configuration file where you set certain properties like how much memory Knime uses, number of cells Knime keeps in memory and database properties - a bit more about it you can find in Knime under Help->Help Contents->KNIME->KNIME runtime options. Within Knime there is a batch size parameter within database Writer, Update nodes should be configured. As well logic is to do as much as possible on database and then pull your data to Knime
I actually won’t be doing scheduling but I tried it out with Windows Task Scheduler and Airflow, both calling Knime in batch and it worked fine. Regarding failure detection I have a log table in my database which I will use. As well I will use Knime log together with nodes like Email for sending email when it is done and error handling nodes (on this I have to work a bit more), as there will be no scheduling there will be no automatic restart.

Br,
Ivan

system · July 31, 2018, 7:49am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.