TF Performance improvement

Mink · May 11, 2018, 1:14pm

Hi, how is it possible to speed up the TF process?

Situation: One 150Mb textfile calculate TF.

Workflow:
Flat File Document Parser then Number Filter then Punctuation Erasure then Stop Word Filter then Bag Of Words Creator this is very fast it takes 2-3 seconds but then TF Node takes over 10 hours.

The machine is used are a gaming computer.

What is a good way to speed up this execution?

Thank you!

Patrick1974 · May 13, 2018, 5:11am

Hey @Mink,

assumed you run Windows as OS, what does the task manager show about “Performance”?
Is the CPU at top or the disk I/Os or the Network…?
Depending on that info, the community could narrow down some optimization alternatives.

Kind regards,

Patrick.

Mink · May 14, 2018, 7:48am

Thanks for your answer. Ubuntu 16.04 i use. The cores usage are on 100%. In general what is a fast way to convert a big txt file like more GBs to a KNIME Document type? This takes very long too. The nodes between Flat File Document Parser and TF are fast. There are some settings to solve this?

RolandBurger · May 14, 2018, 8:14am

Hi Mink,

Please have a look at this blog post to get some pointers for performance tweaks: https://www.knime.com/blog/optimizing-knime-workflows-for-performance

If you haven’t increased the default memory allocated to KNIME, that should be your first step. In addition, you could try to narrow down your feature space even further. E.g., you could use as POS tagger node first, and then a Tag Filter to only select nouns.

Cheers,
Roland

Mink · May 15, 2018, 7:22am

Hi Roland. Thanks for your answer. I set the allocated memory to a higher size. But it’s still slow. I found a solution for a faster execution I split this 150Mb file to 3200 files and then the TF node takes 32min to execute. This works for my task. Thanks.

Patrick1974 · May 16, 2018, 5:30am

Hey @Mink,

some quick questions so others might be able to re-use your approach:
Did you split the documents using node(s) and if so, could you post this workflow?
Do you run the documents through the TF nodes sequential or parallel? If in parallel, how many TF nodes do you run at the same time?

Kind regards,

Patrick

system · May 23, 2018, 5:33am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.