Processing hundreds of millions of records

tulipliu · February 26, 2019, 3:58am

Hi there

I have been trying to process about 168 millions of records in the workflow below:

It has taken more than 4 days to process and still running. How can I improve the workflow so it runs faster?

Any ideas would be greatly appreciated.

Many thanks for your help.

Jennifer

Iris · February 26, 2019, 6:34am

Hi Jennifer

If I understand your workflow right the loop can be replaced by a joiner node. You need to join on the iteration.

Best Iris

tulipliu · February 26, 2019, 9:05pm

Hi Iris

Thank you for your advice.

The workflow actually needs to process over 900 text files and loop through over 167 millions of records combined from the text files. Here is the workflow before the streamed execution:

Should I still replace the loop in the streamed workflow with a joiner node?

Many thanks
Jennifer

izaychik63 · February 26, 2019, 9:24pm

You may try a Chunk Loop and Column Expression node inside the loop instead of 2 conversions and constants processing?

mlauber71 · February 26, 2019, 9:33pm

A few remarks since it is difficult to judge from the screenshots:

if you have to process that many files and lines it could make sense to do it in chunks and have some method in place to log the current status and be able to restart at a certain point. Eg. write results to disk after each 50 out of 900 processed files or something. Because if you do not have a very powerful infrastructure, compartmentalization might be a thing for you
also this saving of steps in-between might function as some sort of progress-bar. You write the current status into a separate CSV file and maybe even some time statistics. Sometimes it makes such tasks easier if you could have an idea how long it might take (or you could give someone who is waiting for results an estimation)
see that the power you have is used in an optimal way (like Iris said) avoid unnecessary steps and keep the steps manageable for the power you have. Streaming might help you will have to see about the size of the chunks
see if all the nodes can just run in memory that might speed up things
see if you can assign more RAM and maybe a higher number of threads working in parallel
if you might encounter changing formats it could be good to just import things as strings first (and convert numbers maybe later)
to access a large chunk of files (of the same structure) simultaneously also sounds like a job for a Big Data environment (Hive, Impala external table). If you really must you can think about setting up such an environment in an AWS cluster or something
also see that you follow all the advises to empower your KNIME environment (https://www.knime.com/blog/optimizing-knime-workflows-for-performance, Large data tables missing?)
if your graphical KNIME interface gets stuck you might think about running KNIME in batch mode (and combine it with the logging and ability to restart the process cf. above)

tulipliu · February 26, 2019, 11:38pm

Hi there

Thanks for the tip.

I could not find a node called Column Expression node. Could you please specify exactly which node I should use to replace the 2 Constant Value Column nodes used in the loop?

Many thanks
Jennifer

izaychik63 · February 27, 2019, 12:39am

you may need to install it.

tulipliu · February 27, 2019, 1:40am

Hi there,

Thanks for the reply.

I installed the node and got it working to replace the Constant Value Column nodes.

Another question is about you suggested to use Chunk loop to process the files. Did you mean using Chunk Loop Start before the loop to break up the files into chunks?

Many thanks
Jennifer

izaychik63 · February 27, 2019, 1:59am

I assumed to use it instead of variable loop. But I could be wrong. Also, there is

That may help you increase speed of processing.

tulipliu · February 27, 2019, 3:33am

Hi Iris

Could you please explain a bit more about using the joiner node instead of the loop in my workflow?

Many thanks
Jennifer

tulipliu · February 27, 2019, 4:05am

Hi Iris

That’s ok. I got the joiner node working. I thought the loop may help speed up the workflow but I actually complicated the process unnecessarily.

Many thanks for pointing that out.

Jennifer

tulipliu · February 27, 2019, 11:32pm

Hi there

A quick question. will it work to have a Row to Variable Loop node inside the Chunk Loop to speed up the workflow?

Many thanks
Jennifer

izaychik63 · February 27, 2019, 11:47pm

It would be simple if you provide your workflow and couple of records to process.

tulipliu · February 28, 2019, 1:14am

Hi there

I have attached the workflow and the associated data files below:

Sample Workflow.knwf (56.3 KB)

sample data.xlsx (7.9 KB)

sample Master tbls.xlsx (148.3 KB)

School & Public Holiday Dates Oct2013-2018.xlsx (56.1 KB)

Many thanks for your help

Jennifer

beginner · February 28, 2019, 11:45am

In my experience the recommended Column Expression node is very slow and not useful for large amounts of rows.

In your Sample workflow, the “Round Dates Match” metanode is the slow part. I could easily greatly increase speed:

Move Java snippets outside of the loop
Rule-based Row Filters

“Matches” does a regular expression match but you are only looking for equality. Use = instead.

Replace Column Expressions back to constant value

With these changes this metanode with loop already runs a lot faster.

As a general comment you also don’t need to attach the flow variables to each node separately. Flow variables are passed from node to node automatically.

BUT: Loops are very slow in KNIME and are best avoided

Hence I added a version of the metanode without a loop. Without loop execution is more or less immediately with the sample data.

And the workflow with improved “Round Dates Match” metanode:

Sample Workflow Improved.knwf (302.1 KB)

aklenner · March 4, 2019, 7:12am

There are some very good tips&tricks already, it makes of course sense to identify the slow part as done by beginner, but some general remarks:

Java snippet and also Java Edit Variable are completely broken in loops. Do not ever use them - just use the (simple) version of both and it makes a difference between days and minutes in execution time.
Joining on big tables is a pain in KNIME if you can reduce the size of the tables before hand without losing information do it. Don`t just hide columns (column filter), cache them so they are forced to lose the information.
Don`t keep redundant information - often the way KNIME text processing is implemented you are keeping two versions of the document - that is another sink hole for performance. Check that you only keep text (either within the document ) or string in a column not both if it is not absolutely necessary - for some strange reason the number of columns, especially if they contain a lot of text have a huge influence to joining speed even if those columns are not touched during joining.

But honestly, getting rid of all java snippets should already make a world of difference here.

Alex

tulipliu · March 4, 2019, 9:14pm

Hi Alex

Thanks very much for your advice.

If I get rid of all the Java Snippets nodes, how else can I use the date variables in the equation when they are stored as a string?

Thanks
Jennifer

tulipliu · March 4, 2019, 9:22pm

Hi there,

Thanks so much for your suggestions. They have worked really well.

I am also a beginning using KNIME. I have miss used loops and made my workflow slower without realising it.

Having modified parts of my workflow, I examined the beginning of my workflow and could not see any other way out:

Would you be able to suggest a better way of processing over 900 text files, extracting the required information from the text file name and joining all of them together?

Thanks again
Jennifer

mlauber71 · March 4, 2019, 10:04pm

well in fact there is a way …
You can use the HIVE function of an external table from the local Big Data environment and point it to a folder that contains all the CSV files you want to aggregate. And then you can refer to this external table and convert that to a regular table. All the CSV files will have to have the same structure and you will have to specify the structure.

Please not I have used “|” as a separator in the 3 sample files.

OK and here come the quirks:

the HIVE version KNIME is using does not respect the settings to skip the initial row (TBLPROPERTIES ('skip.header.line.count'='1', 'skip.footer.line.count'='4')) and also not the setting for the footers
if you use a folder that is within the KNIME workspace it will contain a workflowset.meta file that will also be read and imported

So if you can live with having all the data initially as strings (you can convert them later) and you have to remove the lines that contain the headers and the XML-info from the workflowset.meta you can import all the CSV files from one folder in one step (ab)using the Local Big data environment. Not sure if there also is a maximum (might depend on the power of your machine).

The upside is the external table will update each time you add new files - but please be aware that often you would have Hive work together with Impala’s COMPUTE INCREMENTAL STATS which is not available in the the local Big Data environment of KNIME.

kn_example_bigdata_hive_csv_loader.knar (688.8 KB)

tulipliu · March 4, 2019, 10:56pm

Hi there

Thanks a lot for your advice.

I am not sure I can use this because:

All my tab delimited text files are stored in a local drive
I am not using HIVE version KNIME
I am not working in the local Big Data environment of KNIME

Thanks
Jennifer