I have 600 stores and 150 millions of rows. I am using “table row to variable loop start” node to run flow and make variable manipulation and calculate lots of variable…
My tables in database and i want to parallel execution to earn time because this job is daily operation, must be run everyday. I defined two database users and trying run with both of them as parallel (like 300+300). When i tried paralled execution for 2 stores, It couldn’t be any performance increasing. There is no difference between one user or two users’s calculation time.
Please help, is there a solution for my problem.
you can use the Parallel Chunk Start and the corresponding end node to parallelize a part of your workflow.
Regarding the loop, Have you tried “Parallel Chunk Start”? You can use Table Row to Variable after the loop start to convert table attributes to flow variables. Just remember to set the chunk count equal to the number of the rows in your data set.
Here is an example:
parallel_chunk_loop.knwf (30.9 KB)
As you said that it work with same size data but my flow works based on partition. For example;
I have 600 stores and i want to 5 parallel flow with 120 stores on everyone. But every stores have different count product with different dates.
I want to 5 stores start with parallelly and their rows shouldn’t mix each other.
How can i handle it.
If the operation for each group of the stores is different, then split the data set into 5 and use the parallel chunk separately on each of them. But if the operation is the same and you want to divide the stores at the end, you can use the parallel chunk on the initial data set and then split the data set based on the stores.
If I’m missing something, you can provide a sample data set and let us investigate the issue further.
I think, it is the most important point is that flow must be executed based on each store.
Stores can include different number of rows because of product and date like i share first.
Think that there is 600 stores and flow is going to execute for every stores similarly. There are lots of data manipulation and one store shouldn’t be mix another one and one stores’s product shouldn’t be mix it’s different products.
Thin that i have a database script like and now i am executing it like that;
ORDER BY product,Date
Note:-- there are 600 stores and they loops with table row to variable loop, 600 times.
After ı take data to knime there are lots of data manipulation. I need to do it 5120 not 1600
If you only have 5 stores, why don’t you just create 5 branches in your workflow and let KNIME do the parallelization? Workflow parts that are the same can be wrapped in a metanode or component and copied across branches.
In this situation I do the parallel chunk on just the list of stores. Then table row to variable loop start, then filter the data for each store inside the loop. There is some extra data copying happening but if this isn’t a problem, it is the most maintainable solution.
A promising alternative can be to use the create local spark context node and then use the spark nodes to process your data. This takes a bit of coding but can really speed up a workflow, at least in my experience.
Be careful if you manually branch the workflow to keep the node settings synchronized. It’s easy to make a mistake if you do it that way.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.