Delete rows and continue with new table

Phibu · June 28, 2019, 6:44am

Hello everybody,

I use a rule-based line filter to filter specific data in a large file. I compare two values from different tables. If they matche, the filtered rows will be aggregated and written to a CSV file.
I repeat this several times within a loop.

However, the rule-based row filter is very slow. Is it possible to use a rule-based row spliter and use the second Output (false rows) as the new input for the rule-based row filter? This would reduce the amount of data after a loop-Iteration.

Thank you in Advance

armingrudd · June 28, 2019, 6:53am

Hi,

Would you please provide an example to better explain your gaol?

Phibu · June 28, 2019, 9:04am

Hey @armingrudd,

this is my workflow. It works and I receive the file I want to have.

Date Recording is the date prepreperation of two files. File one (connected with the Sorter node) is having 7,9 mio rows, file two (connected with Chunk Loop Start) 1000 rows.
In configured the rule-based row filter as followed:
$${SStartTime}$$ <= $TimeStamp$ AND $TimeStamp$ <= $${SEndTime}$$ => TRUE
Afterwards I aggregate the colomns by the GroupBy node.

I want to speed up the process of filtering. Maybe I can delete all the rows I already aggregated and use the reduced file as a new input for the Rule-based Row Filter? Or do you have a better idea?

Best regards!

armingrudd · June 28, 2019, 10:52am

Which output port of the Metanode has the “TimeStamp” column?

Phibu · June 28, 2019, 11:13am

The first file (connected to the Sorter) with 7,9 mio rows.

armingrudd · June 28, 2019, 11:19am

Ok,
The main approach you are following is the same as what I would do except in these:

I would use a Table Row To Variable Loop Start node instead of the chunk loop and the table row to variable nodes. And the loop then can be closed by a Variable Loop End node.
I would use a Date&Time-based Row Filter node instead of the Rule-based Row Filter.

Phibu · June 28, 2019, 11:23am

@armingrudd

Thanks for your help. I will adjust the workflow next week and post the result.

It is a great community and nice tool :).

Phibu · June 28, 2019, 12:49pm

Hey @armingrudd,

one more question. I configured the Rule-Based Row Filter as followed:

So I have different StartTime and EndTime in every Iteration of the loop.

How can I configure the Date&Time-based Row Filter the same way?

Best regards!

armingrudd · June 28, 2019, 12:55pm

Go to the Flow Variables tab:

Phibu · July 1, 2019, 9:41am

Hey,
it works and I also improved my workflow with your ideas.

Thank you!

Phibu · July 22, 2019, 12:02pm

Hello,

is there a function to delete rows in a data table? I dont mean filtering. I really mean delete or update the table in my workflow.

Best regards.

armingrudd · July 22, 2019, 2:15pm

Hi,

What’s the problem with filtering?

In some cases you may need to use Domain Calculator. Check this topic as an example:

If you explain your case further, It would be possible to help you better.

Phibu · July 23, 2019, 6:58am

Hello,

I still use a date and time based row filter to filter my data. I use this filter in a loop with a Table Row To Variable Loop Start, as you previously recommended.

However, the date and time based row filter takes a long time to filter a large amount of data. Once I filtered my data table, I wirte them into a csv file. Afterwords, I dont need the already filtered data anymore. To improve the running time, I like to reduce/update the data-table for the filtered data.

Don’t have to update the data-table in every Iteration, but at least every 50. Is there a way to do this or do you have another idea?

Best regards!

armingrudd · July 23, 2019, 7:32am

To have only the remaining rows in the next iterations, you can use Recursive Loop.

Use the Recursive Loop Start node after the Sorter node. Use the current iteration number to filter the output of the Row Filter node and convert the output to variable. Split the main table based on dates using Rule-based Row Splitter node. Use Variable to table row after the CSV Writer node and close the loop using the Recursive Loop End node. Pass the second port of the splitter to the second port of the Recursive Loop End node.

If you provide a sample dataset, I would build an example workflow for you.

Phibu · July 23, 2019, 8:16am

Hey @armingrudd

that would be awesome! Here are the two (smaller) sample datasets.
I used data from the rows “Start and Endtime” to filter the rows in “Testsample 1”. (starttime < timestamp < endtime)
Instead of join the filtered rows with another docuement (as in the screenshot in the last post), you can also aggreagete the filtered values and write it into an csv.

Start and Endtime.txt (59.3 KB)
Testsampel 1.txt (3.8 MB)

Notice, the “Testsample1” contains only 90.000 rows instead of millions. Had to reduce the size significantly to upload an example.

Best regards!

ipazin · July 23, 2019, 8:52am

Hi there @Phibu,

apart from optimizing workflow with different nodes here are some things you can also try.

For better speed execution you can try Streaming Extension in KNIME:
https://www.knime.com/blog/streaming-data-in-knime

For general tips&tricks on optimizing KNIME workflows check this blog post:
https://www.knime.com/blog/optimizing-knime-workflows-for-performance

Additionally if you are not using last KNIME version, 4.0.0, I highly recommend as performance has been a major focus of this release:
https://www.knime.com/whats-new-in-knime-40#performance

Br,
Ivan

armingrudd · July 25, 2019, 5:48am

Sorry for the delay @Phibu,

Here is the example workflow:

recursive_filter.knwf (1.3 MB)

I hope this would help you.

system · January 23, 2020, 5:48pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.