Get latest "batch" from table

tb_g_23 · August 9, 2023, 12:35am

Hello all!

I need help in grouping a table to return to me the latest “batch” or “set” of data in a table.

The table below has duplicate entries for input_name and product_id (pink=undesirable , yellow=expected data) for the same activity_id. The higher the id indicates the latest data entry.

My best guess is to use the GroupBy node but I keep getting earlier entries that are not needed.

Thank you!

takbb · August 9, 2023, 12:58am

Hi @tb_g_23 , from your description I would also use the GroupBy node and have it group by activity_id and product_id and returning as the aggregation, the maximum of the id column.

Join the output of the GroupBy back to the original table, joining on id. You could include activity_id and product_id n the join too but they are superfluous if id is unique. This should then give you there latest data for each product and activity.

tb_g_23 · August 9, 2023, 1:07am

Hey! Thanks for answering.

I’ve tried this already. 2 Problems I’ve encountered with this solution

Not every id has a product_id so I have to group by product_name and activity_id
This solution gives me a product from an earlier “batch” that I do not need.

See below (green=good, pink=undesireable)

takbb · August 9, 2023, 8:07am

Hi @tb_g_23 ,

I’ve had a re-think and a re-read. I hadn’t spotted the additional features of your data, and I generally find it easier to analyse actual sample data rather than screenshots.

OK, so if we are to assume that every batch is marked by having an initial activity_input_order of 0, and that they are listed in ascending order, so that they batches are always grouped together in your data, with you data in id order, then this becomes a case of marking the batches, and a fairly common way of doing this in KNIME is by marking batch-start (activity_input_order=0) with a 1, and all other rows with a 0.

After this, a moving aggregation can be used to “cumulatively sum the ones”.

Start with a Rule Engine

$activity_input_order$=0 => 1
TRUE => 0

This is creating a column named “batchnumber” and the effect of this will be the following:

After that, a Moving Aggregation will cumulatively aggregate the batch numbers.

which generates the following:

As you can see this has now assigned a unique value to each “batch”

Following that, a group by can then be used to find the maximum batch number for each activity id, and then join back using the max batch numbers

Grouping Batches.knwf (19.3 KB)

Hopefully that gives some ideas. If it still doesn’t quite work, please upload some sample data that we can have a play with.

tb_g_23 · August 9, 2023, 3:58pm

@takbb This really helped. I thought I could’ve elaborated on the problem I’m trying to solve but I think your suggestion solved it!

For some context, I’m trying to recreate an Excel export available via web from a Farm Ops Management program by pulling data from our database linked via API.

With your help and few updates this is by far the closest I’ve been able to recreate that.

I updated the join to include the activity_id, and it still returned a higher input count that the excel
extract…

… so I assigned a rank to further filter and get what I needed.

Not sure if this the most elegant solution but it meets my criteria for this iteration.

I’ve attached a workflow with sample dataset for your review.

Thank you!

activity_id and input batches.knwf (85.5 KB)

takbb · August 11, 2023, 5:54am

Hi @tb_g_23 , I’m glad to hear it’s moved you forward. Thank you for uploading your revised workflow. Since you put it there “for review”, I took a quick look, and here are my observations:

For a given “batch number”, there can only be a single activity_id (unless the activity_id can vary within batches, but if that were to occur, it would blow the other logic where it finds the “max” batch number for each activity), so I believe that including the activity_id in the join conditions on the joiner should make no difference other than perhaps for documentary purposes. (i.e. this may not be apparent when looking at the workflow in the future, and so makes it clearer).

It isn’t clear to me what problem the sorter, rank and row filter nodes are, because with the sample data they don’t change anything, but I am assuming that with your wider data set they have an effect.

Anyway, good to know this has helped.

system · August 18, 2023, 5:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.