I need help in grouping a table to return to me the latest “batch” or “set” of data in a table.
The table below has duplicate entries for
product_id (pink=undesirable , yellow=expected data) for the same
activity_id. The higher the
id indicates the latest data entry.
My best guess is to use the GroupBy node but I keep getting earlier entries that are not needed.
Hi @tb_g_23 , from your description I would also use the GroupBy node and have it group by
product_id and returning as the aggregation, the maximum of the
Join the output of the GroupBy back to the original table, joining on
id. You could include
product_id n the join too but they are superfluous if
id is unique. This should then give you there latest data for each product and activity.
Hey! Thanks for answering.
I’ve tried this already. 2 Problems I’ve encountered with this solution
- Not every
id has a
product_id so I have to group by
- This solution gives me a product from an earlier “batch” that I do not need.
See below (green=good, pink=undesireable)
Hi @tb_g_23 ,
I’ve had a re-think and a re-read. I hadn’t spotted the additional features of your data, and I generally find it easier to analyse actual sample data rather than screenshots.
OK, so if we are to assume that every batch is marked by having an initial
activity_input_order of 0, and that they are listed in ascending order, so that they batches are always grouped together in your data, with you data in
id order, then this becomes a case of marking the batches, and a fairly common way of doing this in KNIME is by marking batch-start (activity_input_order=0) with a 1, and all other rows with a 0.
After this, a moving aggregation can be used to “cumulatively sum the ones”.
Start with a Rule Engine
$activity_input_order$=0 => 1
TRUE => 0
This is creating a column named “batchnumber” and the effect of this will be the following:
After that, a Moving Aggregation will cumulatively aggregate the batch numbers.
which generates the following:
As you can see this has now assigned a unique value to each “batch”
Following that, a group by can then be used to find the maximum batch number for each activity id, and then join back using the max batch numbers
Grouping Batches.knwf (19.3 KB)
Hopefully that gives some ideas. If it still doesn’t quite work, please upload some sample data that we can have a play with.
@takbb This really helped. I thought I could’ve elaborated on the problem I’m trying to solve but I think your suggestion solved it!
For some context, I’m trying to recreate an Excel export available via web from a Farm Ops Management program by pulling data from our database linked via API.
With your help and few updates this is by far the closest I’ve been able to recreate that.
I updated the join to include the activity_id, and it still returned a higher input count that the excel
… so I assigned a rank to further filter and get what I needed.
Not sure if this the most elegant solution but it meets my criteria for this iteration.
I’ve attached a workflow with sample dataset for your review.
activity_id and input batches.knwf (85.5 KB)
Hi @tb_g_23 , I’m glad to hear it’s moved you forward. Thank you for uploading your revised workflow. Since you put it there “for review”, I took a quick look, and here are my observations:
For a given “batch number”, there can only be a single activity_id (unless the activity_id can vary within batches, but if that were to occur, it would blow the other logic where it finds the “max” batch number for each activity), so I believe that including the activity_id in the join conditions on the joiner should make no difference other than perhaps for documentary purposes. (i.e. this may not be apparent when looking at the workflow in the future, and so makes it clearer).
It isn’t clear to me what problem the sorter, rank and row filter nodes are, because with the sample data they don’t change anything, but I am assuming that with your wider data set they have an effect.
Anyway, good to know this has helped.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.