I need help in grouping a table to return to me the latest “batch” or “set” of data in a table.
The table below has duplicate entries for input_name and product_id (pink=undesirable , yellow=expected data) for the same activity_id. The higher the id indicates the latest data entry.
My best guess is to use the GroupBy node but I keep getting earlier entries that are not needed.
Hi @tb_g_23 , from your description I would also use the GroupBy node and have it group by activity_id and product_id and returning as the aggregation, the maximum of the id column.
Join the output of the GroupBy back to the original table, joining on id. You could include activity_id and product_id n the join too but they are superfluous if id is unique. This should then give you there latest data for each product and activity.
I’ve had a re-think and a re-read. I hadn’t spotted the additional features of your data, and I generally find it easier to analyse actual sample data rather than screenshots.
OK, so if we are to assume that every batch is marked by having an initial activity_input_order of 0, and that they are listed in ascending order, so that they batches are always grouped together in your data, with you data in id order, then this becomes a case of marking the batches, and a fairly common way of doing this in KNIME is by marking batch-start (activity_input_order=0) with a 1, and all other rows with a 0.
After this, a moving aggregation can be used to “cumulatively sum the ones”.
@takbb This really helped. I thought I could’ve elaborated on the problem I’m trying to solve but I think your suggestion solved it!
For some context, I’m trying to recreate an Excel export available via web from a Farm Ops Management program by pulling data from our database linked via API.
With your help and few updates this is by far the closest I’ve been able to recreate that.
I updated the join to include the activity_id, and it still returned a higher input count that the excel
extract…
Hi @tb_g_23 , I’m glad to hear it’s moved you forward. Thank you for uploading your revised workflow. Since you put it there “for review”, I took a quick look, and here are my observations:
For a given “batch number”, there can only be a single activity_id (unless the activity_id can vary within batches, but if that were to occur, it would blow the other logic where it finds the “max” batch number for each activity), so I believe that including the activity_id in the join conditions on the joiner should make no difference other than perhaps for documentary purposes. (i.e. this may not be apparent when looking at the workflow in the future, and so makes it clearer).
It isn’t clear to me what problem the sorter, rank and row filter nodes are, because with the sample data they don’t change anything, but I am assuming that with your wider data set they have an effect.