MongoDB: Selecting first record of set of rows where recognised as a duplicate

mattlightbourn · August 8, 2016, 3:55am

I have a MongoDB with a a set of documents based upon when they were updated in the origin system. So, for example, a sales invoice with number 1044 may be received twice, the first time when the transaction_status is "UNPAID" and again 28 days later when the transaction_status is "PAID".

Using MongoChef, I use a SORT by transaction_reference (ascending) and object_creation_date (decending) followed by a $group and select the $first of each set to have the latest value.

$sort : 
{
"party_uuid":1, 
"connection_uuid":1,
"transaction_reference":1, 
"object_creation_date": -1
}

$group : 
{
_id: "$transaction_reference", 
"party_uuid" : {$first: "$party_uuid"},
"transaction_date" : {$first: "$transaction_date"},
"transaction_net_value" : {$first: "$transaction_net_value"},
"object_category" : {$first: "$object_category"},
"transaction_status" : {$first: "$transaction_status"},
"object_origin_category" : {$first: "$object_origin_category"},
"object_origin" : {$first: "$object_origin"},
"goods_and_services" : {$first: "$goods_and_services"}
}

I want to do a similar thing in a workflow in KNIME and I've done the following so far:

I've added a MongoDB Reader which I've configured and it is working. I have the following configuration in it:


Query : {object_class : "Goods & Services Transaction", 
     object_origin_category : "Bookkeeping"}

Projection : {_id:0, 
     transaction_reference :1, 
     object_creation_date:1,
     object_category:1, 
     object_type : 1, 
     object_origin: 1, 
     transaction_date : 1, 
     transaction_status :1, 
     transaction_gross_value : 1, 
     transaction_net_value :1, 
     transaction_tax_value :1, 
     party.party_name : 1, 
     party_uuid : 1}

Sort : {party_uuid : 1, 
     transaction_reference : 1, 
     object_creation_date : -1}

I've then added a JSON to Table to make it easy to read, which has worked (with exception to insisting on renaming my two date variables which is very annoying).

What I need help with is the following

From the table, I want to divide the set into two outputs - 1) the first row for each transaction_reference based upon the sort order above. 2) the last row for each transaction_reference based upon the sort order above.

I've tried groupBy, I've looked at Row Filter and others and cannot find one where I can add in the criteria of either MIN or MAX of object_creation_date for each transaction_reference.

Thanks, Matt

mattlightbourn · August 8, 2016, 4:47am

So I've done the following which seems to work, however appears to be a long winded way of doing what I wanted to do.

I setup a JSON Path to setup values in set properly, then JSON to Table. I then did a Uniquify IDs on transaction_references. 'post' sort (which I had done at the beginning in the MongoDB Reader), means earlier dated occurences of the same transaction. I used the Row Filter to exclude any row where the transaction_references has *_0 in it.

This left me with only the $first record found per transaction. What I've realised is, if more than one update has been applied, there maybe _1, _2, etc. Also, I cannot gaurantee that the origin system doesn't contain an underscore in the transaction_reference field.

So, I've managed to do an initial working proof of concept but believe that I have probably made heavier work of it than I needed to. Any help with this is greatly appreciated. Thanks, Matt

Iris · August 10, 2016, 12:14pm

Hi Matt,

just were thinking about your question. Just to get that right, you want to resolve the duplicates with native KNIME nodes.

This you can for example do with the GroupBy node. GroupBy all columns which are your identifiers and use First on all the others.

Sorry, but with MongoDB I cannot be of help.

Best, Iris

Shibumi · August 10, 2016, 9:29pm

Hi Matt,

I'm not familiar with how MongoDB works but in reading your question, I wondered if the SQL portion has a windowing function like in Oracle/SQL Server?

mattlightbourn · August 15, 2016, 1:26am

thank you for your answers, much appreciated. I will get back to this and try the GroupBy node for this. There doesn't appear to be any aggregation nodes for Mongo in Knime which is why I'm having issues with it. It only has query, projection and sort. Through using these, I can probably get very close to the result I want but with all duplicates coming through.

What I would need from the GroupBy node is to ensure that it only takes the values of the FIRST record it sees where more than one record related by group.

Thanks, Matt