Extracting matching words from another dataset and fill the missing place

Hi, I have 2 datasets named as Buyer dataset and the Seller dataset. I want to automate this. The buyer dataset has some missing values in cust_id. I want to fetch the disappeared cust_id from Buyer’s description or Seller_id in the Seller dataset. I tried using cross joiner and compared the values based on cust_id and seller_id. Could anyone please help with this? I am new to this platform and learning things. I am attaching the workflow with this question.
KNIME_project6.knwf (18.9 KB)

@anjeesanjeet Welcome to the forum!

I had a quick look at your workflow and I think there is too little information to refill the disappeared cust_id from the other dataset.

What I did is run a full outer join on the “Description” (which is the only column effectively “shared” between the two tables) with the joiner node which then shows the rows missing in either one of the input tables.

If that is just a fault in your example input data after the full outer join you could e.g. use the rule engine to fill missing values with the corresponding other ID

Example rule:

Hope that helps a bit

Cheers,
Lars

Hi @PTDataScientist thank you for your instant reply. If you see, in the buyer’s dataset there is cust_id. If you check in the cust_id where Krish is the seller, the cust_id is missing but in the description, there is a value which looks similar to other cust_id. I want to compare that with the seller_id and if found to be similar then I want to extract that cust_id from the buyer’s description to cust_id. Hope you understand my query.

OK, I see. But then I am afraid that might turn into a suicide mission if you don’t have a full list of valid IDs that you could match to the description.

What I have tried is extracting all available IDs in both columns and loop over them to match the resulting IDs against the description but that only matches the “NN-4260-34” in the description which has the ID set correctly. The workflow can’t find the 64836-N as there is no corresponding transaction in the other table…

From the given Customer/Seller-IDs I also don’t see a way I could come up with a regex or so that would not also match e.g. the “500 GB” or the “64 bit” which would mess up the data even further.

KNIME_project6_prediction.knwf (28.2 KB)

Thanks, @PTDataScientist, Is this possible to compare the (strings/text/every word) of the buyer’s description with Seller_id in the seller dataset and if the word in the Buyer’s dataset matches or looks anyway similar(Seller_id and Buyer’s description word), replace that matched word in the missing customer_id in Buyer’s dataset.

My main Aim is to find the values of the missing cust_id in the buyers’ dataset.

Did you try to modify my generic solution in the workflow which matches all IDs to a description to achieve what you want to do? Almost everything you need should be in there… Take the IDs - build a generic regular expression ".*($$cust_id$$).*" and match it against the description to generate a predicted id.

Happy to help if you have more specific questions on where you are struggling…

I usually use the dictionary rule node for this kind of thing. I use the column expressions or rule nodes to create dynamically built formulas. If they need to match row IDs as well (to achieve a more of a row level join), then I use the row ID node to turn the ID into a column and add the column match to the formula. You essentially gain the ability to do formula based highly complex conditional joins.

If you are not familiar with prepping formulas for the dictionary rule node, then you can see a workflow example here.

2 Likes

Thank you @iCFO for your help.

1 Like

Thank you @PTDataScientist for the help. I appreciate your time.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.