how to train a model to link 2 sheets between which no obvious link exists

crnico · May 19, 2020, 11:47am

Hello Knime community,

I’m looking for a way to train a model in Knime to link 2 sheet between which the link is not so obvious…

You’ll find attached an XL file within which there are 2 sheets and the columns on which I’d like to link the 2 sheets is highlighted in yellow.

As you’ll see these columns are not exactly the same and I need to train a model to enable Knime to understand which cell is more or less equal within the sheet 1 with the equivalent column in sheet 2…

Of course the workflow I have to build will manipulate a larger amount of data. I’m just trying to find a way to perform this operation.

Sample_data.xlsx (10.3 KB)

qqilihq · May 19, 2020, 11:57am

“Training a model” sounds like total overkill to me. You do not have any “learning” data, do you? I’d rather:

Define some normalization rules and then do an exact match join, or
Use some string similarity measure (see e.g. the String Similarity node from Palladian), define a threshold, and then join with the “most similar” candidate, or
combine 1. and 2.

– Philipp

crnico · May 19, 2020, 12:14pm

Hello Philipp,

Yes you’re probably right the word training maybe not appropriated… Anyway I just need a solution to make it work I’m going to try you’re solution through the palladian node you mentionned.
Thanks for this.

Nicolas

qqilihq · May 19, 2020, 12:20pm

Hey Nicolas,

I had a look at the sheet again, unless I’m mistaken the issue is just that you have some “garbage strings” before/after the identifiers? If this is the main issue, I’d rather suggest some simple string processing first where you remove these via some simple rules. Once set up, this is simple to understand and to optimize.

Only try to go for option (2) if you see that option (1) does not work

Fingers crossed and keep us posted!
– Philipp

qqilihq · May 19, 2020, 12:45pm

I’ve put a very simple example on my NodePit Space for you to get started (this is basically what I described in Step 1 above):

I use Palladian’s Regex Extractor to extract the CFE identifier in the second table, and then “canonicalize” the CFE identifiers by stripping spaces and hyphens, and lastly join but tables together.

Get it here:

Hope this helps!

crnico · May 19, 2020, 1:52pm

Hello Philipp,

This is exactly what I was expecting. This is really simple actually… I was imagining a way more complex solution…

This will do the job perfectly. Thanks a lot Philipp!
Nicolas

qqilihq · May 19, 2020, 1:53pm

Welcome! Glad to hear

system · May 26, 2020, 1:53pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.