how to train a model to link 2 sheets between which no obvious link exists

Hello Knime community,

I’m looking for a way to train a model in Knime to link 2 sheet between which the link is not so obvious…

You’ll find attached an XL file within which there are 2 sheets and the columns on which I’d like to link the 2 sheets is highlighted in yellow.

As you’ll see these columns are not exactly the same and I need to train a model to enable Knime to understand which cell is more or less equal within the sheet 1 with the equivalent column in sheet 2…

Of course the workflow I have to build will manipulate a larger amount of data. I’m just trying to find a way to perform this operation.

Sample_data.xlsx (10.3 KB)

“Training a model” sounds like total overkill to me. You do not have any “learning” data, do you? I’d rather:

  1. Define some normalization rules and then do an exact match join, or
  2. Use some string similarity measure (see e.g. the String Similarity node from Palladian), define a threshold, and then join with the “most similar” candidate, or
  3. combine 1. and 2.

– Philipp

Hello Philipp,

Yes you’re probably right the word training maybe not appropriated… Anyway I just need a solution to make it work :slight_smile: I’m going to try you’re solution through the palladian node you mentionned.
Thanks for this.

Nicolas

Hey Nicolas,

I had a look at the sheet again, unless I’m mistaken the issue is just that you have some “garbage strings” before/after the identifiers? If this is the main issue, I’d rather suggest some simple string processing first where you remove these via some simple rules. Once set up, this is simple to understand and to optimize.

Only try to go for option (2) if you see that option (1) does not work :slight_smile:

Fingers crossed and keep us posted!
– Philipp

I’ve put a very simple example on my NodePit Space for you to get started (this is basically what I described in Step 1 above):

I use Palladian’s Regex Extractor to extract the CFE identifier in the second table, and then “canonicalize” the CFE identifiers by stripping spaces and hyphens, and lastly join but tables together.

Get it here:

Hope this helps!

1 Like

Hello Philipp,

This is exactly what I was expecting. This is really simple actually… I was imagining a way more complex solution…

This will do the job perfectly. Thanks a lot Philipp!
Nicolas

2 Likes

Welcome! Glad to hear :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.