I’m looking for a way to train a model in Knime to link 2 sheet between which the link is not so obvious…
You’ll find attached an XL file within which there are 2 sheets and the columns on which I’d like to link the 2 sheets is highlighted in yellow.
As you’ll see these columns are not exactly the same and I need to train a model to enable Knime to understand which cell is more or less equal within the sheet 1 with the equivalent column in sheet 2…
Of course the workflow I have to build will manipulate a larger amount of data. I’m just trying to find a way to perform this operation.
“Training a model” sounds like total overkill to me. You do not have any “learning” data, do you? I’d rather:
Define some normalization rules and then do an exact match join, or
Use some string similarity measure (see e.g. the String Similarity node from Palladian), define a threshold, and then join with the “most similar” candidate, or
Yes you’re probably right the word training maybe not appropriated… Anyway I just need a solution to make it work I’m going to try you’re solution through the palladian node you mentionned.
Thanks for this.
I had a look at the sheet again, unless I’m mistaken the issue is just that you have some “garbage strings” before/after the identifiers? If this is the main issue, I’d rather suggest some simple string processing first where you remove these via some simple rules. Once set up, this is simple to understand and to optimize.
Only try to go for option (2) if you see that option (1) does not work
I use Palladian’s Regex Extractor to extract the CFE identifier in the second table, and then “canonicalize” the CFE identifiers by stripping spaces and hyphens, and lastly join but tables together.