search reference in different columns

Knimer17 · August 9, 2021, 10:42am

Hello,

i have two files. First one looks like this:

the column number should be filled with the numbers of the other file. The problem is that both files are not always completely filled and are sometimes mixed up. For example the postal code is in the column “Name 3”.

Other file:

As you can see it is a bit mixed up.
At the end I want the column “Number” of the first file to be filled up by several chained searches and it should look like this:

I hope anyone understands what i want.

Many thanks in advance and best regards

Knimer17

Daniel_Weikert · August 9, 2021, 3:58pm

Can you get some “unique identifier” in both tables?
My first idea was to join the tables and then try column merger to fill the missing.
As a tip
If you can provide a sample it will certainly be easier for the forum to support.
br

bruno29a · August 9, 2021, 6:53pm

Hi @Knimer17 , because the structure of you data is not standardized, the way I approached it is I combined the columns in the 2nd table together, except for number, and made them into a text.

I then search each field from the 1st table, except for number, from the text of the 2nd table, and count how many are found. I then filter on the rows whose count is the highest, meaning it’s the most similar or the closest match.

I’ve not used much of text search via Knime, so I don’t know if there are specialized nodes for this type of search, or if there could be a more optimized way to do this, but I ended up doing it this way.

This is what the workflow looks like:

My 2 input tables (same as yours):

And I get the same results as what you want:

Here’s the workflow:
Search reference in different columns.knwf (24.9 KB)

Knimer17 · August 11, 2021, 7:53am

Hey @bruno29a , many thanks for your response. Do you think there is another way to put the two files together/ without cross joiner? In original both files together have about 400k rows, have run the cross joiner for 3 hours and a whole 1% were ready

Daniel_Weikert · August 11, 2021, 8:18am

You could try to use Bruno’s idea and merge columns (I would do it probably using groupby and sorted list) for both tables and then try the similarity search node and find closest match on string distance. You avoid squaring your rows.Maybe that works for you
br

HansS · August 11, 2021, 8:47am

Hi @Knimer17

I made some changes to the wf from @bruno29a . I introduced a loop. I’ḿ not sure if it works wtih 400K records, but it is worth a try. Search reference diff coll.knwf (80.2 KB)

gr. Hans

takbb · August 11, 2021, 10:53am

Hi @Knimer17

I made some assumptions that we couldn’t easily tell which data columns would be matched, and so effectively decided to find all matches between all columns and all rows. However this shouldn’t be quite as painful as it sounds, as it first of all unpivots your data into one long table of “address component” values for each table, and then finds the potential matches.

After this, it reassembles the tables based on the possible matches between columns, and performs a “String similarity” on the address as a whole for each “matched” address. It then filters out to give only those addresses which match above a certain threshold.

This won’t necessarily give you exact answers, but at the far end you should have a dataset containing a set of possible matches which you could apply further rules to should you wish to improve the accuracy. You can also adjust the threshold level for a match to be considered (I just arbitrarily chose 50%) .

What comes out at the far end is a “best guess” and it could be that it matches an address in Table 1 to more than one “similar” address in Table 2

I would not expect this to take an exceptionally long time to process 400k records, but it partly depends on how many pieces of address it matches as to whether this fully workable. There are potentially some additional tweaks/refinements that could be added if it doesn’t quite work for your full data set.

I took @bruno29a 's supplied sample data and added an additional row to check that it didn’t match anything. An area of concern that I noted in the annotation is what happens if an address differs just by building number, or something like that. You’d need to introduce other rules to handle this, but at least the data set at the point where you’d introduce those rules would be more manageable, and you’d just be applying the rules on already partially-matched data. You’d put such rules in where the nodes are above the annotations to the far right of the flow). This could either be by adding to the “Rule-based Row Filter” are adding in further nodes of your own.

Search reference in different columns-takbb 1.knwf (63.1 KB)

At the end, it includes the “String similarity” value (from the String Similarity mentioned earlier by @Daniel_Weikert ) which can be taken as a kind of “confidence level” (1 being a full match) that the addresses are the same:

Edit:

As I posted this, I thought of a small potential improvement:

Having found all of the data elements on which the two tables match, maybe we should filter out “potential” matches, where they only had one data value in common? This “threshold” could of course be manually adjusted to suit your needs, but fewer “potential” matches would improve the throughput of the subsequent joins, as you have filtered out some of the “noise”…

Search reference in different columns-takbb 2.knwf (64.3 KB)

[Further edit: fixed small bug in both uploaded flows]

bruno29a · August 11, 2021, 1:48pm

Hi @Knimer17 , the Cross Joiner is meant so that each row from one file can be compared to every rows of the other file.

Indeed, with 400k rows, that can be quite big and resource consuming, and consequently slow

@HansS 's approach with a loop can help, in that it will still do what my workflow is doing, but it will not merge all records, but will merge 1 row at a time from one of the table, so less resource consuming, and the Duplicate Row Filter will run much faster too as it will have much less records to compare. You can give it a try, I’d be interested to hear what kind of performance difference it makes.

However, it also depends on how many records each file has separately. When you say “In original both files together have about 400k rows”, for the Cross Joiner, it would be better to know how many each files has separately, as the Cross Joiner basically will end up with Number_of_records_from_file_1 x Number_of_records_from_file_2

For example, let’s say you have 6 rows in total for both files together, so depending on how many each file has separately, this could mean the Cross Joiner could end up with:
1 x 5 = 5
2 x 4 = 8
3 x 3 = 9

So, you can see how you can end up with a medium table or a much bigger table.

The Similarity search is also another approach. I actually started in that direction originally, but backed off of it because I felt like I did not have “control” on what constitute a “similarity” here. That is not to say that it is wrong going with Similarity search. This is just a personal opinion and preference on my part (essentially I implemented my own similarity search in my approach). You should still try @takbb 's solution and see if you get the same results (you should get the same results), and also his approach is different. It would be interesting to see what’s the performance difference.

You could also share your 2 files if it’s not confidential, and we could test the performances on our side

takbb · August 12, 2021, 2:12pm

… and we can always mix-and-match different solutions, @Knimer17 …

Here is my earlier solution, but with the String Similarity nodes removed, and the “matching nodes” from @bruno29a’s s solution retro-fitted in their place… (marked with “teal” annotation boxes )

So here, it uses my earlier approach to narrow down the possible matches, but then uses @bruno29a’s logic to determine from this subset which ones to keep…

I think this works, and assuming it does, shows how we can grow our ideas from each other

I agree it would be really interesting to know from each of the solutions how well they scale with a larger dataset. One of the challenges with suggesting solutions like this is that we don’t get to try them with the representative data sets, and so they all work in theory, but sometimes we find our suggestions aren’t practical when put to the final test…

Search reference in different columns-takbb 3.knwf (67.6 KB)

ipazin · September 2, 2021, 10:36am

Hello @Knimer17 and others,

joining a party a bit late but if values are always exact/same over both tables (e.g. arthur street is always arthur street and not arthur st.) doesn’t it make sense to first clean other file so that all street names are in same column (same goes for other columns) and then “simply” use Joiner node? That might be easier, faster and more accurate in the end.

Br,
Ivan

bruno29a · September 2, 2021, 12:26pm

Hey @ipazin , the issue was not so much about arthur street vs arthur st, but rather that arthur street could be in different columns

ipazin · September 2, 2021, 1:15pm

Hello @bruno29a,

I got that and was saying it seems to me easier to clean data so that arthur street can be only in one column.

Ivan

bruno29a · September 2, 2021, 1:54pm

Hi @ipazin , yes, if cleaning is possible, then yes of course, it would be the better way. Based on my experience dealing with addresses though, this can be very difficult to deal with when you do not have control on the input data.

If you look at the data from @Knimer17 , while we can see that arthur street and possibly barn street would require the same kind of cleaning, the postal code of airplane avenue would require a different type of cleaning. And in latter particular case, how do you even decide that it’s a postal code? That could have been the house number. On top of that, postal code’s format varies per country, which makes it almost impossible to determine that this value in Name 3 is actually a postal code.

It would take much more effort to try to clean, and you won’t even be 100% sure that you’ve covered all the different “dirty” cases.

system · March 4, 2022, 1:54am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.