Merge tables by column similarity

karelman · September 23, 2015, 9:33pm

I want to merge two huge tables that the only information that have in "common" is the title of the document.
The problem is that the title is not always written exactly in the same way in all cases.
My idea is to do a clustering based on strings similarity using a distance method and then find the "pairs" in both tables.
The problem is that in any of the documentation I had found, you have to have a single table to calculate the distance measure.

Example:

Table 1
Columns
Feature 1, Feature 2, PojName, Feature 3

Table 2
Columns
Feature X, Feature Y, Feature Z, PojectName

Expected Table Result;
Columns
Feature 1, Feature 2, PojName, Feature 3, Feature X, Feature Y, Feature Z, PojectName.
In wich, the string in cells of PojName, and the one in PojectName are identical or mainly the same.

Geo · September 25, 2015, 1:35am

How about the node Similarity Search fed by the String Distances node ? The Similarity Search node finds the nearest neighbor in the reference table for each observation in your table based on the measure (in port or configure screen, depends on the function you are looking for). Jaro-Winkler, n-gram, Levenshtein, etc. all is there.

karelman · September 28, 2015, 11:58am

Yes, the use a similarity measure is the way, but the problem is that a node like string distance demands the use of ONE table, and I have TWO tables among I need to compare just by one column. In other words the distance node must have two input channels-

Geo · September 28, 2015, 10:24pm

The trick won't work if you have two different column names, i.e. PojName and PojectName. You need to rename one of the two columns, so that the names match. Then use the string distance node with your reference table and feed it into Similarity Search. After that connect the reference table and the other table. Now, the function from String distance node will be used - it will be mentioned in the configuration screen and you won't be able to choose any other distance function than the one that you have fed. Alternatively, you can feed no string distance function and you'll be served with a drop down menu to choose distance functions - of course, your desired candidate distance function may not be there.

The key step here is to have one string column name which is identical in both tables.

P.S.: The way I understand the distance function nodes is that they only define the parameters (distance type, distance parameters, variables involved in the calculation, etc.), however, nothing is calculated until you connect them to a relevant node, such as Similarity Search.

petermeissner · October 28, 2019, 3:02pm

To pick up on this again:

Using Similarity Search you can ‘Join’ two tables.
The similarities will be calculated on a column of your choosing.
The column name has to be the same for both tables.
The number of matches as well as upper and lower bounds for similarities can be specified.
Last but not least one can choose a column of the second table that will be appended to the joined table (e.g. an ID column to identify the rows of table 2 that have been joined).

Stay calm, the interface is a bit confusing.