Find similar rows

Hello
If i have a row and would like to pool out all similar rows from a table, what i should do ?

Malik

Hi Malik,

I guess you are looking for the “Reference Row Filter” node.

Best,
Armin

Hi @Armin
Im looking to find the exact rows- that means have the same values and not according to ID.
For example if i have one rows as :

and Table
I want to get all the rows that has the same values of the one row.

Best
Malik

So in this case I suggest you to do this:

Use a “Table Row to Variable” node right after the node which produces that single row for you.

Now you have all the columns and values of that single row as flow variables.

Then use a “Rule-based Row filter” node for the main dataset and connect the output port of the “Table Row to variable” node to this filter node and when defining the rules use your main dataset columns and the corresponding variable to be checked and do the filtering based on them.


filter
variable row filter.knwf (15.9 KB)

Best,
Armin

3 Likes

I took the liberty to add an example with a collection column. One would be able to define a pattern of columns which are to be transformed into a collection column and then use the row reference to determine which rows are identical or not identical.

kn_example_var_row_filter.knar (34.3 KB)

3 Likes

Hi Markus, (I hope I’m not wrong this time… :sweat_smile:)

@mlauber71, That’s nice! Thanks for sharing the idea. I think your solution is much easier and better as if the data structure changes in the future, there will be no need to change the configuration of the nodes. But in my solution, one has to change the rule.

Best,
Armin

2 Likes

Your solution might work faster if one has really huge amounts of data and a limited trove of matching rows one wants to detect. The solution with the combined columns would take some RAM and space if we deal with really huge datasets.

One solution might be to assigned artificial IDs in the first place, loop over the reference columns only to store which IDs match and later join that back, but that would depend on the precise task at hand.

1 Like

I have also solve it using Python script:
import pandas as pd
output_table=pd.DataFrame()
cnt=0;
t = input_table_2.iloc[0,:]
for index, row in input_table_1.iterrows():
flag = 1
for i in range(0,row.size-1):
if (t[i] != row[i]):
flag= 0
if ( flag == 1):
cnt = cnt +1
print flag
output_table[cnt]=row

1 Like

KNIME 4.2 does not startup anymore after installation of the needed plugins. Probably there is an issue with the plugins at this moment.