Remove rows that are contained (as sub-strings) in other rows

Hello all,

I have a table like as follows:
This is a cat
This is a pet
This is a dog
is
a
is a
is
is a dog
That is a dog

I would like to end up with a table as follows:
This is a cat
This is a pet
This is a dog
That is a dog

Essentially remove the rows that are already contained (as sub-strings) in other rows.

Thanks

Hi,

Considering the examples, you can use a Rule-based Row Filter node and this expression:
$column$ LIKE “* is a *” => TRUE

Best,
Armin

Hi there!

I think @saqib is looking for general approach to his problem. Maybe a bit clearer requirements and possible input could help others suggest a solution :wink: Anyways I saw you got some answers here which seem pretty good so I will make assumption you got what you were looking for :slight_smile:

Br,
Ivan

1 Like

@ipazin,

Yup I did get some solution on Stackoverflow, but they don’t seem to scale for large datasets. I am working with a large dataset, so I am looking for something that more optimized.

Any ideas?

Thanks.

If you have frequency information on the expected correct phrases then this method can be used

@izaychik63,

I don’t have the frequency information.

Thanks,
Saqib

Hi there,

I would first try with workflow optimization rather than a new idea/approach. This is a great blog post about it: https://www.knime.com/blog/optimizing-knime-workflows-for-performance

Additionally KNIME summer release (3.8) should bring faster workflow execution so count on that as well :wink:

Br,
Ivan

I think this might get you started. Not a complete solution but I think the start of one, in fact very close to one…Hint: Just add more Regex the filter.

Testing1.knwf (17.7 KB)

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.