Filter rows using list of words with wildcards

PLS_KN · May 3, 2022, 12:22pm

Hi,

I have a file with about 2M rows and several columns. 1 column contains the description of items:
e.g. “T-shirt made from recycled cotton”, or “laptop computer year 2019” or “salad bowl made from glass and iron”.

I want to keep only rows that are of interest to me, so I created a list of “words to remove” and I want to remove all rows that contain 1 or more of these words. The item descriptions are very long and can differ slightly so I used “table creator” nodes where each row contains word(s) including wildcards. For example “cotton” is on the list, so I want to remove rows that contain “cotton” in the description.

Reference row filter did not work: it seems it does not work well with wildcards (0 records were removed). In the reference table I have “cotton”, but description are like " 20 white cotton t-shirts" so it does not work.
Individual Row filter nodes would work but my list contains about 55 words to remove to it’s quite tedious to do it that way (would need 55 nodes).
Rule based row filter would also work but I would have to write 55 statements such as $Description$ LIKE “Cotton” or $Description$ LIKE “papaya” ETC 55 times as well so it becomes impossible to read and making changes would lead to errors probably.
Loop: using a table creator (for my list) & row filter node (select the option “exclude the rows by attribute values”) and loop through each item of my list. However this seems to have de-multiplied the data and the results are not correct.
Loop and row filter value: this time “include” the rows by attribute value, and then join the main data set with the “intermediary results” and do an anti-join. This works but is not ideal.

Any recommendations on how to proceed?
thanks!
pls

badger101 · May 3, 2022, 2:15pm

Hi @PLS_KN , if your dataset doesn’t contain sensitive info, share it here so I can work out a workflow for you. If it does contain sensitive info, either remove the sensitive columns or do some modification.

I will also need the list of your filter words.

Thanks.

elsamuel · May 3, 2022, 2:25pm

Personally I’d go with the Rule-Based Row Filter (Dictionary) node.

Since you already have the list of words, you can automatically create the rules using a String Manipulation node

PLS_KN · May 3, 2022, 3:06pm

Hi @badger101
unfortunately I work with sensible data so I’ve create two dummy tables. Let me know if you think of anything! here’s the workflow
KNIME_project.knwf (27.4 KB)

thanks again
PLS

PLS_KN · May 3, 2022, 3:08pm

thanks @elsamuel I will give that a try!

badger101 · May 3, 2022, 3:31pm

Elsa’s suggestion should work for you.

Cheers.

duristef · May 3, 2022, 4:37pm

hi @PLS_KN
you can use regular expressions, like in this workflow

The Stopwords node contains the “words to remove”. They are “packed” into a single regular expression, so the matching happens only once per Description. The words, in fact, are “patterns” belonging to a regex and they are meant to match exactly, so if you want to include both singulars and plurals you must use e.g. computers*. For the same reason “iron” won’t match “ironing”, unless you use iron(ing)*.
immagine

The complete regex, once calculated by the String Manipulation node, is passed as a variable to the Regex Extractor, which creates a column with matched words (if any) or missing values
immagine

The last step is to filter out the “matching” rows and the Group column
Result

20220503_filter_rows_stopwords.knwf (18.9 KB)

PLS_KN · May 4, 2022, 9:03am

thanks! I will try that one too; it seems a bit longer to set up but will execute faster.

cheers
PLS

PLS_KN · May 4, 2022, 9:08am

where did you find the "REGEX extractor"node? It’s not available and knime did not find the necessary extension for this node.

Thanks

takbb · May 4, 2022, 9:22am

Hi @PLS_KN , regex extractor is part of the Palladian nodes.

Regex Extractor — NodePit.

See the bottom of that page for installation instructions

gonhaddock · May 4, 2022, 11:06am

Hello @PLS_KN
I couldn’t avoid to give this a try. It is based in a Logical Indexing methodology. It doesn’t use the Regex Extractor component (I cannot download it from my network today).

With the mentioned logical indexing matrix you can give this workflow some other LOGICAL uses like: when does a word happens when some other is present before…

This is an example of the excluded words in your final filter:

I hope you find useful this workflow. BR

system · August 2, 2022, 11:06am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.