Delete duplicates

Wilko1806 · May 13, 2020, 4:53pm

I’m looking to use KNIME to automatically sort through emails I have scraped from Let’s Extract Market Studio.

The objective would be that if the Domain occurs three times or more then delete all under that domain. Sort of like deleting duplicates however some duplicates are necessary as some websites have info@website.com and then contact@website.com for example. The things that I am trying to get rid of are the masses of website emails collected when my tool goes onto Deliveroo or the Guardian or BBC and it collects all the emails from there as they use a keyword I’m searching with.

Currently I’ve put this in place to split my keywords three ways as I did multiple keywords in one search. Now as I said, I am looking to delete irrelevant emails (which in my head seemed easiest way to do it: with the domain => 3 = delete all).

ipazin · May 13, 2020, 5:17pm

Hi there @Wilko1806,

welcome to KNIME Community Forum!

Well seems to me this could be one way:

extract domain using regex function from String Manipulation node
use GroupBy to count domain occurances
use Row Filter node to keep domains you want to exclude
use Reference Row Filter to exclude row from table obtained after first step

Hope this helps!

Br,
Ivan

mlauber71 · May 14, 2020, 8:18pm

You could use a SQL function with row numbers and only keep two. The mentioned example is from a big data workflow. But it seems H2 is also allowing to use

row_number()

function *1).

*1)
http://www.h2database.com/html/changelog.html

Wilko1806 · May 16, 2020, 1:28pm

I am very inexperienced with KNIME. I don’t know what to put in as an expression for the String Manipulation Node.
I don’t understand how to use GroupBy to count domain. I went to Manual Aggregation and entered Domain and used the Count Aggregation.
I don’t know how to keep the domains I want.
I can’t get to the reference row filter point.

is there anywhere I can learn KNIME as like free online lessons?

Wilko1806 · May 16, 2020, 1:30pm

I’m not sure which SQL function to use.

ipazin · May 18, 2020, 2:34pm

Hi there @Wilko1806,

sure there are ways to learn KNIME better! Check this topic for more info: How to learn more ?
Also this one can help: Update Database output

Now to your use case. For better understanding I have created an example workflow which is attached. Check it out and if any questions feel free to ask. If you don’t know regex wanna check with some expert does it covers all cases

2020_05_18_Delete_Duplicates_3_or_more.knwf (17.1 KB)

Br,
Ivan

Wilko1806 · July 1, 2020, 3:21pm

I never replied because I sort of gave up. I just used your Workflow and something just clicked in my head and I got it to work for mine. Thanks a lot for helping me and I’m sorry I am saying this over 40 days later I just moved past this project for a little while until it became relevant again.

mlauber71 · July 1, 2020, 3:44pm

@Wilko1806 in case you want to explore further the case of duplicates I wrote a condensed article and built a demo workflow:

system · December 31, 2020, 3:45am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.