Delete duplicates

I’m looking to use KNIME to automatically sort through emails I have scraped from Let’s Extract Market Studio.

The objective would be that if the Domain occurs three times or more then delete all under that domain. Sort of like deleting duplicates however some duplicates are necessary as some websites have info@website.com and then contact@website.com for example. The things that I am trying to get rid of are the masses of website emails collected when my tool goes onto Deliveroo or the Guardian or BBC and it collects all the emails from there as they use a keyword I’m searching with.

Currently I’ve put this in place to split my keywords three ways as I did multiple keywords in one search. Now as I said, I am looking to delete irrelevant emails (which in my head seemed easiest way to do it: with the domain => 3 = delete all).

Hi there @Wilko1806,

welcome to KNIME Community Forum!

Well seems to me this could be one way:

Hope this helps!

Br,
Ivan

4 Likes

You could use a SQL function with row numbers and only keep two. The mentioned example is from a big data workflow. But it seems H2 is also allowing to use

row_number()

function *1).

*1)
http://www.h2database.com/html/changelog.html

4 Likes

I am very inexperienced with KNIME. I don’t know what to put in as an expression for the String Manipulation Node.
I don’t understand how to use GroupBy to count domain. I went to Manual Aggregation and entered Domain and used the Count Aggregation.
I don’t know how to keep the domains I want.
I can’t get to the reference row filter point.

is there anywhere I can learn KNIME as like free online lessons?

I’m not sure which SQL function to use.

Hi there @Wilko1806,

sure there are ways to learn KNIME better! Check this topic for more info: How to learn more ?
Also this one can help: Update Database output

Now to your use case. For better understanding I have created an example workflow which is attached. Check it out and if any questions feel free to ask. If you don’t know regex wanna check with some expert does it covers all cases :wink:

2020_05_18_Delete_Duplicates_3_or_more.knwf (17.1 KB)

Br,
Ivan

3 Likes

I never replied because I sort of gave up. I just used your Workflow and something just clicked in my head and I got it to work for mine. Thanks a lot for helping me and I’m sorry I am saying this over 40 days later I just moved past this project for a little while until it became relevant again.

1 Like

@Wilko1806 in case you want to explore further the case of duplicates I wrote a condensed article and built a demo workflow:

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.