I’m looking to use KNIME to automatically sort through emails I have scraped from Let’s Extract Market Studio.
The objective would be that if the Domain occurs three times or more then delete all under that domain. Sort of like deleting duplicates however some duplicates are necessary as some websites have info@website.com and then contact@website.com for example. The things that I am trying to get rid of are the masses of website emails collected when my tool goes onto Deliveroo or the Guardian or BBC and it collects all the emails from there as they use a keyword I’m searching with.
Currently I’ve put this in place to split my keywords three ways as I did multiple keywords in one search. Now as I said, I am looking to delete irrelevant emails (which in my head seemed easiest way to do it: with the domain => 3 = delete all).
You could use a SQL function with row numbers and only keep two. The mentioned example is from a big data workflow. But it seems H2 is also allowing to use
I am very inexperienced with KNIME. I don’t know what to put in as an expression for the String Manipulation Node.
I don’t understand how to use GroupBy to count domain. I went to Manual Aggregation and entered Domain and used the Count Aggregation.
I don’t know how to keep the domains I want.
I can’t get to the reference row filter point.
is there anywhere I can learn KNIME as like free online lessons?
Now to your use case. For better understanding I have created an example workflow which is attached. Check it out and if any questions feel free to ask. If you don’t know regex wanna check with some expert does it covers all cases
I never replied because I sort of gave up. I just used your Workflow and something just clicked in my head and I got it to work for mine. Thanks a lot for helping me and I’m sorry I am saying this over 40 days later I just moved past this project for a little while until it became relevant again.