Comparing text data in two columns in separate tables to see if one tables contains a string from the other

Hi everyone, and thank you for your help.

I've been working on a project that review web logs from a variety of sources and compare the web traffic to a known blacklist of websites. What I am trying to do is as follows:

1. Take weblog data of visited websites (done - csv is imported with the urls in a column)

2. Import a blacklist from a csv (also done).

3. For each website visited in the weblog (these are all stored in one column), see if it has a corresponding entry in the blacklist (also stored in one column). This is where I'm stuck. A few details about this - first, the blacklist is the parent domains only; so it is common to have a visited website be within a parent domain on a blacklist, but contain more characters than just the parent domain (example: may be in the blacklist, but the visited url may be or something like that). Thus, I need to see if an entry in the weblog contains any of the entries in the blacklist, but does not necessarily match perfectly any entries in the blacklist.

4. Once an entry is matched, flag it with a 1 in a new column on the weblog csv (I assume this is trivial, but I've spent so much time working on step 3. that I haven't gotten to this point yet).

That said, any help resolving step 3 would be greatly appreciated. I've spent hours trying to figure it out.





Hey Drew,

Interesting project! I would solve 3 in four steps:

  1. create a new column which contains the extracted domain
  2. use the reference row splitter to split your data into those in the blacklist and those without.
  3. former gets a constant value true, latter constant false
  4. concatenate the two columns again

Hope that helps.


You can create rulesets based on your blacklist of websites and then use the Rule Engine (Dictionary) node to filter your visited websites or assign numbers or do whatever you want.


Thanks guys. I don't know the Rule Engine very well, but I did make some progress using the reference row splitter. I was able to strip out port info (http://, https://) easily and a lot of other information. The problem I'm now facing is how to strip out the "www" part of the url. Let me give an example as it would be easier.

The blacklist presents all urls like this: "" or "". 

However, the weblog presents urls like "" or ""

I've been able to effectively remove everything to the left of the "//" (like http:// and https://) and everything to the right of the colon (:443) or second directory slash (/folder). The problem is that, while most websites use "www." after the "http://" part, not all do. Some have "www1." or "www2." or "www2000." There are a lot of these variation, moreso than I feel comfortable coding manually. Some websites don't have a "www" of any form, so I can't just strip off the first set of characters before a period.

Does anyone have any advice on what to do to strip out all the varying versions of "www"? 


I am unfortunately not that great with Regex, though it sounds like this should be solvable via regex searches.

e.g.  with string manipulation node, use regexreplace.

something like

^http\:\ / \ / www    (no spaces, only there for clarity), additionally, and this I am really not 100% sure:

[a-z][0-9]*\.    (i.e. ^http\:\ / \ / www[a-z][0-9]*\. ).

or something like that would cover any letter and number up until the first dot?

(Basically removing all that, so you would end up to be baddomain:123/folder)

then another regex that covers the colon and everything forward, alt. only the slash and everything forward (at this point a standard wild-card replace should work if you are more comfortable with that, err, well, I am at least)


See attached a workflow demonstrating how I would filter the pages. Short explanation:

The String Manipulation node creates a list of rules out of your list of blacklisted websites. Please note, that $URL$ is the column name of the tables with the visited websites, you can or should use your desired column name.

The Rule Engine (Dictionary) node just applies the list of rules to your list of visited websites. So you don't need any fancy RegEx or fairy dust to find forbidden URLs.

Let me know if you have more questions.


ImNotGoodSry - this looks like exactly what I need. It looks like I need to research Knimes whole philosophy of Rules engines (I'd have thought to use Regex as well). I'm going to start working this into my workflows now.

DocMinus - Thank you as well for taking the time to help me. I appreciate you spending the time to help me-






I've spent several hours today messing around with String Manipulation and Rule Engine (Dictionary). The problem I'm facing is that, while the workflow you offered (once customized to fit column names and such) does a great job of flagging websites form the blacklist file, it is also flagging a lot of non-blacklisted sites (like 10%-20% of all sites). I can't make heads or tails of why it is generating the false-positives that it is (some of the websites are things like,, and - websites that are not in the blacklist and don't appear to have any reference anywhere in the blacklist). There are over a million sites isted in my blacklist file, but I'm using vim to search them for matches and I can verify that these false positives that I'm reviewing are not in the list.

I'm scratching my head on this one. I can't find any pattern why some things are getting flagged incorrectly. I haven't made any meaningful changes to the script you shared. It seems logical to me, although I've tried rewriting it to include "MATCHES" and "IN" instead of "LIKE", and none of my attempts are working either (they all return no matches). 

Any help will be appreciated.




I think the blacklist might contain *s or ?s in it, which would explain the false positives when used with LIKE (you might want to filter them out using the Row Splitter nodes). For MATCHES, you might want to escape the blacklist content to prevent wrong expansions, like the following (with String Manipulator you can creaet that) Java regular expression: \QyourBlackListElement\E, please note that the MATCHES predicate only matches whole expressions, so you might need something like this:

join(".*?", join("\\Q", $blackListColumn$, "\\E"), ".* => FALSE")
//or with one join
join(".*?\\Q", $blackListColumn$, "\\E.* => FALSE")

Cheers, gabor

I appreciate the thought. The list doesn't contain any *'s or ?'s. But it may contain other characters messing things up... I am using Row splitter to get rid of quotes (there are only 4 in the list of 1.03 million, but still they screw up processing of the Rule Engine.).

I'm working at it some more tonight. Will report back if I find anything else out. I'm trying to figure out right now what entries in the blacklist are causing the false positives - so I'm replacing "Bad" with the URL that it matched to. The results are pretty interesting (things like and - which aren't in the blacklist). I'll update if / when I make sense of it. I'm guessing other characters are throwing it off...


The "-" character isn't the problem - I've filtered all websites out of the blacklist with a dash in them and the problem persists.

I've pasted below a sample from 100 web searches. The results are generated from a String Manipulation node using the following expression: join("$ClickURL$ LIKE \"*", $URL$, "*\" => \"bad\"") . I've changed "bad" to be the $URL$ variable, so I get the URL that matches instead of just "bad" (see below) - but the rule identifies the same searches.

I'm using an AOL search corpus from 2006 as my test and validation data for my work. The first columns is what the user clicked (variable $ClickURL$). The second is the URL that was matched in the blacklist ($URL$). The third is a column I made. I used vim to go through and check to see if the url actually exists or not. I do apologize for the suggestive nature of some of the searches - I wanted to clean them up, but did not because doing so may remove information necessary to diagnose the problem. I appreciate everyone's understanding.

Clicked link What the rule engine matched from the blacklist Is this URL actually in the blacklist? No Yes Yes No No No No (no derivative o this url is in the blacklist) “” is not. Three url's are that are blogspot urls, but contain more to the url (example: No (no derivative of this url is in the blacklist) Yes No No No No No No No

Any thoughts are appreciated. I'm running a larger sample on my server tonight to see if I can find more patterns.



Hi Drew,

please no link to adult websites :) I had to delete them and now I published your post.

Best, Iris 

Hi Drew,

Something like this might work better for you:

join("$ClickURL$ MATCHES /(?:.*?(?<=[./]))?\\Q", $URL$, "\\E.*/ => \"bad\"")

Take a look at it here in action:

(You can use / / instead of " " in rule expressions, which case you can escape values, like \, so from \Q it becomes \\Q.)

Hope this helps, gabor



Thank you for the regex. I've only today had time to work on this project again, so forgive my delay in getting back. The "Rule Engine (Dictionary)" node does not like the use of / instead of ", so I replaced the two / with \" and it worked fine. This is a wonderful regex and I appreciate it!

So far it is working worlds better than the previous code I had put together. It does still seem  confuseed at periods at times, however. For example, I have a few domains such as Someone will click on, and it gets flagged because it matches . I haven't been able to figure out in the regex yet what is causing this behavior. Thoughts?

Regardlesss, I greatly appreciate your support in this endavor. Thank you.