Comparing text data in two columns in separate tables to see if one tables contains a string from the other

theamazingdrew · February 17, 2016, 7:01am

Hi everyone, and thank you for your help.

I've been working on a project that review web logs from a variety of sources and compare the web traffic to a known blacklist of websites. What I am trying to do is as follows:

1. Take weblog data of visited websites (done - csv is imported with the urls in a column)

2. Import a blacklist from a csv (also done).

3. For each website visited in the weblog (these are all stored in one column), see if it has a corresponding entry in the blacklist (also stored in one column). This is where I'm stuck. A few details about this - first, the blacklist is the parent domains only; so it is common to have a visited website be within a parent domain on a blacklist, but contain more characters than just the parent domain (example: facebook.com may be in the blacklist, but the visited url may be www.facebook.com/user/login or something like that). Thus, I need to see if an entry in the weblog contains any of the entries in the blacklist, but does not necessarily match perfectly any entries in the blacklist.

4. Once an entry is matched, flag it with a 1 in a new column on the weblog csv (I assume this is trivial, but I've spent so much time working on step 3. that I haven't gotten to this point yet).

That said, any help resolving step 3 would be greatly appreciated. I've spent hours trying to figure it out.

Drew

ferry.abt · February 18, 2016, 10:01am

Hey Drew,

Interesting project! I would solve 3 in four steps:

create a new column which contains the extracted domain
use the reference row splitter to split your data into those in the blacklist and those without.
former gets a constant value true, latter constant false
concatenate the two columns again

Hope that helps.

Best,
Ferry

ImNotGoodSry · February 18, 2016, 1:23pm

You can create rulesets based on your blacklist of websites and then use the Rule Engine (Dictionary) node to filter your visited websites or assign numbers or do whatever you want.

Best,
Marc

theamazingdrew · February 19, 2016, 7:49pm

Thanks guys. I don't know the Rule Engine very well, but I did make some progress using the reference row splitter. I was able to strip out port info (http://, https://) easily and a lot of other information. The problem I'm now facing is how to strip out the "www" part of the url. Let me give an example as it would be easier.

The blacklist presents all urls like this: "baddomain.com" or "otherdomain.net".

However, the weblog presents urls like "http://www.baddomain.com:443/folder" or "https://www1.otherdomain.net"

I've been able to effectively remove everything to the left of the "//" (like http:// and https://) and everything to the right of the colon (:443) or second directory slash (/folder). The problem is that, while most websites use "www." after the "http://" part, not all do. Some have "www1." or "www2." or "www2000." There are a lot of these variation, moreso than I feel comfortable coding manually. Some websites don't have a "www" of any form, so I can't just strip off the first set of characters before a period.

Does anyone have any advice on what to do to strip out all the varying versions of "www"?

Docminus · February 22, 2016, 3:58pm

I am unfortunately not that great with Regex, though it sounds like this should be solvable via regex searches.

e.g. with string manipulation node, use regexreplace.

something like

^http\:\ / \ / www (no spaces, only there for clarity), additionally, and this I am really not 100% sure:

[a-z][0-9]*\. (i.e. ^http\:\ / \ / www[a-z][0-9]*\. ).

or something like that would cover any letter and number up until the first dot?

(Basically removing all that, so you would end up to be baddomain:123/folder)

then another regex that covers the colon and everything forward, alt. only the slash and everything forward (at this point a standard wild-card replace should work if you are more comfortable with that, err, well, I am at least)

ImNotGoodSry · February 22, 2016, 5:42pm

See attached a workflow demonstrating how I would filter the pages. Short explanation:

The String Manipulation node creates a list of rules out of your list of blacklisted websites. Please note, that $URL$ is the column name of the tables with the visited websites, you can or should use your desired column name.

The Rule Engine (Dictionary) node just applies the list of rules to your list of visited websites. So you don't need any fancy RegEx or fairy dust to find forbidden URLs.

Let me know if you have more questions.

Best,
Marc

blacklisted_websites.zip

theamazingdrew · February 25, 2016, 5:36pm

ImNotGoodSry - this looks like exactly what I need. It looks like I need to research Knimes whole philosophy of Rules engines (I'd have thought to use Regex as well). I'm going to start working this into my workflows now.

DocMinus - Thank you as well for taking the time to help me. I appreciate you spending the time to help me-

Drew

theamazingdrew · February 26, 2016, 6:20am

Marc

I've spent several hours today messing around with String Manipulation and Rule Engine (Dictionary). The problem I'm facing is that, while the workflow you offered (once customized to fit column names and such) does a great job of flagging websites form the blacklist file, it is also flagging a lot of non-blacklisted sites (like 10%-20% of all sites). I can't make heads or tails of why it is generating the false-positives that it is (some of the websites are things like netgear.com, w3schools.com, and wss.collutions.com - websites that are not in the blacklist and don't appear to have any reference anywhere in the blacklist). There are over a million sites isted in my blacklist file, but I'm using vim to search them for matches and I can verify that these false positives that I'm reviewing are not in the list.

I'm scratching my head on this one. I can't find any pattern why some things are getting flagged incorrectly. I haven't made any meaningful changes to the script you shared. It seems logical to me, although I've tried rewriting it to include "MATCHES" and "IN" instead of "LIKE", and none of my attempts are working either (they all return no matches).

Any help will be appreciated.

Drew

aborg · February 27, 2016, 12:45pm

I think the blacklist might contain *s or ?s in it, which would explain the false positives when used with LIKE (you might want to filter them out using the Row Splitter nodes). For MATCHES, you might want to escape the blacklist content to prevent wrong expansions, like the following (with String Manipulator you can creaet that) Java regular expression: \QyourBlackListElement\E, please note that the MATCHES predicate only matches whole expressions, so you might need something like this:

join(".*?", join("\\Q", $blackListColumn$, "\\E"), ".* => FALSE")
//or with one join
join(".*?\\Q", $blackListColumn$, "\\E.* => FALSE")

Cheers, gabor

theamazingdrew · February 29, 2016, 7:15am

I appreciate the thought. The list doesn't contain any *'s or ?'s. But it may contain other characters messing things up... I am using Row splitter to get rid of quotes (there are only 4 in the list of 1.03 million, but still they screw up processing of the Rule Engine.).

I'm working at it some more tonight. Will report back if I find anything else out. I'm trying to figure out right now what entries in the blacklist are causing the false positives - so I'm replacing "Bad" with the URL that it matched to. The results are pretty interesting (things like es.com and t.com - which aren't in the blacklist). I'll update if / when I make sense of it. I'm guessing other characters are throwing it off...

UPDATE

The "-" character isn't the problem - I've filtered all websites out of the blacklist with a dash in them and the problem persists.

I've pasted below a sample from 100 web searches. The results are generated from a String Manipulation node using the following expression: join("$ClickURL$ LIKE \"*", $URL$, "*\" => \"bad\"") . I've changed "bad" to be the $URL$ variable, so I get the URL that matches instead of just "bad" (see below) - but the rule identifies the same searches.

I'm using an AOL search corpus from 2006 as my test and validation data for my work. The first columns is what the user clicked (variable $ClickURL$). The second is the URL that was matched in the blacklist ($URL$). The third is a column I made. I used vim to go through and check to see if the url actually exists or not. I do apologize for the suggestive nature of some of the searches - I wanted to clean them up, but did not because doing so may remove information necessary to diagnose the problem. I appreciate everyone's understanding.

Clicked link	What the rule engine matched from the blacklist	Is this URL actually in the blacklist?

http://gonyc.about.com	t.com	No
http://www.adulttoyexpress.com	adulttoyexpress.com	Yes
http://www.babeland.com	babeland.com	Yes
http://www.geocities.com	es.com	No
http://quotations.about.com	t.com	No
http://www.batchmates.com	es.com	No
http://www.hellodarling.net	hellodarling.net	No (no derivative o this url is in the blacklist)
http://quotes4love.blogspot.com	blogspot.com	“blogspot.com” is not. Three url's are that are blogspot urls, but contain more to the url (example: badsite.blogspot.com)
http://www.dolliecrave.com	dolliecrave.com	No (no derivative of this url is in the blacklist)
http://www.netglimse.com	netglimse.com	Yes
http://www.netgear.com	r.com	No
http://www.netgear.com	r.com	No
http://www.activisionvalue.com	ue.com	No
http://computer.howstuffworks.com	s.com	No
http://www.nycroads.com	s.com	No
http://www.awesomeguides.com	es.com	No
http://web.singnet.com.sg	t.com	No

Any thoughts are appreciated. I'm running a larger sample on my server tonight to see if I can find more patterns.

Drew

Iris · February 29, 2016, 1:50pm

Hi Drew,

please no link to adult websites :) I had to delete them and now I published your post.

Best, Iris

aborg · February 29, 2016, 9:31pm

Hi Drew,

Something like this might work better for you:

join("$ClickURL$ MATCHES /(?:.*?(?<=[./]))?\\Q", $URL$, "\\E.*/ => \"bad\"")

Take a look at it here in action: http://fiddle.re/t4an6a

(You can use / / instead of " " in rule expressions, which case you can escape values, like \, so from \Q it becomes \\Q.)

Hope this helps, gabor

theamazingdrew · March 10, 2016, 7:27am

Gabor

Thank you for the regex. I've only today had time to work on this project again, so forgive my delay in getting back. The "Rule Engine (Dictionary)" node does not like the use of / instead of ", so I replaced the two / with \" and it worked fine. This is a wonderful regex and I appreciate it!

So far it is working worlds better than the previous code I had put together. It does still seem confuseed at periods at times, however. For example, I have a few domains such as baddomain.blogspot.com. Someone will click on gooddomain.blogspot.com, and it gets flagged because it matches blogspot.com . I haven't been able to figure out in the regex yet what is causing this behavior. Thoughts?

Regardlesss, I greatly appreciate your support in this endavor. Thank you.

Drew