I've been working on a project that review web logs from a variety of sources and compare the web traffic to a known blacklist of websites. What I am trying to do is as follows:
1. Take weblog data of visited websites (done - csv is imported with the urls in a column)
2. Import a blacklist from a csv (also done).
3. For each website visited in the weblog (these are all stored in one column), see if it has a corresponding entry in the blacklist (also stored in one column). This is where I'm stuck. A few details about this - first, the blacklist is the parent domains only; so it is common to have a visited website be within a parent domain on a blacklist, but contain more characters than just the parent domain (example: facebook.com may be in the blacklist, but the visited url may be www.facebook.com/user/login or something like that). Thus, I need to see if an entry in the weblog contains any of the entries in the blacklist, but does not necessarily match perfectly any entries in the blacklist.
4. Once an entry is matched, flag it with a 1 in a new column on the weblog csv (I assume this is trivial, but I've spent so much time working on step 3. that I haven't gotten to this point yet).
That said, any help resolving step 3 would be greatly appreciated. I've spent hours trying to figure it out.
You can create rulesets based on your blacklist of websites and then use the Rule Engine (Dictionary) node to filter your visited websites or assign numbers or do whatever you want.
Thanks guys. I don't know the Rule Engine very well, but I did make some progress using the reference row splitter. I was able to strip out port info (http://, https://) easily and a lot of other information. The problem I'm now facing is how to strip out the "www" part of the url. Let me give an example as it would be easier.
The blacklist presents all urls like this: "baddomain.com" or "otherdomain.net".
However, the weblog presents urls like "http://www.baddomain.com:443/folder" or "https://www1.otherdomain.net"
I've been able to effectively remove everything to the left of the "//" (like http:// and https://) and everything to the right of the colon (:443) or second directory slash (/folder). The problem is that, while most websites use "www." after the "http://" part, not all do. Some have "www1." or "www2." or "www2000." There are a lot of these variation, moreso than I feel comfortable coding manually. Some websites don't have a "www" of any form, so I can't just strip off the first set of characters before a period.
Does anyone have any advice on what to do to strip out all the varying versions of "www"?
or something like that would cover any letter and number up until the first dot?
(Basically removing all that, so you would end up to be baddomain:123/folder)
then another regex that covers the colon and everything forward, alt. only the slash and everything forward (at this point a standard wild-card replace should work if you are more comfortable with that, err, well, I am at least)
See attached a workflow demonstrating how I would filter the pages. Short explanation:
The String Manipulation node creates a list of rules out of your list of blacklisted websites. Please note, that $URL$ is the column name of the tables with the visited websites, you can or should use your desired column name.
The Rule Engine (Dictionary) node just applies the list of rules to your list of visited websites. So you don't need any fancy RegEx or fairy dust to find forbidden URLs.
ImNotGoodSry - this looks like exactly what I need. It looks like I need to research Knimes whole philosophy of Rules engines (I'd have thought to use Regex as well). I'm going to start working this into my workflows now.
DocMinus - Thank you as well for taking the time to help me. I appreciate you spending the time to help me-
I've spent several hours today messing around with String Manipulation and Rule Engine (Dictionary). The problem I'm facing is that, while the workflow you offered (once customized to fit column names and such) does a great job of flagging websites form the blacklist file, it is also flagging a lot of non-blacklisted sites (like 10%-20% of all sites). I can't make heads or tails of why it is generating the false-positives that it is (some of the websites are things like netgear.com, w3schools.com, and wss.collutions.com - websites that are not in the blacklist and don't appear to have any reference anywhere in the blacklist). There are over a million sites isted in my blacklist file, but I'm using vim to search them for matches and I can verify that these false positives that I'm reviewing are not in the list.
I'm scratching my head on this one. I can't find any pattern why some things are getting flagged incorrectly. I haven't made any meaningful changes to the script you shared. It seems logical to me, although I've tried rewriting it to include "MATCHES" and "IN" instead of "LIKE", and none of my attempts are working either (they all return no matches).
I think the blacklist might contain *s or ?s in it, which would explain the false positives when used with LIKE (you might want to filter them out using the Row Splitter nodes). For MATCHES, you might want to escape the blacklist content to prevent wrong expansions, like the following (with String Manipulator you can creaet that) Java regular expression: \QyourBlackListElement\E, please note that the MATCHES predicate only matches whole expressions, so you might need something like this:
join(".*?", join("\\Q", $blackListColumn$, "\\E"), ".* => FALSE")
//or with one join
join(".*?\\Q", $blackListColumn$, "\\E.* => FALSE")
I appreciate the thought. The list doesn't contain any *'s or ?'s. But it may contain other characters messing things up... I am using Row splitter to get rid of quotes (there are only 4 in the list of 1.03 million, but still they screw up processing of the Rule Engine.).
I'm working at it some more tonight. Will report back if I find anything else out. I'm trying to figure out right now what entries in the blacklist are causing the false positives - so I'm replacing "Bad" with the URL that it matched to. The results are pretty interesting (things like es.com and t.com - which aren't in the blacklist). I'll update if / when I make sense of it. I'm guessing other characters are throwing it off...
UPDATE
The "-" character isn't the problem - I've filtered all websites out of the blacklist with a dash in them and the problem persists.
I've pasted below a sample from 100 web searches. The results are generated from a String Manipulation node using the following expression: join("$ClickURL$ LIKE \"*", $URL$, "*\" => \"bad\"") . I've changed "bad" to be the $URL$ variable, so I get the URL that matches instead of just "bad" (see below) - but the rule identifies the same searches.
I'm using an AOL search corpus from 2006 as my test and validation data for my work. The first columns is what the user clicked (variable $ClickURL$). The second is the URL that was matched in the blacklist ($URL$). The third is a column I made. I used vim to go through and check to see if the url actually exists or not. I do apologize for the suggestive nature of some of the searches - I wanted to clean them up, but did not because doing so may remove information necessary to diagnose the problem. I appreciate everyone's understanding.
Clicked link
What the rule engine matched from the blacklist
Is this URL actually in the blacklist?
http://gonyc.about.com
t.com
No
http://www.adulttoyexpress.com
adulttoyexpress.com
Yes
http://www.babeland.com
babeland.com
Yes
http://www.geocities.com
es.com
No
http://quotations.about.com
t.com
No
http://www.batchmates.com
es.com
No
http://www.hellodarling.net
hellodarling.net
No (no derivative o this url is in the blacklist)
http://quotes4love.blogspot.com
blogspot.com
“blogspot.com” is not. Three url's are that are blogspot urls, but contain more to the url (example: badsite.blogspot.com)
http://www.dolliecrave.com
dolliecrave.com
No (no derivative of this url is in the blacklist)
http://www.netglimse.com
netglimse.com
Yes
http://www.netgear.com
r.com
No
http://www.netgear.com
r.com
No
http://www.activisionvalue.com
ue.com
No
http://computer.howstuffworks.com
s.com
No
http://www.nycroads.com
s.com
No
http://www.awesomeguides.com
es.com
No
http://web.singnet.com.sg
t.com
No
Any thoughts are appreciated. I'm running a larger sample on my server tonight to see if I can find more patterns.
Thank you for the regex. I've only today had time to work on this project again, so forgive my delay in getting back. The "Rule Engine (Dictionary)" node does not like the use of / instead of ", so I replaced the two / with \" and it worked fine. This is a wonderful regex and I appreciate it!
So far it is working worlds better than the previous code I had put together. It does still seem confuseed at periods at times, however. For example, I have a few domains such as baddomain.blogspot.com. Someone will click on gooddomain.blogspot.com, and it gets flagged because it matches blogspot.com . I haven't been able to figure out in the regex yet what is causing this behavior. Thoughts?
Regardlesss, I greatly appreciate your support in this endavor. Thank you.