Parsing web log files in knime

marketingthema · March 14, 2015, 12:02am

Hello everybody ,

I have a weblog file extracted from a reverse proxy web server looking like this :

Timestamp, UserIP, Size_of_response_in_bytes, Ressource, Url_request, Http_status_code, The_value_of_request_header, %0, %0,%0

Example :

1420066889,105.157.31.108,41229,dalloz,/http/www.dalloz-bibliotheque.fr/index.php?subpage=search&q_collection=4,200,"text/html; charset=utf-8", textcolumn0, textcolumn1, textcolumn2

I used the weblog reader under knime to parse the file with the following structure

%t , %h , %b , %0, "%r", %>s , %0, %0, %0, %0

But knime tell me that the first line does not match the pattern ! the problem is that i'm not sure if the weblog is from an appache server ?! i have a file using separations using a comma (like a CSV) but the knime node is unbale to parse it !

Thank you for help !

thor · March 14, 2015, 7:59pm

The format doesn't look like an Apache log file. You cannot read it with the weblog reader in this case. You can try to use the normal File Reader instead.

marketingthema · March 15, 2015, 1:17am

Thank you Thor for the help !

i used the file reader but the problem is that i want an automatic parser because i have many records for the same url request where some records contain .js .jpeg .html etc.... the problem is that i have to filter different elements that constitute a visited webpage ! and sometimes it's hard to know wich element is a part of a page or not, this can give make fake statitical results ! a parser node or algorithm generally can help filter and group results by user or request !

Is there otherwise a way or node to convert "non appache" to "appache log format" ?

Thx

thor · March 16, 2015, 8:13am

The WebLog reader does not perform any automatic grouping of requests either. However, if you can read it with a File Reader you can do the pre-processing with other nodes, such as String Manipulation or GroupBy.

marketingthema · March 25, 2015, 1:22pm

Thank you Thor,

The difficulty is to parse multiple lines and to know what to select ! i have seen the text nodes but i still wonder if there is a clean and sure way to parse data and be sure of wish hits should be counted ! for example having rules like including only html + htm + pdf + php to be counted ! this is the hard part !

Anyway, i will continue searching ! thanks again !