I’ve been trying to use KNIME’s text processing features for analysing log files. I’ve read through some posts like here or the technical report on text processing from 2015 but can’t figure out why my TF node produces unexpected numbers…
In the most basic approach I am looking only at the message type (each row/message in the log comes with a message type which can take one out of 5 values like e.g. INFO or WARNING).
I’ve used a window loop to slice the log into 100s windows and the results is a table with one row per window. It has two columns, the window number and a string column containing all the types observed in that window separated by space:
Then I’m using the Strings to Document node (see below), followed by a RowID node (to make the Window column the new RowID column) and the BoW node (on the Document column):
Because the BoW node produces a term carrying the document name for each message, I’m using a Row Filter to remove all these rows before applying a TF node:
And now comes the funny result, for example document “Row9” contains 13 terms in total:
In my mind this should give me e.g. a relative term frequency for “SEVERE” of 2/13=0.1538. However, I’m getting 0.143, which is 2/14:
Is it possible that for some reason the term “Row9”, which I believe I had filtered out, is still considered for some reason? And if yes, how to avoid it being considered, or even better, not being generated as a term in the first place when running the BoW node?
Thanks a lot in advance!
Hi @M_Herrmann -
It’s definitely true that your Row Filter operation isn’t affecting the calculation of the term frequencies. The TF node operation is performed on the Document itself, and not on your filtered table.
To get around this, I would make the following changes in your Strings to Document node:
- Change the Title from RowID to Empty String, to avoid having “RowX” being parsed as a term
- Check the “Use sources from column” box and set it to Window (you may have to change the Window variable from an integer to string first)
Later on, after you have calculated the term frequencies, use the Document Data Extractor node on the document source to relabel your documents.
thanks for the prompt reply! Always felt having to filter out the “RowX” was a bit funny but never got it to work with “Empty String” as a title. Changing the “Window” from integer to string was the trick to allow me doing that!
I’m not sure though how to extract the original document title using the Document Data Extractor node though. Using the output from the Strings to Document node as an input I was trying to regenerate the Window information by extracting “Meta info”. But all I get is a column “Meta info” with no (visible) content where I’m not sure how to use it for joining with the output from the TF node:
What’s the trick?
Here I wouldn’t use the document title at all. If I remember right, the title is parsed by the Bag of Words Creator along with the text itself, which you probably don’t want. Instead, if you assign the window variable to the source, you can bypass that issue.
For the Document Data Extractor node, if you select Meta info, a column of type List is returned. You probably don’t want that either, so I would suggest extracting the source instead (after you assigned it as I described above, of course), which will just return a single string.
Also, using this approach, there is no need to join anything. The source actually exists as part of the Document field all along, so you end up just pulling it out and assigning it as a labeling field after the TF node.
I’m not using the title any more as per your suggestion and instead assign the “Window” variable to Meta Information / Source. What I meant was I couldn’t figure out how to extract that Meta Information / Source (which used to be the Document Title before I made the changes you suggested). Sorry for not being clear…
Anyways, I was extracting the Meta info and not the Source which is why I didn’t get the expected Window information. So problem solved, thanks a lot!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.