Text analysis question

I'm trying to use keyword analysis on a set of xml files as a classification exercise but have got as far as guesswork can take me....any tips appreciated.

I have a table where each row represents a separate xml file (web page). Each page belongs to one of several categories (website sections) so my StringValue columns are Section and Text.

I apply Strings To Document to the 'Text' column so that I can then use various Transformation/Preprocessing nodes (BoW creator, Punctuation Erasure, etc).

So I end up with a long list of terms used across all the documents. What I need to do now is measure the occurrence of each term, but crucially determine that occurrence according to Section rather than as a whole.

Not sure how to go about this, and once I've used the BoW creator node my output table contains just the columns 'Terms' and 'Document', ie the 'Section' column is no longer shown (and I imagine I'll need it to have the option to sort results by section).

Any tips gratefully received :-)

 

Hi marmot,

Parsing xml: have you already parsed the text out of the xml files and git rid of the xml tags? To do this, use the KNIME XML extension. An XPath node is provided that allows for the extraction of values/fields in an xml document.

Counting words: For word counting use the TF node. The node can count the relative or absolute term frequencies. The IDF node calculates the inverse document frequencies.

Extracting class labels (Website Sections): Use the Strings to Documents node to create documents. In the dialog of the node specify the column containing the Website Sections to be used as Category column (Source and Category, second checkbox). This information can be extracted later on with the Document Data Extractor node. In the dialog of the node select the category information to be extracted.

Attached you find a workflow that creates documents with a website section used as category, counts words, computes TF * IDF values for each term and extracts the website section information at the end.

Cheers, Kilian