term features for Bayesian Learner

ckevinhill · January 27, 2015, 8:21pm

I have a training dataset for Search Data that looks something like:

Phrase	Value1	Value 2	Value 3	Classification
abc	100	50	17	Brand Term
abc def	99	99	10	Category Term

There are a mix of numerical values that are associated with the delivery of the Search phrase (cost, position, click rate, etc) as well as the phrase itself. I have manually classified 2% of terms into either Brand Term or Category Term buckets and want to build a Learner to classify the other 98%.

I would like to use a combination of terms in Phrase as well as Values to build a classifier to most accurately predict outcome. For instance when Brand name is in Phrase it is always classified as Brand Term so seems this would be valuable feature to carry over.

My problem is I am not sure how to conver the "Phrase" column into a feature set of the Bayesian Learner. I would assume I need to create either a Bit or Term Vector or an individual column per unique term within Phrase (example below):

Phrase	Value1	Value 2	Value 3	abc	def	Classification
abc	100	50	17	1	0	Brand Term
abc def	99	99	10	1	1	Category Term

Any good examples of doing something like this either via Text Processing nodes or other KNIME nodes?

ckevinhill · January 27, 2015, 11:44pm

This approach seems to work but happy to hear if someone knows of a better methodology:

Strings to Document (on phrase), set Title to ID, Content to phrase
Bag of Words Creator (on document)
Document Data Extractor (convert Title/ID back to String column)
Term ot String (convert individual Terms to String column)
Rule-based Row Filter (remove "extra" rows where Title = Term)
Column Filter (to just String Title and Term columns)
Constant Value Column ( =1 for next Pivot step)
Pivot ( group = Title, pivot = Term, agg = first of constant column )
Missing Value ( = 0)
Create Bit Vector (from Int columns)
Column Filter (to just Title and Bit Vector)
Joiner (back on initial data set by Title/ID)

Result is initial data set now with appended Bit Vector representing the term space. The Naive Bayes Learner seems happy with this input (assuming PMML compatibility is turned off).