StringToWordVector

#1

Hello.
I would like to process a table that combines numeric and text attributes. FOr the text attributes, I would like to a) parse them into 1, 2, and 3-word ngrams, b) create a new numeric column for each ngram (header=ngram), and c) populate the columns with a 1 in each row whose text contains the ngram.

If you are familiar with Weka, this is their “String To Word Vector” filter. Unfortunately, although Knime includes many Weka data mining algorithms, it does not provide access to Weka filters directly.

To illustrate:

Input table
Row X1 X2 X3 Y
1 2.1 1.2 “apple sauce” 5.1
2 2.2 4.5 “apple juice” 3.6

Output table
Row X1 X2 Y apple sauce juice
1 2.1 1.2 5.1 1 1 0
2 2.2 4.5 3.6 1 0 1

I would appreciate any clues you could provide me.

Bill

0 Likes

#2

Looks like you can split column X3 and then pivot on it. Because pivot require counting field,
you may need to add, say, RowID to count on it later.

1 Like

#3

Thank you, that works – pivot the bag of word and then rejoin to the main table.

Bill

1 Like

#4

Hi @Bill_Bane -

You can do this sort of parsing in KNIME using the Textprocessing extension. Once you have converted your text to documents, there are nodes for dealing with NGrams, bags of words, document vectors, term frequencies, and so forth. Check it out, it’s a very useful extension. :slight_smile:

2 Likes