POS tagger + Bag of Words node trow away important informations

Hello everyone,

I'm using the POS Tagger. The input data for this node is a table, which consists of several StringCell colums and the for the POS tagger necessary documentCell column. But after using the node the output table consists only of the adjusted documentCell column. That means the important informations in the StringCell colums get lost.

I have tried to seperate these StringCell columns from the rest of the table and then to join them based on documentCell columns. (Cannot use the row IDs, because the BoW node also throws away the StringCell columns and enlarges the number of rows in the table. So the only possibility seems to be: joining the tables based on the documentCell columns.)

But when I try to join them based on the documentCell columns the output table is empty. I could imagine, that this is caused by the POS tags, which are added to the document by the POS-Tagger. So the documents of both tables are not equal anymore and cannot be joined...

 

Has anyone an idea how I could keep the StringCell columns in the table?

KNIME-Learner

Hi,

yes, this is an open issue which was discussed earlier. There is an workaround to join the additional columns later on back to the bow data table.

  1. Generate a unique ID for each row of the data set you are staring with in an additoinal string column e.g. by extracting the row id with the String Manipulation node.
  2. Set this ID as category column in the strings to documents node, when creating the documents
  3. Preprocessing and bow creation
  4. Extract the category of the documents using the Document Data Extractor
  5. Join additional columns based on the ID

Cheers, Kilian

 

 

Hello Kilian,

Thank you for your answer. I have only seen it half an hour ago.

Unfortunately I have a problem, when I follow your instructions:
I create an unique document column based on the row ID. So I have two document columns, because I need another document column, which contains the text I want to preprocess. Ans so the POS tagger tells me, that there is only on document column allowed.

Am I doing something wrong? If this is the case, can you please
post an example workflow?

Thank you for your answer.

KNIME-Learner
 

All tagger nodes accept only one document column. This is a shortcoming of the tagger nodes. You need to filter the second document column and join it later on.

Cheers, Kilian
 

Hello Kilian,

sorry, I missed writing back again. I keept the original document column und the BoW node and the Kuhlen Stemmer node. So I have been able to join the necessary information later back on the documents, based on the original document column.

Before that in the workflow, I took the necessary information around the the POS-Tagger node and used the column appender node to put it back to the other information.

So I have to workaurounds, but it works.

To illustrate that, I attached a picture.

Thank you for your help.

KNIME-Learner

Hi KNME-Learnre,

thank you for sharing your solution. I appreciate it. Just a comment, you could apply the Kuhlen Stemmer directly after the Stop Word Filter instead of applying it on the bow. Preprocessing on lists of documents is much faster than on bows.

Cheers, Kilian