Strings to Document - Title in full text

There is a problem with the injection of title in full text when creating a document from strings.
The preprocessing and transformation nodes often act on the full text.

Examples:

- Number filter will remove the title if the title is numbers.

- Bag of Words Creator will have the title as a term. If you choose to not have a title, the Strings to Document will insert Row ID (Row0 and so on) as title and you will have Row as term.

If you look at the Sentiment Classification example (08_Other_Analytics_Types/01_Text_Processing/03_Sentiment_Classification) and run it as is, you will have no titles from the Strings to Documents node (even if it is activated), and you will have a nice roc-curve with a score at 0.94.
If you then open the configuration of the Strings to Documents and deactivate the "Use title from column", and directly activate it again, you will have, after running the workflow, a roc-curve with a score of 1.0. This because now the model is trained on the document classification and it will always be correct.

Is the title injection in the full text correct or is it a bug/not so good feature?

By the way, I'm running Knime v3.4.1.

Best regards,

Max

Hi Max,

the title is part of the full text. This is fully intentional. Having RowIds as titles and thus as part of the full text is not a good idea. Especially if you create document vectors for classification. In the next release (6th Dec. 2017) we will release a new feature for the Strings to Document node that allows to create document with empty title. You will then have the choice of using a column, the RowId or an empty string as title. This will solve the problem that if you have no column to use for title but don't want to use the RowId as well.

Cheers, Kilian

Thanks a lot!

Looking forward for this release. It will make it less problematic when working with sentiment analysis. You don't have to track what data you injected in the training data...

Thank you for a versatile application!

Max

 

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.