Example Workflow "Document Classification"

Hello KNIME Community,

i have a question regading to your example workflow “Document Classifcation”.


I didntunderstand this part of the Workflow, what is happening here…could you give me a more detailed description about htat?

Thank you very much.

Kind regards,
Canan

Hi Canan,

this part is count in how many documents each term occurs and then to filter out terms that occur in less then x% of the documents.

Cheers, Kilian

Hey Kilian,

thanks for your answer, but why convert we first have to convert the documents to “bag of words” to then use the “term to string” not?
And is the whole process of term filtering the same like the keygraph keyword extraction? what is the difference between them?

Another question: do you think that is is better to use the lemmatizer before using the other preprocessing steps?

Thanks and kind regards,
Canan

Hi Canan,

on a bow you can perform a group by to group by terms and count the docs. Grouping on terms directly is also possible but this will take the assigned tags into account as well. Maybe you have equal words but different tags resulting in different groups. This is why we converted terms to strings to get rid of the tags. However, of course you can also group directly on terms.

Using the lemmatizer directly after POS tagging makes sense since it relies on tagged documents.

Cheers, Kilian

Hi Kilian,

thanks, but do you think it makes also sense to use the lemmatizer before using the stemmer or the other preprocessing steps like puntuation erasure etc.?

Is term filtering a part of preprocessing or transformation?

Thank you :slight_smile:Canan

Using the Lemmatizer and the Stemmer make not much sense. But using the Lemmatizer before the Filtering etc. is useful. Filtering is part of preprocessing.

Cheers, Kilian

1 Like

Thank you so much Kilian :slight_smile: Then i will try it with both and compare the results with each other…maybe the results will change when i use lemmatizer instead of stemming.

Hi @kilian.thiel, where can I define that terms should be filtered out by x %? I have found nothing in the settings in this regard.

Thank you and regards,
Canan