Example Workflow "Document Classification"

anon33357744 · April 29, 2018, 4:38pm

Hello KNIME Community,

i have a question regading to your example workflow “Document Classifcation”.

I didntunderstand this part of the Workflow, what is happening here…could you give me a more detailed description about htat?

Thank you very much.

Kind regards,
Canan

kilian.thiel · April 30, 2018, 6:24am

Hi Canan,

this part is count in how many documents each term occurs and then to filter out terms that occur in less then x% of the documents.

Cheers, Kilian

anon33357744 · April 30, 2018, 9:44am

Hey Kilian,

thanks for your answer, but why convert we first have to convert the documents to “bag of words” to then use the “term to string” not?
And is the whole process of term filtering the same like the keygraph keyword extraction? what is the difference between them?

Another question: do you think that is is better to use the lemmatizer before using the other preprocessing steps?

Thanks and kind regards,
Canan

kilian.thiel · April 30, 2018, 10:04am

Hi Canan,

on a bow you can perform a group by to group by terms and count the docs. Grouping on terms directly is also possible but this will take the assigned tags into account as well. Maybe you have equal words but different tags resulting in different groups. This is why we converted terms to strings to get rid of the tags. However, of course you can also group directly on terms.

Using the lemmatizer directly after POS tagging makes sense since it relies on tagged documents.

Cheers, Kilian

anon33357744 · April 30, 2018, 10:16am

Hi Kilian,

thanks, but do you think it makes also sense to use the lemmatizer before using the stemmer or the other preprocessing steps like puntuation erasure etc.?

Is term filtering a part of preprocessing or transformation?

Thank you Canan

kilian.thiel · April 30, 2018, 11:32am

Using the Lemmatizer and the Stemmer make not much sense. But using the Lemmatizer before the Filtering etc. is useful. Filtering is part of preprocessing.

Cheers, Kilian

anon33357744 · April 30, 2018, 11:44am

Thank you so much Kilian Then i will try it with both and compare the results with each other…maybe the results will change when i use lemmatizer instead of stemming.

anon33357744 · April 30, 2018, 11:58am

Hi @kilian.thiel, where can I define that terms should be filtered out by x %? I have found nothing in the settings in this regard.

Thank you and regards,
Canan

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.