Document Scrubber

sjporter · November 15, 2019, 12:02pm

Hey all, I created a Document Scrubber component. It’s similar to the Document Pre-processing component created by a KNIME team member, but it adds some additional and supports three languages (English, German, and Spanish) for POS tagging, stopword filtering, and stemming.

The options include:

Stopword filtering
Part-of-Speech Tagging
Convert to lowercase
Number filtering (default)
Number filtering (extended)
Punctuation filtering
Diacritic mark filtering
Stemming
Minimum characters per term

It doesn’t assume that you want to do anything aside from cleaning up text, so subsequent nodes like Bag-of-Words will have to be implemented separately.

It is based on the TIKA Language Parser node, which has mixed results on some text. I may end up making the language support optional and have a default processing path if that’s the case.

Check it out and let me know what you think!

ScottF · November 15, 2019, 4:42pm

Moving this topic to the main AP forum for better visibility. Thanks SJ!

system · May 16, 2020, 4:53am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.