Hey all, I created a Document Scrubber component. It’s similar to the Document Pre-processing component created by a KNIME team member, but it adds some additional and supports three languages (English, German, and Spanish) for POS tagging, stopword filtering, and stemming.
The options include:
- Stopword filtering
- Part-of-Speech Tagging
- Convert to lowercase
- Number filtering (default)
- Number filtering (extended)
- Punctuation filtering
- Diacritic mark filtering
- Stemming
- Minimum characters per term
It doesn’t assume that you want to do anything aside from cleaning up text, so subsequent nodes like Bag-of-Words will have to be implemented separately.
It is based on the TIKA Language Parser node, which has mixed results on some text. I may end up making the language support optional and have a default processing path if that’s the case.
Check it out and let me know what you think!