Document Scrubber

Hey all, I created a Document Scrubber component. It’s similar to the Document Pre-processing component created by a KNIME team member, but it adds some additional and supports three languages (English, German, and Spanish) for POS tagging, stopword filtering, and stemming.

The options include:

  • Stopword filtering
  • Part-of-Speech Tagging
  • Convert to lowercase
  • Number filtering (default)
  • Number filtering (extended)
  • Punctuation filtering
  • Diacritic mark filtering
  • Stemming
  • Minimum characters per term

It doesn’t assume that you want to do anything aside from cleaning up text, so subsequent nodes like Bag-of-Words will have to be implemented separately.

It is based on the TIKA Language Parser node, which has mixed results on some text. I may end up making the language support optional and have a default processing path if that’s the case.

Check it out and let me know what you think!


Moving this topic to the main AP forum for better visibility. Thanks SJ!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.