Stop Word Filter differences between "can't" and "Can't"... (not working with some)

Hi, I’m using the Stop Words Filter with a custom list. I want to delete some verbs. The Stop Words Node only works for some cases and I don’t understand how. I import both the xls (source of comments) and the .txt file (stopwordlist) with encoding UTF-8 and I also tried with US-ASCII.
Here the outputs:

Here a snapshot of the stop word list:

FYI, i use the same apostrophe for Can’t and can’t. Interestingly as well is how “I’ve” and “i’ve” are not filtered in any case… worst: “An’t” and “an’t” work!!! (i just copy pasted from can’t, deleted the c, and for one replaced “a” with “A”)

I tried with both check in “case sensitive” and not, with the same results…

Does someone have an idea why this could be happening?
Thanks

ps. Dictionary Filter Node behaves the same:

postStopWords.txt (72 Bytes)
Test.txt (597 Bytes)

Hi,
Use OpenNLP WhitespaceTokenizer in the Strings To Document node and it works as you expect.

5 Likes

Thanks! it works now.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.