I’ve noticed a glitch in the behaviour of the Dictionary Replacer node. I’m sure it’s more general than what I will describe, but I haven’t tested other variations. In my case, I am working with Twitter data, and have tagged retweeted usernames as single terms, such as “RT @TwitterUser”. Later, I want to strip this down to just the username “@TwitterUser”. To do this, I need to create a table of replacements (e.g. “RT @TwitterUser” to “@Twitter User”) and make them using the Dictionary Replacer. (I would do this much more efficiently with the Replacer node, but unless something has changed, it cannot handle multipart terms, as per this thread.)
However, after this operation, I find that most of the tagged usernames now have a space between the @ and the name, as in “@ TwitterUser”. I can also see spaces appearing after underscores.
I assume this all has something to do with the underlying tokenisation model, and if I wasn’t performing other analyses on the same texts, I might solve it by using the whitespace tokeniser. But surely this is a case where the tokenisation should be over-ridden. Is that much more difficult to achieve than it sounds?
I only get this issue if I use the OpenNLP SimpleTokenizer. It works fine when using the EnglishTokenizer, PTBTokenizer or the Whitespace tokenizer. Is it the same for you?
I will create a ticket to have a deeper look into the issue with the SimpleTokenizer.
Nope, I am seeing this with the EnglishTokenizer, as per the attached workflow. Tagging_test.knwf (30.3 KB)
As you can see, it doesn’t happen in every instance. In this case, @FredFlintstone became ‘@ FredFlintstone’ and @_MaryJane became ‘@_Mary Jane’, but @megacorp and @john_smith were not changed.
You would need more data to be sure, but perhaps the capitalisation has something to do with it.