I’ve noticed a glitch in the behaviour of the Dictionary Replacer node. I’m sure it’s more general than what I will describe, but I haven’t tested other variations. In my case, I am working with Twitter data, and have tagged retweeted usernames as single terms, such as “RT @TwitterUser”. Later, I want to strip this down to just the username “@TwitterUser”. To do this, I need to create a table of replacements (e.g. “RT @TwitterUser” to “@Twitter User”) and make them using the Dictionary Replacer. (I would do this much more efficiently with the Replacer node, but unless something has changed, it cannot handle multipart terms, as per this thread.)
However, after this operation, I find that most of the tagged usernames now have a space between the @ and the name, as in “@ TwitterUser”. I can also see spaces appearing after underscores.
I assume this all has something to do with the underlying tokenisation model, and if I wasn’t performing other analyses on the same texts, I might solve it by using the whitespace tokeniser. But surely this is a case where the tokenisation should be over-ridden. Is that much more difficult to achieve than it sounds?