Dictionary Replacer inserts spaces into tagged terms

AngusVeitch · August 31, 2020, 12:13am

I’ve noticed a glitch in the behaviour of the Dictionary Replacer node. I’m sure it’s more general than what I will describe, but I haven’t tested other variations. In my case, I am working with Twitter data, and have tagged retweeted usernames as single terms, such as “RT @TwitterUser”. Later, I want to strip this down to just the username “@TwitterUser”. To do this, I need to create a table of replacements (e.g. “RT @TwitterUser” to “@Twitter User”) and make them using the Dictionary Replacer. (I would do this much more efficiently with the Replacer node, but unless something has changed, it cannot handle multipart terms, as per this thread.)

However, after this operation, I find that most of the tagged usernames now have a space between the @ and the name, as in “@ TwitterUser”. I can also see spaces appearing after underscores.

I assume this all has something to do with the underlying tokenisation model, and if I wasn’t performing other analyses on the same texts, I might solve it by using the whitespace tokeniser. But surely this is a case where the tokenisation should be over-ridden. Is that much more difficult to achieve than it sounds?

julian.bunzel · August 31, 2020, 12:26pm

Hey @AngusVeitch,

this sounds weird. Usually, the tokenisation should keep the text as it is and should not alter the strings by adding whitespaces for example.

I will have a look at this and come back to you soon!

Cheers,
Julian

julian.bunzel · September 1, 2020, 2:45pm

Hey again,

I only get this issue if I use the OpenNLP SimpleTokenizer. It works fine when using the EnglishTokenizer, PTBTokenizer or the Whitespace tokenizer. Is it the same for you?
I will create a ticket to have a deeper look into the issue with the SimpleTokenizer.

Cheers,

Julian

AngusVeitch · September 1, 2020, 11:07pm

Nope, I am seeing this with the EnglishTokenizer, as per the attached workflow.
Tagging_test.knwf (30.3 KB)

As you can see, it doesn’t happen in every instance. In this case, @FredFlintstone became ‘@ FredFlintstone’ and @_MaryJane became ‘@_Mary Jane’, but @megacorp and @john_smith were not changed.

You would need more data to be sure, but perhaps the capitalisation has something to do with it.

julian.bunzel · September 2, 2020, 7:34am

Okay, thank you for the workflow. I think it’s enough to determine the issue. I’ll attach it to the ticket.

Thanks again for reporting!!

Cheers,

Julian

saqib · November 19, 2020, 3:10pm

Hi @julian.bunzel. Any updates on this issue?

system · June 2, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.