Can the Replacer node handle spaces in terms?

AngusVeitch · May 14, 2020, 1:32am

I’m having some troubles with the Replacer node (Text Processing). I was hoping that I could apply this node to tagged terms that contain spaces. So, for example, I can convert ‘Queen street’ (already tagged) to ‘Queen Street’ without having to extract a bag of words and use the Dictionary Replacer. But this doesn’t seem to work.

In this case, I tried to replace the formula “([A-Z][a-z]{2,}) street” with “$1 Street”, but it didn’t work (see attached workflow). On the other hand, I can use the same approach to successfully replace ‘Queen-street’ with ‘Queen Street’.

Is the Replacer node not desinged to handle spaces in tagged terms? Or is this an unintentional limitation? I know that I could work around this by replacing the spaces with something else, but that would defeat the purpose of using the Replacer, since I would have to create a bag of words and use the dictionary replacer, through which process I could make the necessary replacements anyway, and all of which adds up to more processing time. I was hoping that the Replacer would be a more efficient approach.Replacer_test.knwf (20.9 KB)

julian.bunzel · May 15, 2020, 9:40am

Hey @AngusVeitch,

you are correct. The Replacer node doesn’t work properly in this case and I’m not sure if the Dictionary Replacer would work correctly to solve this issue. The problem is that after the tokenization the words are two terms: Queen and street. After tagging these two terms are combined to one term which internally still holds two words. The Replacer node tries to match based on these single words and not on the whole term, which is no problem for terms which aren’t multi-word terms. That’s why it works for Queen-street. It was always handled as just one word.

Thank you for bringing this up. I will create a ticket, so that we can implement multi-word support for the Replacer (and probably also the Dictionary Replacer) node.

Best,

Julian

AngusVeitch · May 15, 2020, 10:47am

Thanks for that explanation. I’m often unsure about when a term is just a word and when it is more. Multi-word support in the Replacer would be great. As yes, the Dictionary Replacer on its own is no help, but it does the job if preceded by the Dictionary Tagger. I work with messy historical texts, so I often use these two nodes in combination to correct OCR errors or tokenisation problems (e.g. trailing commas or periods being attached to terms). Being able to rely on one-step regex formulas instead, or being able to skip the tagging process for multi-part terms, would be a great help.

Cheers.

mpenalver · April 20, 2022, 9:47am

Thank you @AngusVeitch for the hint about using Dictionary Tagger before Dictionary Replacer. I was stuck after finding out that the latter doesn’t support multi-word strings. I hope it will soon.

system · June 2, 2023, 9:39pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.