Dictionary Replacer won't work

My goal is to remove symbols identified ground-up from the text itself (Document type column), since the implemented nodes don’t work with all the symbols I have in the text (barely work with English only).
I did this chain and I’m up to the point where I need (with least effort) to replace the symbols in accordance with two mapped columns - from Origin to Replacer. The thing is that after I run the examination again via the Java snippet (I use c_PreprocessedDocument.split(""); to identify the separate symbols, then group them to see them distinct), I encounter the same symbols I set to be replaced by a space.
Please, give ideas how to do that. And should we report a bug on that node?

Hi @deicide_bg

Thanks for reaching out.
Would it possible for you to share the workflow with a bit of data, so we can have a closer look and see if it’s a bug?

It’s rather a big one, I have to recreate it for you, so just as well you may try creating a small one yourself. Just have some strings to document, then try to replace symbols based on a dictionary with two columns - one to detect the symbol and one to replace with. If needed, I will try to provide some sample, but now I’m trying to work around the issue.

Hi @deicide_bg,

can you please let me know what kind of symbols you are trying to replace?

  1. Are these different characters, which are part of a word/term, e.g. replace ä with ae?
  2. Or are these symbols which are surrounded by white spaces?

The Dictionary Replacer node can only replace terms. This means if you want to replace characters, which are parts of a term, this is not possible with the Dictionary Replacer node. This example workflow shows how you can replace multiple characters based on a dictionary.

In the second case, it might be enough to change the tokeniser.

Best
Kathrin

1 Like

I came up with the idea of replacing anything that does not include the alphabets (Bulgarian and English), so that I could clean it in one take. Since RegEx does work only with a single character, the dictionary replacer seems a logic approach. Yet, after going through it, I checked the characters again and the old unwanted ones were still there.
Finally, I went to Python to do that, having each document (row) as string and then as a set of symbols, then finding the unique symbols throughout the entire number of corporae. It did the job.
So the symbols are like “ ” © and more. Some non-UTF8

Congrats, but what do you mean with “RegEx does work only with a single character” ?

yeah, sorry, I meant the Replacer node’s option for using RegEx seems to execute only on a single character (or I just couldn’t find how to write a line of characters to replace with the same symbol). I.e. I have 20 chars which I want to replace with space (" "). It should read “()!@$(!&@$(!&” and replace with " ". Well, this didn’t work.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.