My goal is to remove symbols identified ground-up from the text itself (Document type column), since the implemented nodes don’t work with all the symbols I have in the text (barely work with English only).
I did this chain and I’m up to the point where I need (with least effort) to replace the symbols in accordance with two mapped columns - from Origin to Replacer. The thing is that after I run the examination again via the Java snippet (I use c_PreprocessedDocument.split(""); to identify the separate symbols, then group them to see them distinct), I encounter the same symbols I set to be replaced by a space.
Please, give ideas how to do that. And should we report a bug on that node?
Hi @deicide_bg
Thanks for reaching out.
Would it possible for you to share the workflow with a bit of data, so we can have a closer look and see if it’s a bug?
It’s rather a big one, I have to recreate it for you, so just as well you may try creating a small one yourself. Just have some strings to document, then try to replace symbols based on a dictionary with two columns - one to detect the symbol and one to replace with. If needed, I will try to provide some sample, but now I’m trying to work around the issue.
Hi @deicide_bg,
can you please let me know what kind of symbols you are trying to replace?
- Are these different characters, which are part of a word/term, e.g. replace ä with ae?
- Or are these symbols which are surrounded by white spaces?
The Dictionary Replacer node can only replace terms. This means if you want to replace characters, which are parts of a term, this is not possible with the Dictionary Replacer node. This example workflow shows how you can replace multiple characters based on a dictionary.
In the second case, it might be enough to change the tokeniser.
Best
Kathrin
I came up with the idea of replacing anything that does not include the alphabets (Bulgarian and English), so that I could clean it in one take. Since RegEx does work only with a single character, the dictionary replacer seems a logic approach. Yet, after going through it, I checked the characters again and the old unwanted ones were still there.
Finally, I went to Python to do that, having each document (row) as string and then as a set of symbols, then finding the unique symbols throughout the entire number of corporae. It did the job.
So the symbols are like “ ” © and more. Some non-UTF8
Congrats, but what do you mean with “RegEx does work only with a single character” ?
yeah, sorry, I meant the Replacer node’s option for using RegEx seems to execute only on a single character (or I just couldn’t find how to write a line of characters to replace with the same symbol). I.e. I have 20 chars which I want to replace with space (" "). It should read “()!@$(!&@$(!&” and replace with " ". Well, this didn’t work.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.