Remove special characters from the text

DmitryIvanov76 · August 18, 2021, 10:32pm

This never happened before but ones again I’m searching for your help in solving very simple task.
I have a text parsed by Tika Parser URL input. The source is pdf file. The file itself is rather simple but there’re lists like:

• Advocate for change and work with the company to adopt • Partnered with a local domestic violence shelter to provide • Participated in an EMEA-wide Diversity Dialogue
new policies and services.

And like Reducing our environmental footprint. [the character isn’t printed, its a white square]

And sure, white question mark in diamonds.

I’d like to replace these special characters with something meaningful to process the document later with some formatting in html (make a question mark a paragraph, make lists with appropriate tags, etc.)

I was sure that I can easily achieve this with String manipulation node string replacer using Unicode of appropriate characters. It didn’t work.

They I tried regex replace in the same node - no luck.

I had many exciting results, like words without punctuation, text without page breaks, but these special characters I can’t catch.

Please help me - I really tried hard but failed…

bruno29a · August 19, 2021, 3:25am

Hi @DmitryIvanov76 , I see you tried doing so using Unicode, but it did not work.

Have you tried using Hex instead? Hex never fails.

You can take a look at this thread where I explained and demonstrated how to do this with Hex:

Also, if you could share some sample data, that would be useful

claudeostermann · August 19, 2021, 10:02am

Hi,

Sorry for my very late reply to all the suggestions.
Indeed I tried with Hex and I was able to remove my hidden characters.

Thank you for the different hints which helped me a lot.

Best,

Claude.

bruno29a · August 19, 2021, 12:34pm

Hi @claudeostermann , I think you replied to the wrong thread

This is a thread where I referred to your thread as it looks to be similar issue, and you probably got tagged because of this. It’s the other thread that needs a reply

DmitryIvanov76 · August 19, 2021, 2:10pm

Dear Bruno,

Thank you very much - HEX certainly should work in my case!

Can you please have a look at the attached workflow. I made it on purpose to show where I’m lost with this task.

Special_char_rsults.knwf (49.7 KB)

When I try to process the entire corpus of text, I have no result.

When I process text by sentence there’re errors likely related to hex to text node (e.g. ‘ leads to misinterpretation of HEX, but some errors are not so obvious)

I thought that this is because of my attempts to replace special characters instead of just removing them, but it appeared that this is not the matter.

When I process test string which is very simple (the last table creator node) the result is perfect.

If you have a minute, can you please have a look. I have to confess that alone I have no chance to succeed in this quest.

Have a great day and thank you very much for your help!

bruno29a · August 19, 2021, 10:53pm

Hi @DmitryIvanov76 , I think it’s going to be a relatively long explanation, so I’ll just say it before you go into the details: It’s FIXED!

For the sake of the discussion, I’ll be talking only about the first workflow, which has 3 parts in it. The second workflow, I did not even see it until much later lol, and we know it works.

First of all, I had a few problems with the workflow, because it was quite resource consuming, especially the 2nd and 3rd flow, more precisely in Hex to String part. I ran all 3 at the same time, but these 2 took several minutes to run, and somehow it used up a lot of memory too, so my computer became very slow. I could not do much after that. Had to close everything, and re-open. I also did not dare save the data.

I then tried again, but running only one at a time, that is 1 after the other. Unfortunately, same results, computer too slow to do anything after that. Had to close everything again, and re-open.

This time, I’d run one at a time, but resetting the rest and running only 1 part of the workflow only. Was a bit better, but still relatively slow, could not easily compare the results - and by compare I mean compare original string vs processed string, not comparing results of a flow vs another.

I managed to eventually optimize the Hex to String code using StringBuilder instead:

String hex = c_newHex;

StringBuilder result = new StringBuilder();
for (int i = 0; i < hex.length(); i = i + 2) {
  String s = hex.substring(i, i+2);
  int n = Integer.valueOf(s, 16);
  result.append((char)n);
}
out_newresult = result.toString();

That ran under 1 sec, and had no problem running all of the 3 at the same time and keeping all 3 in green state!

I was finally able to compare the results. When looking at the processed data, I thought it was the conversion from Hex to String that was the issue. However, I tried a few of the processed Hex data via online converters and I was getting the same results, so it looked like the conversion was working.

That meant that it was either the conversion from string to hex that was not correct or the string manipulation was breaking something when doing the replace.

I added a Rule Engine to check for any difference and for the first workflow, none of the sentence was modified. This can happen, depending on how the Sentence Extractor works. Neverthless, the processed data was still coming out as modified. In fact, some of the hex sentence could not even convert back to string for some reason.

That meant that there was issues with the conversion of String to Hex, and potentially still some issues with the Hex to String conversion too, since in some cases it could not convert back from Hex to String, but online converters could.

In the end I found that the apache commons codec library has has the encodeHex() and decodeHex() in its Hex class:
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Hex.html

I implemented the encodeHex() first, and still used my code for converting back to String. The best way to test this was to create a Test node where I would convert from String to Hex into a variable, and then convert the variable back from hex to String, and then compare the original data and the processed data. Both should be the same.

Unfortunately, that was still not 100% the same, though it was a huge improvement from the original method. Based on what I was seeing, it looked like it was mostly character encoding issue at that point. So I set the charset to UTF-8 using StandardCharsets.UTF_8 from the library java.nio.charset.StandardCharsets.

So, the code became:
org.apache.commons.codec.binary.Hex.encodeHex(c_result.getBytes(StandardCharsets.UTF_8))

Comparing with online converters, it looked like that did the trick for converting from String to Hex, but it was still not converting back properly from hex to string. It was still an issue with character set. “Specia’” characters such as “™” was not displaying properly. I ended up using the decodeHex() method since I figured (i) if I used the hex class and used the encodeHex(), it’s better to use its decodeHex(), as they’d use the reverse logic. The developers would probably make sure that each of them is the reverse of the other; (ii) Since I was able to use StandardCharsets.UTF_8 with getBytes() when converting to hex, decodeHex() returns a byte array, which I can then use with the String class and pass StandardCharsets.UTF_8

So, I ended up re-writing both conversions.

I tested in my Test node, and it works, no difference between the original and processed data, as expected.

I ran them in the 3 workflows. The first workflow, no difference, as there was no change in each of the sentence.
Second workflow, when I compare the original string and processed string, I found this difference:
In 2019, approximately 8% of our suppliers 5. Do nothing – leave supplier in "non- 4. Bi-annual supplier performance
vs
In 2019, approximately 8% of our suppliers 5. Do nothing – leave supplier in<p>-non- 4. Bi-annual supplier performance

The third workflow did not have any difference either.

Here’s the workflow (I also included the apache commons codec jar file):
Special_char-Bruno.knwf (343.9 KB)

Note: I made some modifications to the Tika Parser by removing unnecessary column - unnecessary for the troubleshooting. You can use your originial Tika Parser, it will still work.

DmitryIvanov76 · August 19, 2021, 11:16pm

Hi Bruno!
I have no words to explain my gratitude! Bravo! Seriously, it is outstanding. Thank you very much!

system · August 26, 2021, 11:16pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.