Remove special characters from the text

Hi @DmitryIvanov76 , I think it’s going to be a relatively long explanation, so I’ll just say it before you go into the details: It’s FIXED!

For the sake of the discussion, I’ll be talking only about the first workflow, which has 3 parts in it. The second workflow, I did not even see it until much later lol, and we know it works.

First of all, I had a few problems with the workflow, because it was quite resource consuming, especially the 2nd and 3rd flow, more precisely in Hex to String part. I ran all 3 at the same time, but these 2 took several minutes to run, and somehow it used up a lot of memory too, so my computer became very slow. I could not do much after that. Had to close everything, and re-open. I also did not dare save the data.

I then tried again, but running only one at a time, that is 1 after the other. Unfortunately, same results, computer too slow to do anything after that. Had to close everything again, and re-open.

This time, I’d run one at a time, but resetting the rest and running only 1 part of the workflow only. Was a bit better, but still relatively slow, could not easily compare the results - and by compare I mean compare original string vs processed string, not comparing results of a flow vs another.

I managed to eventually optimize the Hex to String code using StringBuilder instead:

String hex = c_newHex;

StringBuilder result = new StringBuilder();
for (int i = 0; i < hex.length(); i = i + 2) {
  String s = hex.substring(i, i+2);
  int n = Integer.valueOf(s, 16);
  result.append((char)n);
}
out_newresult = result.toString();

That ran under 1 sec, and had no problem running all of the 3 at the same time and keeping all 3 in green state!

I was finally able to compare the results. When looking at the processed data, I thought it was the conversion from Hex to String that was the issue. However, I tried a few of the processed Hex data via online converters and I was getting the same results, so it looked like the conversion was working.

That meant that it was either the conversion from string to hex that was not correct or the string manipulation was breaking something when doing the replace.

I added a Rule Engine to check for any difference and for the first workflow, none of the sentence was modified. This can happen, depending on how the Sentence Extractor works. Neverthless, the processed data was still coming out as modified. In fact, some of the hex sentence could not even convert back to string for some reason.

That meant that there was issues with the conversion of String to Hex, and potentially still some issues with the Hex to String conversion too, since in some cases it could not convert back from Hex to String, but online converters could.

In the end I found that the apache commons codec library has has the encodeHex() and decodeHex() in its Hex class:
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Hex.html

I implemented the encodeHex() first, and still used my code for converting back to String. The best way to test this was to create a Test node where I would convert from String to Hex into a variable, and then convert the variable back from hex to String, and then compare the original data and the processed data. Both should be the same.

Unfortunately, that was still not 100% the same, though it was a huge improvement from the original method. Based on what I was seeing, it looked like it was mostly character encoding issue at that point. So I set the charset to UTF-8 using StandardCharsets.UTF_8 from the library java.nio.charset.StandardCharsets.

So, the code became:
org.apache.commons.codec.binary.Hex.encodeHex(c_result.getBytes(StandardCharsets.UTF_8))

Comparing with online converters, it looked like that did the trick for converting from String to Hex, but it was still not converting back properly from hex to string. It was still an issue with character set. “Specia’” characters such as “™” was not displaying properly. I ended up using the decodeHex() method since I figured (i) if I used the hex class and used the encodeHex(), it’s better to use its decodeHex(), as they’d use the reverse logic. The developers would probably make sure that each of them is the reverse of the other; (ii) Since I was able to use StandardCharsets.UTF_8 with getBytes() when converting to hex, decodeHex() returns a byte array, which I can then use with the String class and pass StandardCharsets.UTF_8

So, I ended up re-writing both conversions.

I tested in my Test node, and it works, no difference between the original and processed data, as expected.

I ran them in the 3 workflows. The first workflow, no difference, as there was no change in each of the sentence.
Second workflow, when I compare the original string and processed string, I found this difference:
In 2019, approximately 8% of our suppliers 5. Do nothing – leave supplier in "non- 4. Bi-annual supplier performance
vs
In 2019, approximately 8% of our suppliers 5. Do nothing – leave supplier in<p>-non- 4. Bi-annual supplier performance

The third workflow did not have any difference either.

Here’s the workflow (I also included the apache commons codec jar file):
Special_char-Bruno.knwf (343.9 KB)

Note: I made some modifications to the Tika Parser by removing unnecessary column - unnecessary for the troubleshooting. You can use your originial Tika Parser, it will still work.

8 Likes