Java Snippet: Convert Unicode to String not working all the time

Hi,

I am testing something out and struggle to identify possible reasons why some Unicode characters like “\u1E900” which translates to 𞤀 is converted via the following mechanism to Ẑ0 instead?

Test Data

Code Point in 4-Digit Hexadecimal Unicode Character UTF-8 Result
\u1E900 𞤀 Ẑ0
\u1E922 𞤢 Ẓ2
\u1E922 𞤢 Ẓ2
\u1E900 𞤀 Ẑ0
\u1E901 𞤁 Ẑ1
\u1E923 𞤣 Ẓ3
\u1E923 𞤣 Ẓ3
\u0050 P P
\u0070 p p
\u0070 p p
\u0050 P P

Java Code

// system imports
import org.knime.base.node.jsnippet.expression.AbstractJSnippet;
import org.knime.base.node.jsnippet.expression.Abort;
import org.knime.base.node.jsnippet.expression.Cell;
import org.knime.base.node.jsnippet.expression.ColumnException;
import org.knime.base.node.jsnippet.expression.TypeException;
import static org.knime.base.node.jsnippet.expression.Type.*;
import java.util.Date;
import java.util.Calendar;
import org.w3c.dom.Document;


// Your custom imports:
import org.apache.commons.text.StringEscapeUtils;
// system variables
public class JSnippet extends AbstractJSnippet {
  // Fields for input columns
  /** Input column: "Code Point in 4-Digit Hexadecimal" */
  public String c_CodePointin4DigitHexadecimal;

  // Fields for output columns
  /** Output column: "Result" */
  public String out_Result;

// Your custom variables:

// expression start
    public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:

out_Result = StringEscapeUtils.unescapeJava(c_CodePointin4DigitHexadecimal);


// expression end
    }
}

Many thanks in advance!
Mike

Adding to this, the other way around seems to not work fully either using:
out_UTF8to4DigitHex = StringEscapeUtils.escapeJava(c_UnicodeCharacterUTF8);

Code Point in 4-Digit Hexadecimal Unicode Character UTF-8 UTF-8 to 4-Digit Hex
\u1E900 𞤀 \uD83A\uDD00
\u1E904 𞤄 \uD83A\uDD04
\u1E907 𞤇 \uD83A\uDD07
\u1E915 𞤕 \uD83A\uDD15
\u1E901 𞤁 \uD83A\uDD01
\u2866 \u2866
\u28A6 \u28A6
\u2826 \u2826
\u28C6 \u28C6

The braille signs above work but not the other.
|\u2846|⡆|\u2846|
|\u2886|⢆|\u2886|

I got it to work … by putting ChatGPT to a test. Though, I did not convert the Hex representation to unicode which might have been my mistake from the beginning. The RegEx from ChatGPT required some adjustments as well.

if (c_CodePointin4DigitHexadecimal == null) {
    out_4DigitHextoUTF8 = "Error: Input string is null";
} else if (!c_CodePointin4DigitHexadecimal.matches("([0-9A-Fa-f]{4,}[^0-9A-Fa-f]?)+")) {
    out_4DigitHextoUTF8 = "Error: Input string is not a valid representation of hexadecimal values separated by spaces";
} else {
    String[] hexValues = c_CodePointin4DigitHexadecimal.split(" ");
    StringBuilder result = new StringBuilder();
    try {
        for (String hexValue : hexValues) {
            int codePoint = Integer.parseInt(hexValue, 16);
            result.append(Character.toChars(codePoint));
        }
        out_4DigitHextoUTF8 = result.toString();
    } catch (NumberFormatException e) {
        out_4DigitHextoUTF8 = "Error: Invalid hexadecimal value";
    } catch (IllegalArgumentException e) {
        out_4DigitHextoUTF8 = "Error: Invalid code point";
    }
}

Since I am not a Java expert and given kudos to the source, I can live with that … but still feel a bit filthy. I’d like to hear anybody’s point of view using ChatGPT and maybe some feedback about the generated code as well.

Hi @mwiegand -

Interesting that you found a way using ChatGPT. It seems this is becoming more and more common!

I must confess that I know very little about encoding, but just to clarify, is this issue that you found specific to KNIME, or more related to Java generally?

Hi @ScottF,

It is hard to say with certainty if that is a Knime exclusive issue. Though, the current Java version of Knime or of my OS is Java 17 whilst the most recent one is 19. I also found this quite interesting to read:

I am currently working on the UTF-8 to Hex conversion. The Unicode namelists aren’t quite clean so I share the results with Unicode. My actual goal is kind of a data test bench to identify the presence of invisible characters (Formatting Classes) to ensure ingress data is clean. The invisible characters were often found to break processing , though. However I am going a bit borderline as I want to test every RegEx Unicode Category agains any Unicode Characters … because, why not xD

https://www.regular-expressions.info/unicode.html#category

Also the display sometimes seems off. Now, I am checking the entire or at least a fast majority of all Unicode characters and a font, even OTF, can only hold up 16 bit which equals to around 65k. At least I can check data uniformity by inspecting the “raw data”.

Best
Mike

I did some testing around the default font used by Knime since I identified some irregularities like:

Character Short Schematic Name Unicode Character UTF-8 UTF-8 to 4-Digit Hex Code Point in 4-Digit Hexadecimal 4-Digit Hex to UTF-8 4-Digit Hex to UTF-8 to Hex Font Support
CJK COMPATIBILITY IDEOGRAPH-FAD7: [FB84 FED3 - 0020 - 0002 ] 𧻓 \ud85f \uded3 FAD7 FAD7 Character 𧻓 is supported by font Noto Sans Adlam

Column Explanation

  1. “Character Short Schematic Name” (Extracted as is from data)
  2. “Unicode Character UTF-8” (Extracted as is from data)
  3. “UTF-8 to 4-Digit Hex” conversion of literal to hex
  4. “Code Point in 4-Digit Hexadecimal” (Extracted as is from data)
  5. “4-Digit Hex to UTF-8” conversion of hex to literal
  6. “4-Digit Hex to UTF-8 to Hex” back conversion to cross verify process
  7. “Font Support” verification if string in column no. 5 “4-Digit Hex to UTF-8” is supported by font

By using the following code (sponsored by ChatGPT), I extracted the default font and attempted to cross validate that with the font settings.

GraphicsEnvironment ge = GraphicsEnvironment.getLocalGraphicsEnvironment();
String defaultFontName = ge.getAllFonts()[0].getFontName();
Font defaultFont = Font.decode(defaultFontName);
out_Font = "Default font: " + defaultFont.getFontName();

Default font: Serif

Since it is not an actual font but more a generic font type definition, there might be a disjoint which manifests in character display issues. Like the above.

Here is the HTML from which I extracted the columns (URL):

<?xml version="1.0" encoding="UTF-8"?>
<td class="p" title="CJK COMPATIBILITY IDEOGRAPH-FAD7: [FB84 FED3 | 0020 | 0002 |]">𧻓<br>
    </br>
    <tt>FAD7</tt>
</td>

The character isn’t rendered there either but the more striking question is why in the screenshot above, when converting form hex to literal, the character is rendered correctly.

Code for font compatibility check as follows (again, sponsored by ChatGPT)

import java.awt.Font;

String hexString = c_CodePointin4DigitHexadecimal;
String[] hexValues = hexString.split(" ");
String[] fontNames = {"Noto Sans Adlam", "Unifont", "Inter", "Inter V", "Everson Mono"};

for (String hexValue : hexValues) {
  try {
      int codePoint = Integer.parseInt(hexValue, 16);
      char[] chars = Character.toChars(codePoint);
      String string = new String(chars);

      boolean isSupported = false;
      for (String fontName : fontNames) {
          Font font = new Font(fontName, Font.PLAIN, 12);
          if (font.canDisplay(string.charAt(0))) {
              out_FontSupport = "Character " + string + " is supported by font " + fontName;
              isSupported = true;
              break;
          }
      }

      if (!isSupported) {
          out_FontSupport = "Character " + string + " is not supported by any of the specified fonts";
      }
  } catch (NumberFormatException e) {
      out_FontSupport = "Invalid hex string: " + hexValue;
  }
}

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.