I am testing something out and struggle to identify possible reasons why some Unicode characters like “\u1E900” which translates to 𞤀 is converted via the following mechanism to Ẑ0 instead?
Test Data
Code Point in 4-Digit Hexadecimal
Unicode Character UTF-8
Result
\u1E900
𞤀
Ẑ0
\u1E922
𞤢
Ẓ2
\u1E922
𞤢
Ẓ2
\u1E900
𞤀
Ẑ0
\u1E901
𞤁
Ẑ1
\u1E923
𞤣
Ẓ3
\u1E923
𞤣
Ẓ3
\u0050
P
P
\u0070
p
p
\u0070
p
p
\u0050
P
P
Java Code
// system imports
import org.knime.base.node.jsnippet.expression.AbstractJSnippet;
import org.knime.base.node.jsnippet.expression.Abort;
import org.knime.base.node.jsnippet.expression.Cell;
import org.knime.base.node.jsnippet.expression.ColumnException;
import org.knime.base.node.jsnippet.expression.TypeException;
import static org.knime.base.node.jsnippet.expression.Type.*;
import java.util.Date;
import java.util.Calendar;
import org.w3c.dom.Document;
// Your custom imports:
import org.apache.commons.text.StringEscapeUtils;
// system variables
public class JSnippet extends AbstractJSnippet {
// Fields for input columns
/** Input column: "Code Point in 4-Digit Hexadecimal" */
public String c_CodePointin4DigitHexadecimal;
// Fields for output columns
/** Output column: "Result" */
public String out_Result;
// Your custom variables:
// expression start
public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:
out_Result = StringEscapeUtils.unescapeJava(c_CodePointin4DigitHexadecimal);
// expression end
}
}
I got it to work … by putting ChatGPT to a test. Though, I did not convert the Hex representation to unicode which might have been my mistake from the beginning. The RegEx from ChatGPT required some adjustments as well.
if (c_CodePointin4DigitHexadecimal == null) {
out_4DigitHextoUTF8 = "Error: Input string is null";
} else if (!c_CodePointin4DigitHexadecimal.matches("([0-9A-Fa-f]{4,}[^0-9A-Fa-f]?)+")) {
out_4DigitHextoUTF8 = "Error: Input string is not a valid representation of hexadecimal values separated by spaces";
} else {
String[] hexValues = c_CodePointin4DigitHexadecimal.split(" ");
StringBuilder result = new StringBuilder();
try {
for (String hexValue : hexValues) {
int codePoint = Integer.parseInt(hexValue, 16);
result.append(Character.toChars(codePoint));
}
out_4DigitHextoUTF8 = result.toString();
} catch (NumberFormatException e) {
out_4DigitHextoUTF8 = "Error: Invalid hexadecimal value";
} catch (IllegalArgumentException e) {
out_4DigitHextoUTF8 = "Error: Invalid code point";
}
}
Since I am not a Java expert and given kudos to the source, I can live with that … but still feel a bit filthy. I’d like to hear anybody’s point of view using ChatGPT and maybe some feedback about the generated code as well.
Interesting that you found a way using ChatGPT. It seems this is becoming more and more common!
I must confess that I know very little about encoding, but just to clarify, is this issue that you found specific to KNIME, or more related to Java generally?
It is hard to say with certainty if that is a Knime exclusive issue. Though, the current Java version of Knime or of my OS is Java 17 whilst the most recent one is 19. I also found this quite interesting to read:
I am currently working on the UTF-8 to Hex conversion. The Unicode namelists aren’t quite clean so I share the results with Unicode. My actual goal is kind of a data test bench to identify the presence of invisible characters (Formatting Classes) to ensure ingress data is clean. The invisible characters were often found to break processing , though. However I am going a bit borderline as I want to test every RegEx Unicode Category agains any Unicode Characters … because, why not xD
Also the display sometimes seems off. Now, I am checking the entire or at least a fast majority of all Unicode characters and a font, even OTF, can only hold up 16 bit which equals to around 65k. At least I can check data uniformity by inspecting the “raw data”.
Since it is not an actual font but more a generic font type definition, there might be a disjoint which manifests in character display issues. Like the above.
Here is the HTML from which I extracted the columns (URL):
The character isn’t rendered there either but the more striking question is why in the screenshot above, when converting form hex to literal, the character is rendered correctly.
Code for font compatibility check as follows (again, sponsored by ChatGPT)
import java.awt.Font;
String hexString = c_CodePointin4DigitHexadecimal;
String[] hexValues = hexString.split(" ");
String[] fontNames = {"Noto Sans Adlam", "Unifont", "Inter", "Inter V", "Everson Mono"};
for (String hexValue : hexValues) {
try {
int codePoint = Integer.parseInt(hexValue, 16);
char[] chars = Character.toChars(codePoint);
String string = new String(chars);
boolean isSupported = false;
for (String fontName : fontNames) {
Font font = new Font(fontName, Font.PLAIN, 12);
if (font.canDisplay(string.charAt(0))) {
out_FontSupport = "Character " + string + " is supported by font " + fontName;
isSupported = true;
break;
}
}
if (!isSupported) {
out_FontSupport = "Character " + string + " is not supported by any of the specified fonts";
}
} catch (NumberFormatException e) {
out_FontSupport = "Invalid hex string: " + hexValue;
}
}