Clean non ASCII characters from Strings or Columns to be converted in XML

Dear Developers,
I am making heavy use of XPATH, Web services and String to XML nodes and I've found that they are not able to cope with characters such as ©  that of course do not belong to properly formatted XML.

The matter is reproducible.

Perhaps you might want to introduce an "XMLCleaner" node ? One implementation of that is mentioned for the atlassian code (I have not used it)

 http://repository.atlassian.com/atlassian-xml-cleaner/jars/atlassian-xml-cleaner-0.1.jar

Currently the only way to do that is using a Java Snippet e.g. embedding the following class

http://code.google.com/p/springapps/source/browse/dom4j-test/src/main/java/com/studerb/dom4j_test/XMLCleaner.java?spec=svn160&r=160

...

Any other suggestion is welcome !

Here is my minimalistic code for the Java Snippet XMLCleaner, derived from http://code.google.com/p/springapps/source/browse/dom4j-test/src/main/java/com/studerb/dom4j_test/XMLCleaner.java?spec=svn160&r=160
and also ccording to the spec (http://www.w3.org/TR/xml/#charsets).

It should work, however its output does not make String to XML happy frown

StringBuilder out = new StringBuilder();
int codePoint;
int i = 0;
while (i < $text$.length()) {
 codePoint = $text$.codePointAt(i);
 if ((codePoint == 0x9) ||(codePoint == 0xA) ||
     (codePoint == 0xD) ||((codePoint >= 0x20) &&
     (codePoint <= 0xD7FF)) ||  ((codePoint >= 0xE000) && (codePoint <= 0xFFFD)) ||
     ((codePoint >= 0x10000) && (codePoint <= 0x10FFFF))) {
      out.append(Character.toChars(codePoint));
 }
 i += Character.charCount(codePoint);
}
return out.toString();

Fixed ! Thanks to the table (http://www.ssec.wisc.edu/~tomw/java/unicode.html#x0000)
I've adapted the code and now the output is "digested" properly by
String2XML

StringBuilder out = new StringBuilder();
int codePoint;
int i = 0;
while (i < $text$.length()) {
 codePoint = $text$.codePointAt(i);
 if ((codePoint == 0x9) ||
     (codePoint == 0xA) ||
     (codePoint == 0xD) ||
     ((codePoint >= 0x20) && (codePoint <= 0x7E))
    ) {
      out.append(Character.toChars(codePoint));
 }
 i += Character.charCount(codePoint);
}
return out.toString();


Perhaps that node above could be useful also to others and therefore is posted here, however I do not claim maintaining it or making it more general wink

It's a bit strange since Strings in Java are already in Unicode, so that the XML parser should in principle not have any problems understanding it. Can you send an example that doesn't work?

Dear Thor, thankyou for your followup. I was as well surprised ... however with that workaround things work well. The example is already in the first posting: put those characters in a text and you will see the error happening.

If I read in a file with "<bla>©</bla>" with the File Reader and then use String To XML it works as expected. Can you construct a simple example file where you get the error and send it to me?

We just found a problem with reading XML files on systems where the default encoding is not UTF-8 (Windows and presumably also Mac). Since I'm using Linux I could not reproduce your problem but I believe it is fixed now in the upcoming 2.4.2 release.