Unicode with JPython nodes

rxuriguera · January 23, 2012, 3:31pm

Hi all,
I've been trying to use JPython nodes to do some operations on strings but I run into problems with non-ascii characters. The following error shows up in the logs, when trying to cast a string from the table to Unicode:
UnicodeError: ascii decoding error: ordinal not in range(128)

I attach a very simple three-node workflow to show the problem:
Snapshot: http://i.imgur.com/Urqtt.png
Workflow zip file:

First there is a Table Creator node to create a 1x1 table with a non-ascii string (e.g. làlà lóló lülü çeçe ñiñi). Then there are two JPython Script 1:1 nodes, the top one just shows the type of the string object when passed to JPython, which is javainstance. The node at the bottom tries to convert the content of the cell to Unicode when the aforementioned error is trhown.

This is the code:
iterator = inData0.iterator() while iterator.hasNext(): row = iterator.next() cell_01 = row.getCell(0) # Both the following lines result in the same error cell_01_unicode = unicode(cell_01) cell_01_unicode = unicode(cell_01.toString()) ...

Hope someone can help. Thanks!

potts · January 23, 2012, 5:56pm

Hi rxuriguera --

If I understand your issue correctly, I am able to provoke a similar behavior in straight python (as well as jython) outside of KNIME:

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('làlà lóló lülü çeçe ñiñi')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Of course, it would be nice if I could just add the specifier that the supplied string should be interpreted as unicode because it would be short and sweet if I could do:

>>> unicode(u'làlà lóló lülü çeçe ñiñi')
u'l\xe0l\xe0 l\xf3l\xf3 l\xfcl\xfc \xe7e\xe7e \xf1i\xf1i'

A more general solution that I believe will do what you want is this:

>>> unicode('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))
u'l\xe0l\xe0 l\xf3l\xf3 l\xfcl\xfc \xe7e\xe7e \xf1i\xf1i'

There are sometimes subtle differences to consider with older versions of the python specification -- our JPython nodes leverage jython 2.2.1, considerably older than my above use of 2.7.1. Thankfully, trying the above with jython 2.2.1 in an interactive shell yielded similar results (you'll see a few subtle differences that should be innocuous if you try it yourself).

Hope this helps,

Davin

rxuriguera · January 23, 2012, 6:56pm

Thank you for your response Davin! Unfortunately, this solution does not seem to work... While your code does work in the Python interpreter, when using it inside a Knime JPython node yields the following error:
UnicodeError: utf-8 decoding error: invalid data at org.python.core.Py.UnicodeError(Unknown Source) at org.python.core.codecs.decoding_error(Unknown Source)

The exact code I've put in the node is the following:

iterator = inData0.iterator()

while iterator.hasNext():

		    row = iterator.next()

		    outContainer.addRowToTable(row)

		    print unicode('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))

Thanks

potts · January 23, 2012, 10:20pm

Ah, interesting. It looks as though jython 2.2.1 is trying to be too smart about its handling of Java strings containing wide chars.

I can reproduce that same misbehavior in the jython 2.2.1 interactive shell when trying to print what appears to otherwise be a kosher string:

Jython 2.2.1 on java1.6.0_29
Type "copyright", "credits" or "license" for more information.
>>> unicode('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))
u'l\xE0l\xE0 l\xF3l\xF3 l\xFCl\xFC \xE7e\xE7e \xF1i\xF1i'
>>> print unicode('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))
Traceback (innermost last):
File "<console>", line 1, in ?
UnicodeError: ascii encoding error: ordinal not in range(128)

Bizarre! What does jython 2.2.1 think it is doing?

"Does jython 2.5.2 do the same thing", I wondered. So I tried the same in the jython 2.5.2 interactive shell:

Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) 64-Bit Server VM (Apple Inc.)] on java1.6.0_29
Type "help", "copyright", "credits" or "license" for more information.
>>> unicode('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))
u'l\xe0l\xe0 l\xf3l\xf3 l\xfcl\xfc \xe7e\xe7e \xf1i\xf1i'
>>> print unicode('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))
làlà lóló lülü çeçe ñiñi

So it seems like jython 2.5.2 has done a better job. Perhaps this was a known bug in 2.2.1 that got nailed in the 2.5.x releases, but I do not know that.

Before declaring that we should immediately update the JPython nodes to use jython 2.5.2, I should first ask what it was that you were trying to do but couldn't really.

I ask this because when I use the JPython node to do this on your 1x1 table (via Table Creator):

print len(row.getCell(0).toString())

I get "24" as the result. When I do similarly in jython 2.2.1 and 2.5.2, I also get "24" from:

>>> len('làlà lóló lülü çeçe ñiñi'.decode('UTF8'))

This leads me to think I might already have an appropriately coded string from 'row.getCell(0).toString()' and calls to unicode() or decode() might be unnecessary, depending upon what you really want to do?

Davin