File Reader problem with UTF-8 Encoded Files

Edlueze · March 20, 2015, 1:54am

Hi Folks:

I'm developing a node and I've encountered a small issue with reading in UTF-8 Encoded Files with the File Reader node. In my configure() routine I check that the input table from the upstream File Reader has a column called "Location" and a column called "Population". My locations are all Chinese city names as such:

Location	Population
上海	2301.91
北京	1961.24
吉林	441.47

That's the populations of Shanghai, Beijing and Jilin in the ten-thousands for the curious.

While the second, third, forth, etc columns always seem to match, the first column is never matched. But when I use a ANSI file like below everything works fine.

Location	Population
Sydney	300.0
Melbourne	250.0
Canberra	20.0

Digging deep into the code, I found the KNIME routine inSpecs[0].getColumnSpec("Location") is failing for column 0 and returning a null.

The reason appears to be that the upstream File Reader prefixes a non-visible character to the beginning of the name field of column 0 when reading in UTF-8 files so that the name of column 0 becomes " Location" instead of "Location". I've attached a picture of the debugger for reference.

My workaround is to ensure that column 0 always contains something harmless like the RowID.

Just thought you should know.

01_unicode_location.png

aborg · April 1, 2015, 8:05pm

Did the input UTF-8 file contain a BOM? Maybe that is causing that.

Edlueze · April 3, 2015, 3:38am

There may be something to that as it was not something I considered.

I keep my original list in Excel, but because Excel can't save UTF-8 CSV files I manually copy-and-paste into a Notepad++ file (having an Excel formula insert the commas). I just checked Notepad++ and it's now set to "Encode in UTF-8 without BOM" but I rather suspect that I just used the less scary-looking Notepad++ setting "Encode in UTF-8".