BUG: File Reader mishandles UTF-8 BOM

dandl · March 19, 2020, 6:56am

A file in UTF-8-BOM encoding will start with the 3 character sequence 0xEF,0xBB,0xBF. A file reader should recognise this, skip it, and read the remainder of the file in UTF-8. See https://en.wikipedia.org/wiki/Byte_order_mark.

The Knime File Reader node does not recognise it, but appears to replace the sequence by a single NUL. The consequence for a CSV file is that the first column has a name starting with an (invisible) NUL. This is a very hard bug to track down. It took me most of a day.

Please fix!

bjoern.lohrmann · March 26, 2020, 1:27pm

Hi @dandl

sorry that this issue did cost you so much time, the behavior is indeed not so good. We have opened a ticket internally. Currently we have plans anyway to release a revised version of the File Reader, which contains many bug fixes and new features. We are planning to fix the UTF8 BOM issue with the revised File Reader, which will hopefully be published with 4.2.

Best,
Björn

dandl · March 26, 2020, 1:58pm

It’s software. It has bugs. Over time, hopefully less bugs. Good to hear we’re moving in the right direction.

This is probably more of a problem for Windows developers, where UTF-16, ANSI, code pages and BOM-UTF-8 are probably a lot more common. For me they are a fact of life and this is a problem I’ll be glad to do without.

I hope you’ve got tab-delimited and Excel-style quoting and 1900 dates covered too. They’re pretty common out in the (Windows) wild.

system · September 25, 2020, 1:58am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.