This is a feature request for discussion. It was triggered by a recent post where the user was having difficulty converting a web-scraped byte stream to a document in the local computer’s locale file format. Several other packages have already, or are considering, ignoring the default machine locale and using UTF-8 as the default to encode user content in files. Setting the default file encoding to UTF-8 would encourage users to utilise a more modern encoding which does not have the limitations of ANSI Code Page encoded files.
Way back when computers were steam powered, the ASCII file format was created to encode text files. This worked beautifully whilst computers were mostly used to process western languages with a limited number of different characters. However, as computing was democratised and spread globally, a need arose to support a wider range of languages and so a need to support symbols such as Greek and Windings arose. This lead to a situation where different languages had there own slightly different set of ANSI Code Page codes for encoding characters - some languages have multiple coding schemes.
Rather than having to determine the file format every time a file was opened or closed, the operating system had a configuration option, the
locale, which determined what the standard encoding of a users file contents. Applications just had to read and write what was determined by the locale and everything would be OK.
Well, it was OK whilst people were not sharing files; or where only sharing files with people using the same file encoding. But some fool invented the internet, and we all started sharing files with each other and with people in different locales.
This created a problem as it was difficult to merge documents with different text encodings, as information would be lost, so Unicode encoding was created to standardise encoding of users text data within a file. So far, so good and we are now at the turn of the Millennium, computers are still quite large but at least they use electricity.
Great, you say, we should all switch to using Unicode (UTF-8) and life will be wonderful. But, what about all those legacy documents and applications that don’t use UTF-8? How are you going to manage the transition from a world where most documents are stored in the users locale encoding and applications understand that to one exclusively using Unicode? Plus, Unicode is complex and not all encoders / decoders work reliably.
So, most computers continued to read and save files using the locale encoding by default and those that wanted to use Unicode could do so if they wanted. Now, I should mention that the Operating System has overtime moved to using Unicode for file names and system functions without people realising it. Mostly because it is within the software development community and anything that makes life easier is OK with them (even if they do seem to get passionate arguing the finer points of detail).
Time has now moved on and Unicode is increasingly being used to store more and more user content. Particularly so as a lot more content is destined for the World Wide Web and html formatted documents (which have to be UTF-8 encoded). Many modern applications recognise this and default to using Unicode for storing data. However, there are still many that default to using the locale encoding when they could default to Unicode (UTF-8).
KNIME is one such application and my feature request is to include a configuration option to override the default locale encoding and set it to UTF-8. Therefore, whenever a reader or writer reads or writes a file it defaults to UTF-8 instead of the encoding determined by the locale. Also, and this is where it gets a little more difficult, all extensions and third party tools would also need to respect this configuration option (including Python scripts and extensions).
As an additional note, the use of locale is sometimes dependent upon the platform, so for instance in Python UTF-8 is the default on Posix platforms, but uses the system locale encoding on Windows. For cross platform software there is a large risk of chaos and confusion if users are not pushed to standardise on Unicode.
KNIME is aimed at the data science community. On the whole, data scientists source data from a wide range of sources and are more likely to work with data from other international organisations. With the increase in Internet Of Things it is more likely that they will store data in Unicode format than the end analysts Computer’s locale encoding. Therefore, as documents do not originate from the consuming computer it is highly unlikely that they will use that computers locale for encoding.
For the majority of Western-language speaking people this is not too much of a problem as there is significant overlap between the characters supported within different locale encodings and the documents can be trans-encoded with little lost of information (though there is no way to identify corruption in the data other than information loss). With increasing numbers of users who use languages with thousands of characters (such as Chinese, Japanese, Koeran, etc.) there is a significant problem. If we also add in that emojis are now a standard part of most languages there is a strong argument that all documents should use Unicode encoding.
Data Scientists should be at the forefront of pushing to drop the use of locale encoded documents and standardise on Unicode. This is particularly so within multi-national organisations where data needs to flow freely between workflows without risk of corruption.
I don’t believe that any change should be forced on users without their consent. However, KNIME should provide a configuration option to ignore the computer’s default locale encoding and use Unicode as default within all nodes where text encoding takes place (mostly file nodes). This default should also apply to all extensions and scripting nodes, such as Python which defaults to the computer locale encoding unless instructed otherwise. Note, the tab which allows the encoding to be chosen should remain, the option only sets the default choice in the encoding.
This is a major change - most users are unaware that there file is stored encoded according to their systems locale and defaulting to UTF-8 may mean they cannot open their text files in other applications that are expecting them to be locale encoded. So a configuration option is sensible as it will throw up areas of the platform and users environment that still default to the computer’s default encoding and so will need updating as it becomes apparent where further change is required.
There may be a need when loading previously created workflows to emit a warning that a node is reading or writing using the locale default encoding. It may also be appropriate to emit a warning that default locale encoding is being used anyway to prompt people to consider using UTF-8. I mention these as any change would have a significant impact both technically and behaviourally.
This is a proposal, so comments are welcome.