SDF Writer US-ASCII charset limit

Hi

We've been having some issues writing out references into an SDF where the reference contains an accented charachter. 

Looking into he SDF Writer node the DefaultSDFWriter#openOutputWriter method is specifying US-ASCII as the Charset. I cant see in the specification for SDF (http://media.accelrys.com/downloads/ctfile-formats/ctfile-formats.zip) that the data has a charachter limit other than a charachter max length. 

Making the following change:

 

        if (m_settings.fileName().endsWith(".gz")) {
            return new BufferedWriter(new OutputStreamWriter(new GZIPOutputStream(os), StandardCharsets.ISO_8859_1));
        } else {
            return new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.ISO_8859_1));
        }

 

Appears to enable us to write out the references. HaveI overlooked something or would it be safe to make this change?

Cheers

Sam

One thing I overlooked was can the SDF Reader reader it back in? Answer: nope

Hrm, maybe CT Files are expected to be ASCII?

I also didn't find a reference to what charset is acceptable in SDF files. Therefore we decided to stick to ASCII (which probably was the only one around when SDF was invented...).

This has been a serious problem for me, as SDFs I encounter in the wild can be ASCII, cp-1252, or UTF-8.

Is there any chance of this functionality being added to the SDF reader/writer? I use knime to run various filters to weed out data quality problems, and it's really not great if in doing so all the alphas, betas, primes etc end up getting scrambled in the process.

Recently, I encountered the similar problem, unicode character in SDF content or in file path will cause problem when reading/writing those file. Can’t read file that path contains non-ascii, or the non-ascii character can’t decode well when reading/writing SDF file.
About the file path problem, I found that add “-Dfile.encoding=UTF-8” into knime.ini could resolve it (don’t forget to restart knime after the modify, if knime is running).
About the problem of writing SDF with non-ascii. I created an alternative way to overcome. The concept is to write the output file via binary object (NOT by “SDF writer” node). The brief workflow was listed below:
(1) Read SDF file by “SDF Reader” node(you need add “-Dfile.encoding=UTF-8” into knime.ini then the non-ascii can be decoded correctly.)
(2) Do your processing (add those node what you want to do).
when completed the processing, now you may want to write out SDF. I didn’t using “SDF writer” here, because it can’t handle non-ascii. A trick was here.
(3) Use “SDF inserter” node, if you want to write property back to the output file
(4) Concat the molecular if multiple molecular existed (by "GruopBy node, and set the “Value delimiter” to empty)
(5) Converting the string to binary obj (by “Strings to Binary Objects”)
(6) Writing out SDF file from binary obj (by “Binary Object to Files”)
(7) DONE
I suggest that KNIME team would add UTF-8 support of SDF Reader/Writer.

The example workflow was attached (13.3 KB).
This workflow was tested on Knime v3.3.1 and v3.5.1 by myself

We have an approach in the Vernalis file readers for allowing the user to select encoding, or let KNIME guess. The code for that is here - https://community.knime.org/svn/nodes4knime/trunk/com.vernalis/com.vernalis.knime.core/src/com/vernalis/io/

In the UI for the node configuration, it presents as (see e.g. the Load Text Files or Load Local Mol Files nodes):

image

(That’s a bit ugly with the truncated title - maybe we should change that!) The same code is called without user exposure to the options in e.g. our PDB Downloader node. Maybe the SDF reader could be modified in some way similar?

Steve

I’d like to see this change in the core KNIME SDF readers. At Lhasa we’ve got a custom version of the nodes with a different hard coded Charset which has resolved the specific issue we were having. Your solution would be better.

@thor If a change along the lines of that proposed by @s.roughley would be ok I’d be happy to look into implementing a charset selection in the dialog and make a pull request / publish the code?

Cheers

Sam

@swebb - Might be worth holding off a little, as I have done some re-working of the FileHelpers class internally. I hope to push it with some new nodes in the not to distant future, and that will expose a new method,

public static String guessEncoding(FileEncodingWithGuess fileEncoding, URLConnection uc,
		InputStream is) throws IOException {...}

It simply pulls out some of the code embedded in the existing methods but would not require a complete read of the file!

Steve

If any of you wants to contribute patches, you are very welcome! Unfortunately there is a bit of paperwork involved, see https://bitbucket.org/KNIME/knime-sdk-setup/src/master/CONTRIBUTING.MD for details.

@htsn01 I tried something similar, but unfortunately I do not have local admin rights on my work computer. knime.ini is read-only for me.

One of the key advantages of KNIME for me is the level of flexibility it gives me to make tools to solve problems without needing admin rights. It would be great if this functionality could be integrated into the SDF reader/writer node itself, rather than having to edit ini files.

@WildCation Yet, I hope so, that the UTF-8 support could be integrated into SD read/write node.
what OS your use? I use linux, and I could untar the knime software into my home folder (hence, you will have enough permission to edit anything at your local home) , and run it.
If you are using Windows, you may try to download the zip version of knime, then unzip it into the folder that you have admin rights and run it.

I have just tested a new ‘Load SD-Files (SDF)’ node in the family of ‘Load… files’ from Vernalis. It looks to work OK in initial testing but needs some further testing before being released into the wild! This follows the same file encoding pattern as shown in our previous comment above. The node doesnt have the ability to extract properties from the SDF blocks (you can use the ‘SDF Extractor’ node to do that anyway), but does attempt to parse the header block and counts lines. Not all SDF writers follow the original format guide very strictly, however, so the output is sometimes nonsense (although the node never actually fails to read the sdf… yet!)

Steve

I look forward to seeing it!

1 Like

I’ve just updated the nightly build to v1.17.0, which has this node added. Please do try it and let me know if it works for you.

Steve

So far so good. It seems like the node adds a bunch of metadata columns which I don’t need, but it’s easy enough to filter them out. Now all I need is a UTF-8 compatible SDF writer - the files I process tend to be large enough that @htsn01 's workaround ends up crashing due to full java heap space (which I can’t expand, because I can’t edit knime.ini…)

Thank you so much for implementing this. I will continue testing it and will let you know if I run into any problems.

Thanks - glad it is working. I will think about whether we can easily provide options turn the extra parsing on and off in node dialogue without breaking any of the other nodes sharing the underlying code

Steve

I’ve just updated with a few minor enhancements - you should now be able to choose which bits of the output you want. Also, it is now possible to extract the mol block directly if you so wish. See Update to v 1.17.2 (Nightly only) for details. I will push this onto the stable builds in the next week or so.

Steve

That’s great, thanks.

My inelegant solution to writing SDFs as UTF-8 without running out of Java heap space:

  • SDF Inserter to add property columns to the SDF column
  • String Manipulation to remove extraneous line breaks from the end of each SDF block
  • Column filter to remove all other columns from the table except for the SDF
  • CSV writer - UTF-8 encoding, no column delimiter, quote mode “never”, and no quote chars. Output with extension .sdf

It is a bit slower than using the standard KNIME SDF writer node because of the additional steps in the process - however, it works much better for me than using the String to Binary node, which almost always runs out of java heap space and falls over before completion. And best yet, thanks to this and your new SDF reader node, no scrambled special characters.

1 Like