I have 30 Structure Data Files that I want to aggregate for a total of 3 million rows.
I have almost succeeded at making my workflow efficiently and elegantly append the content of each SDF:
- List Files
- TableRow To Variable Loop Start
- SDF Reader
- [column renaming logic]
- SDF Writer
- SDF Reader
- CSV Writer
- Variable Loop End
Why the step 6 and 7? I use the 'Append' option of the CSV Writer so that the workflow only deals with a fraction of the total number of rows and appends data to the existing file. This file contains mol blocks as written to a temp SDF in step 5.
The issue? CSV Writer adds a line feed for each row of the input table. The resulting SDF then fails parsing. Despite the obvious overhead, I tried transposing rows into columns, which fixes the internal extra line feeds, but not the extra ones at the end of each iteration.
How can I remove this line feed?
Do you have another workflow that would serve my purpose, other than aggregating the 3 million rows inside the KNIME workflow?
I'm not sure I understand what you are trying to do here. What are steps 5-7 trying to accomplish?
Would it be possible to see your workflow with some public data such as the SD file from the Open Source Malaria project? It should only have around 150 structures in it but if you split it up right it ought to demonstrate the problem, no?
Thanks for your reply.
If I go straight from 4 to 7, the data written by CSV Writer is not in an SDF format.
If, from 4, I write to SDF, there is no way I can append data to my existing SDF as the SDF Writer only has the option to allow overwriting file or not (no append as in the CSV Writer).
I have downloaded the Malaria file and made two copies of it (Malaria_1.sdf and Malaria_2.sdf) in a C:\OriginalSDFs folder. The temporary and aggregated SDFs are saved in a C:\AggregatedSDF folder.
The dictionary used in the String Replace (Dic) node contains one line: "CATALOG_NBR, CATALOG NO.". For the purpose of this explanation, I saved it in the same C:\OriginalSDFs folder.
The resulting All.sdf will display errors if you try to open it with an SDF Reader. This is due to the extra line feeds inserted by the CSV Writer.
Please, do not hesitate to let me know if you think of a different approach.
I've got the workflow, just haven't had a chance to look at it yet, sorry. Will try to get to it tomorrow.
If I understand correctly, then you are trying to build up a large sdf without aggregating the entire thing at once in KNIME?
I have exactly the same problem - it would be really useful for the sdf writer node to have options similar to those provided for the csv writer for behaviour if the file exists (i.e. fail, overwrite or append). I cant think that this would be a hugely difficult modification to make?
Hello Steve, you understand correctly.
Can you check the "pending review" queue? I sent you an example with a workflow attached yesterday and I still do not see it here.
In the meantime, I wanted to explore a vb script to postprocess my sdf:
Const ForReading = 1
Const ForWriting = 2
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("G:\1_COMPOUND_ACQUISITIONs\1_2014\June campaign\KNIME\All.sdf", ForReading)
strText = objFile.ReadAll
Set objFile = objFSO.OpenTextFile("G:\1_COMPOUND_ACQUISITIONs\1_2014\June campaign\KNIME\All_Trimmed.sdf", ForWriting)
strNewText = Replace(strText,"$$$$" & vbCrLf & VbCrLf,"$$$$" & VbCrLf)
Works well on a small file but hits VB's limitations on a 5 GB file, as one would expect.