problem in CSV writer produce large file in size than input

ah123 · June 11, 2021, 4:14pm

CSV writer node produces a file with about more than 2 G in size for (11129 rows, 3 columns) However the input only 90M excel (11129,2 column)

the nodes I uses :
1- CSV reader
2- Molecule typecast
3- RDKITfingerprint
4-fingerprint similarity between molecules themselves (list consist of 11129 elements but in one column)
5- column filter
6- column rename
7- CSV writer

is there anyway to reduce the size of the produced excel file ?

elsamuel · June 11, 2021, 4:47pm

Welcome to the forum @ah123.

Are you calculating a matrix of pair-wise similarities between all 11129 molecules?
What columns are you filtering out?

Can you share your workflow so that we can open it and see what’s happening?

ah123 · June 11, 2021, 5:36pm

@elsamuel , thanks for your reply
I calculated the similarity between the molecules themselves so I chose Matrix in the finger print similarity between all 11129

the column filtering I remove columns to only remain the Tanimoto similarity and Row ID

KNIME_project.knwf (18.1 KB)

elsamuel · June 11, 2021, 5:59pm

If you generated and kept an 11,129 x 11,129 matrix of similarity scores, it’s not surprising that the resulting file size is much larger than the starting file.

ah123 · June 11, 2021, 6:24pm

the result consists of 2columns ( the column of similarity consist of list of collection of 11129 numbers of similarity) the matrix itself still ( 11129*2)

bruno29a · June 11, 2021, 6:29pm

Hi @ah123 , are you able to include the csv file too?

elsamuel · June 11, 2021, 6:58pm

the result consists of 2columns ( the column of similarity consist of list of collection of 11129 numbers of similarity) the matrix itself still ( 11129*2)

I don’t really understand this explanation.

Can you share a screenshot of the output of the Fingerprint Similarity node and the Column Rename nodes?

As far as I can tell, you set up the Fingerprint Similarity node to calculate pair-wise similarities between each molecule and every other molecule, creating an 11,129 x 11,129 matrix of similarity scores.

At the end of this workflow, each cell in the similarity column contains 11,129 values.

When you export this to a csv file, it will be massive.

ah123 · June 11, 2021, 7:23pm

the Tanimoto column consists of a list of 11197 elements in each row

ah123 · June 11, 2021, 7:25pm

Hi @bruno29a
no problem, I tried but here not allowable with CSV files

takbb · June 11, 2021, 7:47pm

Hi @ah123, if you rename the file to .txt you should then be able to upload it.

elsamuel · June 11, 2021, 8:23pm

Like I said before, the Tanimoto column consists of 11,000+ cells with 11,000+ values per cell, each with a dozen or more digits…

If you try to save this as a text file, the file size will be very large.

SimonS · June 12, 2021, 6:14am

Excel files are compressed which can also explain why the input file is much smaller. Try writing it out with the Excel Writer. If the table contains many similar values, the written Excel file may also be much smaller than the csv file.

Best,
Simon

takbb · June 12, 2021, 7:52am

Hi @ah123, I’m in agreement with @elsamuel. I don’t know the nodes in question but if from your input set of 11200 rows you are generating a table where every element is linked to every other element, then you have a column consisting of 11,200 x 11,200 elements. If each of those values has a length of say 20 characters (which is about what it appears in the screenshot) , then you are asking the csv writer to produce an output which will be 11,200 x 11,200 x 20 bytes in size (plus the other associated data).

So after that it’s a simple equation…

11,200 x 11,200 x 20 = 2,508,800,000 = over 2GB

Given the file size you quoted, that doesn’t to me indicate any problem in the csv writer, as it appears to be doing exactly what is asked of it. The only way to make it write a smaller csv file is to reduce the data set that it is being asked to write…

Yes, writing to excel format instead might reduce the size a bit as an XLSX is actually ZIPped XML, but you’ll probably just be moving the problem as at some point Excel would have to deal with it and I don’t think it would be pretty…

So I think fundamentally you need to decide if what you are asking it to write is what you actually want it to write.

system · December 11, 2021, 7:52pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.