Adding column to dataframe takes 8 Minutes

mereep · October 23, 2018, 1:03pm

Hello community,

I was trying to use knime for some basic image processing.
Therefore I loaded the mnist training data via ImageReader to append correct labels I used a python node with very simple Code:

# Copy input to output
output_table = input_table.copy()

import os

labels = []
for path in input_table['Source']:
    labels.append(path.split(os.sep)[-2])

output_table.loc[:,'label'] = labels

This code needs about 8(!) Minutes to execute and creates TONS of files in the temp directory (why?)
I will append the corresponding lines of the log file. python_script_log.log (58.3 KB)
I had to crop most of the (all the same-outputting) lines because of the 4mb upload limit.

What is possibly going wrong there? How to overcome that? Is the “copy”-instruction at the beginning to blame? Is it necessary at all?

MarcelW · October 23, 2018, 2:10pm

Hi mereep,

How many images do you send to Python? The transferred data is copied at least once per direction (Java to Python, Python to Java). The transfer also comprises conversion of the KNIME internal image format to TIFF which is the image format we use on Python side (and vice versa). The generated temp files are needed to store the converted images on the way back to KNIME.
In any case, I’d suggest to perform as much (pre-)processing as possible using native KNIME nodes.

Marcel

christian.birkhold · October 23, 2018, 2:20pm

One more addition: If you can avoid the copy, do so. You don’t need to create an extra copy as the input data to python already represents a copy of the data coming from the KNIME table.

mereep · October 23, 2018, 2:31pm

Thanks for your answers.

Actually I dont want to touch the images atm at all using python. I am only interested in a column of the table where the filepath (Source in that case) is written and extract the folder name as label (1 2 3 etc). Thats working fine (Except for the mentioned processing overhead).

So I see the problem that the image is converted to some python-specific format in that process, while I dont even need it. So to solve the problem I should fork the table in the part that contains the image and the part that contains only the interesting part for python and rejoin it basically after the python node.

Maybe this heavy-lifting to python should only happen if I actually touch the image (Meaning: do this only when I actually access the field and lazy-evaluate that process?)

MarcelW · October 26, 2018, 4:23pm

Yes, unfortunately that’s the only possible approach at the moment.

That would certainly be useful. We already thought about something similar. However, it’s not a trivial feature to add, so I can’t promise whether that will ever make it into a release.

system · June 2, 2023, 9:13pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.