Python node takes ages to transfer from pd dataframe to output table

Zvereec1 · September 23, 2024, 1:35pm

I assembled an pandas dataframe of about 200k rows and 10 columns via rest api, which takes about 1 minute.

If I then want to convert it to the knime output table by doing knio.output_tables[0] = knio.Table.from_pandas(final_table), that takes forever (I quit after 30 minutes, this only takes seconds when doing it outside of knime).

Is this normal? Python nodes seem useless when working with larger datasets in this case.

MartinDDDD · September 23, 2024, 1:53pm

Hey there,

that is a bit odd - can you share some more details?

I tried to replicate with some sample data incl. Logging timing (you should be able to copy&paste into a Python Script node):

import knime.scripting.io as knio

# This example script creates an output table containing randomly drawn integers using numpy and pandas.

import numpy as np
import pandas as pd



import time

# Function to display timestamps and durations
def log_duration(start_time, step):
    duration = time.time() - start_time
    print(f"{step} took {duration:.4f} seconds")

# Step 1: Start the timer for array creation
start_time = time.time()

# Creating a numpy array with 200k rows and 10 columns
array = np.random.randn(200000, 10)
log_duration(start_time, "Array creation")

# Step 2: Start the timer for DataFrame conversion
start_time = time.time()

# Converting the numpy array to a pandas DataFrame
df = pd.DataFrame(array, columns=[f'Column_{i+1}' for i in range(10)])
log_duration(start_time, "DataFrame conversion")

# Displaying DataFrame structure again for confirmation
df.head(), df.shape

# log conversion
start_time = time.time()

output = knio.Table.from_pandas(
    df
)
log_duration(start_time, "DF to KNIO conversion")

knio.output_tables[0]  = output

Here’s the console output:

Array creation took 0.0255 seconds
DataFrame conversion took 0.0010 seconds
DF to KNIO conversion took 0.0600 seconds

Which Version are you running on? Could you check if the above example also takes so long?

Zvereec1 · September 23, 2024, 2:22pm

That script works fine:

Array creation took 0.0718 seconds
DataFrame conversion took 0.0010 seconds
DF to KNIO conversion took 0.1991 seconds

I am running KNIME 5.2.5.

The dataframe that I created contains both strings and ints, not just doubles, could that have something to do with it?

MartinDDDD · September 23, 2024, 2:37pm

Hmm. I changed the example to include strings / int as well:

import knime.scripting.io as knio

# This example script creates an output table containing randomly drawn integers using numpy and pandas.

import numpy as np
import pandas as pd



import time

# Function to display timestamps and durations
def log_duration(start_time, step):
    duration = time.time() - start_time
    print(f"{step} took {duration:.4f} seconds")

# Step 1: Start the timer for array creation with different data types
start_time = time.time()

# Creating a numpy array with 200k rows for numeric columns
numeric_array = np.random.randn(200000, 5)  # 5 columns of type double (float64)

# Creating integer columns
int_array = np.random.randint(0, 100, size=(200000, 3))  # 3 columns of type int

# Creating string columns (randomly generated strings of length 5)
string_array = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=(200000, 2))  # 2 columns of type string

log_duration(start_time, "Array creation for mixed types")

# Step 2: Start the timer for DataFrame conversion
start_time = time.time()

# Converting the arrays into a pandas DataFrame
df_mixed = pd.DataFrame(np.hstack([numeric_array, int_array, string_array]), 
                        columns=[f'Float_Column_{i+1}' for i in range(5)] + 
                                 [f'Int_Column_{i+1}' for i in range(3)] +
                                 [f'String_Column_{i+1}' for i in range(2)])

# Ensuring appropriate types
df_mixed[[f'Int_Column_{i+1}' for i in range(3)]] = df_mixed[[f'Int_Column_{i+1}' for i in range(3)]].astype(int)
df_mixed[[f'String_Column_{i+1}' for i in range(2)]] = df_mixed[[f'String_Column_{i+1}' for i in range(2)]].astype(str)

log_duration(start_time, "DataFrame conversion for mixed types")

# log conversion
start_time = time.time()

output = knio.Table.from_pandas(
    df_mixed
)
log_duration(start_time, "DF to KNIO conversion")

knio.output_tables[0]  = output

little bit slower, but not to the extend that you are experiencing:

Array creation for mixed types took 0.0210 seconds
DataFrame conversion for mixed types took 0.8026 seconds
DF to KNIO conversion took 0.2821 seconds

Is there anything else going on in your Python code?

mlauber71 · September 23, 2024, 2:38pm

@Zvereec1 you could try and store the data in Parquet files inside the Python node and then import the data back. Might give an idea what is going on

Zvereec1 · September 23, 2024, 2:53pm

Hm yes that cannot be it then.

I have a temporary array that I keep appending entries to:

entry = pd.DataFrame.from_dict({…
…
qcstemp.append(entry)

And in the end I convert all those entries to a pandas dataframe:
qcsresults = pd.concat(qcstemp)

The dimensions of which are: [199111 rows x 20 columns]

That all runs fine though, it’s just the final line where it gets stuck:

knio.output_tables[0] = knio.Table.from_pandas(qcsresults)

Zvereec1 · September 23, 2024, 3:13pm

Perhaps I am misunderstanding, but wouldn’t that require me to complete running the python script in the first place? This is where I am generating my data.

MartinDDDD · September 23, 2024, 4:59pm

Hmm.

Not sure what is going on - some more things to look into:

go to your knime.ini (in the root folder of your knime installation) and look for how much RAM is allocated. The default should say -Xmx2048m in the .ini file.
If it is not default it could be e.g. -Xmx4096m or something like -Xmx4g … the numbers always indicated how much RAM is available. If it is the default then: Change it to ~half of your RAM available - e.g. if you have 8gb RAM then change it to -Xmx4g
Go to Preferences => General and check “Show Heap Status”:

image914×742 25.2 KB

That should show a bar at the bottom of your screen showing how much memory is being “utilised”:

If the box is already checked and you don’t see the bar at the bottom of the screen you may have to uncheck it, save & apply, then check it.

When you then run your Python script you can observe how that number changes / increases - maybe something maxes out and then the remaining procedure is slowing down substantially.

mlauber71 · September 23, 2024, 5:18pm

@Zvereec1 you would save the df file from within the python node as a parquet file and then read it in knime

Zvereec1 · September 23, 2024, 5:39pm

I am running the workflow on the corporate server though, so the allocated RAM of my local installation should not matter!

MartinDDDD · September 23, 2024, 5:56pm

Apologies I did not understand that “via rest api” => you use knime server / hub. Thought you are pulling the data from a rest api and then convert to pandas and then to knio.Table.

You are right then it does not make sense to check your local installation, but rather utilisation of cores / RAM need to be checked on server / hub-side…

Afraid I may not be the best person to guide you on that…

Zvereec1 · September 23, 2024, 5:57pm

I am not familiar with parquet files, just trying to find out why dataframe → knime output table is taking so long.

mlauber71 · September 23, 2024, 6:22pm

@Zvereec1 is it different on your local machine and what is the RAM config on the server. Also: are you sure it is the Python and not the way from the server to your local machine?

Zvereec1 · September 24, 2024, 9:11am

No you understood correctly, that is exactly what I am doing, but I was doing that on a server. Either way, I think I solved it: when I clear my temporary array in python, the node runs within minutes. Perhaps knime is trying to clean that up somehow?

system · October 1, 2024, 9:11am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.