I assembled an pandas dataframe of about 200k rows and 10 columns via rest api, which takes about 1 minute.
If I then want to convert it to the knime output table by doing knio.output_tables[0] = knio.Table.from_pandas(final_table), that takes forever (I quit after 30 minutes, this only takes seconds when doing it outside of knime).
Is this normal? Python nodes seem useless when working with larger datasets in this case.
that is a bit odd - can you share some more details?
I tried to replicate with some sample data incl. Logging timing (you should be able to copy&paste into a Python Script node):
import knime.scripting.io as knio
# This example script creates an output table containing randomly drawn integers using numpy and pandas.
import numpy as np
import pandas as pd
import time
# Function to display timestamps and durations
def log_duration(start_time, step):
duration = time.time() - start_time
print(f"{step} took {duration:.4f} seconds")
# Step 1: Start the timer for array creation
start_time = time.time()
# Creating a numpy array with 200k rows and 10 columns
array = np.random.randn(200000, 10)
log_duration(start_time, "Array creation")
# Step 2: Start the timer for DataFrame conversion
start_time = time.time()
# Converting the numpy array to a pandas DataFrame
df = pd.DataFrame(array, columns=[f'Column_{i+1}' for i in range(10)])
log_duration(start_time, "DataFrame conversion")
# Displaying DataFrame structure again for confirmation
df.head(), df.shape
# log conversion
start_time = time.time()
output = knio.Table.from_pandas(
df
)
log_duration(start_time, "DF to KNIO conversion")
knio.output_tables[0] = output
Here’s the console output:
Array creation took 0.0255 seconds
DataFrame conversion took 0.0010 seconds
DF to KNIO conversion took 0.0600 seconds
Which Version are you running on? Could you check if the above example also takes so long?
Hmm. I changed the example to include strings / int as well:
import knime.scripting.io as knio
# This example script creates an output table containing randomly drawn integers using numpy and pandas.
import numpy as np
import pandas as pd
import time
# Function to display timestamps and durations
def log_duration(start_time, step):
duration = time.time() - start_time
print(f"{step} took {duration:.4f} seconds")
# Step 1: Start the timer for array creation with different data types
start_time = time.time()
# Creating a numpy array with 200k rows for numeric columns
numeric_array = np.random.randn(200000, 5) # 5 columns of type double (float64)
# Creating integer columns
int_array = np.random.randint(0, 100, size=(200000, 3)) # 3 columns of type int
# Creating string columns (randomly generated strings of length 5)
string_array = np.random.choice(['A', 'B', 'C', 'D', 'E'], size=(200000, 2)) # 2 columns of type string
log_duration(start_time, "Array creation for mixed types")
# Step 2: Start the timer for DataFrame conversion
start_time = time.time()
# Converting the arrays into a pandas DataFrame
df_mixed = pd.DataFrame(np.hstack([numeric_array, int_array, string_array]),
columns=[f'Float_Column_{i+1}' for i in range(5)] +
[f'Int_Column_{i+1}' for i in range(3)] +
[f'String_Column_{i+1}' for i in range(2)])
# Ensuring appropriate types
df_mixed[[f'Int_Column_{i+1}' for i in range(3)]] = df_mixed[[f'Int_Column_{i+1}' for i in range(3)]].astype(int)
df_mixed[[f'String_Column_{i+1}' for i in range(2)]] = df_mixed[[f'String_Column_{i+1}' for i in range(2)]].astype(str)
log_duration(start_time, "DataFrame conversion for mixed types")
# log conversion
start_time = time.time()
output = knio.Table.from_pandas(
df_mixed
)
log_duration(start_time, "DF to KNIO conversion")
knio.output_tables[0] = output
little bit slower, but not to the extend that you are experiencing:
Array creation for mixed types took 0.0210 seconds
DataFrame conversion for mixed types took 0.8026 seconds
DF to KNIO conversion took 0.2821 seconds
Is there anything else going on in your Python code?
Perhaps I am misunderstanding, but wouldn’t that require me to complete running the python script in the first place? This is where I am generating my data.
Not sure what is going on - some more things to look into:
go to your knime.ini (in the root folder of your knime installation) and look for how much RAM is allocated. The default should say -Xmx2048m in the .ini file.
If it is not default it could be e.g. -Xmx4096m or something like -Xmx4g … the numbers always indicated how much RAM is available. If it is the default then: Change it to ~half of your RAM available - e.g. if you have 8gb RAM then change it to -Xmx4g
Go to Preferences => General and check “Show Heap Status”:
If the box is already checked and you don’t see the bar at the bottom of the screen you may have to uncheck it, save & apply, then check it.
When you then run your Python script you can observe how that number changes / increases - maybe something maxes out and then the remaining procedure is slowing down substantially.
Apologies I did not understand that “via rest api” => you use knime server / hub. Thought you are pulling the data from a rest api and then convert to pandas and then to knio.Table.
You are right then it does not make sense to check your local installation, but rather utilisation of cores / RAM need to be checked on server / hub-side…
Afraid I may not be the best person to guide you on that…
@Zvereec1 is it different on your local machine and what is the RAM config on the server. Also: are you sure it is the Python and not the way from the server to your local machine?
No you understood correctly, that is exactly what I am doing, but I was doing that on a server. Either way, I think I solved it: when I clear my temporary array in python, the node runs within minutes. Perhaps knime is trying to clean that up somehow?