Python scripts in Loops

Hi there,

I use a python node for some visualisations within a loop (many iterations like 70.000) . Even if the amount of data is quite small (less than 200 rows, 4 columns) the node stucks always a second at 30%.

Can it be that in every call the libraries are reloaded? Is this the root cause for the slow execution? Is there a way to preload all needed Python libraries at the beginning/start of a Workflow or Knime itself

from io import BytesIO
import seaborn as sns
import matplotlib.pyplot as plt

# Some numerical stuff
df = input_table
df["x"]= df.reset_index().index - 10
df["hue"] = df["hue"].astype(str)


# plotting
f, ax= plt.subplots(2,1, figsize=(8,6),sharex=True)

sns.scatterplot(data = df, x="x", y="y", hue= "hue", palette ="tab10", ax= ax[0])
ax[0].axvline(x = 0, color = 'k',lw=1, label = 'Meas Start', linestyle=":")

sns.lineplot(data = df, x="x", y="y_line" ,  palette ="tab10", ax= ax[1], label="Temperature")
plt.axvline(x = 0, color = 'k',lw=1, label = 'Meas Start', linestyle=":")
plt.legend()

# write it into the buffer
buffer = BytesIO()
plt.savefig(buffer, format=a'png')
# The output is the content of the buffer
output_image = buffer.getvalue()

@ActionAndi I think if you wrap a loop around a Python node the node will always restart a python session which might take some time - you might only shorten that to some extent by creating a small conda environment or use the internal one (How to Set Up Your Python Extensions | KNIME) and also if data is being transferred make sure to use the new columnar backend (Data Transfer between KNIME & Python Just Got Faster | KNIME).

Parallelisation might help but only to a certain extent maybe and I have not tried that with python (Parallel Chunk Start – KNIME Community Hub).

Other than that you might want to think about doing the loop within Python then the setup would only occur once. You would then export the PNG files directly from within Python:

Also: with such a large number of items it might make sense to store the ones you already have processed in a table so you can restart the process if something should happen.

2 Likes

@ActionAndi a solution could look something like this:

kn_example_python_loop_graphic_restart.knwf (164.6 KB)

The script would look like this:

import knime.scripting.io as knio

# This example script creates an output table containing randomly drawn integers using numpy and pandas.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

var_data_path = knio.flow_variables['context.workflow.data-path']

# Load your data into a Pandas DataFrame
# df = pd.read_csv(var_data_path + "your_data.csv")

df = knio.input_tables[0].to_pandas()

# Load the parquet file containing previously processed rows
try:
    processed_df = pd.read_parquet(var_data_path + "output.parquet")
except FileNotFoundError:
    processed_df = pd.DataFrame(columns=["row_id", "name_column", "timestamp"])


# Loop through each row in the DataFrame
for index, row in df.iterrows():

    # Check if this row has already been processed
    if any(processed_df["row_id"] == index):
        continue

    # Create a plot based on the data in this row
    x = row["x_values"]
    y = row["y_values"]
    plt.plot(x, y)

    # Set the file name based on a column in the DataFrame
    file_name = f"{row['name_column']}.png"

    # Save the plot to disk with the file name
    plt.savefig(var_data_path + file_name)

    # Save the row ID to the parquet file
    new_row = pd.DataFrame({"row_id": [index], "name_column": [row["name_column"]],
                            "timestamp": [pd.Timestamp.now()]})
    processed_df = pd.concat([processed_df, new_row], ignore_index=True)

    # Save the processed rows to the parquet file
    processed_df.to_parquet(var_data_path + "output.parquet", index=False)

    # Close the plot
    plt.close()

knio.output_tables[0] = knio.Table.from_pandas(df)


4 Likes

Thank you so much! That did the trick!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.