same code diffrent result with Jupyter notebook and Python Script

kevin_zhao · June 8, 2024, 1:30pm

The same Python code produces different output results. I want to implement a cumulative calculation for continuous grouping.
The calculation result in Jupyter notebook is normal, but it is incorrect in the Python Script node. It seems to be caused by missing values when shifting down by one row. Is there any reason for this? Any tips are greatly appreciated.
（python 3.11, KNIME 5.2 ）

import pandas as pd
data = {'Category': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A'],
        'Value': [1, 2, 3, 5, 1, 2, 1, 8, 2]}
df = pd.DataFrame(data)
df['Cumulative Sum'] = df.groupby((df['Category'] != df['Category'].shift(1)).cumsum())['Value'].cumsum()
print(df)

tomljh · June 8, 2024, 2:49pm

Could it be that the Python version used in KNIME is inconsistent with the Python version used in Jupyter, resulting in a different default behavior for Pandas.

kevin_zhao · June 8, 2024, 3:40pm

in the Python Scirpt ,the version is 3.11.6(Python Intergraton Envoriment). it seems something wrong.
When I use Anaconda to create the a virtual Envoriment that use the same version, then use the virtual Envoriment by Conda Environment Propagation node，re-excute the same code, it’s ok.

it is strange. use same version (python 3.11.6, pandas 2.0.3) except the envoriment, get different result.

Thank you for your suggestion.

kevin_zhao · June 9, 2024, 1:17am

I will rephrase the question, as it might not have been clear before.
it seems all right when use dictionary to create DataFrame in Python Script. But when you create table by Table Creator,then transfer the data to Python Script, wrong result happends. Maybe Python Script do something to missing value.

So I have to change my code as below,then it works:

import knime.scripting.io as knio
import pandas as pd

df = knio.input_tables[0].to_pandas()

# fill missing value with True 
df["group_id"]=(df['Category'] != df['Category'].shift(1)).fillna(True).cumsum()

df['Cumulative Sum'] = df.groupby("group_id")['Value'].cumsum()

knio.output_tables[0] = knio.Table.from_pandas(df)

|— | — | — | —|

|A | 1 | | |
|A | 2 | | |
|A | 3 | | |
|A | 8 | | |
|A | 2 | | |
|B | 5 | | |
|B | 1 | | |
|C | 2 | | |
|C | 1 | | |

mlauber71 · June 9, 2024, 7:27am

Maybe you can share a whole workflow. I do not see any missings in your initial data. And the Python code should be the same when comparing it.

kevin_zhao · June 10, 2024, 1:58pm

Sorry for not expressing the problem clearly. I have taken a screenshot and uploaded the workflow. Thank you all for your attention and answers.

Cumulative_calculation.knwf (10.6 KB)

First Python Script

import knime.scripting.io as knio
import pandas as pd

# Create DataFrame

df = knio.input_tables[0].to_pandas()

df['Cumulative Sum'] = df.groupby((df['Category'] != df['Category'].shift(1)).cumsum())['Value'].cumsum()

knio.output_tables[0] = knio.Table.from_pandas(df)

Second Python Script

import knime.scripting.io as knio
import pandas as pd
data = {'Category': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A'],
        'Value': [1, 2, 3, 5, 1, 2, 1, 8, 2]}
df = pd.DataFrame(data)
df['Cumulative Sum'] = df.groupby((df['Category'] != df['Category'].shift(1)).cumsum())['Value'].cumsum()
knio.output_tables[0]=knio.Table.from_pandas(df)

mlauber71 · June 10, 2024, 5:28pm

@kevin_zhao the whole thing is sort of a mystery. Only difference I can find is that initially the column type when coming in from KNIME is int32 while inside the Python node when created via Pandas it is int64. Later changing the type does not change the outcome. So this seems to be some python / pandas problem.

system · September 8, 2024, 5:28pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.