This may be an easy question.
Will the release of Pandas 2.0 have an impact on Python scripting nodes?
The next major release of Pandas (version 2.0) will introduce some significant changes. Most significantly appears to be the introduction of PyArrow backed dtypes in addition to the traditional Numpy backed dtypes.
The Python nodes in KNIME have several adaptations to allow Pandas dataFrames to access the KNIME PyArrow datastore (if I understand correctly, though I haven’t studied the architecture fully). Pandas 2.0 appears to offer some benefit for KNIME, but may also introduce risk and incompatibility.
Has anyone done an assessment for the impact of Pandas 2.0 on KNIME. If so, what do we need to be aware of?
LinkedIn | Medium | GitHub
very good question, no we haven’t done a proper assessment yet. But as Pandas and PyArrow are both created by Wes McKinney and a lot of the developers are actually working on both projects at the same time, I hope the Pandas ↔ PyArrow conversion will become even faster without us having to change a lot
The KNIME tables are stored using Arrow under the hood and if you use KNIME’s
to_pandas() method we use PyArrow’s conversion method to a Pandas dataframe. I expect this to benefit from the improvements in Pandas out of the box. For special KNIME types (e.g. Geospatial data) we have our own Pandas ExtensionTypes which should not be affected by this change.
The feature that is nice is that PyArrow->Pandas can use nullable types by default. Right now, we don’t use those when coming from KNIME (we have e.g. a
sentinel parameter in
from_pandas() that allows to replace missing values in integer columns). We can add compatibility for that in the future – but only when users provide a special flag, as otherwise existing Python scripts might break.
So much for my current insight, if you have more ideas what could be useful → keep it coming
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.