Error with pandas in the Labs Python Integration

Hi all,

I am trying to run a script using the Labs Python Integration node and I encounter an error that does not happen when I execute the code in pure Python.

Code:

# the column last is Local Date in KNIME
dfc[fd_revw] = 3
dfc['expected'] = dfc['last'] + pd.to_timedelta(dfc['estimate'] * 365, unit='D')

I get the following exception:

Executing the Python script failed: Traceback (most recent call last):
  File "<string>", line 18, in <module>
  File "C:\Users\md45qh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\ops\common.py", line 69, in new_method
    return method(self, other)
  File "C:\Users\md45qh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\arraylike.py", line 92, in __add__
    return self._arith_method(other, operator.add)
  File "C:\Users\md45qh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\series.py", line 5526, in _arith_method
    result = ops.arithmetic_op(lvalues, rvalues, op)
  File "C:\Users\md45qh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\ops\array_ops.py", line 218, in arithmetic_op
    res_values = op(left, right)
TypeError: unsupported operand type(s) for +: 'KnimePandasExensionArray' and 'TimedeltaArray'

Does anyone know what might be going on?

Another error, now when performing a Join using PyArrow, which also seems to be related to some KNIME specific data type.

This happens whenever I have a Local Date field in my data and I use this in the join by the way. When I remove these fields, the join run without problems. Looking at the data schema, it seems that Local Date fields get a this KNIME Logical Type as opposed to an actual data type (e.g. string):

image

@toscanomatheus could you give us an example where this happens that one might be able to reproduce? Is the error within the Python node or when trying to bring the data back to KNIME?

Hi @mlauber71 ,

Thanks for reacting. The error is in the execution of the Python node (the Labs version, with the new knio module for serialisation). I think you can reproduce it by the following:

Create a table (via Table creator) and make sure that you have a Local Date field there. Then try to use that field in Python according to:

  1. Using pandas, adding days to a date similar to the code I share
  2. Using pyarrow, do a join with another table
    Let me know if you get different results than me.

@toscanomatheus I think it would be best if you could provide an actual workflow so we can see what is happening and what to maybe do about it.

@mlauber71
Here it is, with one example for each of the issues I related. I’m curious if you will get the same…

python_script_date_error.knwf (11.4 KB)

1 Like

Hi @toscanomatheus,

Thanks for the feedback. This is indeed a KNIME specific data type for which we built a Pandas extension to represent it as efficient as possible. That’s the KnimePandasExtensionArray.

It looks like we have to implement additional operators for this extension array. I’ll create a development ticket to make sure we add that soon. So there’s nothing wrong in your code, it’s just that our KNIME type doesn’t support that operation yet. Sorry.

For the time being, you could try to convert dfc['last'] to a Python list or numpy array (tolist() or to_numpy()). Each element should then have the type datetime.date. From that you could create a new pandas Series and perform the addition.

For the join Problem, this looks like PyArrow is missing the functionality to use ExtensionTypes in their Table.join() method. They have a corresponding ticket already for R and C++, and the C++ lib is used under the hood in Python: [ARROW-16695] [R][C++] Extension types are not supported in joins - ASF JIRA

Another workaround suggestion:
You could convert the column with the extension type to its “storage type” by accessing .storage of the contained array (if it is a ChunkedArray you’ll have to do that for each chunk). Then perform the join, and at the end fix the type of the column again to the extension type. This unfortunately requires a little more involved pyarrow array wrapping… Here is some pythonic pseudo code:

import pyarrow as pa

def _apply_to_array(array, func):
    """ helper method to support chunked arrays """
    if isinstance(array, pa.ChunkedArray):
        return pa.chunked_array([func(chunk) for chunk in array.chunks])
    else:
        return func(array)

extension_col = my_table.column[<local_date_column>]
extension_type = extension_col.type
unwrapped_col = _apply_to_array(extension_col, lambda a: a.storage)
unwrapped_table = pa.Table.from_arrays([..., unwrapped_col, ...])

result_table = unwrapped_able.join(other)

wrapped_col = _apply_to_array(result_table[<local_date_column>], lambda a: pa.ExtensionArray.from_storage(a, dtype=extension_type))
result_table = pa.Table.from_arrays([..., wrapped_col, ...])

Please let us know whether that helps. Best,
Carsten

4 Likes

@toscanomatheus this might be an interesting case. I will have to have a closer look at your specific file.

I startet to look into the question pof the Python Labs nodes and Date and Time variables (again) and indeed there seems to be a problem with the handling of types. In a previous discussion I toyed around with several formats and then exported the data thru Parquet files and brought the data back to KNIME which worked (mostly):

This time I tried two things with string formats exported from KNIME to Python and there transformed them into Timestamps with either Pandas or PyArrow. Again this worked within Python and you could export that with the help of Parquet but the date time formats would fail when you wanted to bring them back to KNIME.

The Date and Time variables brought from KNIME to Python do have some very specific formats and I am not sure they can be used in Python in a good way:

What does work is to convert specific strings to Python readable Formats. Here shown in a Jupiter notebook but then within the Python KNIME node:

# using Pandas
# from KNIME yyyy-MM-dd;HH:mm:ss
df['Local Date Time (DateTime)'] = pd.to_datetime(df['Local Date Time (String)'], format='%Y-%m-%d;%H:%M:%S')

The resulting “datetime64[ns]” or “datetime64[ns, Europe/Berlin]” format seems not to be supported for conversion back to KNIME, though it can be stored as parquet.

The same is true if you employ PyArrow to convert the strings to Date and Time:

# using PyArrow
# from KNIME yyyy-MM-dd;HH:mm:ss
df = df.append_column(
  "Local Date Time (DateTime)",
  pc.strptime(df.column("Local Date Time (String)"), format='%Y-%m-%d;%H:%M:%S', unit='s')
)

@carstenhaubold I think the handling of date and time variables between KNIME and Python might have to be improved. I think standard timestamps from Pandas and PyArrow should be supported so you could bring back results in such a format. Otherwise you would have to resort to parquet or string variables.

in the sub-folder /data/ there are two Jupyter notebooks to try a few things with date and time variables with

Pandas: knime_py_pandas_date_time_columns.ipynb
PyArrow: knime_py_pyarrow_date_time_columns.ipynb

3 Likes

Thanks for the detailed analysis @mlauber71, and thanks to both of you for sharing workflows and experiments to reproduce the situation!

You’re right, we need to have another look at our date and time support in the Python (Labs) nodes. We’ll get back to you soon.

4 Likes

@mlauber71 @carstenhaubold
Thanks both for the detailed explanations. For now that answer my questions and I look forward to the updates to the nodes.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.