Pandas Node: Support factorized data type

DiaAzul · March 29, 2023, 1:53pm

Background
PM4PY is a Python process mining library. It is used to mine event logs to identify business processes. It should be a good fit with KNIME, which is well suited for sourcing and preparing event logs; however, the PM4PY library fails when used within a KNIME scripting node. Process mining has existed for some time, but is gaining more prominence as organisations seek to identify process improvements.

Problem
A sample workflow is included here.

The conda environment needs the addition of the PM4PY library which can be installed using pip (not conda).
pip install pm4py

Executing the Python script produces the following error:

Executing the Python script failed: Traceback (most recent call last):
  File "<string>", line 9, in <module>
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\discovery.py", line 372, in discover_petri_net_heuristics
    return heuristics_miner.apply_pandas(log, parameters=parameters)
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\algo\discovery\heuristics\variants\classic.py", line 119, in apply_pandas
    heu_net = apply_heu_pandas(df, parameters=parameters)
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\algo\discovery\heuristics\variants\classic.py", line 271, in apply_heu_pandas
    dfg = df_statistics.get_dfg_graph(df, case_id_glue=case_id_glue,
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\algo\discovery\dfg\adapters\pandas\df_statistics.py", line 97, in get_dfg_graph
    df = df.sort_values([case_id_glue, start_timestamp_key, timestamp_key])
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\util\_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\frame.py", line 6902, in sort_values
    indexer = lexsort_indexer(
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\sorting.py", line 350, in lexsort_indexer
    cat = Categorical(k, ordered=True)
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\arrays\categorical.py", line 441, in __init__
    codes, categories = factorize(values, sort=True)
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\algorithms.py", line 785, in factorize
    codes, uniques = values.factorize(  # type: ignore[call-arg]
  File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\arrays\base.py", line 1092, in factorize
    uniques_ea = self._from_factorized(uniques, self)
  File "C:\Program Files\KNIME\plugins\org.knime.python3.arrow_4.7.1.v202301311311\src\main\python\knime\_arrow\_pandas.py", line 541, in _from_factorized
    raise NotImplementedError(
NotImplementedError: KnimePandasExtensionArray cannot be created from factorized yet.

The apparent cause is lack of support for factorized data types in the KNIME Python library.

Request
Add support for factorized data types within the KNIME Pandas library (and Python script node) to allow the node to support KNIME Sets and other array like data types. It would also enable support for packages such as PM4PY that use this data type.

DiaAzul
LinkedIn | Medium | GitHub

DiaAzul · April 2, 2023, 12:55pm

@carstenhaubold

I’ve got a fix that works for me (it would be nice if you can build a correct solution into the main KNIME release so that I don’t need to patch this with every upgrade ).

In the file KNIME\plugins\org.knime.python3.arrow_4.7.1.v202301311311\src\main\python\knime\_arrow implement _from_factorized():

    @classmethod
    def _from_factorized(cls, values, original):
        # needed for pandas ExtensionArray API
        arr = pa.array(values)
        converted_data = katy._wrap_primitive_array(arr, False, "dummy")
        return_value = cls(
            original._storage_type,
            original._logical_type,
            original._converter,
            converted_data,
        )
        return return_value

The purpose of the function is to ensure that the uniques table returned by the factorize function is the same dataType as the original data so that it can be used as a lookup when given a code from the code table. The code above does not do any form of type checking and assumes that the converted data is in the correct format. You will need to robusticate the code, I was having a lot of difficulty following the _types.py file as there is minimal documentation (understandable as it is not intended for public consumption).

Hope that helps
DiaAzul
LinkedIn | Medium | GitHub

carstenhaubold · April 3, 2023, 7:22am

Hi @DiaAzul,

Thanks for the request and the suggested fix. Impressive that you dug through the depths of our code base to implement a workaround!

I’ve created a development ticket so that we implement this missing functionality with the next KNIME release.

Best,
Carsten

DiaAzul · April 3, 2023, 9:07am

@carstenhaubold

Thanks for raising the ticket.

It’s always good to read other people’s code to understand where I can improve my own practice. I come from a strongly typed Java/C# background, so having lots of if-then statements, unbound functions and weak typing is harder to follow. Was useful to study how Pandas Extension types are used - with Pandas 2.0 this is likely to become more prevalent which will create a multitude of compatibility issues if it is not managed carefully.

system · July 2, 2023, 9:07am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.