Background
PM4PY is a Python process mining library. It is used to mine event logs to identify business processes. It should be a good fit with KNIME, which is well suited for sourcing and preparing event logs; however, the PM4PY library fails when used within a KNIME scripting node. Process mining has existed for some time, but is gaining more prominence as organisations seek to identify process improvements.
Problem
A sample workflow is included here.
The conda environment needs the addition of the PM4PY library which can be installed using pip (not conda).
pip install pm4py
Executing the Python script produces the following error:
Executing the Python script failed: Traceback (most recent call last):
File "<string>", line 9, in <module>
File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\discovery.py", line 372, in discover_petri_net_heuristics
return heuristics_miner.apply_pandas(log, parameters=parameters)
File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\algo\discovery\heuristics\variants\classic.py", line 119, in apply_pandas
heu_net = apply_heu_pandas(df, parameters=parameters)
File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\algo\discovery\heuristics\variants\classic.py", line 271, in apply_heu_pandas
dfg = df_statistics.get_dfg_graph(df, case_id_glue=case_id_glue,
File "D:\mambaforge\envs\knime_process\lib\site-packages\pm4py\algo\discovery\dfg\adapters\pandas\df_statistics.py", line 97, in get_dfg_graph
df = df.sort_values([case_id_glue, start_timestamp_key, timestamp_key])
File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\util\_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\frame.py", line 6902, in sort_values
indexer = lexsort_indexer(
File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\sorting.py", line 350, in lexsort_indexer
cat = Categorical(k, ordered=True)
File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\arrays\categorical.py", line 441, in __init__
codes, categories = factorize(values, sort=True)
File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\algorithms.py", line 785, in factorize
codes, uniques = values.factorize( # type: ignore[call-arg]
File "D:\mambaforge\envs\knime_process\lib\site-packages\pandas\core\arrays\base.py", line 1092, in factorize
uniques_ea = self._from_factorized(uniques, self)
File "C:\Program Files\KNIME\plugins\org.knime.python3.arrow_4.7.1.v202301311311\src\main\python\knime\_arrow\_pandas.py", line 541, in _from_factorized
raise NotImplementedError(
NotImplementedError: KnimePandasExtensionArray cannot be created from factorized yet.
The apparent cause is lack of support for factorized data types in the KNIME Python library.
Request
Add support for factorized data types within the KNIME Pandas library (and Python script node) to allow the node to support KNIME Sets and other array like data types. It would also enable support for packages such as PM4PY that use this data type.