"Python Source" Node: ERROR Python Source Execute failed: fill value must be in categories

I’m calling a Python module via the “Python Source” node, that works fine from PyCharm. From the “Python Source” node, I get the error:

 ERROR Python Source        0:5        Execute failed: fill value must be in categories

Any way to debug this? The dataframe returned to the “Python Source” has only one category column, and it has all valid values: it is a _merge column, populated with values by pandas.

Here is my source code in the “Python Source” node:

import sys
sys.path.append(’/Homebase/Source Code/PycharmProjects/Fuzzy5’)
import main_fuzzy_match_v1 as fm
output_table = fm.fnFuzzyMatch(‘Silver’,
‘Test’, ‘SELECT [LU], [nxt1], [nxt2] FROM [SUBJECT_test3]’, ‘LU’,
‘Test’, ‘SELECT [LU], [cola], [colb], [colc] FROM [REFERENCE_test3]’, ‘LU’,
‘SG_Stage_1_Simpledata’)

The module “main_fuzzy_match_v1” works great called from PyCharm. It also logs “stdout” and “stderr” to log files: no errors in my code. Something is happening when the dataframe is passed to the “Python Source” node…

The error message you shared matches one that pandas is known to report. While not the only reason, this can occur when attempting to perform a fillna() on a pandas.DataFrame’s categorical column without first adding the fill value to the category definition for that column. As to whether the exception is triggered inside your call to main_fuzzy_match_v1.fnFuzzyMatch() or something later in KNIME’s handling of the pandas.DataFrame stored in output_table, I can not tell without seeing more of your KNIME error log.

You might try converting the column type on your one and only column in your output DataFrame to a str. Your categories will still exist but as their string representations instead. KNIME should have no difficulty understanding a column of str’s and it would avoid any side-effects of the constraints placed on that column because it is categorical. That is, try adding this line (substituting in the correct column name):

output_table['that_column_name'] = output_table['that_column_name'].astype(str)
1 Like

One more thing to try: since this error is often triggered in the way I described before, try filling in the missing values in your pandas.DataFrame with some sentinel value before handing it over to KNIME. That way, you will have total control over how your missing values are handled and at the same time, hopefully side-step the problem. That is:

output_table['that_column_name'].fillna('MISSINGVALUE', inplace=True)

I hope one of these helps. Otherwise, you might need to share more of your error log.

First, thanks for the reply!!!

I do use fillna(), but it works fine:

  • Via a small test program in PyCharm, calling my source module also developed and debugged in PyCharm.
  • Called from “Python Source” node. Most of my code is a .py file (which loads a couple of other .py files): the source-code in “Python Source” simply calls my code, stored in a PyCharm project directory. My source module (called from “Python Source” node) outputs to files for both stdout and stderr, and runs to completion, without error. My code is rather well tested on very large datasets; I’m just running on a small test dataset, in hopes that KNIME will be a good “front end”.

The error occurs when the final dataframe result is passed to the “Python Source” node. All missing values are replaced via fillna().

I confirmed that there is only one category column in my dataframe, and it has no values populated by fillna(). All values in the column “_merge” from the pandas merge function are valid and populated by pandas itself when the merge function is executed.

One thing you mentioned is the datatype of columns in the dataframe. All of my non-numeric columns are of type “Object”. I’ll try converting all the “Object” columns to “str”, and report back.

Thanks again! Fingers crossed…

Well, changing the datatype of the object columns to string does not seem to work in pandas. Apparently, object is what is used for strings.

After executing my module, called from a “Python Source” node, the dataframe generated, per .dtypes, is as follows:
LU_Subject object
nxt1 object
nxt2 object
master_side float64
dupe_side int64
similarity float64
_merge category
LU_Reference object
cola object
colb object
colc object
ID int32

I obtained the above by executing “print(my_dataframe.dtypes)” within my source module, that is called from “Python Source” node, via the following code in the “Python Source” node:
import sys
sys.path.append(’/Homebase/Source Code/PycharmProjects/Fuzzy5’)
import main_fuzzy_match_v1 as fm
output_table = fm.fnFuzzyMatch(‘Silver’,
‘Test’, ‘SELECT [LU], [nxt1], [nxt2] FROM [SUBJECT_test3]’, ‘LU’,
‘Test’, ‘SELECT [LU], [cola], [colb], [colc] FROM [REFERENCE_test3]’, ‘LU’,
‘SG_Stage_1_Simpledata’)

Values in the “_merge” (category) column (obtained by printing to stdout from my source module) are as follows:
0 both
1 both
2 both
3 both
4 both
5 both
6 both
7 left_only
8 both
9 left_only

The error in the KNIME console is:
ERROR Python Source 0:5 Execute failed: fill value must be in categories

What appears to be the relevant portion of the log-file, is:
2020-10-06 17:57:15,585 : ERROR : KNIME-Worker-14-Python Source 0:5 : : Node : Python Source : 0:5 : Execute failed: fill value must be in categories
org.knime.python2.kernel.PythonIOException: fill value must be in categories
at org.knime.python2.util.PythonUtils$Misc.executeCancelable(PythonUtils.java:297)
at org.knime.python2.kernel.PythonKernel.waitForFutureCancelable(PythonKernel.java:1682)
at org.knime.python2.kernel.PythonKernel.getDataTable(PythonKernel.java:993)
at org.knime.python2.nodes.source.PythonSourceNodeModel.execute(PythonSourceNodeModel.java:96)
at org.knime.core.node.NodeModel.execute(NodeModel.java:747)
at org.knime.core.node.NodeModel.executeModel(NodeModel.java:576)
at org.knime.core.node.Node.invokeFullyNodeModelExecute(Node.java:1236)
at org.knime.core.node.Node.execute(Node.java:1016)
at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:558)
at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:95)
at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:201)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:117)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:334)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
Caused by: org.knime.python2.kernel.PythonIOException: fill value must be in categories
at org.knime.python2.kernel.messaging.AbstractTaskHandler.handleFailureMessage(AbstractTaskHandler.java:146)
at org.knime.python2.kernel.messaging.AbstractTaskHandler.handle(AbstractTaskHandler.java:92)
at org.knime.python2.kernel.messaging.DefaultTaskFactory$DefaultTask.runInternal(DefaultTaskFactory.java:256)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: Traceback (most recent call last):
File “C:\Program Files\KNIME\plugins\org.knime.python2_4.2.2.v202009241055\py\messaging\RequestHandlers.py”, line 96, in _handle_custom_message
response = self._respond(message, response_message_id, workspace)
File “C:\Program Files\KNIME\plugins\org.knime.python2_4.2.2.v202009241055\py\messaging\RequestHandlers.py”, line 218, in _respond
data_bytes = workspace.serializer.data_frame_to_bytes(data_frame_chunk, start)
File “C:\Program Files\KNIME\plugins\org.knime.python2_4.2.2.v202009241055\py\Serializer.py”, line 205, in data_frame_to_bytes
data_bytes = self._serialization_library.table_to_bytes(table)
File “C:\Program Files\KNIME\plugins\org.knime.python2.serde.flatbuffers_4.2.0.v202006261130\py\Flatbuffers.py”, line 685, in table_to_bytes
col.fillna(value=’’, inplace=True)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\core\series.py”, line 3425, in fillna
**kwargs)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\core\generic.py”, line 5408, in fillna
downcast=downcast)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\core\internals.py”, line 3708, in fillna
return self.apply(‘fillna’, **kwargs)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\core\internals.py”, line 3581, in apply
applied = getattr(b, f)(**kwargs)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\core\internals.py”, line 2006, in fillna
values = values.fillna(value=value, limit=limit)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\util_decorators.py”, line 178, in wrapper
return func(*args, **kwargs)
File “C:\Users\paper\anaconda3\envs\py3_knime_auto\lib\site-packages\pandas\core\arrays\categorical.py”, line 1756, in fillna
raise ValueError(“fill value must be in categories”)
ValueError: fill value must be in categories

I wonder…
…does the “Python Source” node itself do a .fillna() on the dataframe returned?

I removed the category column ("_merge", described above), and the “Python Source” node works OK.

It really seems like the “Python Source” node itself is doing a .fillna() on all the columns in the dataframe, which causes problems, when one of the columns is of type category.

This would be very strange. One thing you could do is saved your result as parquet file or into a SQLite DB and read it back to KNIME. Not the most elegant way but you could see if this does work.

For now, I will just assume that the Python Source node does not “like” category columns.

The error seems to be something unique to the Python Source node. My code works fine when run external to KNIME, and even within KNIME (per my output to stderr and stdout), except for the final “handoff” of the dataframe to KNIME.

Thanks to all!

Maybe you could provide a sample workflow that reproduces the error and a log file set to debug so further investigation is possible.

I happily would, but my python code consists of 3 or so modules (called from Python Source node calling the main module) and depends on having SQL Server.

For now, I just drop the category column before passing the result to Python Source, and it works in KNIME.

OK, I was able to create a VERY simple workflow, that does nothing but load the dataframe (from a pickle file), and demonstrates the “Python Source” node error. I had some version issues between the Anaconda environment I normally use in PyCharm, and the Python 3 environment KNIME creates, so I saved the pickle file using the Python 3 environment automatically created when configuring the Python Extension.

I’m providing the workflow, along with the dataframe (in the form of a pickle file) that works fine external to KNIME (e.g. PyCharm), but gets the category error from within KNIME.

Note: In order to include the datafile (file df_pickle) in the KNIME export, I had to place the file in the workflow group. UNFORTUNATELY, I do not know how to reference the data-file this way (normally, I use an absolute path on my hard-disk). THUS, for this example to work, you would need to fix the file reference in this line, in the only node in the workflow:

output_table = pd.read_pickle(“df_pickle”)

Sorry for having the file reference wrong in my example, but most of my KNIME work has been with SQL Server. Other than that, it should work fine with the python 3 anaconda environment automatically created by the latest Python Extension: except for what I believe to be a Python Source node “bug”.

Python Source Issue.knar (7.6 KB)

BTW, if someone can let me know how to correctly enter the path to file df_pickle in this example, I won’t have to embarrass myself again next time… :roll_eyes:

@bassman this is a strange issue. There seems to be something going on with the “_merge” column. If I drop it from your file everything is OK. I can export it to parquet and bring it back to KNIME without problems. But if I try to keep it or just rename it the problems start.

When you check the data types of the columns it comes back with the type “Category” (instead of Object?). So this seems to be the issue. @ScottF / @Iris - maybe you can check if the Python to KNIME conversion does something special with this type of data.

(sorry for the rendering of the SVG - KNIME has promised to work on that)

Yes, the “_merge” column.

There are only about 9 rows in the dataframe, and the column was created by pandas itself (when doing a merge). All values were populated by pandas: no missing values that I replaced (no fillna() used).

My earlier suggestion to convert a column to str was only ever intended for the categorical column. Indeed, pandas generically uses a dtype of object for columns containing str so attempting to convert the other columns that are not problematic would likely not have much impact. Knowing the name of your categorical column now, my earlier suggestion could be updated to:

output_table['_merge'] = output_table['_merge'].astype(str)

Your experiment with deleting this column from your DataFrame happily confirms that this one column alone is responsible for the heartburn. If you do not need this column’s information, deleting it sounds like a winning resolution. If you do need to keep this column, my above suggestion is still available.

As to why this column caused problems, it is likely triggered by the serialization process that takes place to exchange data between Python and Java. It is this serialization step that must resolve how to represent missing values and the existence of missing values in a categorical column triggers the problem. Just to be clear, there is nothing wrong with using a categorical column in a pandas.DataFrame and you should be able to pass a categorical column from Python to Java. It’s the missing value in a categorical column that causes problems for the serialization step. Exporting the pandas.DataFrame to CSV or Parquet or some other format that pandas supports would force a reckoning with how to represent missing values and thus also solve your problem indirectly. Converting the one problematic column to str (or otherwise deciding how you want to fill-in missing values in that categorical column) should provide a reusable strategy that is more efficient than converting/exporting to other formats temporarily.

1 Like

What indeed is strange is that as far as I can see the Category column “_merge” does not contain any missings. And it still seems to trigger this warning. Question is if KNIME could do anything to help bring categories to KNIME tables as strings.

Converting it to string in Pandas or dropping the column or using parquet as a workaround is always possible but not that elegant.