Python 1=>2 node display error for molecules

python
#1

Hi,
I was trying to use a Python 1=>2 Scripting node to transform Molecules and then output them into two tables.
KNIME crashed multiple times trying to display the table, always non reproducible showing me either the desired table or just weard numbers, but each time with a pointer error.

I tried to go back to a very easy script and found a (or the?) problem:

Firstly if I create the two tables separately with eg.
output_table_1 = pd.Dataframe({“a” : [molecule,molecule], “b”:[1,2]})
output_table_2 = pd.Dataframe({“a” : [molecule,molecule], “b”:[1,2]})

I see the two tables with the RDKit Molecule and everything works fine.

If I do something like

x = pd.Dataframe({"a" : [molecule,molecule], "b":[1,2]})
output_table_1 = x
output_table_2 = x

The first table is displayed correctly but the second one shows me weard numbers instead of the molecule.
Screenshot%20from%202019-04-12%2015-05-35

This happens with the apache arrow as well as the flatbuffer serialization.

I have attached parts of the error messages (I can attach the whole if needed) and a simple workflow to reproduce the error.

I am not sure if I am doing something weard or if there is somehow a communication problem between rdkit, python and KNIME. (It works without the RDKit Molecules)

System: Ubuntu 16.04 LTS
KNIME: 3.7.1
RDKit KNIME integration 3.6.0.v201903281548

Python: Anaconda environment with python 3.6.7 and rdkit 2019.03.1.0 (error was also there for the latest 2018 release, just upgraded today) pyarrow 0.11.0

I am happy to provide more information if needed.

Thanks in advance for your ideas!

jennifer

Python_fail.knwf (22.3 KB)
errs.txt (4.4 KB)

1 Like

#2

Have you tried

output_table_2 = x.copy()

Also with Pandas it could make sense to make some transformations permanent with.
, inplace=true

Or it is something else. Like KNIME currently not supporting python >=3.7

1 Like

#3

Hi,
thanks for the suggestions. copy() indeed helps.

Still, I would assume that this should not be needed and most certainly should not crash KNIME in some cases?
For me copy() becomes problematic if I want to use larger datasets (I currently have loads of RAM but I would like to ship that Metanode and I think it should be as efficient as possible)

What I am doing is basically creating a dataframe and then I want to filter the data into two subsets based on some criteria. Hence inplace is not possible (only using copy but that not very efficient I guess)

I am using python 3.6 so it should not be a compatibility issue here.

Edit:
I just tried to recreate what I am trying to do. Interestingly I can execute it inside the node but executing the whole node gives the error: Execute failed: ‘Series’ object has no attribute ‘ToBinary’ but all objects are dataframes. I am not sure if I am totally confused here or if the Python 1=>2 node is.

from rdkit import Chem
import pandas as pd

#flow_variables = {}
flow_variables['Keep_all'] = "No"
flow_variables['keep_mixtures'] = "No"
flow_variables['keep_nonorganic'] = "No"

mol1 = Chem.MolFromSmiles('Cc1ccccc1')
mol2 = Chem.MolFromSmiles('Cc1ccccc1')
mol3 = Chem.MolFromSmiles('Cc1ccccc1')
mol4 = Chem.MolFromSmiles('Cc1ccccc1')

all = pd.DataFrame({'col1': [mol1, mol2, mol3, mol4], 'A': ["Yes", "No", "Yes", "Yes"],
                    'Mixture': ["Yes", "No", "No", "No"], 'Nonorganic': ["Yes", "No", "No", "No"]})

output_table_1 = all.copy()

x2 = all.copy()

if flow_variables['Keep_all'] == "No":
    output_table_1 = output_table_1[output_table_1['A'] == "Yes"]
    out1 = x2[x2['A'] != "Yes"]
else:
    out1 = pd.DataFrame()

if flow_variables['keep_mixtures'] == "No":
    output_table_1 = output_table_1[output_table_1['Mixture'] == "No"]
    out2 = x2[x2['Mixture'] != "No"]
else:
    out2 = pd.DataFrame()

if flow_variables["keep_nonorganic"] == "No":
    output_table_1 = output_table_1[output_table_1['Nonorganic'] == "No"]
    out3 = x2[x2['Nonorganic'] != "No"]
else:
    out3 = pd.DataFrame()

output_table_2 = pd.concat([out1, out2, out3])

I would really appreciate any input from your side.
Thanks

1 Like

#4

Okay.
I figured out the following:
the rdkit molecules make KNIME crash…as described here: Forum post

Nevertheless

  1. the rendering issue for the second output is still weard and
  2. the above error: Execute failed: ‘Series’ object has no attribute ‘ToBinary’

Sorry if the post is messy. I am willing to split, move or rewrite it if necessary.

0 Likes

#5

Hi,

This appears to be a bug in the way KNIME is translating things from Python->KNIME.
@christian.dietz: could you please ask someone to take a look at this?
There’s a lot of not-necessarily relevant info here; the key piece is the contents of the second output port of the Python Scripting node in the Python_fail.knwf workflow that @jenniferh attaches above.
The two output tables should be identical to each other since they are derived from the same DataFrame, but something happens so that the RDKit molecules are not deserialized in the second table.

0 Likes

#6

Hi jenniferh and greglandrum,

I was able to reproduce both the “weird numbers” and “‘Series’ object has no attribute ‘ToBinary’” problems and submitted an internal bug report for each of these. I’ll let you know as soon as these are fixed.

Marcel

1 Like

#7

Following up on this problem. It seems to be caused by duplicate row keys/index entries in your output table, probably introduced by the pd.concat(...) line in the script.
We created a patch for this issue that corrects the displayed error message. However, you will have to manually ensure that the index of the data table is unique. The patch will be shipped with the next version of KNIME.

We are also working on your first problem at the moment but didn’t finish development there. I’ll update this post once it’s done.

Marcel

4 Likes

#8

Thanks for the update on this @MarcelW!

-greg

0 Likes

#9

@MarcelW thanks for the update, checking for duplicates is no problem!

0 Likes