Cast column to different type or check column type

jenniferh · May 9, 2019, 1:46pm

Dear all,

I have a workflow and need to account for a column which can either be a String column or an RDKit Molecule column.
What I know is that if the column is empty, it is a string column, otherwise it is an RDKit column.
I came up with two ideas:

a) Check if column is empty:

Apporach: use the function which is proposed by the Java Snippet node:
c_column.isEmpty()

Unfortunately this does work only if the column is a String, but not for the RDKit Mol column.
I could try it and catch the exception but that seems a bit weard.

b) Check if the column is of type String:

Approach: use the isType(name, type)* function as described here:
isType(“MyColumn”, tString)

Yet, that always returns true, so I guess I somehow have to enter the Column name differently, but I have no clue how.
If I input the column itself I get the error
“The method isType(java.lang.String, T) in the type org.knime.base.node.jsnippet.expression.AbstractJSnippet is not applicable for the arguments (org.RDKit.ROMol, java.lang.String)”

c) The nodes I have also tried so far:

The table Validator fails because he cannot cast a String cell to an rdkit cell.
The extract table spec is an option but somehow its hacky and I hope there is a more elegant solution.

The cherry on the cake would be if I could cast the String column to RDKit and be done…but I would be also glad if I could somehow get solution a or b working somehow (with a node, some code snippet or whatever works)

It would be glad to get some help here.

Thanks in advance,
Jenny

PS: I have not posted it to the RDKit Forum since I think that this task could also hold true for other column types which need to be casted and/or checked.

johannes.schweig · May 9, 2019, 2:30pm

Hi @jenniferh,

a) you could use the Empty Table Switch to check if your column is empty. The node passes the output to the first output port if there is something in the table and to the second output port if there is nothing in the table (no rows, no rowids). I would single out the column with a column filter, then remove missing values with a row filter and then connect it to the Empty Table Switch. Then you could do your casting and in the end join it back to the original table.

Cheers,
Johannes

jenniferh · May 9, 2019, 2:43pm

Thanks for the reply, the problem is, the table itself is not empty its really just about these two columns which might, or might not be empty. In addition one of the columns can be empty while the other might still contain entries.

Cheers,
Jennifer

Alec · May 9, 2019, 2:48pm

Hi Jennifer,

when talking about empty, do you mean containing missing values or containing empty strings? In case of missing values, an alternative approach would be to use a Missing Value Column Filter and check if your column still pertains to the table after applying it.

Best,
Alec

jenniferh · May 9, 2019, 3:37pm

Hi thanks to you as well. They are missing values, sorry for not being specific on that. I could do that but do you have an Idea how to proceed after that?
If the columns are there, I need to cast them from rdkit to sdf. This node fails if he does not detect any useful input (i.e a molecule cell).
So basically I have 4 options to account for:

both cols are empty
2/3) Either column 1 or 2 is present
both are present with molecules

Now for 1 or 2/3 I have no idea how to proceed, since I would need to check if these columns exist, and then go on, and that is also a task for which I haven’t found anything (apart from trying to use them and catching the exception in a snippet maybe)
If you have more ideas than I do I am open for suggestions!

Best,
Jennifer

Alec · May 9, 2019, 3:52pm

You could find out if the columns exist by using a Reference Column Splitter after the Missing Value Column Filter. The second out-port will deliver you the columns which contain the missing values. It would probably look similar to this:

(Replace the Number To String with whatever you need)

Best,
A

PS: maybe a toy workflow would help in order to show where you wish to go after that…

quaeler · May 9, 2019, 4:01pm

Unfortunately isType is incorrectly documented - it is actually checking as to whether the given column can provide data in the specified type - and since just about everything under the sun can provide data as a String type, it will return true.

I think there’s nothing weird or hacky about using a try-catch block in Java code, so i would recommend that path (or at least it’s a lot less hacky than the path i would really recommend which would involve using reflection to get at the incoming data spec object.) The thing that seems the problem with that approach is that it requires that the Java Snippet node be referring to a specific incoming column - which mean it knows about the data table spec, which means it already knows what kind of column it is (String, Integer, RDKit Mol, etc etc) - so there is no casting to be done here (i.e no “if this column is of type X then do A” because you know at the moment you’re writing the code that type of the column…)

From the perspective of casting a String to an RDKit Mol - you can create a new RDKIT Mol cell in the Java Snippet code by doing

out_myNewCell = RWMol.MolFromMolBlock(c_myStringValue);

and including the import of import org.RDKit.RWMol;

jenniferh · May 10, 2019, 9:34am

Thanks to all.
@Alec Thanks, but the problem here is that after the splitter the columns are separated but that still runs me into not the conversion issue.

@quaeler Well that is really unfortunate…I was really betting on me not handling the input well =/
This option sounds like a cool idea, the problem is I cannot import org.RDKit.RWMol It complains about not finding it. Adding the rdkit bundles also does not really help. Could it be that you installed RDKit Java wrappers?

I have a solution splitting the column based on empty rows and then casting the non empty and concatenating, and again casting and then concatenating and grouping. But this seems like a huge overhead.

I have attached the workflow with the original problem and two mock tables and the solution which I have implemented.

Nevertheless, I still cannot believe that there is no option to check for a column datatype (since the snippet knows them!) or if a column is empty. At least some isinstanceof should work, assuming I do have a list of types the columns can be.

In case anybody has an easier solution or knows how to achieve checking for types or empty columns, I would be happy to get to know that!

Best,
Jennifer

check_RDKitCols.knwf (130.4 KB)

quaeler · May 10, 2019, 5:34pm

I downloaded and looked at your workflow. I think there’s a couple issues / confusions / nomenclature-isms / … afoot here.

"Empty" columns:
‘Empty’ is a data-type subjective, and use-case subjective, declaration - what might be empty to a String data type, is not necessarily empty to an Integer data type. This is why there is the notion of a ‘missing cell’ - which is what are in all the rows for columns “Mol” and “Mol2” being emitted by your Table Creator - Node 2. If you look at the table output dialog from this node, you’ll see the cells’ contents being rendered as a colored “?” to denote a Missing Cell™. In your Java Snippet node connected to this, you want to check via isMissing("my column name"); – the reason this Java Snippet was throwing an exception is that you were doing the equivalent of calling a method (isEmpty()) on a null object since the cell is “missing.”

ROMol / RWMol:
if ya got ROMol, which you do, then ya got RWMol - they’re in the same package, in the same jar, the latter subclassing the former.

And the more Kafka-esque problem, "checking column types:"
i still don’t understand what you’re trying to do here. If you’re using the Java Snippet node, you can add a variable to represent an input and once you’ve added that variable - you know the column type (e.g a string data type column is represented by a String class instance, RDKIT molecule - an ROMol class instance, etc etc.) So i’m confused as to where, and for what reason, you’re wanting to figure out a column type but are being stymied… ?

jenniferh · May 13, 2019, 9:14am

@quaeler
Thanks for the long response and clarifications about “Empty” and Missing.

RWMol : In case I have an RDKit column and use it in the snippet, it works fine. But as soon as I do not have this column the import does not work, i.e. for the case where both molecule columns are strings. (Erro: The import org.RDKit cannot be resolved)
Sorry my java knowledge comes down to one lecture about oop two years ago…

As for the Kafka-esque problem:

So in general I need a way to figure out if one of the Molecule columns I have will result in a failure of the RDKit node because there is no Molecule column present. The reason is that I want to generate a Metanode which will be used by our group and the users should not need to manually change it to ensure that they can use the link to update their node.
So my go at an “easy” solution was to either

check if a column is empty --> then I know that there is no Molecule in the column and thus I would have to skip the RDKit to Molecule node since it fails if I do not have a molecule column
–> since I now know that there is an isMissing() function I do have a nice and short alternative to my workflow.
check if the molecule column has type string --> then I know as well that I cannot cast the column to an sdf cell via the RDKit to Molecule node.

Hope that makes it a bit less Kafka-esque?

Thanks for the help!

quaeler · May 13, 2019, 3:09pm

Thanks for the further insights!

RWMol: it’s true, were you wanting to use the RWMol class in some intermediary code but ultimately not inputting and also not outputting an RDKit Molecule column, the onus would be on you to hunt down that jar file and specify it to the node in Additional Libraries (and woe be unto the person not versed in the arcane to find that jar.)
A nice improvement to the Java Snippet node would be to have an Eclipse-like “i know about all of these classes, start typing and i’ll find the class and then import its containing jar.”
In your case though, it seems like you’ll be outputting an RDKit Molecule column and so the J.S node should take care of putting that jar on the classpath for you.

Kafka: Ok, now i think i’ve got it… maybe. Were i to phrase it, i would say

there is an incoming data table which has a String column; this String column may be empty, might be missing, or might have valid SMILES (or might have invalid SMILES.) If it is neither empty or missing, you want to create an RDKit Molecule representation out of it.

(You wrote “sdf” above, but in the workflow you provided, the Table Creator nodes’ contents are SMILES; so i wrote SMILES here.)

Is that correct?

jenniferh · May 15, 2019, 8:22am

definitely +1 for that idea, that would really help!

Kafka:
not quite: there is an input table of which two columns can be either a String or RDKit molecules. These two columns should finally be cast to an sdf (or left as a string column since I know that if there is a string column, it does not contain molecules).
For casting I use the RDKit to molecule node. This node fails if the two columns are string columns (Complaining that no RDKit sdf or whatsoever molecule type columns are found).

(This problem originates from a python snippet which outputs string columns if the molecule column is empty)

I have a bit reformatted and rephrased the workflow maybe it is clearer now? So sorry if my explanations are confusing, thank you so much for bearing with me!

check_RDKitCols.knwf (191.0 KB)

quaeler · May 15, 2019, 3:20pm

Kafka:
I think the “kafka” part of this is that there might be a misunderstading of the pre-conditions in which a node can be configured in KNIME. For nodes that rely on knowing about data arriving on their inport, the node must know about incoming data’s tablespec (how many columns? what data type for each column? …) - so in that light, it doesn’t make any sense to talk about a node that dynamically (during workflow execution) detects whether a column named X is data type A or data type B, because before that node could be executed, that node must have been configured, and as part of that configuration, the node already has to know whether column X is data type A or data type B.
As a concrete example from your workflow, your Table Creator “Case 1” emits two Smiles columns and one Integer column; the RDKit From Molecule node which is attached and receives the input could not be correctly configured (and so could not be executed) without knowing about those columns and column types.

I think what’s further confusing to me is that i don’t understand the actual use case here; in the workflow you’ve provided, you’ve hand entered (via Table Creator) all of your data - so naturally you’ll know the data types; if you read these from an external data source (flat file, database, …) all of these things will generate columns of known data type.
It sounds like (from “This problem originates from a python snippet which outputs string columns if the molecule column is empty”) that all of your data will show up in the workflow as Strings - not some as Strings and some as RDKit Molecules (and if it does show up as the latter, could you include the node which produces this indeterminate data table specification in your workflow?)

If you could clarify how the data arriving in your workflow has no fixed data type, then maybe the use case could be better addressed?

jenniferh · May 21, 2019, 3:48pm

Hi,

sorry for the late reply:
I understand that a node has to know the input table spec to know if it can actually execute or not. But I think your explanation definitely helps me to understand KNIME a bit better.

Okay so I’ll try again:
I have a Metanode, which receives user input.
This input are Molecules, but it can also happen that I have invalid Molecules, which are transformed by RDKit to a “None” Type.
The Molecules are transformed by a Python Script (Python Snippet node)
Within this snippet the molecules are also filtered based on the user criteria.
The Molecules the user wants are in table 1.
–> This table usually contains molecules, no issue here
The Molecules the user does not want are in table 2.
–> Depending on the user choice this table might have a Molecule column where the molecules were invalid, i.e. None’s.
Both tables are then output from the python snippet and further used in the node. The problem with the output is, the second tables Molecule column is now a String column due to the presence of only None-types.
For the user I want to cast the RDKit Molecules back to an sdf format, since this is our standard format for molecules (plus we have the annoying problem that KNIME crashes as soon as you try to view an RDKit column)
Trying this I run into the problem which I described (not being able to cast the String Molecule columns). If the second table has only None’s in the Molecule column this column is a String column.

Thanks again for your input!

system · November 20, 2019, 3:48am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.