Nodes in languages other than Java?

Hi,

Let me start by saying thanks for providing knime. the new version looks like a great step forward!

I'm very interested in exploring the possibility of exposing a set of functionality to Knime that is not written in Java (it's C++ wrapped in Python). My current interest is in exposing an open-source cheminformatics and machine learning toolkit, but I could imagine other non-Java projects as well.

I'm not a Java programmer, so I can't assess how difficult my request is. Any ideas? Is this an idea who's time has not yet come?

Thanks again,
-greg

Well, it is possible - we are doing something like this with an old C executable (Quinlan's c4.5r8) ourselves over here. Since Quinlan's licensing terms are a bit odd, we can not put that node into the public release, though. However, I could send you the code for that node as an example if you are interested.

In general we do not quite like this idea because it makes it very hard to keep KNIME "in control". Canceling external executables, for instance, can be very hard if those external tools don't collaborate. Also, adding interactive views will be tough, since you won't have access to a lot of the HiLiting-functionality. But then again, adding new nodes has large value in itself!

Which tool are you thinking about?

The system I'd like to provide access to is actually a "full" cheminformatics suite (rdkit.sourceforge.net), analogous to the cdk only written in c++ with a python wrapper.

Maybe a look at the code you've used for c4.5 woud be helpful.

I probably also need to look into the possibility of putting a java wrapper around some of the rdkit classes using SWIG.

thanks for the answer,
-greg

Ok, if you send me your email address, I'll drop you the node that calls the C-stuff. Wrapping things is not that complicated. Wrapping stuff nicely is harder...

I tried to find more information on rdkit, without much access. Sourceforge has lots of links but none seem to carry much info? Can you elaborate a bit on what functionality you have in there?

Somehow I feel dumber every day on this forum: what is SWIG? Nothing (java related) to find on the web...

berthold wrote:

I tried to find more information on rdkit, without much access. Sourceforge has lots of links but none seem to carry much info? Can you elaborate a bit on what functionality you have in there?

Yes, the online presence of the toolkit is very minimal; I have been lax in putting together any sort of introduction to the software or its capabilities. It's something I'm starting to seriously think about, but it hasn't happened yet.

The RDKit is a toolkit for cheminformatics, descriptor calculation, and (to a lesser extent) machine learning that was developed at my former company. On the chemistry side it includes the standard suite of things for reading and writing molecules, substructure searching, similarity calculations, molecular depiction, etc. There's also some 3D functionality (2D->3D conversion, conformational analysis, pharmacophore searching), and a set of calculators for molecular descriptors.

berthold wrote:

Somehow I feel dumber every day on this forum: what is SWIG? Nothing (java related) to find on the web...

More likely it's bad/inadequate communication on my part!
SWIG, the "Simplified Wrapper Interface Generator" (http://www.swig.org/), is a tool for producing wrappers around C/C++ code to enable it to be used in other languages. SWIG is primarily designed for scripting languages, but it does have some Java support.

-greg

Ok, just sent you the c4.5 sources (for our nodes only, of course) in a seperate email.
Let me know if you never got it.

I am curious to see what kind of functionality your RDkit offers - do the sources
contain documentation or some more details? Man pages, maybe?

If you are using some sort of generic backbone for your toolkit wrapping this
in java should not be too complicated. We usually go through a file-based
I/O, using reasonably well defined CSV formats. Not optimal, of course, since
you end up storing the data twice but I am pretty sure you do not want to
read the KNIME internal cache format :-)

berthold wrote:
Ok, just sent you the c4.5 sources (for our nodes only, of course) in a seperate email.
Let me know if you never got it.

It hasn't made it to me yet.

berthold wrote:

I am curious to see what kind of functionality your RDkit offers - do the sources
contain documentation or some more details? Man pages, maybe?

The C++ does have some docs in the headers that I periodically extract with doxygen. To save you the trouble, I put these up at:
http://www.rdkit.org/C++_Docs/
You can get the whole thing as one file at:
http://www.rdkit.org/RDKit_C++_Docs.Feb2007.tgz

I also combined the slides from two presentations that provide something of an overview of the code and put them up here:
http://www.rdkit.org/RDKit_Overview.pdf

There is also automatically generated documentation for the python code, but that's not as complete or useful due to limitations of the doc-generation tools.

berthold wrote:

If you are using some sort of generic backbone for your toolkit wrapping this
in java should not be too complicated. We usually go through a file-based
I/O, using reasonably well defined CSV formats. Not optimal, of course, since
you end up storing the data twice but I am pretty sure you do not want to
read the KNIME internal cache format :-)

Avoiding having to read someone else's cache format sounds good to me. :-)
This does bring up a question though: when dealing with molecules from standard molecular file formats there's typically a lot of computation that has to go on up front to convert the text into "chemistry" (for want of a better word). Like many other chemistry toolkits, the RDKit has an internal binary representation of molecules that can be dumped out to files/strings in order to save time (lots of time) when rebuilding molecules. This is the format I use when I, for example, store molecules in BLOB columns in databases. Do the knime I/O formats support binary strings (e.g. a length followed by a chunk of bytes)?

Best Regards,
-greg

Thanks for the links, I'll look into those later today.

As for the format questions - that's a good one! We decided against inventing
our own, internal molecular representation (or using someone else's such as
the CDK molecular format) but we keep what we read. So if we read SDF, the
data cell in KNIME holds the string representation. Same for Mol2, Smiles, ...
If you explicitly convert it to a CDK-cell, this is what you will have and propagte
to the next node. This way the user knows what (s)he does at any given
step of the pipeline.

So, if you were to add your own type, you would add translator to (and from)
this type and then wrap your representation in a KNIME DataCell. And yes, we
have the concept of BlobDataCell, which allows to efficiently stream a lot of
data into (and out from :-) our cache format. Actually, it does not need to be a
BlobDataCall, also all others can provide their own (de)serialization but Blobs
have the added advantage of being stored only once, no matter how often they
appear in the pipeline.

Do you have any example code that I could use to test my Java snippet?

I dont think it is working as it should?

Many thanks,

Sajan

return 2;

A bit more complex:
return “Row”+$$ROWINDEX$$;

Even more complex:
return “C:/test_” + $params$.replace(’/’,’’).replace(’\’,’’) + “.xls”;