Removing/filtering duplicate molecules from an input

Hi, i am a new user to Knime and i've been looking through the forums for an answer but not understood if what i was looking at was what i wanted, nor if it was possible.

I have a few molecular databases i've created, and i wish to combine 2 at a time as an input, then filter those entries which are duplicates, leaving those which are not in a separate datastream if possible.

I'm not sure if i can use a component with this feature already included, or i need to write a piece of code for another component to carry out the task

An example of what i'm thinking of. Two databases with some molecules in common:

Database_1
Molecule_no....Structure
1.........................CCC
2.........................COCC
3.........................CONC

Database_2
Molecule_no.......Structure
11..........................CCCCCCC
12..........................CCC
13..........................COCOCO

Filtering the duplicate molecules 1 + 12 so that they are in a separate datastream to the non-duplicates:

Datastream_1 (duplicates)
Molecules 1 and 12

Datastream_2 (non-duplicates)
Molecules 2, 3, 11 and 13

Hope my explanation is understandable!

Thanks in advance,
Dan

Hi Dan,

dan_mason wrote:
I have a few molecular databases i've created, and i wish to combine 2 at a time as an input, then filter those entries which are duplicates, leaving those which are not in a separate datastream if possible.

I'm not sure if i can use a component with this feature already included, or i need to write a piece of code for another component to carry out the task.


I cannot think of any node in the current version that can accomplish such things, not even for "simple" datatypes like strings or number. For molecules it would be even more complicate I guess, because Smiles is not unique.

Regards,

Thorsten

Ok thanks for your help. I'll try to think of another way to achieve this in that case.

Cheers,
Dan

This can be achived quite easily by a few simple Java commands, i'll post them here in the near future

Dan

i was just looking for a simple way of filtering duplicates out of datasets in a knime workflow exactly in the way dan asked for - dan, did you post the java commands you mentioned in your last message, would be really helpful for me!?!

thanks in advance
sebastian

If the molecules have unique values in both tables' columns, you can use the Row ID node to put them into the virtual Row ID column, and then use the Reference Row Filter to include/exclude all row ids presents in the second, reference table. This node is new in KNIME 1.3.5.

thanks gabriel, i didn't work with reference row filter yet - but indeed: it works quite well.

at least if the molecules aren't in different protonation states - i use smiles as reference row, so different protonation states of the same molecule are not recognizerd as double. nevertheless, if i put all molecules in the same protonation state with *other* programs the problem is solved. or is there a node in knime to change protonation state of molecules?

regards
sebastian

Hi there,

I have written a node which filters duplicates, as soon as I have access to the nodes4knime project, I will publish it there. The node will take two input streams, and compare them, if a match is found then it will return that row from the selected input port into the +ve stream, else place this in the -ve stream (no match found). The node works by using string comparison. I would not recomend its use for SMILES comparison, unless you have standardise the SMILES strings so that they should be the same, but again I wouldn't do this.

Best regards,

Stanage.

Use InchIKey and then filter a simple string