The text processing capabilities of KNIME are great, and the Abner tagger works well for biological terms.
Is there an easy way to create one for chemistry which tags chemical names in a document. This would be so powerful as there is no easy way to do this at the moment.
The OSCAR software can parse text and find chemical names:
It would be nice if these tools could be wrapped as KNIME nodes, I don't know if anyone has done this, but I believe the OSCAR code is in Java.
Thanks for the hint, i didn't thought about a Chemistry Tagger so far. The OSCAR is a good suggestion and the node is on my feature request list.
A chemistry tagger has been integrated in the new version, which uses the Oscar tagging library.
Many many thanks for implementing this. A great job has been done here to allow Oscar tagging of chemical names, and Oscar tag filtering to just get the chemical names as output.
It now becomes really easy to get molecular names out of papers and patents from the excellently implemented PDF reader too, a much needed addition which really adds alot more options to KNIMEs utility.
And just to make things even more great, is the Term to Structure node. This is absolutely brilliant and this will be getting used an awful. A good way to get structures out from papers and then all these can be grouped together as Murcko Scaffolds etc using RDKit/Indigo and grouped together for frequencies using GroupBy node, KNIME has no limits!
I am so pleased with these implementations, I really like them, and they are very impactful and useful.
One minor point, is would it not make sense to have the Text and Image processing nodes directory in the root of the node repository so people see them more easily. They are a bit hidden, and people tend to overlook them and how powerful they are.
Whilst I am on the subject of node repository structure, I do think it would be good to move the RDKit/Indigo/Erlwood/CDK nodes to the Chemistry repository. Can we not just put a symbol at end of these names (i.e. a siloutte of people to indicate community) to indicate they are community nodes rather than have them in a community directory. As the number of great nodes increases, it would make sense to have them all organised in the logical locations. A community folder doesnt make sense for finding nodes I feel. There can also be another folder for scripting to encompass R, OCtave, Python, Groovy, Matlab etc.