I've a little question about really large Knime Jobs.
1. Molreader which reads 1.000.000.000 Molecules.
2. A Manipulator Node which makes something with the Molecules.
3. Finally a simple Table or anything other which shows the results.
If I start now the Workflow, it would take very very long time until I see some results because even the First node (the Molreader) would take "forever" to read the 1.000.000.000 Molecules. And it would take even longer to Manipulate them with our second node, the Manipulator.
So it could take very long time where I dont see any result.
Is there a way to do some Parallel Executment?
Imagine the Molreader (1.) reads 100Molecules and send them to the Manipulator. The Manipulator workships the 100Molecules and send them directly to the the Result Table(3.). Then the Molreader(1.) reads the next 100 Molecules and so on.
In this case I would be able to see the results in human times even if I start a gigantic job.
Is it possible? Or have every node to complete its job completly until the next node is executed.
PS: I know 1.000.000.000 is really very much :) But its easier to explain it with large numbers ;)
...that would be one of the few examples where a row-based pipeline does have an advantage over a table based processing. KNIME currently only supports the latter, however. So you would need to wait for a while to process those molecules. Note that you could use your 1.000.000 core machine to substantially to cut this time down quite massively, however - KNIME is using a ThreadedNodeModel which allows to distribute data parallel operations (such as your molecular property calculation) on multi-processor machines. So in this case you would only wait 1/1000000 as long ;-)
We have been thinking about enabling a way to stream data through KNIME pipelines with the new workflow manager but I doubt this will happen anytime soon (simply for lack of resources, not so much lack of interest). I see this as a low priority because the cases you describe are rather rare and I can only see a license-balancing issue as another reason to require such type of processing.
PS: Have you thought of throwing out the duplicates from your data file? Or are those 1 billion artificially generated molecular structures?
Thank you very much for your reply.
Yes i thought about it to throw out "uninteresting" rows. But our core problem is that it have to be "pseudo parallel" by design :)
But thank you anyway for your informations!
Now I know how KNIME works and we can think about future steps regarding our Node.
Is this still an issue for you?
I may have an truly parallel data mining alternative.
This might not be an issue for imax, but it's an issue for me.
Can you, or somebody else explain me a way of multithreading in knime?