parallel chunks - use full server power

j_wollenhaupt · July 11, 2022, 4:36pm

Dear KNIME users and developers,

how can I make KNIME use more ressources with parallel chunks?

let’s say the process inside the chunk is a 1 CPU job.
No matter if I use 20 or 60 parallel chunks in the parallel chunk loop, only 20 CPUs are fully busy and also the resulting speed is merely the same.

I have a 192 cpu server (and 1 TB RAM or so…)

in KNIME settings
“maximum working threads for all nodes” is already at 384 (was default)

if I start several parallel chunk nodes with e.g. 20 chunks each … also the CPU usage is not increased and everythings takes longer … plus the GUI slows down a lot and becomes less reactive.

is this an inherent problem of KNIME and how can I get it faster?

Do two KNIME installations (different folders) and open two instances of the GUI which each work on half of my large table?

happy about any suggestions.

best regards,
Jan

linh · July 14, 2022, 4:10pm

Hi @j_wollenhaupt ,

Adjusting the threads in the KNIME settings is a good starting point to look at this performance issue. Maybe you can find additional useful information in this forum post: Does Parallel Chunk Start node uses the threads available?

Are you referring to a server with a KNIME license here?

What you can also try is to allocate your machine memory by modifying your knime.ini file or try the new table backend.

Since you are contemplating to split the data set in half and run two independent KNIME instances, you could also do the same in one KNIME instance, just making sure the results of both branches don’t need to be merged in the parallel chunk end (this is expensive, all data is copied), for instance by use of file readers/writers etc.

If you want, you can also provide a “jstack” from the KNIME process and share it with us, so that our developers can have a look.

Best regards,
Linh

kienerj · July 15, 2022, 10:51am

parallel chunks is a crutch at best and especially with lots of rows becomes problematic you will spend a lot of time in the actual data splitting and concatenation.

The main question is, what the actual workflow is doing, what nodes it is using. Standard preprocessing like filtering or any other IO bound things usually don’t benefit much from more cores. In these cases you are better of to make components and use streaming execution.

I do feel however KNIME could profit by making many more nodes natively multi-threaded. This also applies to 3rd party nodes. This would eliminate the need for any “crutches”. RDKit is a good example with many natively multi-threaded nodes.

Still using all 192 cores isn’t easy. Amdahls law applies. At some point the added benefit becomes very small. such a machine is only really gonna be fully occupied if you have very heavy and long running calculations like Quantum chemistry or such things or if you train neural networks with CPUs. But for “standard KNIME operations” it is likley almost always overkill.

Daniel_Weikert · July 15, 2022, 4:34pm

Would indeed be very interesting what kind of datasize and type you are working on. The compute power sound quite heavy so what is the goal?

j_wollenhaupt · July 17, 2022, 1:51am

thanks all for your interest and good feedback,

I will try to answer in one post:

@linh

I use the KNIME GUI in an X-Session of a 192-cpu server.
I alreade checked the threads and increased the memory in the knime.ini by changing
-Xmx2048m
to
-Xmx16g

can I do any further tweaks using the knime.ini?

I don’t know what a jstack is, but I shared an Example workflow further below.

@kienerj

my two main application is filtering of chemical molecules using the “substructure matcher” node. In this case 4.5 billion of rows with molecules are compared to 1 query molecule. So the operation on every row takes only a fraction of a second, and the rows are independend, thus the table can be split by parallel chunking etc. No matter if I split the table beforehand to 50 parts or use a parallel chunk, KNIME will never use more than aprox 20 CPUs at a time. This operation takes days then … from the “computational effort” this should be doable in a few hours or less with 192 cores.
So if I take one node, KNIME uses 2-3 CPUs and gets 300k rows done in 1min. As I have 60x more CPUs it should be doable in a second per 300k (on average), thus the whole shabang should be over in 2-3 hours.

Funny that you mention RDKit, my 2nd application is with their “Molecule Substructure Filter”, which, same as the “substructure matcher” mentioned above, is not multi-threaded, unfortunately. Takes about the same time as the substructure matcher but is more difficult to use due to the format requirements of the input, queries need to be variables etc.
(I noticed when the chunks are set very high, e.g. 250-300, then I see in case of the RDKit matcher some spikes where 30-35% of CPUs are used … but still by far not the “full force” )

Anyway, in my 2nd case I have a list of 1-10 million molecules (rows) and match them to 200-700 queries.
Also here splitting the table or using parallel chunks does not get KNIME to do things in parallel and it takes a whole working day to run this calculation. Should be doable in 30min or so with 192 cpus.

@Daniel_Weikert

as I cannot upload the 130GB of data for the 4.5 billion rows (SMILES format), I simulated this with a “one to many rows” node. I did three typical examples in there, 2 with one query and one with 500 queries. maybe this helps.

here the example workflow:

test_chunks_.knwf (59.8 KB)

kienerj · July 18, 2022, 5:13am

This means KNIME can use only 16 GB of your 1 TB of RAM. So you should set it to something as high as possible to not block other things running on the server. Say 500 GB or even 950 GB.

I think this is the main cause for the slowness with your dataset. 16 GB ram is simply not enough for the size of your data. It means you are hammering your disk which I hope is a very fast nvme ssd or else this would be another topic to look at. Your 192 core server will be slowed down if you are running on 15k SAS harddisk vs ssds especially when you are giving knime this little amount of RAM.

Therefore greatly increasing the RAM KNIME can use is your next step and if it doesn’t help we can further speculate what might be happening. (About RDKit, you can always go to the github of the nodes and enter an issue like lack of multi-threading. Then they are at least aware of the issue and that users care)

You could use the substructure counter for this at it has a second input port for all queries and then filter on the result table. Not sure if it is multi threaded but theoretically each row can be worked on interdependently.

Again I would change the RAM in the ini and if that doesn’t help, further investigate like RDKit is C++ at heart and I have no idea if there are any limits in the interface to Java regarding threading.

Another option if more RAM doesn’t help is to simply drop down to python entirely for this step, probably completely outside knime if you can’t get the performance in place. There via multiprocessing you will have much more control on resource usage.

j_wollenhaupt · July 20, 2022, 9:50pm

@kienerj

the big increase in RAM was a good idea. Actually I have 800GB, so I put it to 300GB now.
It helped but did not solve the issue. With RDKit node worked better that with substructure matcher, but apparently one then also has to find a “sweet spot” in parallel chunks. Automatically it went to 288, but actually 24 was the sweet spot to get at least 70% of parallel CPU usage, more chunks then again decreased.

Also substructure counter was a good suggestion, but with a list of 200-700 queries the resulting table becomes difficult to filter (better to say: cumbersome to build the filter for me)

But somehow your hint made me search further and I found a great node, the CDK Smarts Query node. This one is fully multithreaded

So this solves definetly my 1st case.
With this I could search 350mio molecules for one query in 150 min , plus 120 min transforming SMILES to CDK, this is a great timing.

The one disadvantage is that the SMARTS have to be constructed in a special way that I have to figure out yet, so that it servers the purpose of substructure searches.

In case that can be worked out (see my post from just now in the CDK part of the board … which btw looks a bit abandoned, not many topics last year, I hope some people still read it^^)
then this could also potentially solve my 2nd case.

there, speedwise I got 22min for 120k molecules and 178 Smarts as queries. Would also be great.

aworker · July 21, 2022, 2:16pm

Hi @j_wollenhaupt

I have downloaded your workflow and made a few modifications to make it work.
I have run the workflow on a laptop with 6 cores (12 threads) & 128 Gigabytes memory and it took 10 hours to run up to successfully execute your problem A) with the same number of rows (100 Million). The solution implemented is the following:

As one can see from the image, all the parallelism is achieved by the -RDKit Substructure Filter- node. This node handles itself the parallelism and does not need any further parallelism to be added around it. In fact, if parallelism is added using -Parallel Chunk Loop- nodes, then the two parallelism schemes fight each other against resources and this is most probably what was happening with your implementation. In other words, it is not recommended to encapsulate two parallel solutions because it generates competition for resources. It is neither recommended to run in parallel two parallelized branches in a workflow for the same reason.

As you can see below, the solution is fully using all the cores in my computer.

I have modified “Problem B)” in the same way to show how to implement a possible solution too. It should work with your initial configuration too and much faster than in my PC given the hardware of your server:

Still there is the question of why your server only runs a subset of its cores when you execute your workflow. Maybe it happens because it has been configured in this way by its administrator to limit the maximum number of processes per user. Nevertheless, the solution I’m providing here should work quite faster in your server and even better now that you have extended the memory upper limit.

The executed workflow can be downloaded from the HUB here:

The uploaded workflow has been configured and executed only with 10 Million molecules to limit the total amount of occupied disk in the hub but it has been verified and executed locally on 100 M.

Hope this helps. Please reach out again if you need further help.

Best
Ael

j_wollenhaupt · July 21, 2022, 10:46pm

@aworker thank you so much for putting in the work,

on the big server it looks a bit different, there “only” 60-70% of cpus used (in max … it is somehow oszillating between 10 and 70 … maybe not all steps of what the node does is parallel),
but anyway much more than we saw with a single node, probably our limit before has been the way too little memory that we had allocated, therefore we saw the RDKit node never performing so well.

So I think your workflow is still a large improvement and showes that my version misused the parallel chunking. (i.e. several multi-threaded nodes fighting with each other)

However, bottom line, the Smarts Query Node ist still much faster and also internally parallelized. About 9 times faster in total if you calculate the time they need on some data. (or 3 times if you factor in the molecule to CDK transformation, but this you only need to do once and then can recycle for future searches)
strangely today with some real-life data, i.e. 2mio different rows, it never never got to this 100% cpu usage as for the test data … but maybe there are other factors in these computing questions that I do not grasp^^ … for me mostly the total runtime counts…

few last (hopefully not too stupid) question on the memory settings before we close the issue:

regarding the serial chunk loops:
why are they there and at which point do they bring an advantage?
Would you still use them if all the data still fits in the RAM?
(my Smarts Query Node ran fine with 350mio rows, 100GB data or so)

I assume this is for cases, where the data is larger than the RAM … if so, should one then put special settings, e.g. the chunk loop memory settings to “save to disk” but the central node (Smarts Query or RDKit substructure) set to “save in memory” … KNIME is anyway writing to disk when RAM / heap space specified in knime.ini becomes limiting, right?

kienerj · July 22, 2022, 4:45am

Keep in mind that RDKit does this transformation to but transparently if you do not feed it an rdkit molecule. If the goal is to reuse the rdkit molecule it can be an advanatge to use RDKit from Molecule before any other rdkit node. That will also split of invalid molecules.

As for the cpu usage, is it 192 thread or 192 cores? I suspect threads so it is really “only” 96 cores which like means dual socket cascade-lake AP server. (48 cores per CPU * 2 CPUs * 2 thread = 192 threads). I guess the hyperthreading can’t really be fully used anymore in such a configuration. Also I wonder how well the OS (is it Windows or linux?) can still cope and schedule correctly with this many threads. Diminishing returns. It’s not really surprising to not see 100% usage.

Having said that and since there seems to be money around given this nice hardware I wonder if it wouldn’t make sense to also look into paid software namely Patsy (or their Arthor product) (I have no affiliation, haven’t actually used it myself but I know it exists)

Another option is to try the RDKIt SubstructLibrary in python.

Documentation:
https://www.rdkit.org/docs/source/rdkit.Chem.rdSubstructLibrary.html
Example:
https://greglandrum.github.io/rdkit-blog/tutorial/substructure/2021/08/03/generalized-substructure-search.html

Maybe it will work with your huge amount of RAM and will likley be a lot faster as everything will be in memory. Worth a try if speed is of the essence.

j_wollenhaupt · July 22, 2022, 10:44am

Even with doing the tranformation to RDKit beforehand, the CDK node is 2 -2.5 times faster. As in Case A, I don’t need either RDkit or CDK later on, the time counts, so 3x faster for CDK

could be that it’s 48 cores, or 24 cores per CPU, all I know is when I use top or htop I see 192 threads.
But your 2nd statement cannot be fully true as I saw 100% for the CDK node and also see that for nodes for virtual screening regularly.

maybe you got a wrong impression. We are an academic group and have to share this hardware (which is several years old) for a number of purposes. Just for these weeks I can fully use it, so I try to make it as efficient as possible.

As I am not a programmer and do not know any python, this would not fit for me. But thanks for the suggestion.

j_wollenhaupt · July 22, 2022, 10:45am

@kienerj

could you say something about why you used the serial chunks in your example and what settings inside these nodes you would recommend?

j_wollenhaupt · July 22, 2022, 12:36pm

sorry, this was directed at @aworker … sorry for confusion

j_wollenhaupt · July 22, 2022, 12:38pm

btw, for problem B, i.e. many smarts to look for, the cdk node is 20-25times faster that the variable loop with the rdkit node (in both cases molecules pre-calculated)

It does not need any loops but takes the list of smarts via it’s 2nd input and does some internal multi-threading, which always uses 100% cpus.

So I think for this thread, to sum up:

If one has to stick to the RDkit node, then the workflow of @aworker is great.
i.e. do not put this RDKit node in parallel loops.
If you can change to the CDK smarts query, even better, but the input SMARTS are a bit more tricky t get right.

plus setting very high setting for RAM in the knime.ini is also essential

aworker · July 22, 2022, 2:23pm

Hi @j_wollenhaupt

No problem at all

Sorry for my late reply. I’m currently under a rush of work. I will try to answer the questions you asked in your two last posts later. In the meanwhile, please find here a modified solution for your case B) which uses the -RDKit Molecule Substructure- node. This one has 2 entries, the 1st one for the list of molecules and the 2nd one for the queries which can even be under different molecular formats (SMILES, SMARTS, SDF MOL, RDKit).

Maybe this can help too to accelerate your parallelized job.

Best
Ael

kienerj · July 23, 2022, 11:00am

One thing I forgot is that if the patterns are either immediately a yes or no decision (if pattern is there either include or exclude) then you could potentially massively improve performance if you exclude the matches for the next query. Say you start with 4.5 billion and 100 million match the first pattern, the next round would then only happen on 4.4 billion rows and so forth. Each structure search would happen on less and less molecules.
(Recursive loop, iteration 1 uses query 1, iteration 2 query 2 and so forth). Inside knime the looping can be slow any maybe this won’t be faster but worth a try. Outside knime with more control I’m pretty sure this could be massively faster assuming you are making decision based on a single substructure match.

j_wollenhaupt · July 24, 2022, 9:36pm

@kienerj also great good idea, but in my scase, usually about 5-10% are filtered out, so it would not make a huge difference

j_wollenhaupt · July 24, 2022, 10:32pm

@aworker

your work is impressive, thank you

this new RDKit node with the two inputs is super fast, even a bit faster than the CDK solution when there is many patterns

it does not deliver 100% the same results (quality wise, i.e. not exactly the same number of matched molecules) but this is something that’s hard to track down, probably due to some SMARTS in my lists not being ideal for the search. Also cannot say which search is more “correct” either from the new or old RDKit node (or neither, haha), I never manually checked in detail but trusted them that they do the searches as intended and that the SMARTS lists I inherited are fully correct/functional.

However I would definetly call this the solution of Problem “B)” in the workflows mentioned. So many queries. If it’s just one query the CDK smarts query seems the fastest in my hands.