FeatureFinderMetabo takes too long with real data

Hi everyone!
I tried FeatureFinderMetabo with the tutorial data (2012_02_03_PStd_10_1.mzML, 2012_02_03_PStd_10_2.mzML, 2012_02_03_PStd_10_3.mzML, 2012_02_03_PStd_050_1.mzML, 2012_02_03_PStd_050_2.mzML, 2012_02_03_PStd_050_3.mzML) and optimized parameters and it takes only 3-5 minutes, even though each file is >200Mb. However when I try it with different experimental files(5 mzML files or 5 mzXML files with FileConverter node) which are ~70Mb each, it takes from 2 up to 4 hours. I understand that the quality and maybe the size of the data/files is important, but it seems a bit weird to have such a big difference in file size and then in processing time. I would expect that the smaller the files are the less time they take.

Hi!

Did you check if your data is centroided and that there are no zero-intensity peaks?
Sometimes you need to so some cleanup.
If you want (and are allowed), you can send us/upload a test file and me and my colleague (more proficient with the internals of this tool) will have a look.

Hi!
Thank you very much!
Yes they are publicly available (I downloaded them from MetaboLights database): https://drive.google.com/open?id=1f4z4_QNagyhqdnfM409otQRJbbZ5fKlz for 5 mzML files and https://drive.google.com/open?id=1qWsO8VeK_a0CH-3KEHKyu8Evrs7hWnju for 5 mzXML files.
With optimized parameters (as proposed in the OpenMS handout https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Tutorials/Handout/master/handout.pdf ) it takes 2 hours for the 5 mzML files and 3 hours and 40 minutes for the 5 mzXML files. How could I check if they have zero intensities? You mean after FeatureFinderMetabo has processed them? Also, I opened them and they say they are centroided.

Hi!
Do we have any updates on this?

Hi,

“max-trace-length” seems to be the parameter that prolongs the runtime significantly.
Can you try setting it to a smaller size or to “-1”?
But we don’t know why this happens, yet. I assume it is some kind of bug or inefficiency, since it should not happen.

I also suggest to have a look on the data in TOPPView
to get a better feeling about the best parameters for your dataset/machine anyway.

Hi!
Thank you very much! Yes that does reduces the time a lot. I also noticed that despite the fact that some files are really small 60 or 90 Mb, they may be compressed(“zlib compression”), which means they are much larger in reality, and consequently may take more time.

So, I am using FeatureFinderMetabo in combination with Parallel Chunk Start node and Parallel Chunk End node to reduce runtime (so that there is one file per execution in parallel). However, if I was to change the threads of FeatureFinderMetabo up to 2, wouldn’t that suppose to be faster? I did changed it to 2 and I am processing 2 files in parallel, but it takes exactly the same time, no matter if I have set 1 or 2 threads in FeatureFinderMetabo node. Generally, I have 8 threads in my laptop and everything else closed except Knime.

Hi!

That is true. Especially if it is a compressed mzML or otherwise a complex sample (e.g. many compounds and features to be found) the times can vary quite a bit. It should however not be as pronounced as you reported, especially since “-1” is supposed to be “no maximum cutoff”. We are looking into this.
Regarding parallelism, I would prefer the Parallel Chunk Loop as long as you have enough files to be processed. The speedup from parallelism inside the algorithm is limited since it most of the time has to look at the data “as a whole”.

Hi!
Thank you very much! You mean you would prefer the Parallel Chunk Start node and Parallel Chunk End node instead of increasing the threads? And that, although it says “Parallel” it is not quite realistic? I tried to combine both, the parallel nodes and increasing the threads, but I saw that with increasing the threads there was absolutely no difference.

Hi!
So, regarding the max_trace_length parameter, if I decrease it from 600 to 300 it takes more time and if I decrease it to 200 it takes even more time. Also, the more I decrease it the more results I take which I am not sure if it is supposed to work like that as it is supposed to be a cutoff/threshold/check, right? Unless, the more we decrease it the smaller the window check of seconds becomes which gives space for more “checks/controls” and more rows/features to be detected, thus the “more time”.