OpenMS nodes for LC-MS/MS data in combination with MetaboliteSpectralMatcher

I am not sure if by i.e. 20 ppm you mean simply 20/1000000 and basically mz plus/minus 0.00002. This is my java snippet:

if(c_exp_mass_to_charge-0.00002<=c_mz && c_mz<=c_exp_mass_to_charge+0.00002){
if(c_retention_time>=c_rt_start && c_retention_time<=c_rt_end){
out_newrt = c_rt;
out_newmz = c_mz;
}
}

But that gives me only 10 metabolites.

I missed something there. It must be around plus/minus 0.01 with which I get 175 metabolites.

you usually do mz - mz * tolerance * 1/1000000 <= mz && mz <= mz + mz * tolerance * 1/1000000

Thank you very much for your help!
I changed it to:

if(c_exp_mass_to_charge-c_exp_mass_to_charge * 20 * 1/1000000<=c_mz && c_mz<=c_exp_mass_to_charge+c_exp_mass_to_charge * 20 * 1/1000000){
if(c_retention_time>=c_rt_start && c_retention_time<=c_rt_end){
out_newrt = c_rt;
out_newmz = c_mz;
}
}

and now I get 76 metabolites.

Does not sound much, but everything else would in my opinion be too likely to be a false positive.
The tolerance depends a bit on your instrument resolution.

Which spectral database did you use?

It is not that many indeed. I used the latest MBSpectra.mzml file which i built according to this https://github.com/OpenMS/MassBankUpdate. The truth is, I am not aiming at a specific instrument but more like to a globally acceptable tolerance value. So i will tweak it a bit maybe.

P.S.: I didnt know how to update the MB2HMDBMapping.csv so I left it as it was https://forum.knime.com/t/openms-updating-mb2hmdbmapping-file-to-create-the-mbspectra-mzml-for-metabolitespectralmatcher-node/37097.

Do you think it would make sense to use MAPC and FeautureLinker in the separate branch where the FFM is? I would get mz_cf and rt_cf and then join these values with the exp_mass_to_charge and retention time of every file from MSM?

Or would it make more sense to join on the separate values rt_0, mz_0, etc from MAPC and FeatureLinker with the exp_mass_to_charge and retention time of every file from MSM?

I would do the mapping from MSM to Features per-file, i.e. in two branches of the same loop.
Because after MapAlignment, RT values might be shifted slightly.

But yes, we need to find something to map the IDs of the features from the single runs to the consensus run. I have to think about that a bit.
Internally there should be a unique ID for each feature, maybe we can access them in the consensus somehow.

Yes I think I have done something similar here

.

In the java snippet in this example, I am using the rt_cf, mz_cf (from MAPC and FeatureLinker) and retention_time, exp_mass_to_charge values for the tolerances.

if(c_exp_mass_to_charge-c_exp_mass_to_charge251/1000000<=c_mz_cf && c_mz_cf<=c_exp_mass_to_charge+c_exp_mass_to_charge251/1000000){
if(c_retention_time>=c_rt_cf-30 && c_retention_time<=c_rt_cf+30){
out_newrt = c_rt_cf;
out_newmz = c_mz_cf;
}
}

However, wouldn’t it be more accurate to use the rt, mz of every file separately for the tolerances in the javasnippet (i.e. rt_0,mz_0 with retention_time,exp_mass_to_charge and then rt_1,mz_1 with retention_time,exp_mass_to_charge again)?

Yes I think it would be more accurate if you compare against the values of the file where the MSM matches came from.

Preferably even before MapAligner, as mentioned, because alignment will introduce RT shifts.

So basically, put MAPC after the cross Joiner and java snippet where I check for tolerances. But, that means to use the MAPC after the join/comparison has taken place which means implement MAPC on the ~66 metabolites(in this case) that were joined. Is that right?

No, I mean, associating/cross-joining features with identifications in the same loop. Then align and link. Then re-associate all identifications from the single runs to the consensus.

Ok, I think I am missing something again. How could I use the MAPC after the loop if I already do the cross-join first in the loop. There are no featureXMLs from MSM(it exports only in mzTab). There is only the cross-joined table. How could I align? Do you also mean checking for tolerances between cross-Joiner and loop end?

Metabolite_SpectralID.knwf (72.8 KB)

I think you may mean something like the picture below.

However, it is impossible to perform the 2nd cross-Join(after branch1 and branch2) because the branch1 creates a table with ~43.000 rows and the branch2 creates a table with ~14.000.000. So the cross-Join will take forever because the tolerances(Java Snippet) are placed later. But I guess, theoretically, that would be a way to include correction of retention time distortions.

Why does branch 2 have 14000000 rows? You only need to keep the associations where the ID was in tolerance.

In branch 2 you have to check for tolerances.

When merging the branches, you should be able to join exactly based on rt, mz, iteration of branch2 and rt_n , mz_n of the consensus branch1 where n is the iteration/file number.

1 Like

Thank you very much! I believe it is as accurate as it can get if I want MAPC. It is a bit expensive because of a cross-Join but it is feasible! Now I get around 450 metabolites in total (from 2 replicates) with corrected RTs which is more reasonable!

So, after trying with negative and positive ionization mode with corresponding files, it possible that in the last join(join exactly based on rt, mz, iteration of branch2 and rt_n , mz_n of the consensus branch1) I get an empty table. In positive mode I do get 225 metabolites but in negative mode nothing.

PS: The weird thing is that from 2 negative ionization files, I get 160 identified metabolites from the MSM branch (branch2) which I join exactly on the rt_n, mz_n and iteration. Joining the rt_0,mz_0 and iteration I get an empty table and joining on rt_1,mz_1 and iteration I get 83 rows. The other 77 rows(from the 160) belong to the 1st file with rt_0,mz_0 which is kind of weird. But I imagine it has to do with the fact that the rts of the 1st file are actually shifted?

So, with the two files in negative ionization mode, if in the last join(branch1 against branch2) I use mz, iteration and intensity instead of rt of branch1 with the mz_n, iteration, intensity_n of branch2 where n is the iteration/file number, I get the expected results: 77 rows from the 1st replicate and 83 from the 2nd replicate = total of 160 metabolites. However, even if it is correct now, I am afraid it is not reliable or if it makes sense…

Hmm I think you are right. rt_0, rt_1 etc are already the shifted times. Only the reference map will have the original values. I have to think about it more.