Parse JCAMP file?

Is there a way to read/parse JCAMP files in KNIME? Here is a sample of what the file looks like:

##TITLE=Entry 1 xyz abcd
##JCAMPDX=Revision 4.10
##DATA TYPE=MASS SPECTRUM
##NAMES=me (C) 2023
##CAS NAME=name here
##SAMPLE DESCRIPTION=sample, sample 2, sample 3
##MOLFORM=H2O
##CAS REGISTRY NO=10282-20-1
##MW=183
##$RETENTION INDEX=776
##$CONDENSED SPECTRUM=NO
##NPOINTS=3
##XYDATA=(XY…XY)
1 28
2 83
39 1
##TITLE=Entry 2 xyz abcd
##JCAMPDX=Revision 4.10
##DATA TYPE=MASS SPECTRUM
##NAMES=me (C) 2023
##CAS NAME=name here
##SAMPLE DESCRIPTION=sample, sample 2, sample 3
##MOLFORM=H2O2
##CAS REGISTRY NO=10283-20-1
##MW=152
##$RETENTION INDEX=77
##$CONDENSED SPECTRUM=NO
##NPOINTS=4
##XYDATA=(XY…XY)
1 48
3 96
68 38
69 46

I want there to be one row for each entry, with all of the data fields preceeded by ## to be its own column.

Hi @zhuma

Just to be clear, are you looking for something like this?

Yes exactly! Sorry for being unclear.

Allright @zhuma! You can approach this in various ways but this is a way that does the trick.

What it does:

  • Use a Line Reader node to import the file.

  • Since each entry has a dynamic numbers of rows, I first determine the start and end point of each group. I look for it a row that contains ##TITLE, assuming that this marks a new entry, with using

if (indexOf(column("Column_Arr[0]"),"##TITLE") > -1 ) {
    rowIndex()
} else {
    null
}

  • Using the rowIndex() as identifier ensures that it’s always unique. With a Missing Value node you can supplement this group designator to ensure completeness.

  • If you start a Group Loop with the group column as inclusive column, it will process each group as a whole.

  • The main issue here is that you have rows that do not start with ##, specifically for ##XYDATA. Based on your sample I had to assume that this is the only case. So what I did is; when Arr[1] is null, replace it with the value of Arr[0], under the same condition make Arr[0] and replace it with the last known column name and group the xydata coordinates.

  • You can cleanup the ##XYDATA (XY…XY) row before the loop starts if it’s not a required record.

  • Next is the actual transpose of the data from row based to column based.

  • All actions related to the RowID is here with a Insert Column Header to make sure the correct column names are generated.

  • Since Row0 is not required anymore, you can use a row filter to exclude range 1 to 1.

  • The loop end will merge all groups naturally. Final output:

See WF:
parse JCAMP file.knwf (87.4 KB)

Hope this helps!

2 Likes

Thank you so much! Is there a way to account for records which may not have all of the data fields available? For example:

##TITLE=Entry 3 fjk jsop
##JCAMPDX=Revision 4.10
##DATA TYPE=MASS SPECTRUM
##NAMES=me (C) 2023
##CAS NAME=Test
##SAMPLE DESCRIPTION=sample; sample2; sample3
##MOLFORM=CH3
##CAS REGISTRY NO=444-21-1
##MW=37
##$RETENTION INDEX=
##$CONDENSED SPECTRUM=NO
##NPOINTS=2
##XYDATA=(XY…XY)
1 38
3 84

Where RETENTION_INDEX field is blank.

Actually, never mind. I think this is a setting in Loop End.

1 Like

If I need to then print the results back in the same format as the original, would I just do the same steps, but backwards? Thank you!

That can be done quicker by adding the column name to each field with a seperator, creating a list of each row, ungroup it and then split the fields again. But its feasibility really depends on how dynamic the columns are. If you always expect these column names than it should be doable.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.