Expand key-value pairs from a cell to multiple columns

mpreusse · September 17, 2021, 3:02pm

I have a table that contains a field with key value pairs (it’s a gff file).

Each row contains a different set of key value pairs in that field.

Parent=transcript:ENST00000417324;Name=ENSE00001727627;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00001727627;rank=3;version=1
ID=CDS:ENSP00000493376;Parent=transcript:ENST00000641515;protein_id=ENSP00000493376

I would like to map the values into new columns with the keys as headers. I tried the ‘Cell Splitter’ already but that works best for a fixed split with a fixed output.

Do you have any pointers how to approach this?

bruno29a · September 17, 2021, 8:23pm

Hi @mpreusse and welcome to the Knime community.

You can use Cell Splitter, but there are more operations to be added to accomplish what you want to do.

The main challenge here is that your records might not all have the same keys. For this type of structure, I usually use JSON, as that’s exactly how a JSON “structure” is, and also because converting JSON to a table, Knime will automatically create the necessary columns dynamically.

If I look at your data, I am assuming that each pair is delimited by a semi-colon (;), and that a pairing of key value is determined by key=value format. I am assuming it’s 2 lines, therefore 2 records, is that correct? I’m making this assumption because I see the key Parent repeating.

I have put something together based on these assumptions, and here are the results:

The workflow looks like this:

And I used the input that you gave:

Here’s the workflow: Expand key-value pairs into columns.knwf (10.4 KB)

mpreusse · September 18, 2021, 8:02am

Thanks @bruno29a, that is a really elegant solution!

I am new to KNIME and did not know about the JSON nodes before.

Next step is to run it in parallel because I have around 3 million rows with the key value pairs

bruno29a · September 18, 2021, 3:17pm

No problem @mpreusse , it should still work with 3 million rows. It’ll just require some memory, and will take longer to run.

mpreusse · September 19, 2021, 9:41am

I tried it without chunking or parallelization and it was really fast, took a few minutes. Really impressive!

bruno29a · September 19, 2021, 1:58pm

Hi @mpreusse , yes, it should be ok without chunking/batches or parallelization. There’s really not much happening there. They’re straight forward operations.

Glad that it worked very well for you.

system · September 26, 2021, 1:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.