XPath node: How to repeat the data from a parent tag

Thoralf · July 16, 2024, 1:36pm

For reasons of data protection, I cannot use original data and will therefore create an example with the same problem.

Let’s imagine that the contents of a DJ’s suitcase are delivered in an XML file. At song level. It would look something like this:

<box>
  <box_number>1</box_number>
    <album>
      <album_title>The Dark Side of the Moon</album_title>
      <album_artist>Pink Floyd</album_artist>
      <label>Capitol Records</label>
      <music_genre>Progressive Rock</music_genre>
      <song>
        <song_number>1</song_number>
        <song_title>Speak to Me</song_title>
        <duration>3:57</duration>
        <song_composer>Nick Mason</song_composer>
        <song_texter></song_texter>
      </song>
      <song>
        <song_number>2</song_number>
        <song_title>Breathe</song_title>
        <duration>3:57</duration>
        <song_composer>Roger Waters, David Gilmour, Richard Wright</song_composer>
        <song_texter>Roger Waters</song_texter>
      </song>
      <song>
             ...
      </song>
    </album>
</box>

Now I would like to transfer this structure into an Excel file using the XPath node. To do this, I read out the album data and sort it as a String cell and in the tag options as Multiple Rows in columns. I will leave the songs as a Node cell for now, because I will read them out in a second step. I think this is how the KNIME developers have planned the work process.

But here is my question:
If I then display the songs, they would fill 10 lines. This means that only the line with the first song contains the complete data set. The other 9 rows would have missing values in the album columns. As the photo from my real use case shows:

Xpath_reading

If a second album were to be added now, this data would be displayed in the second row, even though the song columns there contain the second song from the first album. This is then in fact an incorrect data set.

So how can I repeat the album data in the rows until all the songs have been written to the Excel table? So that each song also contains the data of the corresponding album? And only then can the data from the second album be written. How does that work?

ArjenEX · July 16, 2024, 1:55pm

Hi @Thoralf

If I understood it correctly you want to have the album information also associated with each song within that album.

My approach would to first extract all albums as nodes in multiple rows config.

Thereafter, you can dynamically retrieve the entire content of each album. Main split is here between the header information of the album and each song. For this you can go for a SingleCell option. For the songs it would be MultipleRows with a wildcard on the song identifier, like /album/song[*]/song_number

This will retrieve all information of each song and to which album it belongs.

Hope this helps!

Thoralf · July 17, 2024, 10:07am

Hey @ArjenEX, thank you very much for your answer! It works like a dream.

But I simplified the structure of the example XML a bit because I thought it wouldn’t be a problem. The real structure is:

...
   <song>
        <song_number>2</song_number>
        <song_title>Breathe</song_title>
        <duration>3:57</duration>
        <song_composer>
           <name>Roger Waters</name>
        </song_composer>
        <song_composer>
           <name>David Gilmour</name>
        </song_composer>
        <song_composer>
           <name>Richard Wright</name>
        </song_composer>
        <song_texter>Roger Waters</song_texter>
   </song>
...

The composer should be in a cell at the end, like: [Roger Waters, David Gilmour, Richard Wright]

I thought this would work by reading the <song_composer> as a node cell in the XPath data types and in the tag options as multiple rows and in a second step converting these node cells, as you once explained to me yourself: How do I bring repeating tags from an XML file into a single table cell? - #2 by ArjenEX

But by using the wildcard on the song identifier …/song[*]/… the composers are unfortunately not read in one cell, instead each composer is read in one row, which again invalidates the entire row data set.

So how can I use your good solution and still have all the composers of a song in the corresponding cell?

ArjenEX · July 17, 2024, 10:19am

Hi @Thoralf

In that case I would get the album as whole, then album details plus song as a node and then song details with composer and texter as collection.

song as Node(MultipleRows)

song_composer and song_texter as String(CollectionCell)

Thoralf · July 17, 2024, 11:45am

Thanks for your help, @ArjenEX!
it works like it should

system · July 24, 2024, 11:45am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.