extract html code from xml column

mobmsc · July 14, 2017, 11:07am

I have a process which extracts a particular html table from an xml document using xpath (select node cell not string) as I need to keep the table structure to then extract the rows/columns

My problem is the output of the xpath is another xml document but I can't parse it with the html parser and if I output it as string I loose the structure. How do I change the column value so I can extract the table and keep the row and column values

Example output of xPath

<?xml version='1.0' encoding='UTF-8'?>
<table align="center" border="1" xmlns="http://www.w3.org/1999/xhtml">
    <tbody>
        <tr>
            <td>Week commencing</td>
            <td>Teaching Week</td>
            <td>Timetable Week</td>
        </tr>
        <tr>
            <td>28 Jan</td>
            <td>1</td>
            <td>1</td>
        </tr>
        <tr>
            <td>4 Feb</td>
            <td>2</td>
            <td>2</td>
        </tr>
        <tr>
            <td>11 Feb</td>
            <td>3</td>
            <td>3</td>
        </tr>
        <tr>
            <td>18 Feb</td>
            <td>4</td>
            <td>4</td>
        </tr>
        <tr>
            <td>25 Feb</td>
            <td>5</td>
            <td>5</td>
        </tr>
        <tr>
            <td>4 Mar</td>
            <td>6</td>
            <td>6</td>
        </tr>
        <tr>
            <td>11 Mar</td>
            <td>7</td>
            <td>7</td>
        </tr>
        <tr>
            <td>18 Mar</td>
            <td>8</td>
            <td>8</td>
        </tr>
        <tr>
            <td>25 Mar</td>
            <td>Easter Week</td>
            <td>9</td>
        </tr>
        <tr>
            <td>1 Apr</td>
            <td>9</td>
            <td>10</td>
        </tr>
        <tr>
            <td>8 Apr</td>
            <td>10</td>
            <td>11</td>
        </tr>
        <tr>
            <td>15 Apr</td>
            <td>11</td>
            <td>12</td>
        </tr>
        <tr>
            <td>22 Apr</td>
            <td>12</td>
            <td>13</td>
        </tr>
        <tr>
            <td>29 Apr</td>
            <td>13</td>
            <td>14</td>
        </tr>
    </tbody>
</table>

mobmsc · July 14, 2017, 3:25pm

So the following xslt will output the values within the table but I can't figure out how to get the entire html code+values

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <html>
  <body>
<tbody>
    <xsl:value-of select="/"></xsl:value-of>
</tbody>
  </body>
  </html>
</xsl:template>
</xsl:stylesheet>

Iris · July 26, 2017, 8:42am

Hi Mob

I am not sure how far you got, my thought about this:

The community contribution palladian has a html parser node.

A string cell can be converted to an xml cell (String to XML) and can than be parsed using the XPath nodes.

Finally we also have a XSLT node.

Best, Iris

Iris · July 26, 2017, 8:45am

And this example might be useful for you: https://www.knime.org/nodeguide/applications/forum-analysis-of-the-knime-forum/parsing-the-knime-forum

system · June 2, 2023, 9:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.