Fetching html tables from webpages

Hi Team,

I have hundreds of webpage urls from which I need to fetch only tables (html tables) and store them in KNIME tables.
Any starting points/examples would be helpful.

Thanks
Ravikiran

I got some examples using HTTP Retriever → HTML Parser → XPath nodes.
I hope this is the way to go, right? If there are any better way please respond.

That’s a good starting point yes! If it’s gets more complex then the Selenium nodes can be of use.

4 Likes

Sure, thanks for the pointer. I will look into the Selenium nodes also.

In a input xml, I need to select everything other than a few elements using XPath node.

From the below xml input, I need to select everything other than ‘div’ and ‘span’.

<?xml version="1.0" encoding="UTF-8"?>
<th class="border-right vertical-center height-100" scope="row" xmlns="http://www.w3.org/1999/xhtml">
    <div class="hide-desktop rights-label">
        <span>Rights</span>
        <img src="https://www.schrodinger.com/sites/default/files/schrodinger_logo_horizontal.png">
        </img>
    </div>
    <span class="hide-desktop program-label">Program</span>
    <strong>SGR-1505 (MALT1)</strong>
    <small>Relapsed /
            Resistant<br>
        </br>Non-Hodgkin's Lymphoma </small>
</th>

And the query as:

//dns:th/*[not(self::div or self::span)]

Expected output:

    <strong>SGR-1505 (MALT1)</strong>
    <small>Relapsed /
            Resistant<br>
        </br>Non-Hodgkin's Lymphoma </small>

Actual output from XPath node:

    <div class="hide-desktop rights-label">
        <span>Rights</span>
        <img src="https://www.schrodinger.com/sites/default/files/schrodinger_logo_horizontal.png">
        </img>
    </div>

I am new to XPath so any help is appreciated.

Additional info:
When I use the same input and query at XPath Expression Testbed, its working as expected.

Thanks
Ravikiran

Which website(s), which tables?
br

Hi @ravikiran

I must admin this exclusion route is quite a complex approach but you can achieve this via: //*[not((name()='img') or (name()='span') or (name()='th') or (name()='div'))]

Output type String(CollectionCell) gets you the direct text output which you can subsequently process to your liking.

Hope this helps!

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.