Extract tbody section from HTML file

ArjenEX · May 26, 2021, 11:57am

Hello,

I have some difficulty using KNIME when trying to extract data from a HTML table.

Situation:
I have a directory of HTML files that are similar in format which all contain a table with information that needs to be extracted.

Which is created by:

Goal:
Extract the <tbody> section of the HTML and extract the data for each <tr> and have all <td> populate a column.

Issue:
I cannot get KNIME to convert the HTML code properly whereby I can approach the tbody by XPath (which I believe is the proper method for this).

Steps so far:

Use Vernalis Load Text Files node which is able to read the entire HTML and output it as File Contents column.
Use Column to XML but that cannot handle the opening and closing of the HTML and replaces it.
screenshot2721×61 2.35 KB
Use Xpath to get to the <tbody>. I defined all namespaces that are mentioned in the HTML file.
The Xpath Editor itself also does not recognize any tag or attritube.

xpath527×624 16.6 KB

No succes with HTML Node to Text as well (error: No suitable column for org.knime.core.data.xml.XMLValue found)

Some guidance would be greatly appreciated.

Arjen

zioludo · May 26, 2021, 12:12pm

I believe the issue is the fact you miss “<” the “>” in your XML column… in other words, if I am not wrong, even if the the column is type XML you have in reality one big string.

I am not familiar with Vernalis but try using Palladian extension to parse the files .

Then for the XPATH node I suggest you look for “//tbody” (find tags regardless of hierarchy) and select node as output. You will then get another XML column from where you could parse the rows and the single data cells…

Hope it’s clear

Ludovico

L.

ArjenEX · May 26, 2021, 12:22pm

Thanks Ludovico for the quick response. I have tried HTML Node to Text from Palladian as mentioned but that gives me an No suitable column for XMLValue found. The HTML parser from Palladian also requires an input. Do you know what the proper input step should be then?

bruno29a · May 26, 2021, 2:15pm

Hi @ArjenEX , the Column to XML is basically taking the whole content as data, and it thinks that the “<” and “>” are part of the data, and therefore “escapes” them by basically converting them to html code.

I do not have any sample html similar to your structure (note: it is always the best idea to give us sample data, and in case of sensitive data, you can “sanitize” them by replacing them by fake data), so I basically use the html code of THIS page itself.

With the HTML Parser, it basically takes the html string and convert it to HTML objects, which you can then access via XPath.

Here’s what my workflow looks like:

I extracted the title tag and the 2nd meta tag, and I extracted them as both XML format and string/value format for this example via XPath:

For the data from the meta tag, the meta tag does not enclose any data, but rather its “content” attribute is what has data, and you can see how I extracted that.

You can play around with the XPath with your HTML content to extract tbody and values within the tr and td

Here’s the workflow: html to table.knwf (34.0 KB)

ArjenEX · May 26, 2021, 5:26pm

Thanks a lot Bruno!
It makes a lot more sense now and got it working like how I wanted.

zioludo · May 27, 2021, 11:54am

Hi Arjen

Let me add this reference suggestion on Xpath to Bruno’s good answer:

Have fun!

Ludovico

system · June 3, 2021, 11:55am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.