How to insert data tables from web pages?

MoBa · November 17, 2021, 7:27am

Hello all,
I didn’t find a convenient way to access and insert data tables from web pages.
Following, two simple examples:

Table with Nutrition Facts
Historical Nasdaq data

It is quite easy to import such tables in Excel but I would like to have them in KNIME in order to transform and combine several tables.
I’m sure, there must be an easy way!?
I’d highly appreciate any suggestions and support.
Many thanks in advance!

qqilihq · November 17, 2021, 11:02am

You can use the Table Extractor from the Selenium Nodes for this:

Some more background and example:

MoBa · November 17, 2021, 12:18pm

Hi @qqilihq,
thanks a lot for your quick reply!
This totally looks like the solution I was looking for.
However, since I’m using KNIME for private and not for commercial purpose I don’t have access to the selenium nodes.
And the annual licence fee is quite high so I’m afraid, this wouldn’t be a realistic option for me.
Are there any alternative nodes / workflows free of charge for this task?
Thank you!

qqilihq · November 17, 2021, 12:30pm

Hi MoBa,

if it’s a one-time project, there’s a free 30-day trial of the Selenium Nodes which you can use. It gives you access to all functionality, and there’s no obligations or subscription involved. Please feel invited to give it a try

Alternatively you can of course also do this “by hand” which means replicating the convenience of the Table Extractor node. The web page which you show could also be scraped using the simple HTTP Retriever and HTML Parser nodes from Palladian, which is entirely free of charge for free KNIME versions:

Use the HTTP Retriever to download the webpage in questions, and the HTML Parser to build a clean DOM model of the HTML page. To extract the table structure you’d then need to employ some XPath nodes to stepwise transform the <table> structure into a KNIME table:

This is totally doable, especially if you tailor it to one specific type of table / website. What the Table Extractor from the commercial Selenium Nodes does is, provide all this as a convenient, ready to use node to save you these manual labor.

Fingers crossed!

–Philipp

MoBa · November 17, 2021, 5:32pm

Hi Philipp,
many thanks for all your helpful input!
Since I’d like to update the web data on a regular basis, I’m afraid the 30-day free trial for the Selenium Nodes wouldn’t help on the longterm.
Therefore, I will definitely give it a try with the proposed option using Palladian Nodes.
Fingers crossed

Daniel_Weikert · November 18, 2021, 9:04am

Use a python source node and read the html file

import pandas as pd
df = pd.read_html("https://www.fda.gov/food/new-nutrition-facts-label/how-understand-and-use-nutrition-facts-label")
output_table = df[0]

system · May 19, 2022, 9:04pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.