Help with get data (table) from webpage

Felipereis50 · November 23, 2024, 2:31pm

Hello friends,

I would like some help extracting a table from a website.
I am not familiar with Knime for extracting data from websites.

The only tool I have used is POWER BI, which automatically identifies the data.
It makes the task very simple.

I like to learn from Knime.

If possible, could someone create a workflow for extracting the table from this site? Ranking de Fundos Imobiliários | FundsExplorer

Based on the workflow, I can study how it was done.

Table

----------------------Tried--------------------------------
I tried using this component “Table from XHTML,” but it did not return the table.
I believe I would need to customize it.

https://hub.knime.com/alexanderfillbrunn/spaces/Public/Components/Table%20from%20XHTML~CiE7hTN611IQMXBX/current-state

qqilihq · November 23, 2024, 2:56pm

Hi Felipereis50,

You can do it with the Table Extractor node:

Best regards,
Philipp

Felipereis50 · November 23, 2024, 3:07pm

Great

I’m configuring node Pit to download.
I’ll come back in no time!!!

Felipereis50 · November 23, 2024, 3:41pm

Hi friend,

I think those “selenium nodes” are paid, cn you confirm?

qqilihq · November 23, 2024, 4:29pm

Yes. There’s a free trial, afterwards there’s paid licenses.

Felipereis50 · November 23, 2024, 4:38pm

What a pity,

I’ll wait to see if someone can help me using the available nodes.

rfeigel · November 23, 2024, 4:44pm

Try this for some ideas:

Felipereis50 · November 23, 2024, 9:45pm

Hi friend

Well…
I’m not, even close, to be a student of HTML.

My steps

went to google and use the inspection HTML

Searched for the name “table” to find a clue to “where to start” and I found all the values from HTML (table)

image2357×926 329 KB

If I open the “child” , I can see all the code for the first value, and so on…

I tried to copy the Xpath or Full XPath e past to the node Xpath Knime

image673×916 67.1 KB

But no success

Next Step (help)

I’ll need some help for the code that I will have to use into Xpath.

rfeigel · November 24, 2024, 2:02am

Can you share the workflow you have?

Felipereis50 · November 24, 2024, 10:30am

Of course

here
Funds_state.knwf (76.8 KB)

But is very very simple. I have nothing.

---------------My Experience with Xpath-----------------
I have a little experience with Xpath from reading XML.
I’ve created this workflow and achieved the desired result.
But with HTML I’m lost.

mlauber71 · November 24, 2024, 1:42pm

@Felipereis50 the nodes do work although there is a problem that the website does not seem to provide the numbers under certain circumstances. So the cell are always empty.

Here is a Python code trying to deal with that. It is built so as to extract all tables into Parquet files in a directory you can specify.

Felipereis50 · November 24, 2024, 2:47pm

Thank you very much, mLauber.

I tried starting the installation of CONDA, but I couldn’t do it.
I’m using a corporate computer, and there are restrictions on installing programs.
I won’t be able to complete it.

Based on your analysis, Xpath wouldn’t be ideal, correct?
I searched on YouTube and found some tutorials on how to perform web scraping, specifically for the site I mentioned, and I only found examples using Python.
Many of them use Python libraries. (Beautiful soup)

Perhaps I could replicate it using Python Node based on the tutorial, but if I need to install any library on my computer, I won’t be able to proceed.

In any case, if I can’t manage it, I’ll have to resort to using Power BI as a source to capture the table.

Thank you in advance, and I’ll consider the thread closed.

mlauber71 · November 24, 2024, 4:26pm

@Felipereis50 having the ability to install software (namely Python) might be crucial to actually using analytics tools. In this case the Webpage Retrieval and so on do work in principal, but the data itself seems to be dynamically provided by some sort of sub-page which is not in itself accessible.

In principal you can try a cascade of XPath and JSON Path to extract the elements you want.

The " Table from XHTML" component does search for the (first) table element //table[1] and then for the data rows and so on if they are there. The path syntax can be somewhat confusing first but with a little trial and error you can manage … given that the data is actually there in the retrieved html/xml document.

ricciV1 · November 28, 2024, 9:44am

Hey @Felipereis50,

have you considered using the KNIME Webinteraction Nodes?
They are developed by KNIME and therefore free to use.

I have attached the workflow that I used.

Kind regards Ricci

webInteraction_example.knwf (54.4 KB)

Felipereis50 · November 29, 2024, 5:02pm

Hi friend

I’m trying, but now, I’m getting an error from Web Interaction Start

I’m looking from some help in the forum.

I put the path but no success

Execute failed: Message: Could not locate chromedriver at path: C:\Program Files\Google\Chrome\Application\chrome.exe

Felipereis50 · November 29, 2024, 10:02pm

I managed using firefox.

Result: Thanks.
@ricciV1
Worked perfectly.

Thank you, @mlauber71, for the support, but the tip from @ricciv1 was simpler.

mlauber71 · November 29, 2024, 10:04pm

@Felipereis50 never have used the KNIME web interaction nodes but will keep them in mind. They seem to be more elegant than a python code

system · December 6, 2024, 10:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.