find content / column with Xpath

sanderlenselink · January 26, 2025, 4:31pm

Hi,

can someone help with Xpath?

Here my simplified workflow . . .

I want to retrieve the column of symbols (tickers). I suppose something has changed and I tried a lot but the ticker column keeps empty.

Of course I make a stupid mistake but I don’t see it

Many thnx in advance

Xpath_1.knwf (10.7 KB)

MartinDDDD · January 26, 2025, 5:30pm

Hey there,

I think there are two issues at play - one minor and probably an oversight and one bigger one…

The small one: As far as I can tell from your workflow you are right now getting google.com homepage as response - you may want to select your URL column rather than having the default google.com address scraped

That said, it looks like yahoo does not like to be pinged this way - the node responds with 503 error.

My gut feeling is that you may have to opt for using the KNIME Web Interaction Extension to have KNIME open the website in a browser and then grab the data. There was a just KNIME it challenge to extract economic use from yahoo finance using exactly this extension.

Here’s the solution thread with plenty of options to pick from to see how it can work:

https://forum.knime.com/t/solutions-to-just-knime-it-challenge-9-season-3/81017/30

Here is my solution:

sanderlenselink · January 26, 2025, 6:28pm

Hi MartinDDDD,

many thnx for your very quick respons

regarding your 1st remark your fully right . . . sorry I abuse your time . . . I was inaccurate constructing my basic example flow . . . mea culpa

I tried to run your suggestion and filled 2nd node Navigator Labs with this URL: Yahooist Teil der Yahoo Markenfamilie

That produces the following error . . .
ERROR Navigator (Labs) 5:2 Execute failed: HTTPConnectionPool(host=‘localhost’, port=30459): Max retries exceeded with url: /session/a72344c9fef2b061b15aa84e3c85a12f/url (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x0000023659746AA0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it’))

Strange because the mentioned URL is an existing page. An also tried “Refresh”.
Any idea why it produces this error?

When I run your unchanged example (with URL Yahooist Teil der Yahoo Markenfamilie ) the results are missing values.

I hope you (or someone else) can help

THNX in advance

rfeigel · January 27, 2025, 3:35am

Take a look at this. It pulls single ticker data but can be modified to pull other data.

I’ve modified the original to include conda propagation for the required Python environment as well as writing an output. You’ll need to change the location of the Excel Writer in the String Manipulation node.

sanderlenselink · January 27, 2025, 11:43am

Hi rfeigel,

thnx . . . I succeeded to run your script.

But I will have some challenges:
(1) modify it for these pages: Yahooist Teil der Yahoo Markenfamilie
and 2025-01-01 / 2025-01-02 etc etc

(2) run it for 500+ funds

In general . . . I don’t understand why my basic flow doesn’t work anymore. With WebPage Retriever and Xpath life was so simple

What changed at Finance-Yahoo?

THNX

rfeigel · January 28, 2025, 2:54am

Could you explain in more detail exactly what you’re trying to do?

sanderlenselink · January 28, 2025, 9:59am

Hi @rfeigel rfeigel

I build a financial model and as a part of it I check if there are any stock splits for the funds/companies I analyse.

As input I use e.g. this page: Yahooist Teil der Yahoo Markenfamilie

I carry out this check every month. So download the finance.yahoo split pages for every day of a specific month, e.g.:
…calendar/splits/?day=2025-01-01
…calendar/splits/?day=2025-01-02
…
…calendar/splits/?day=2025-01-31

Later on in the KNIME flow I check for all the funds in my portfolio if there was a stock split. But that’s not relevant my actual problem

Until recently the flow worked fine with Webpage Retriever and Xpath.
Attatched the relevant part of my workflow
Xpath_1_extended_example.knwf (102.6 KB)

I suppose something changed at Finance.Yahoo but I don’t know what.

So my problem . . . how to download the tables for every day. See screenshot.

Processing: Xpath_1.knwf…

rfeigel · January 29, 2025, 1:46am

Try this. Feed a list of tickers. If any of the tickers split in the past month it returns the ticker, split date and split ratio.

sanderlenselink · January 30, 2025, 6:27pm

@rfeigel . . .

thnx for your latest contribution . . . VERY VERY GREAT . . . it gave me lot of insights and I learned a lot of it

Your approach is first to collect all the split data of the portfolio. Something you could not know is that my model is only interested in te most recent splits.

Further it takes a lot of running time to collect all the split data of my portfolio (about 800 funds and that takes 45+ minutes runtime). A well known problem that Python loops are very slow

Accidently I discovered the HTTP Retriever node and that it reads / collects all the data of pages such as Yahooist Teil der Yahoo Markenfamilie

With your contribution in mind I started tweaking . . . and developed the attached flow.

My approach is to collect all the split data over the e.g. last 60 days. Of course with a lot of reduncy (funds not in the portfolio). Later on in the flow (not attached because it is ordinary KNIME) I join all the split funds with my portfolio (joined by yahoo ticker).

The advantages of this approach are:

runtime about 15 seconds
less redundant split data

I only could came to this thanks your feedback . . . 1000x thnx

REMAINING QUESTION . . .
Has someone an idea why the HTTP Retriever node reads all the data on a certain page (abovementioned page_. And Webpage Retriever does not ???

. . . maybe this is a question to someone close to the KNIME development team

Enclosed my flow . . .
Xpath_3_DEFDEF.knwf (89.0 KB)

system · April 30, 2025, 6:27pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.