Extract HTML-links from a webpage

Hi everyone,

I would like to extract all (relevant) URLs from a website, e.g. https://www.stepstone.de/jobs/data-science, to read out the text with the ‘Web Text Scraper’ node.

The ‘Get Request’ and ‘Webpage Retriever’ nodes did not work in my tests (error: ‘Read timed out’; timeout (s): 20).

Thanks a lot for your help!

Dear @mkbuennem,

I just tried to open the page in a browser and the request timed as well. So one explanation is, that the server isn’t responding in time.

Looking at the response with Postman, however, implies that the page is using JavaScript to prohibit scraping. If you’d like to continue building your workflow, I’d suggest doing a manual search and saving the page as HTML. You can read the page in KNIME with the XML Reader and establish the downstream processing. Your best option to get the page content directly might be the Python Integration and using Beautiful Soup.

Best,
Stefan

1 Like

Hi
in addition to @stelfrich s great post I would add have a look at their api. A first search result suggests that they have one so that would be my preferred way to go.
br

Hi @mkbuennem

Have a look here to get inspired by @kowisoft, who did something quite similar:

Hope this helps!

1 Like