Extract HTML-links from a webpage

mkbuennem · October 10, 2022, 8:07am

Hi everyone,

I would like to extract all (relevant) URLs from a website, e.g. https://www.stepstone.de/jobs/data-science, to read out the text with the ‘Web Text Scraper’ node.

The ‘Get Request’ and ‘Webpage Retriever’ nodes did not work in my tests (error: ‘Read timed out’; timeout (s): 20).

Thanks a lot for your help!

stelfrich · October 11, 2022, 2:14pm

Dear @mkbuennem,

I just tried to open the page in a browser and the request timed as well. So one explanation is, that the server isn’t responding in time.

Looking at the response with Postman, however, implies that the page is using JavaScript to prohibit scraping. If you’d like to continue building your workflow, I’d suggest doing a manual search and saving the page as HTML. You can read the page in KNIME with the XML Reader and establish the downstream processing. Your best option to get the page content directly might be the Python Integration and using Beautiful Soup.

Best,
Stefan

Daniel_Weikert · October 11, 2022, 3:41pm

Hi
in addition to @stelfrich s great post I would add have a look at their api. A first search result suggests that they have one so that would be my preferred way to go.
br

qqilihq · October 11, 2022, 6:41pm

Hi @mkbuennem

Have a look here to get inspired by @kowisoft, who did something quite similar:

Hope this helps!

system · January 9, 2023, 6:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.