How can you scrape websites that require cookies acceptance?

Data1981 · October 30, 2023, 4:19pm

I’m trying to scrape the agendas off city government websites, all of which use a vendor called Laserfiche. My plan is to pull the list of URLs for all meetings in 2022 and then 2023 and then loop through those URLs to grab the individual agenda items.

I can see the agendas just fine on my browser, but KNIME’s Webpage Retriever node appears to be getting different code back. It looks to me like it may be saying it won’t return to the full page because cookies aren’t accepted. I’ve tried turning on the option where I collect the cookies in a separate column to see if that would trigger a cookie response, but nothing changed.

For example, one of the cities I’m pulling agendas from is Hopkins, Minnesota. This is the link to their list of 2023 agendas: https://lf.hopkinsmn.com/WebLink/Browse.aspx?id=434149&dbid=0&repo=Hopkins

It looks like this in my browser:

But here’s the XPath response I’m getting:

I suspect the problem is KNIME is acting like a browser that doesn’t accept cookies. Anyone know of a way around this? Or am I misdiagnosing the problem?

Data1981 · October 30, 2023, 4:56pm

I should’ve added that the links to the agendas (and the test itself) are showing up when I open “Developer Tools” in my browser. But the code looks completely different in KNIME.

qqilihq · October 30, 2023, 5:15pm

What you see there is the static webpage content, not the content which gets dynamically added by JS. The cookies are not so much the issue here. You can check these posts for an idea:

Alternatively have a look at the endpoint https://lf.hopkinsmn.com/WebLink/FolderListingService.aspx/GetFolderListing2 in your browser’s dev tools which returns the directory content upon a POST request.

-Philipp

system · January 28, 2024, 5:15pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.