I’m trying to scrape the agendas off city government websites, all of which use a vendor called Laserfiche. My plan is to pull the list of URLs for all meetings in 2022 and then 2023 and then loop through those URLs to grab the individual agenda items.
I can see the agendas just fine on my browser, but KNIME’s Webpage Retriever node appears to be getting different code back. It looks to me like it may be saying it won’t return to the full page because cookies aren’t accepted. I’ve tried turning on the option where I collect the cookies in a separate column to see if that would trigger a cookie response, but nothing changed.
I suspect the problem is KNIME is acting like a browser that doesn’t accept cookies. Anyone know of a way around this? Or am I misdiagnosing the problem?
I should’ve added that the links to the agendas (and the test itself) are showing up when I open “Developer Tools” in my browser. But the code looks completely different in KNIME.
What you see there is the static webpage content, not the content which gets dynamically added by JS. The cookies are not so much the issue here. You can check these posts for an idea:
Alternatively have a look at the endpoint https://lf.hopkinsmn.com/WebLink/FolderListingService.aspx/GetFolderListing2 in your browser’s dev tools which returns the directory content upon a POST request.