Workaround for 403 errors?

Thanks Phillipp, I’m trying both of the suggestions!

The first method I attempted was to use different user agents. It seems some sites like one agent over another, but no agent is better in all cases. I was able to reduce the number of 403 responses by running two HTTP Retrievers in a chain (using the Chrome user agent first, and for those that had an error, I tried using the Google Bot agent).

I am also trying the Selenium nodes, and I’ve gotten them to mostly work. I’m attaching my workflow - 403 sites using Selenium.knwf (36.8 KB)

I have a few questions:

  • The Selenium nodes are much slower, especially when extracting elements and attributes. Is there anything I can do to speed them up?
  • I received the following error message while trying to process a large number of sites: “Execute failed: timeout: Timed out receiving message from renderer: 1.171”
  • Is there any way to look for response codes with the Selenium nodes? When I’m using the Palladian HTTP retriever, I can append a column with an HTTP status code, and then filter out future steps for those that get invalid codes. Any suggestion for doing something similar with Selenium?
  • Also, using the Selenium nodes, is there a way to extract links that are part of Javascript. I posted a question about this in the main forum, but I don’t have any suggestions yet - Extract links from Javascript content on webpage