Workaround for 403 errors?

stevelp · May 21, 2020, 7:06pm

Thanks Phillipp, I’m trying both of the suggestions!

The first method I attempted was to use different user agents. It seems some sites like one agent over another, but no agent is better in all cases. I was able to reduce the number of 403 responses by running two HTTP Retrievers in a chain (using the Chrome user agent first, and for those that had an error, I tried using the Google Bot agent).

I am also trying the Selenium nodes, and I’ve gotten them to mostly work. I’m attaching my workflow - 403 sites using Selenium.knwf (36.8 KB)

I have a few questions:

The Selenium nodes are much slower, especially when extracting elements and attributes. Is there anything I can do to speed them up?
I received the following error message while trying to process a large number of sites: “Execute failed: timeout: Timed out receiving message from renderer: 1.171”
Is there any way to look for response codes with the Selenium nodes? When I’m using the Palladian HTTP retriever, I can append a column with an HTTP status code, and then filter out future steps for those that get invalid codes. Any suggestion for doing something similar with Selenium?
Also, using the Selenium nodes, is there a way to extract links that are part of Javascript. I posted a question about this in the main forum, but I don’t have any suggestions yet - Extract links from Javascript content on webpage