Workaround for 403 errors?

stevelp · May 19, 2020, 9:56pm

Hi,

I’ve been getting 403 errors when scraping certain websites using the HTTP Retriever. Any known workarounds for these?

I’m attaching a sample workflow that shows a few sample pages that give me a 403 error.

qqilihq · May 20, 2020, 8:47am

Hi stevelp,

Unfortunately, these sites use some measures to prevent automated crawling (the last URL uses mechanisms from Cloudflare, the other do not state). It’s a little ridiculous, as these measures are obviously not triggered by high-volume requests, but simply how your user agent behaves. E.g. it wouldn’t be possible to open those pages with browsers such a lynx either.

You could play with different user agent strings. E.g. try the following:

simulate a Google Bot: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
simulate Internet Explorer: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; AS; rv:11.0) like Gecko
…

This helps in some cases but definitely not all, and at the end it’ll be a trial and error.

If you’re willing to give the Selenium Nodes a try, I was able to retrieve all three test URLs using them. Keep in mind, that those site use some aggressive blocking, so eventually you will likely run into limits as well!

Updated example workflow:

stevelp · May 21, 2020, 7:06pm

Thanks Phillipp, I’m trying both of the suggestions!

The first method I attempted was to use different user agents. It seems some sites like one agent over another, but no agent is better in all cases. I was able to reduce the number of 403 responses by running two HTTP Retrievers in a chain (using the Chrome user agent first, and for those that had an error, I tried using the Google Bot agent).

I am also trying the Selenium nodes, and I’ve gotten them to mostly work. I’m attaching my workflow - 403 sites using Selenium.knwf (36.8 KB)

I have a few questions:

The Selenium nodes are much slower, especially when extracting elements and attributes. Is there anything I can do to speed them up?
I received the following error message while trying to process a large number of sites: “Execute failed: timeout: Timed out receiving message from renderer: 1.171”
Is there any way to look for response codes with the Selenium nodes? When I’m using the Palladian HTTP retriever, I can append a column with an HTTP status code, and then filter out future steps for those that get invalid codes. Any suggestion for doing something similar with Selenium?
Also, using the Selenium nodes, is there a way to extract links that are part of Javascript. I posted a question about this in the main forum, but I don’t have any suggestions yet - Extract links from Javascript content on webpage

stevelp · May 21, 2020, 7:46pm

On the same topic of speed, is it possible to open multiple browser tabs at once and run them in parallel using the Selenium nodes?

qqilihq · May 22, 2020, 8:35am

Hi Stevelp,

You could probably optimize this further by having a list of User Agent strings and then using loop nodes and some flow variables, where you keep trying until you receive a non-403 response (this will probably increase the complexity of your workflow considerably, not sure if it’s worth it).

First: In general, the Selenium Nodes will always be a ~ magnitude slower than doing a simple HTTP GET, e.g. with the HTTP Retriever. The latter will simply request a file, while the Selenium Nodes act exactly like a browser (this means: downloading all referenced images, scripts, parsing a DOM, rendering it visually, etc.)

However, there are several measures for speeding up things:

Start the browser outside the loop – i.e. move the Start WebDriver before the loop start (as far as I see, you’ve already done this).
Once you’re finished with the “design phase” of your workflow, switch to a headless browser. Last time I tested, this was faster than a UI browser, even though your personal mileage may vary.
Disable image loading. There’s an existing “Snippet” in the configuration. Just select it and click the “Merge” button:

image852×357 39.4 KB
Regarding your specific workflow: You’re currently extracting all <a> elements from the pages by first running a Find Elements node and then the Extract Attribute. As webpages usually contain many, many links, there’s quite some overhead involved – e.g. the Selenium Nodes produce a user-friendly cell preview of all the extracted elements (the <a> in your case), and this takes some time.

image814×302 45.2 KB

There’s two things you can do
1. Go to the Preferences → Selenium and disable the option Save source snapshots in cells. With this settings, the source previews will no longer be created, which should speed up the workflow processing.
2. As a further optimization you can replace the Find Elements and Extract Attribute nodes with a single Execute JavaScript node. In the JS node you can perform the XPath query and extract the href attributes in one step, and then simply return a JSON array with the hrefs:
```
// find all <a> elements via XPath and return an iterator
const iterator = document.evaluate(
	'//a', // XPath
	document, // context
	null, // namespace resolver (not needed here)
	XPathResult.ORDERED_NODE_ITERATOR_TYPE, // result type
	null // result reuse (not applicable here)
);

const hrefs = [];
// iterate the results and extract the `href`
let node;
while (node = iterator.iterateNext()) {
	hrefs.push(node.href);
}
// create a JSON string for the result
// (you can parse this using KNIME’s JSON nodes)
return JSON.stringify(hrefs); 
```
  The workflow would look like this and should be considerably faster:
  
  Find it on my NodePit Space: 403-sites-using-selenium-23919 — NodePit

Would you be able to share a reproducible workflow somehow? I’d like to investigate this. Otherwise, a detailed log output and/or any further details would also be very helpful.

Unfortunately there’s currently no real workaround. The Selenium API just does not give any HTTP status back. I have some ideas how to implement this in the backlog, but at the moment this is still an open todo.

I’ll have a look at this later!

– Philipp

stevelp · May 25, 2020, 5:16pm

Thanks @qqilihq for all the tips! This definitely helps speed things up a lot. I sent the log files and workflow to mail@seleniumnodes.com so you can take a look at some of the issues I was facing.

There’s one more question that I have regarding speed improvements. Is it possible to open multiple browser tabs at once and run them in parallel using the Selenium nodes?

qqilihq · May 26, 2020, 12:50pm

Hey @stevelp,

Thanks for that, I’ll get back about these separately. This might take a bit – apologies in advance!

Not really, unfortunately. The Selenium/WebDriver API does not allow parallelization. Quite contrary: When building the Selenium KNIME Nodes we had to add quite some synchronization magic to prevent potential concurrency issues.

What might work for you:

You could try to parallelize your workflow using the Parallel Chunk Start and Parallel Chunk End nodes and make sure to use a separate browser instance in each chunk (use the Get Pooled WebDriver to start the browser in this case – it’ll keep a pool of active browsers and thus avoids launching a new one in each iteration).
What I once did for a client’s large crawling project: Split up the workload in advance and then run n KNIME command line instances in parallel where each controlled one browser.

Note, that both of these solutions will require plenty of memory – so I suggest to start with a medium number of parallel chunks, observe resources consumption, and then scale up if needed.

– Philipp

system · June 2, 2020, 12:50pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.