HTTP Retriever custom proxy server config causes node to fail

Hi,

I’m trying to override the proxy settings using host & port flow variables. This causes the node to fail with a “Execute failed: (“NullPointerException”): null” error. Inputting the host and port manually in the “Proxy” tab causes the exact same error.

Here is an example workflow:
Proxy_Example.knwf (54.7 KB)

Thanks,
Nancy.

This is fixed in the latest build of the 2.4-dev. branch.

–Philipp

@Nancyjay Let me know if this works.

It is working with the 2.4 dev branch. Thank you!

1 Like

Great – thanks for letting me know!

Hi again!

I again believe the proxy config for the HTTP Retriever Node is not working as it should. I initially thought it was working because the response code was “200”, but I cannot explain the behaviour of the following workflow other than the explanation that it is an issue with the proxies.

When I make a request without using a proxy (so when the useProxy bool is false) the requests are successful and respond with the desired page. However the second and consecutive iterations do use a proxy, and while the response code is “200”, the page is actually a “maintenance” page even though the site is not under maintenance. This only happens under two conditions:

  1. the proxy config is enabled and uses the IP Address and Proxy flow variables
  2. the “accept” header is not included.

The site has a lot of anti-scraping mechanisms, so maybe I’m wrong, but I think it is the proxies. Here is the workflow:
Proxy_Example.knwf (52.6 KB)

Thanks,
Nancy.

Hi Nancy,

I’m rather sure that this has nothing to do with the HTTP Retriever’s proxy implementation.

The request obviously goes through successfully (i.e. a response is returned, even a 200). My gut feeling: Server detects that you’re using a proxy and thus returns the maintenance page.

The proxies you use are obviously from a public list, thus they can be scored and filtered rather easily. (you can check this yourself through a IP reputation service, e.g. IPVoid).

– Philipp

Hi Philipp,

Thanks for your reply.

I had considered this, and yes in the example I posted I am using public IP’s, but I have bought some IP’s as well which also did not work and behaved exactly the same as the public ones. I also copy and pasted my own IP address & port number and input those as flow variables to see if they would work, but they caused the exact same behaviour too.

I am able to scrape this website with Python using rotating proxies, so it does seem to be an issue exclusive to Knime - namely the HTTP Retriever node, as it also works with the Python Script node.

Hi Nancy,

tough one! I just did a test request using the pure Palladian lib (without the surrounding KNIME) and couldn’t reproduce this. Maybe it’s caused by the archaic JVM version KNIME uses? I have no clue at the moment to be honest and it definitely smells like a rabbit hole.

You could enable DEBUG logging and check all the requests in the console, whether there’s some clue what might be wrong. If this still does not help, one would probably need to investigate this via WireShark or thelike.

–Philipp