HTTP Retriever custom proxy server config causes node to fail

Nancyjay · January 14, 2021, 10:41am

Hi,

I’m trying to override the proxy settings using host & port flow variables. This causes the node to fail with a “Execute failed: (“NullPointerException”): null” error. Inputting the host and port manually in the “Proxy” tab causes the exact same error.

Here is an example workflow:
Proxy_Example.knwf (54.7 KB)

Thanks,
Nancy.

qqilihq · January 15, 2021, 4:14pm

This is fixed in the latest build of the 2.4-dev. branch.

–Philipp

qqilihq · January 16, 2021, 9:40am

@Nancyjay Let me know if this works.

Nancyjay · January 16, 2021, 12:55pm

It is working with the 2.4 dev branch. Thank you!

qqilihq · January 16, 2021, 3:43pm

Great – thanks for letting me know!

Nancyjay · January 19, 2021, 2:55pm

Hi again!

I again believe the proxy config for the HTTP Retriever Node is not working as it should. I initially thought it was working because the response code was “200”, but I cannot explain the behaviour of the following workflow other than the explanation that it is an issue with the proxies.

When I make a request without using a proxy (so when the useProxy bool is false) the requests are successful and respond with the desired page. However the second and consecutive iterations do use a proxy, and while the response code is “200”, the page is actually a “maintenance” page even though the site is not under maintenance. This only happens under two conditions:

the proxy config is enabled and uses the IP Address and Proxy flow variables
the “accept” header is not included.

The site has a lot of anti-scraping mechanisms, so maybe I’m wrong, but I think it is the proxies. Here is the workflow:
Proxy_Example.knwf (52.6 KB)

Thanks,
Nancy.

qqilihq · January 19, 2021, 6:55pm

Hi Nancy,

I’m rather sure that this has nothing to do with the HTTP Retriever’s proxy implementation.

The request obviously goes through successfully (i.e. a response is returned, even a 200). My gut feeling: Server detects that you’re using a proxy and thus returns the maintenance page.

The proxies you use are obviously from a public list, thus they can be scored and filtered rather easily. (you can check this yourself through a IP reputation service, e.g. IPVoid).

– Philipp

Nancyjay · January 20, 2021, 9:52am

Hi Philipp,

Thanks for your reply.

I had considered this, and yes in the example I posted I am using public IP’s, but I have bought some IP’s as well which also did not work and behaved exactly the same as the public ones. I also copy and pasted my own IP address & port number and input those as flow variables to see if they would work, but they caused the exact same behaviour too.

I am able to scrape this website with Python using rotating proxies, so it does seem to be an issue exclusive to Knime - namely the HTTP Retriever node, as it also works with the Python Script node.

qqilihq · January 20, 2021, 11:09am

Hi Nancy,

tough one! I just did a test request using the pure Palladian lib (without the surrounding KNIME) and couldn’t reproduce this. Maybe it’s caused by the archaic JVM version KNIME uses? I have no clue at the moment to be honest and it definitely smells like a rabbit hole.

You could enable DEBUG logging and check all the requests in the console, whether there’s some clue what might be wrong. If this still does not help, one would probably need to investigate this via WireShark or thelike.

–Philipp

system · April 21, 2023, 9:38pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.