How to crawl the webpage which needs to login first using httpretriver

zwang1986 · June 4, 2014, 9:09am

Hi all,

I am trying httpretriver to download the content from a webpage, but what I got is the html which ask for login. I have already registed to this website. How should I do to get the correct content not the login requirement?

Best regards

qqilihq · June 5, 2014, 2:05am

Hi zwang,

we're talking about form authentication here? We cannot handle this with our nodes currently.

Best,
Philipp

zwang1986 · June 5, 2014, 3:32am

Thanks Phil,

Yes, it's about form authentication. I am trying other tools then.

Cheers,

Zhi

Scott_Snyder · June 5, 2014, 5:28pm

Perhaps it would be better to use the HTTP Connection node from the KNIME Labs which supports Basic Authentication as part of the node.

Also the REST nodes from the Community Nodes support HTTP requests with login (username, password).

-- Scott

troy.smith · December 19, 2014, 12:00am

Hi,

Was just looking to do this and came across this post. Got me thinking and I found a solution (that works for me at least).

Install the Mozilla Firefox web browser if not already.
1. Add the iMacros addon.
2. Record a macro where you go to the site, log in, go to the page you want (I clicked a PDF I needed to download) and save the page as a local html file if you'd like to parse the html.
In your KNIME workflow, add an External Tool node
1. Add Firefox.exe path for your executable
2. Add the imacros command line arguement to run the macro file you created in step 3 (see attached image as an example)
Downstream from the External Tool running you should now find your page saved where ever you chose so you can now read that copy and parse as needed.

Hope this helps,

Troy

Since iMacros is browser based this technique could also be used to get pages loaded dynamically by Javascript as talked about in this post.

externaltoolconfig.png

system · April 21, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.