Current Sample Request / HttpRetriever use on cookie-required websites

umutcankurt · December 19, 2018, 9:42am

Hi;
After the main link page, the main page’s html data appears when I want to retrieve data from other sub-page URLs. Is there an example workflow on how to solve this problem now?

Thank you for your help

sample url
https://ted.europa.eu/TED/search/searchResult.do?page=2

umutcankurt · December 19, 2018, 12:26pm

no answer? from anyone

umutcankurt · December 20, 2018, 10:03am

A useful answer would be very useful. Don’t you have someone to help? in this regard

marten_kose · December 21, 2018, 1:09pm

Hi @umutcankurt,

can you elaborate little more on what you want to achieve? A request of the provided URL gives me an error.

Best,
Marten

umutcankurt · December 21, 2018, 1:19pm

Hi;

https://ted.europa.eu/TED/browse/browseByBO.do

I want to get the data on the pages below. But I did not try different experiments.
I think it’s moving with “cookies” from the home page. when I try this method it brings me the html on the main page.

I couldn’t figure out how to do a workflow on how to get the page data.

umutcankurt · December 21, 2018, 7:42pm

KNIME_TEST_ted_project.knwf (32.4 KB)

The trial workflow I can’t get results from. Something’s wrong, but I can’t find it.

umutcankurt · December 25, 2018, 11:20am

Hi; @Marten_Pfannenschmidt

I have a similar problem with this url. I would be very pleased to solve the problem if I could get help and support immediately. I’m stuck here.

Cookie …I have to solve the problem, waiting for the support of everyone to help. thanks

sample two
Home page
https://irl.eu-supply.com/ctm/supplier/publictenders

Parse page
https://irl.eu-supply.com/ctm/supplier/1
https://irl.eu-supply.com/ctm/supplier/2
https://irl.eu-supply.com/ctm/supplier/3
.
.
.
.
…

umutcankurt · December 25, 2018, 12:35pm

Hi;@qqilihq
Do you have a solution as a developer? or what is your feedback?

qqilihq · December 25, 2018, 12:44pm

Hi umutcankurt,

I looked at above’s example. As it’s pulling in data via JS/AJAX/XHR, there’s no easy way to to use GET Request or HttpRetriever, instead you’ll need a full browser as provided via the Selenium Nodes. Please see this reply for an explanation:

Simply way to detect this:

Disable JS in your web browser and try loading the page. If the desired content does not show up, you’ll need “real” web browser as provided e.g. via Selenium Nodes.

– Philipp

umutcankurt · December 25, 2018, 12:49pm

thanks for the answer. I’m thinking of buying Palladian nodes, but I have a question mark on my head. I think that the process of getting this data will be much extended because it will open a web page which will have to scan multiple pages.
Do you think it is possible to serialize it with palladian nodes when I want to scan too many web pages (opening the browser / working in the background)?

qqilihq · December 25, 2018, 12:57pm

Hi there,

to avoid confusions:

the Palladian nodes are free (for use in free KNIME versions)
the Selenium nodes are paid

In case you’re wondering whether the Selenium Nodes are the right tool for your task, I invite you to give the free 30-day trial a go.

From my experience:

I’ve used the Selenium Nodes several times to crawl high amounts of pages. Of course, there is a larger performance overhead compared to a pure “download page” approach like with Palladian, but you can often optimize/parallelize/etc. Still, your throughput will always be slower with the Selenium Node, as these are using a real Web browser. But often, that’s the only way to access current web pages resp. web apps.

My suggestion: Try out the trial version and see whether it works for your problems. Feel free to get back if you need any advice regarding optimization.

Best,
Philipp

system · June 24, 2019, 10:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.