HTTP Retriever - problem with some URLs?

kilian.thiel · July 21, 2015, 11:17am

Hi,

I have one URL for which the HTTP Retriever does not download the correct content. When accessing the URL in a browser the result is different than the HTTP Retriever result. The URL is rss feed:

https://idw-online.de/de/pressreleasesrss?country_ids=35&country_ids=36&country_ids=46&country_ids=188&country_ids=65&country_ids=66&country_ids=68&country_ids=95&country_ids=97&country_ids=121&country_ids=126&country_ids=146&country_ids=147&country_ids=180&category_ids=10&category_ids=7&field_ids=100&field_ids=101&field_ids=401&field_ids=603&field_ids=600&field_ids=400&field_ids=606&field_ids=204&field_ids=102&field_ids=306&langs=de_DE&langs=en_US

In the browser the rss content is shown. When using the HTTP Retriever node followed by the Feed Parser rss content can not be extracted. Attached is a workflow that shows the problem. Any ideas?

Cheers, Kilian

rssfeedurl-testing.zip

qqilihq · July 21, 2015, 12:20pm

Hi Kilian,

I checked your workflow, however the issue does not seem to be the HTTP Retriever, but the Feed Parser, which parses the feed's meta information, but not its items. The problem seems to be specific to the KNIME nodes, as the same feed can be parsed correctly when I'm running directly from code using the Palladian lib.

I will investigate this further when I have some spare time and get back to you.

[edit] Are there any further feeds, where you encountered the problem?

Best,
Philipp

kilian.thiel · July 21, 2015, 1:12pm

Hi Philipp,

thank you for your answer. When I try a shorter URL it seems to work.

https://idw-online.de/de/pressreleasesrss?country_ids=35&country_ids=36&field_ids=306

However, I assume the problem is the HTTP Retriever, at least for the long URL. When I am trying the long URL the result of the HTTP Retriever is not a valid rss feed result. It is xml but contains no news items. I believe that for some URLs (mabe very long URLs?) the HTTP Retriever has problems.

Cheers, Kilian

qqilihq · July 21, 2015, 2:49pm

Hi Kilian,

thanks for getting back, I'll have a look.

Philipp

qqilihq · July 22, 2015, 10:09am

Hi Kilian,

thanks for spotting that issue, this was indeed a regression introduced recently in the HTTP-specific Palladian code, which did not parse URLs with query parameters having the same names correctly. That problem is now fixed in that latest build.

Best,
Philipp

kilian.thiel · July 22, 2015, 3:01pm

Hi Philipp,

thank you for fixing it.

Cheers, Kilian

system · April 21, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.