FeedParser: Problems with Yahoo's feeds (and other feeds)

Hi

We experienced some problems with the FeedParser node in a workflow like TableCreator (with URL) > HttpRetriever > FeedParser

The FeedParser stopps with an error when reading Yahoo's feeds, e.g. http://news.yahoo.com/rss/education that looks like this one:


Execute failed: ws.palladian.retrieval.parser.ParserException: org.xml.sax.SAXParseExceptionpublicId: -//W3C//DTD HTML 4.01 Transitional//EN; systemId: http://www.w3.org/TR/html4/loose.dtd; lineNumber: 31; columnNumber: 3; The declaration for the entity "HTML.Version" must end with '>'


 This seems be a problem with all Yahoo feeds, but also with some other feeds. Is there a possibility to avoid this problem?

 

Frank

Hi again :)

I'm currently on holidays and cannot reproduce this issue, but from your description I assume, that you are not getting an RSS/Atom feed but an HTML page. The URL which you provided points to a valid feed though.

I will have a look at the problem and get back to you.

Best,
Philipp

Hi Frank,

I just tried with the URL you provided and it works fine for me (sample attached). I suspect that for any reason you are getting an error page instead of the feed (maybe because of internal proxy issues, WLAN authentification, ...).

It should alrady help to examine the result which is retrieved by HttpRetriever to debug this issue.

Hope this helps,
Philipp

Hi,

As you mentioned I examined the result in the HttpRetriever node. It is a response from our proxy that the authentificattion is missing.

 

Therefore it seems be be the same problem we have with the WebSearcher node behind our proxy server. See the post: http://tech.knime.org/forum/palladian/knimes-proxy-settingsauthentification-details-ignored-by-websearcher-urlresolver-nod

This is the case for all feeds. (My first example was our company's webpage and I thought that this was an "external" resources, not intranet.)

 

Maybe you can add this feature request regarding the proxy authentification also to your list for a next update - as already mentioned for the WebSearcher node.

 

Normally I define the network connections via File > Preferences > Network Connections together with the authentification details (User/password).  These details are then used by nodes like FileReader and XMLReader when accessing external resources.

 

Frank

 

Hi Frank, thanks for the feedback. The autenticated proxy support will be added to all Palladian nodes soon. Thanks for your patience so far :)

Best,
Philipp

Hi folks,

unfortunately, I'm having the same problem using an identical workflow: Table Creator -> HttpRetriever -> FeedParser in KNIME 3.1.1.

Everything seems fine until the process hits the FeedParser node. The console shows me the following error: "Execute failed: Unexpected input type: MissingCell".

I'm trying to parse the RSS feed from http://www.heise.de/newsticker/heise-top-atom.xml.

What's happening here?

The HttpRetriever returned a Missing Cell, which means the download was not successful. Have a look in the console output which should show you an error message. You can also enable DEBUG logging to get a more detailed description and post the output here.