Feed Parser

Hi

I experienced problems with the FeedParser node in a workflow like example Palladian 03: Parse an RSS feed. I have in my table creator 20 links like input: 

http://www.thechilicool.com/

http://www.myfantabulousworld.com/

http://www.imperfecti.com/

http://www.namelessfashionblog.com/

http://www.it-girl.it/

http://www.theladycracy.it/

http://www.thecutielicious.com/

http://babywhatsup.com/

http://wildflowergirlstyle.blogspot.it/

http://www.laragazzadaicapellirossi.com/

http://www.2fashionsisters.com/

http://www.noemiguerriero.com/

http://truccaticoneva.altervista.org/blog/

http://www.thegummysweet.com/

http://www.lapinella.com/

http://www.insideme.it/

http://eniwherefashion.blogspot.it/

http://www.lostilediartemide.it/

http://www.myurbanbonton.com/

http://www.patchworkporter.com/

IT is all right until feed parser and then it gives me this error:

Execute failed: ws.palladian.retrieval.parser.ParserException: org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 312; Il nome entità deve seguire immediatamente '&' nel riferimento di entità.

Where is the problem? I don't find it!!! Please help me!

Hi MarziaB,

these are all not RSS/Atom feed links, but links to regular webpages. What problem are you trying to solve exactly? Sounds like you want to extract data from HTML pages. Therefor you would typically use an HtmlParser node, instead of a FeedParser. Workflow would look like this:

URL input → HttpRetriever → HtmlParser → ContentExtractor, XPath etc. (latter available in XML processing nodes)

If you really want to access feeds, you need to specify the explicit feed URL. You can use the FeedDiscovery node to find feeds within specified HTML pages.

Does that help?

Best,
Philipp

Hi thanks for the answer i read it now!

Now i'll do some experimets and after I'll tell you if i have problems!

Thanks Philip!

Now I put Table Creator- Url normalizer-HttpRetriever- HtmlParser-Content Extractor-Xpath and

 it gives me this 

ERROR     HtmlParser                        Execute failed: org.knime.core.data.MissingCell cannot be cast to ws.palladian.nodes.retrieval.HttpResultValue

Where is the problem??

Hi, I intent to do this but it gives me this error in HtmlParser

Table Creator-Http Retriever- Html Parser-xpath-xpath  

ERROR     HtmlParser                         Execute failed: org.knime.core.data.MissingCell cannot be cast to ws.palladian.nodes.retrieval.HttpResultValue

or if i do this workflow:

Table Ceator-Url Normalizer- HttpRetriever-Html Parser- Xpath- Java Snippet-String to document

this creates a string of document with 2 column with all ???? and """

Helpppp me please! 

Hi Marzia,

looks like something about this thread is messed up. The issue with the "MissingCell" which you mentioned does not happen with current versions of the nodes any longer. Please run the update procedure to get the current version of the Palladian nodes (File → Update KNIME, and make sure that you install the Palladian Nodes).

If there are further issues, please post an example workflow here, outlining your problem. This will make it easier for me to understand your problem.

Best,
Philipp

I installed yesterday the program on a new pc...but now I run the upgrade....

I don't know, i'm proving to do with parse a webpage but nothing. I have to analize 50 webpage (look up) and do text mining, so find name of brand, how many times a brand is named...how can I do this? Tha problem is always that strings of document giv me column "" or???

Hi Marzia,

I'm trying to understand your problem and I assume we can find a simpler solution than your approach. Just to make sure I understood you correctly: You want to find out, how often a certain brand name (e.g. 'calzedonia') is mentioned on a certain site (e.g. http://www.thechilicool.com/). Did I understand you correctly?

Philipp

Yes this in all the webpage together!!!

I thought that it iwas to do easy but when I did a workflow the solution give me problems or no what I want. I wait your istructions! And I also need find how if it is mentioned more t-shirt or jeans or like this...I thought I have to use dictionary for this...at last I would a cloud where there are all this information to graficate what i find! 

Thanks a lot for your help and your disponibility.

Marzia

I read also the article: Analyzing the Web from Start to Finish, but it is so difficult to understand and when I open the example on Knime I don't know how create my document...but it is an interesting alalysis that i can prove if I risolve my problems!

Good evening,

I understand. Do you have a pre-compiled list with the brand-names you want to analyze or do you need to create that as well?

Concerning your private message: Have you already checked this article by Kilian Thiel about sentiment analysis?

Best,
Philipp

Hi, yes I also have a list of brand, because I don't know hao it can understand that "calzedonia" is a bran...can do it? Yes I read that article.

Marzia

If you have that list, I would recommend using Palladian's WebSearcher node and perform queries such as: "calzedonia site:thechilicool.com". This will give you results with pages which contain the brand's name on the given site. Tick the option "Append total result count column" in the searcher's configuration and you'll get an appended column with the number of matches for each query.

As a search engine, use Bing (you need to setup an API key at Microsoft and specify the key in KNIME's preferences under "Palladian Web Searcher"):

https://datamarket.azure.com/dataset/bing/search

[edit] Sample workflow attached.

Hi, I finally do my workflow, but without your help it was impossible, so how your suggest about Api Key...

So, thanks a lot for all your help, patient and Know...if you came to Rome, you have a friend!!! ;)

Bye

Marzia

 

Hi!

Is there possibility to extarct also the date from this workflow? It seems to be more difficult, I have to create a bag of word with brand and rispective years of comment, but I don't have this in my workflow.

I also want to create a cloud with the name of the brand more nominate, how can I do this? Becauese if I don't have strings fdocument and bag of words I think that it isn't possible.

Thanks!

I hope in your fantastic help!

Marzia

Hi Marzia,

for extracting dates from text, we have a DateExtractor node within the Palladian nodes. In case a date is mentioned explicitely in the text, this node should be helpful.

I assume you want to create a tag cloud with the brand names? I would recommend heading over to the KNIME Textprocessing subform. The text processing plugins contain a node for creating tag clouds.

Best,
Philipp

Hi, no Philip I think that I found the way but I don't know hot it isn't right, I put a photo so you can understand

Zara site:thechilicool.com ?
Row0 
Zara site:thechilicool.com   Date ? ?

the workflow is: two table creator, one with brand one with webpage, a cross joiner, java snippet, a websearcher, column filternd 

		<p>What&#39;s the problem? I need for all citation about Zara the date when it appeared.</p>

		<p>Thanks</p>

		<p>Marzia</p>
		</td>
	</tr>
</tbody>

Hi Marcia,

now I see. Unfortunately, not every search engine provides date information for the found results. Bing for example gives no date response field. Finding out when a web page appeared is unfortunetaly all but trivial. We have some mechanism in Palladian, which is outdated however and currently not available as a node.

You may want to try if you can extract date information from each page's source. HTML5 provides dedicated date and time elements, so if the pages you are analyzing make use of current web standards, extracting this information should not be too difficult using the DocumentParser and XPath nodes.

Hope that helps,
Philipp