Extract content of websites

Hi,

I try to extract the content of websites with the content extractor. The workflow is as follows (also attached):

Table creator --> httpRetriever --> htmlParser --> Content Extractor --> Document Data Extractor --> Column Filter --> Document Viewer

The Table Creator contains the three websites:

http://cordis.europa.eu/project/rcn/110738_en.html
http://cordis.europa.eu/project/rcn/191258_en.html
http://cordis.europa.eu/project/rcn/106271_en.html

The three websites are very simular to each other, regardless the content of one website (http://cordis.europa.eu/project/rcn/106271_en.html) can not be fully extracted by the content extractor (see attached workflow).

Are there any solutions?

Many thanks in advance!
Best
Simon

Hi Simon,

the content extractor tries to "guess" the content sections on the website and the menu sections (which will not be extracted). If the fields that you want to extract from the website are always the same, a better solution could be to use the XPath node. Convert the http result of the HTTP Retriever into html with the HTML parser. This nodes creates XML cells. Then use the XPath node to extract the field of the XML / HTML.

Cheers, Kilian

 

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.