Extract content of websites

am_dbs · December 11, 2015, 1:13pm

Hi,

I try to extract the content of websites with the content extractor. The workflow is as follows (also attached):

Table creator --> httpRetriever --> htmlParser --> Content Extractor --> Document Data Extractor --> Column Filter --> Document Viewer

The Table Creator contains the three websites:

http://cordis.europa.eu/project/rcn/110738_en.html
http://cordis.europa.eu/project/rcn/191258_en.html
http://cordis.europa.eu/project/rcn/106271_en.html

The three websites are very simular to each other, regardless the content of one website (http://cordis.europa.eu/project/rcn/106271_en.html) can not be fully extracted by the content extractor (see attached workflow).

Are there any solutions?

Many thanks in advance!
Best
Simon

extract_content_website.zip

kilian.thiel · December 21, 2015, 12:41pm

Hi Simon,

the content extractor tries to "guess" the content sections on the website and the menu sections (which will not be extracted). If the fields that you want to extract from the website are always the same, a better solution could be to use the XPath node. Convert the http result of the HTTP Retriever into html with the HTML parser. This nodes creates XML cells. Then use the XPath node to extract the field of the XML / HTML.

Cheers, Kilian

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.