Content Extractor

karl_0 · September 28, 2015, 9:34pm

Hi!

I try to extract the content of web pages using the Palladian Content Extractor, but what ever I try I just get out the headline.

Does anyone have the same problem or any suggestion what I am doing wrong?

Thanks in advance, Karl.

qqilihq · September 29, 2015, 12:24pm

Hi Karl,

the ContentExtractor node appends a TextDocument cell, which behaves somewhat different than a ordinary StringCell. Use a Document Data Extractor node (available in the Text Processing section) and configure it to append the document's text as dedicated string column. I'm attaching an example workflow.

Best,
Philipp

contentextractortostring.zip

karl_0 · September 29, 2015, 1:30pm

Hi Philipp,

Thank you very much for your help!
Now I understand how to use the ContentExtractor - my first test worked properly :-)

Best wishes, Karl.

system · April 21, 2023, 9:40pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.