Create dataset of extracted text and associated columns without loosing columns through workflow

Hi all,

I'm trying to create a combined dataset comprising of the text from web documents and a number of associated columns related to those documents (Location, checksum, Title) 

My workflow creates the Joined Table of the columns with the URL column but when I pass the URL column that  contains the text I want to process through the HTML Parser -> ContextExtractor I end up loosing the (Location, checksum, Title) columns 

Is there a way to keep the association between the text and the columns as I continue through the workflow?

Michael

Hi Michael,

the HTML Parser node is supposed to receive as input the output of an HTTP Retriever node. My tests show that the HTML Parser node, when used properly, doesn't remove any extra column from the input table, so not sure what is going on in your workflow. Can you share to have a look at it?

Cheers,
Marco.

looking at your workflow screenshot, I think you mean the ContentExtractor, which does not return the original columns, and also resets the RowIDs. However, the output is returned in order, even if there are missing values, so the simplest way to associate back with the input is to use the Column Appender node joining the output from the HtmlParser with the output of the ContentExtractor.

David

 

Hi Marco & David,

Thanks for responding,

Marco I couldn't use the httpretriever node as the html files were on disk and from the KNIME forum I found reference to using the context extractor to change the xml output into readable text but am open to alternative ways of ending up with an text document (the pages I'm processing are text heavy and don't share a common structure like a list of items I could extract distinct columns)

David, You were correct the Columns from the HTML Parser needed to be appended to the ContentExtractor output. I chose to Generate new row keys when configuring the appender node.

If I could ask the further question of the 2 of you. When I view the text via the document viewer node the Document Info panel/pane doesn't include the FileName instead it has a path to a NoFileSpecified.txt and the Title is missing

Is that to be expected and could I update the document info or should I be using a different node to explore the text content of my rows?

 

Hi Michael,

to my knowledge it is not possible to edit the Document Info, in other words there is nothing like the setter version of the Document Data Extractor node. Document info are meant to track the origin of the document from a certain point of view it may make sense that they are not changeable. For documents originated from KNIME string tables through the Strings to Document node it is always the case that the filename is NoFileSpecified.txt and the Title is empty if no association is made to a specific column.

This said, you can always use the Meta Info Inserter node to associate extra information to your documents, also as <key, value> pairs, and use the Meta Info Extractor to read them back.

Hope this helps.

Cheers,
Marco.