Document Data Extractor does not extract abstract from scientific papers

clonedapple · July 13, 2021, 5:42pm

I am trying to extract abstracts from scientific papers using the Document Data Extractor node. However, it fails at this. Initially I thought this had something to do with the node not being able to identify the abstract section in the papers. So I used some papers where the abstract is clearly mentioned. It still failed. I wanted to know how the node works and why it is failing. Is there another node I could use for abstract extraction?
Thank you!

ScottF · July 13, 2021, 8:24pm

Hi @clonedapple -

The Document Data Extractor node is intended for a very specific purpose - to pull out metadata that has been previously stored inside a document. This generally includes fields like author, dates, categories, sources, and so on, in addition to the text itself.

This node is usually introduced after a document has been already created in the workflow, most often with the Strings To Document node, and had additional metadata added to it with the Document Data Assigner.

Having said that, the Document Data Extractor probably doesn’t have anything to work with in your case, and a different approach is needed to ingest the abstracts. Can you post an example of the type of text you are trying to import, along with the workflow you’re using? Then maybe someone can give you some more specific pointers.

system · July 26, 2021, 6:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.