Term co-occurrence count on paragraphs

mpenalver · September 23, 2020, 1:35am

Is my assumption correct that the count of co-occurrences on a paragraph provided by the Term Co-occurrence Counter node is only meaningful on documents for which the original format provides paragraph markers (e.g. Word but not PDF)?

ScottF · September 23, 2020, 6:00pm

This is a good question! I’m not sure, so I’ve asked one of the developers of the TextProcessing extension to chime in.

mpenalver · September 23, 2020, 7:51pm

Thank you, Scott. In my test with a PDF, co-occurrences are incorrectly reported the same at the level of doc, section and paragraph (so no sections or paragraphs seem to be identified), and they differ correctly at the level of sentences.

It would be helpful to know how this node identifies neighbors (immediately consecutive words, I suppose), sentences, paragraphs and sections.

Daniel_Weikert · September 24, 2020, 5:26pm

Can you share your workflow?

julian.bunzel · September 25, 2020, 9:08am

Hey @mpenalver,

this is indeed correct. Although we have implemented sections and paragraphs for the document structure, they are only used in a few parser nodes (e.g. Word Parser). Text read by Tika Parser / PDF Parser are simply put into one section/paragraph.

I will note it down and check if Tika offers an option to detect paragraphs, so that we can parse it properly.

Cheers,
Julian

mpenalver · September 25, 2020, 9:10am

That’s great @julian.bunzel. Thank you very much.

system · October 2, 2020, 9:10am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.