Term co-occurrence count on paragraphs

Is my assumption correct that the count of co-occurrences on a paragraph provided by the Term Co-occurrence Counter node is only meaningful on documents for which the original format provides paragraph markers (e.g. Word but not PDF)?

This is a good question! I’m not sure, so I’ve asked one of the developers of the TextProcessing extension to chime in.

Thank you, Scott. In my test with a PDF, co-occurrences are incorrectly reported the same at the level of doc, section and paragraph (so no sections or paragraphs seem to be identified), and they differ correctly at the level of sentences.

It would be helpful to know how this node identifies neighbors (immediately consecutive words, I suppose), sentences, paragraphs and sections.

Can you share your workflow?

Hey @mpenalver,

this is indeed correct. Although we have implemented sections and paragraphs for the document structure, they are only used in a few parser nodes (e.g. Word Parser). Text read by Tika Parser / PDF Parser are simply put into one section/paragraph.

I will note it down and check if Tika offers an option to detect paragraphs, so that we can parse it properly.



That’s great @julian.bunzel. Thank you very much.

