I have a specific questions on how to access the document structure in a word document. I'd like to mine several hundred reports in *.doc format and I can read then, enrich them and continue with further analysis. The documents all have fairly the same structure (3-4 versions with minor modifications I will address individually) i.e. repeating headlines or tables after certain chapters.
It is a very important information which text comes from which part of the document i.e. from which chapter, in addition there is a fairly large amount of information I would like to skip during analysis. To my mind a good solution would be to try to specifically extract the “interesting document parts” and den continue with the analysis on those parts. I would like to split such a document i.e. in “Introduction”, “Methods”, “Results” and then continue the analysis. Later on I’d like to be able to filter out special text parts for further analysis i.e. check for term frequencies in the “Introduction” section vs. term frequencies in the “Methods” section of all documents.
In my current strategy I tried to extract sentences and then assign rule-based start/ stop marks in an extra row depending on the headlines that are repeating in one document version. In the example you can see the column "CutMarks" with the values "StartCut_Decr" and "EndCut_Descr" where I'd like the cells to be combined in the documents.
Due to the good consistency of the documents this works quite well, however now I do not know how to go on from here. I guess ideally I could generate one new “doughter-document” specifically from those lines identified by my approach.
How could I do this?
I’d also appreciate any idea of how to address this problem alternatively :)
I don't have the exact answer to your problem but I think you can address this by regex-filtering text between two known headlines. Unfortunately, I am not an expert on regex strings but others might be able to help you there. You would then end up with your individual parts of the document neatly arranged in columns form which you can do further processing.
an alternative would be to use a Java Snippet node an append a column with a number. For each "EndCut-Desc" the numer is incremented. You could then group by the number column and aggregate all strings, words or terms having the same number assigned. The Group By node can aggregate strings by concatenating them with a given separator.
Attached you find a workflow using the Java Snippet node to count specific marker strings.
Hi Jerry, thanks a lot for your answer. The CutMarks that are set in the table are done by RegEx filtering, however my difficulty is how to address the lines in between the CutMarks e.g. by a RegEx. I'd like the text to be completely arbitrary which is why it might become difficult to stick to RegEx, as far as I understood it from now...
thank you for your quick and custom-made code-reply!
I'll definitely use the Group by node to aggregate the text as you proposed this sound like a good and simple approach.
Concerning the code you provided: unfortunately my Java is not sufficient to optimize it fully to my needs. Currently the code will change the number every time if finds a mark. However this means all lines will end up with a number which is not exactly what I was looking for. I'd like to discard all lines that are out of the Start-End Mark borders (i.e. in End-Start blocks, or before the first Start and the last End-Mark).
In addition, since I'm using different "subclasses" that repeatedly occur in every document (lets call them part, A, B, C) it would be ideal to assign these identical "subclasses" to the document parts.
You can find your modified workflow below to explain a little better what I am looking for (see "desired outcome" in the input table).
How could I optimize your snipped to obtain the desired output?
attached you find the customized Java Snippet Node that produces your desired output.
Works like a charm…
Thanks Kilian for the excellent support!