Differentiate/Extract Bold & normal text from input files

Hi everyone,
I’m doing a text analysis project and i would like to ask if you have suggestion on how to differentiate Bold vs Normal text. The details is below:

  • Input files: i have multiple Word documents (transcript from interviews). The questions are formatted in BOLD text, and the answers are in normal text.
  • Output: i would want to have an excel output with Question in one column and corresponding answer for that question in 1 column.

I use the words parser and sentence extractor as first 2 nodes, and want write a rule to split Bold vs. normal text. But i’m not sure if this possible, as when i use the word parser note, i think the formatting is not taken into account.

image

Hello @JinnyLe,

do you maybe have a question mark, double colon or any other character which you can use in Cell Splitter node to separate question from answer?

Br,
Ivan

Hi Ivan,
The template that i have is that:

MOD - abcefg. Tmshzmshr? -> All the question (can be a sentence or a paragraph) will start with MOD and in BOLD text, with question mark at the end. So based on your suggestion, i think the MOD can be used? But if i use the sentence extractor, i’m not sure how to use the cell splitter?

gjskkksghj. pmmhggn. -> The anser is in normal text. It can be a sentence or paragraph.

Thank you!

Hello @JinnyLe,

I’m not exactly sure how does your output from Sentence Extractor node looks like but you could try question mark sign as delimiter in Cell Splitter?

Br,
Ivan