Differentiate/Extract Bold & normal text from input files

Hi everyone,
I’m doing a text analysis project and i would like to ask if you have suggestion on how to differentiate Bold vs Normal text. The details is below:

  • Input files: i have multiple Word documents (transcript from interviews). The questions are formatted in BOLD text, and the answers are in normal text.
  • Output: i would want to have an excel output with Question in one column and corresponding answer for that question in 1 column.

I use the words parser and sentence extractor as first 2 nodes, and want write a rule to split Bold vs. normal text. But i’m not sure if this possible, as when i use the word parser note, i think the formatting is not taken into account.

image

Hello @JinnyLe,

do you maybe have a question mark, double colon or any other character which you can use in Cell Splitter node to separate question from answer?

Br,
Ivan

Hi Ivan,
The template that i have is that:

MOD - abcefg. Tmshzmshr? -> All the question (can be a sentence or a paragraph) will start with MOD and in BOLD text, with question mark at the end. So based on your suggestion, i think the MOD can be used? But if i use the sentence extractor, i’m not sure how to use the cell splitter?

gjskkksghj. pmmhggn. -> The anser is in normal text. It can be a sentence or paragraph.

Thank you!

Hello @JinnyLe,

I’m not exactly sure how does your output from Sentence Extractor node looks like but you could try question mark sign as delimiter in Cell Splitter?

Br,
Ivan

Hi Ivan,
I tried it too but still doesn’t work… The reason is My question is usually a paragraph with many sentences, and there is only 1 question mark. it ends up only the sentence with question mark is classified as Question, the rest doesn’t.

Before i can use Cell splitter, i need to Convert the word document into different sentences stored in a column by using Sentence Extractor. If there is a Paragraph extractor, your suggestion might work i guess. Do you have any other idea to work around this?

Hi @JinnyLe,

hmm… not sure what to try next. Maybe some RegEx cause text processing is not designed to split on format of text but rather on content. It would be best if you can share Word example (dummy data is ok as long as it represents format of the real one) and maybe someone comes up with an idea.

Br,
Ivan