Hi everyone,
I’m doing a text analysis project and i would like to ask if you have suggestion on how to differentiate Bold vs Normal text. The details is below:
Input files: i have multiple Word documents (transcript from interviews). The questions are formatted in BOLD text, and the answers are in normal text.
Output: i would want to have an excel output with Question in one column and corresponding answer for that question in 1 column.
I use the words parser and sentence extractor as first 2 nodes, and want write a rule to split Bold vs. normal text. But i’m not sure if this possible, as when i use the word parser note, i think the formatting is not taken into account.
MOD - abcefg. Tmshzmshr? -> All the question (can be a sentence or a paragraph) will start with MOD and in BOLD text, with question mark at the end. So based on your suggestion, i think the MOD can be used? But if i use the sentence extractor, i’m not sure how to use the cell splitter?
gjskkksghj. pmmhggn. -> The anser is in normal text. It can be a sentence or paragraph.
Hi Ivan,
I tried it too but still doesn’t work… The reason is My question is usually a paragraph with many sentences, and there is only 1 question mark. it ends up only the sentence with question mark is classified as Question, the rest doesn’t.
Before i can use Cell splitter, i need to Convert the word document into different sentences stored in a column by using Sentence Extractor. If there is a Paragraph extractor, your suggestion might work i guess. Do you have any other idea to work around this?
hmm… not sure what to try next. Maybe some RegEx cause text processing is not designed to split on format of text but rather on content. It would be best if you can share Word example (dummy data is ok as long as it represents format of the real one) and maybe someone comes up with an idea.