Replace strings within a text field

I am looking to searching within a RTF formatted text note for specific text snippets (specifically “immunocompromised” or similar). I have tried to do a RTF to text conversion previously with little success.

Here is the text note (RTF):

{\rtf1\ansi\ansicpg1252\uc1\deff0{\fonttbl
{\f0\fnil\fcharset0\fprq2 Arial;}
{\f1\fswiss\fcharset0\fprq2 Arial;}
{\f2\froman\fcharset2\fprq2 Symbol;}}
{\colortbl;\red0\green0\blue0;\red255\green255\blue255;}
{\stylesheet{\s0\itap0\nowidctlpar\f0\fs24 [Normal];}{*\cs10\additive Default Paragraph Font;}}
{*\generator TX_RTF32 15.1.531.502;}
\deftab1134\margl0\margt0\margr0\margb0\widowctrl\formshade\sectd
\headery720\footery720\pgwsxn12240\pghsxn15840\marglsxn1440\margtsxn1440\margrsxn1440\margbsxn1440\pard\itap0\nowidctlpar\plain\f1\fs18 CNC - Multiple calls from K last week regarding grocery shopping.\par\par Reports that she has no friends or family able to shop for her, and that she is unable to shop\par due to being immunocompromised.\par\par Advised that community based care services are stretched to the limit.\par Suggested online shopping.\par K has looked into this, but upset that some items are out of stock.\par\par Advised that this is the issue for the whole population…\par Advised applying for ‘Priority Online Shopping’.\par This is for vulnerable people. W Inc are attempting to provide these \par shoppers with their groceries.\par }

I want to remove all of the non-bold characters to leave the text entry as:

CNC - Multiple calls from K last week regarding grocery shopping. Reports that she has no friends or family able to shop for her, and that she is unable to shop\par due to being immunocompromised. Advised that community based care services are stretched to the limit. Suggested online shopping. K has looked into this, but upset that some items are out of stock. Advised that this is the issue for the whole population… Advised applying for ‘Priority Online Shopping’. This is for vulnerable people. W Inc are attempting to provide these shoppers with their groceries.}

I have tried String Replace (Dictionary) but that only replaces whole entries.
I have tried Column Expressions but I’m sure that either it doesn’t work, OR I have the wrong code (see picture)

So I am left with just using String Replacer using each row and variant, which is cumbersome and very very lengthy (I have 323,350 notes to search just for 01.01.2020 til yesterday!)

Can anyone help me with a more efficient way to remove a range of strings from within a text note?

As always, thanks in advance.

A

Hello @AAM,

wow, this seems as a proper text cleaning task you got there. Any progress?

So all this text (one note) is in one cell? And from each cell you are supposed to extract anything related to “immunocompromised” or similar term? But how do you identify where to start and where to end? Is CNC a trigger? Can there be multiple parts from single note you wish to extract?

Best would be if you can share couple of notes that represent your data set and possible scenarios so someone might construct a good enough regex for it. Or maybe someone will have another idea :wink:

Br,
Ivan

3 Likes

@AAM

I’ve done some searching and found some hints on converting from rtf to plain text. There are python modules and java packages that can assist with this part, even if not directly with the remainder of your requirement.

Attached workflow contains an example of both a java and python node, and the output from converting your original sample note.

The python script will take a column “rtf note” and append a new column “plain text”. It requires that you have Python 3 installed, and also that you have installed the additional module “striprtf” which can be installed using PIP with pip install striprtf at the command line, or if you use conda there will presumably be an equivalent installation command, but I don’t have conda so I don’t know what it is…

The java snippet on the other hand should work straight out of the box as it uses native java packages.

There is a small difference in the output, but I think it hopefully is an improvement over anything you would be doing manually:

Python Script:


Java Snippet:

image

rtf to plain text.knwf (11.0 KB)

8 Likes

Nice one @takbb !! Well done!

2 Likes

IT WORKS!

Many thanks. You the man!

A

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.