Regex split?

iiiaaa · February 24, 2016, 10:41am

Hi everybody,

I have hundred of thousand of rows with free text. I would like to extract all words 'the' and the next and previous words. I don't kwow how many time the word the is present in the text and I would not like to extract words that contain 'the' (e.g. 'There')

E.g.

1) I like the Afro school. I want to buy the book related to the CC. There is something interesting inside

2) The pen is on the table. THE, is the best film.

3) I like the sun today

I would like to get:

I like the Afro school. I want to buy the book related to the CC. There is something interesting inside	like the Afro	buy the book	to the CC
The pen is on the table. THE, is the best film.	The pen	. THE,	is the best
I like the sun today	like the sun

Please could you help me?

Thanks in advance

Geo · February 24, 2016, 10:20pm

How about String Manipulation with regexReplace ? The idea would be to search for ( )+(the|The|THE)+( )+, i.e. "the" preceded and followed by a space. Replace the space this by "_the_". Now you could use Cell Splitter with space as delimiter. Unpivot the newly created columns by using wildcard for selecting the variables to unpivot, after that filter all the rows that do not contain "the".

iiiaaa · February 28, 2016, 3:57pm

Many Thanks!