Extract words preceding a search term

Hi

I want to extract a search term together with its preceding (2) words, e.g. mouse as a search term shoul extract "mini mickey mouse" if found in a Pubmed Abstract. A typicl corpus onsist of 2000 to 20000 Pubmed abstract.

 

Any ideas how to solve the problem?

 

Frank

Hi Frank,

in Textprocessing 2.8 which comes with KNIME 2.8 there will be a Wildcard Tagger, providing functionality of wildcard or regex tagging. This node allows to find such word constellations easily.

Meanwhile you need to use a kind of a workaround to find such terms. Starting with your data table containing the documents (with the word mouse), first you need to extract the sentences of the documents as strings, using the Sentence Extractor node. Next use the node "Regex Split", which is searching and extracting substrings that match on a specified regex. Use ".*(\s+[a-z]+\s+[a-z]+\s+mouse).*" as pattern in the node dialog. The output table contains an additional column containing the substrings matching on this regex. Attached you find an example workflow.

Cheers, Kilian

Thank you for the workaround. The Sentence Extractor node is the key to use already existing functionality for strings.

 

Frank

Hello

 

i'm trying to use the regEx as well, however i'm not that familiar with those patterns.

what i'm trying to extract is the following string

xxxx-xxxx  (X can be a number 1-9 or a character a.z)

what should the matching regex pattern look like?

 

many thanks in advance

 

greets

 

Hello again,

 

so i tried this

([A-Za-z0-9]{4})([\-])([A-Za-z0-9]{4})$

and I get split_1=  -  and split_2= the second part of my desired string.

furthermore I get the problem, that if there is a string like this

xxxx-xxxx(  or instead of a parenthesis another character or a comma the whole part is neglected.

how can I simply extract xxxx-xxxx and discard everything that comes after it?

 

greets

 

And