Mining text using non-linear and compound expressions for proximity searches

Cadu · April 19, 2014, 6:22pm

Hi there,

I am mining text for scientific literature review. I arrived in a Regex expression, but I would need to know if Regex is able to provide a bit more. Any advice is very appreciated.

I need to find definitons/concepts of terms/themes. The follow expression is providing a linear half solution:

word1(?:\s+\w+){0,N}\s+word2
definition(?:\s+\w+){0,2}\s+institution

a) However, I research two languages (English/Portuguese) at same time. So, an expression which I tried but not worked was:

concept|conceito|definition|definição(?:\s+\w+){0,2}\s+institution|instituição

b) Besides, it would be the paramount if there is a non-linear expression. I mean, the order doesn't mind. Something like: using just one expression with a proximity of 2 words the output would be "concept of instituion", "institution as concept", and so on.

Maybe I am asking for the impossible, but any workaround or closer manner to handle the mentioned searches are enough.

Many thanks in advance,

Cadu

kilian.thiel · April 23, 2014, 1:37pm

Hi,

about b) You could use the Wildcard Tagger as you are using it now but break up the expressions in pieces in order to get the two words left of e.g. instituion and right of it, whereas on of the words has to be the word concept.

The reg exes would be like:

([a-zA-Z]+\s+)concept\s+instituion
concept\s+([a-zA-Z]+\s+)instituion
instituion\s+concept(\s+[a-zA-Z]+)
instituion(\s+[a-zA-Z]+)\s+concept

This is a bit unhandy i have to admit, but it would work. Attached you find an example workflow.

about a) Can you separate your documents into a English set and a Portuguese set? Then you could specify two dictionaries. Maybe the reg ex does not work due to special characters?

Cheers, Kilian

tagging.zip

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.