Text patterns


I am searching for an approach to split a text in patterns depending on a tagging result, e.g.

two terms left of a tagged term [e.g. Location] and 4 terms right as one pattern.

or even better all terms left and right from tagged term [e.g. Location] until next term tagged [e.g. Organization]

(analoge to the sentence extractor which uses punctuation in my understanding)

I would like to split the text into parts logically belonging together.

Any idea is appreciated



one or more terms around a tagged term can be found. Therefore you need to identify first all terms that can be tagged, e.g. using a tagger node. Convert the tagged documents to a bow, then the terms to strings and then group by the strings. You end up with a column of unique strings.

Use these strings as a dictionary with the wildcard tagger. Before doing this, add some regular expressions to the strings in order to search around these strings, e.g. "([a-zA-Z]+ )*" + <myString> + "( [a-zA-Z]+)*". Therefore you can use e.g. a Java Snippet node. Then use the reg exed strings as dictionary. In the Wildcard tagger select "sentence based" and "regular expression".

Attached you find an example workflow. Hope this helps.

Cheers, Kilian

Just great. Thanks!