French language and KNIME text preprocessing

Geo · February 17, 2016, 1:18am

For almost a year, I have been using R for text mining (tm package), remote controlled via KNIME's R nodes. However, I have been recently confronted with serious text encoding issues probably due to the communication between both platforms. Therefore, I'd like to switch entirely to KNIME for this task.

Along the way, I've come across some difficulties in the context of French language, which I have illustrated in the attached workflow:

- dictionary tagger takes into account apostrophe to match named entities (e.g. UE). In French, this creates problems, given the stopword l' (e.g. l'UE). Yes, one can fuzzy match but that creates its own problems, e.g. protecting the whole word (i.e. l'UE instead of UE) ;

- the French stopword l' is not recognized by Stop word Filter, thus creating challenges in interaction with Punctuation Erasure: the latter merges any word with a preceding l' -> e.g. l'UE becomes lUE ;

- Replacer (which allows to replace punctuation or any regex instead of removing it) appears to cause unexpected behaviour for some non-regex preprocessing nodes which follow after Replacer (cf. Number Filter and N Chars Filter). In the attached example, I have replaced punctuation by $ just to show what's going on. It's the same if one replaces punctuation by a space character - in addition, replacing by space character creates the trouble of having to remove multiple space characters (see next point) ;

- how to remove excessive whitespace between words ? (not shown in the attached workflow) The regex expression \b([ ]{2,})\b is not recognized ...

- what is precisely the regex flavor understood by KNIME ? e.g. \s is apparently not recognized as white-space ;

All in all, I end up using Regex a lot just because of the unrecognized French stopword ( l' ) and due to the weird behaviour after the Replacer node.

Any thoughts or ideas ?

text_preprocessing_quirks_or_bugs.zip

kilian.thiel · February 18, 2016, 5:29pm

Hi Geo,

thank you for the questions and hints. Let me start to answer a few of them.

The basic problem is due to the used tokenization model (open nlp). I'UE should be tekonized as two words but it is actually one word. This makes stop word filtering etc. very difficult.

The tokenization can not really be seen, when looking at document cells in the data table. To see how the terms are tokenized use the Bag of Words Creator node.

About your points in the post and in the workflow:

* Punctuation Erasure: I'UE will be turned into IUE since these are not two tokens and all punctuation marks are removed. The regex used to match punctuation marks is "[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+"

* Number Filter: only of the whole term (token) is a number it will be filtered out. The regex used here to match is "^[-+]?[\\d.,]+"

* N Chars Filter: the I is not removed in the first row because it is part of the token I$UE. The minimum number N is set to 2 in the dialog meaning that I$UE (4 chars) is not filtered.

* RegEx Filter: why UE (part of token I$UE) is removed is a good question. I have to check this in detail and will reply soon.

Cheers, Kilian

kilian.thiel · February 18, 2016, 5:41pm

About the RegEx Filter: your reg ex is "\b([a-zA-Z]){1}\b" which matches "I" of the token "l$UE" using Matcher#find(). Thus the whole token will be filtered out.

The regex flavor understood by KNIME is the basic Java flavor. The spcified string in the dialog of the RegEx Filter node is used and complied to a pattern using Pattern#compile().

Cheers, Kilian

Geo · February 19, 2016, 12:47am

Thank you Kilian for your detailed feedback. So, what I'm beginning to understand is that tokenization appears to play a huge role in all of this, doesn't it ? Does this mean that in a text piece such "this is the end.This is", the part "end.This" forms one single token no matter if I replace the . by whitespace or any other character ?

For the stop word matter, the R Snowball package used together with the tm package behaves in the same manner regarding the stopword l'. However, here I can can replace any punctuation by whitespace and afterwards easily strip whitespace between words. Why would one want to erase punctuation anyway instead of replacing it by whitespace ? Removing it appears to merge words unexpectedly together.

So here is my reformulated question: when trying to do the above with Replacer, even when not connecting the Dictionary Tagger to tag named entities:

- if I search punctuation using either \p{P} or your aforementioned regex for punctuation and replace it by $, the l$UE is a single token and trying to remove $ thereafter removes UE as well (seems logical to me) ;

- if I search for the same regex but replace it with whitespace this time, the Regex Filter "remove single character words: \b([a-zA-Z]){1}\b" still removes both l and UE. Why are they still one token instead of two ? Shouldn't whitespace divide the token into two tokens ?

The interesting thing is that this happens with ANY punctuation that separates two words (e.g. end.This) and which is replaced by ANY character using Replacer. Is this still expected behaviour ?

As of now, the only practical workaround that I can think of is to use regexReplace($text$, "\\p{P}", " ") or even regexReplace($text$, "\\p{P}", "\\$") in a String Manipulation node prior to String to Document, i.e. to replace any punctuation by whitespace or even a $ or whatever. This allows me to successfully proceed with the remaining text mining nodes without weird things happening due to tokenization and punctuation.

I find all this apparent (!) misbehaviour very confusing, but I'm also not a IT specialist.

kilian.thiel · February 19, 2016, 11:41am

Yes, tokenization plays an important role. Punctuation erasure is usefule if you have e.g. a token "and." this would be found and filtered by the stop word filter (exact match "and"!="and."). Removing punctuation marks turns the token "and." into "and" which can be filtered by the stop word filter.

Tokenization is applied only by the Strings to Document node, parser nodes, and by tagger nodes. Other nodes, such as filters etc. do not change tokenization. Replacing characters in tokens will not change tokenization, not matter if there are whitespaces in the token or not.

What you could do to effect tokenization be replacements is to to replacements on the string columns (replace I'UE with I UE) and apply then the Strings to Documents which will do tokenization. Then I and UE will be considered as two tokens instead of I'UE as one token.

I hope this helps.

Cheers, Kilian

Geo · February 19, 2016, 4:04pm

Thank you again, Kilian.

I can live with the workaround described in my previous post.

At least, I have now managed to migrate to KNIME from R concerning text preprocessing. The text encoding trouble in R remote controlled via KNIME is really a limiting factor. The only thing I currently lose from the migration is stem completion. On the bright side, KNIME text preprocessing appears to perform much better and it is easier to debug, well that is except for this tokenization story :-)

Gul · April 27, 2016, 3:39pm

I want to separat french articales e.g. l'avant -> la avant, j'ai - je ai etc. After reading the Geo's post, i tried but came up with nothing. Any idea/solution will be appreciated.

system · June 3, 2018, 10:11pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.