Document data extractor adding spaces to terms containing "_In"

AngusVeitch · May 19, 2016, 7:52am

I have encountered what I can only assume is a bug in the Document data extractor node.

My corpus contains named entities which I have tagged and converted into single strings by replacing the spaces with underscores. So "Department of Industry" becomes "Department_of_Industry". In order to strip away the tags (as discussed in this post), I have used the Document data extractor to convert the documents into strings, and then used the Strings to document node to convert the strings back into documents.

All that works fine, except I noticed that some of the names that I had tagged are getting broken up in the recreated documents. So I am seeing "Department_of_" and "Industry" as two separte terms. When I inspect the output of the Document data extractor (by making a CSV file), I can see that "Department_of_Industry" has become "Department_of_ Industry" -- that is, with an extra space inserted between the last underscore and the word "Industry". As far as I can tell, this has only happened to terms that include a word starting with "In". For example, the same thing happenned to "Department_of_Trade_and_Investment". No other terms seem to have been modified.

So in other words, the Document data extractor seems to be converting the string "_In" to "_ In".

Weird, huh. Can anyone explain this?

Geo · May 20, 2016, 7:17pm

At first glance, this does not look like a bug but rather like a side effect of the tokenizer of the Strings To Document node. Maybe you can share an example workflow with dummy data to reproduce the observed issue ?

AngusVeitch · May 28, 2016, 7:30am

I created a test workflow (attached), but I've not managed to reproduce the issue entirely as I described. However, at least some of the unusual behavour can be observed.

In this case, I can only observe the changes in the Bag of words output. They don't show up in the Document viewer or the CSV output. I'm quite sure that I previously saw the terms change in the actual documents (not just the bag of words), but for the moment haven't been able to replicate this.

Anyway, run the attached workflow and note what happens to the various terms in the test document once they go through the Bag of Words creator. As I observed previously, 'Department_of_Industry' gets split after the second underscore, while some similar terms remain intact. Interestingly, the term 'Department_of_Underscores' was also split.

Also, the string I_I_I got split after the second underscore, but the same string ending in a full stop did not. I tried the same thing with another letter (A_A_A), and this string didn't get split with or without the full stop.

Without knowing how these nodes actually work, I can't make any sense of this. But from my perspective, this behaviour is a real inconvenience.

doc-data-extractor-test.zip

Geo · May 28, 2016, 12:10pm

As i understand the text mining features of Knime, the bag of words does not do anything else than split the terms based on the tokenization that has happened beforehand. This means that even though you can't see the split in the document viewer, it is already there.

Maybe what would be helpful in document viewer would be to visualize each separate token, e.g through an alternating color scheme. Absent that feature, the BoW node allows you to visualize the tokens. So what you are seeing in the BoW output is what is already there before BoW.

AngusVeitch · May 28, 2016, 5:36pm

Ok... so is there any other way to spin this than to say there is a bug in the tokenizer?

AngusVeitch · May 31, 2016, 12:12pm

I'm afraid I don't understand. You say that 'space is used for tokenization', but my problem here is that:

Spaces or divisions are being inserted where in the input text there are none; and
The way in which this is happening is not consistent - e.g. it tends to happen when an underscore precedes some letters (such as I) but not most others.

This impacts my workflow by meaning that I can't use underscores to substitute for spaces in ngrams that I don't want to be separated. Fine, maybe I can use another character instead (which is unfortunate, because the underscore is by far the most logical and preferred option). But how can I see this as anything other than working around a shortcoming in the tokenizer?

I carefully typed from scratch the input text in my previously attached workflow. So I don't know what the tokenizer could be detecting that would make it discriminate between strings that are otherwise identical except for one letter being substituted.

What am I missing here? How is it acceptable or normal for the tokenizer to work inconsistently in the way I am describing?

Geo · June 1, 2016, 12:05am

You're right, it is weird.

I've tested String Manipulation before applying Strings To Document. What I've found is that when you replace "_" by a non punctuation ASCII character (i.e. any alphanumeric character without any accents) such as "c" or "7", the BoW will look as desired - that's the only workaround I've found.

However, replace "_" by anything else, even double the "_" ("__"), and BoW will look full of trouble.

Here's Kilian answer to a quite similar question I've had not long ago: https://tech.knime.org/forum/knime-textprocessing/french-language-and-knime-text-preprocessing#comment-41694

AngusVeitch · June 1, 2016, 1:17am

Thanks, good to know I'm not the only one having this sort of issue. For the life of me though, I can't see how the tokenizer is performing 'correctly' if it behaves so inconsistently.

I suppose for the time being I'll just have to work around this problem and use a different separator instead of the underscore ... and hope that works better!

(I say 'for the time being' in the hope that maybe, someday, this issue will be fixed...)

kilian.thiel · June 2, 2016, 8:49pm

As mentioned this is due to the underlying tokenizer. Here the openNLP tokenization model for English language is used. This tokenization is based on a model and is not a simple white space tokenizer. Other characters besides white spaces can indicate the end of a token too. Also the characters before are taken into account. You can find the model here: http://opennlp.sourceforge.net/models-1.5 (word tokenizer for English language).

The only solution to this is to use different characters, as already mentioned.

Cheers, Kilian

Geo · June 3, 2016, 1:13am

Hi Kilian,

Thank you for the source indication. After all, I wonder, wouldn't a whitespace (or any single character-based) tokenizer be enough for most use cases ? So far, in order to avoid difficulties with the implemented tokenizer for any non-English language, a lot of "cleaning" is required beforehand and it remains quite unpredictable.

In any case, the information that you've provided here would be very useful in the documentation of the String to Documents node.

Geo · June 3, 2016, 1:15am

There's no bug, it has to do with how your text looks before you transform into documents. Space is used for tokenization in the strings to document node [EDIT: actually not correct, see Kilian's post down here]. With this in mind, try to analyze where and how this impacts your workflow.

kilian.thiel · June 6, 2016, 1:52pm

Hi Geo,

you are right that more tokenizers (Whitespace, ...) to choose would make much sense. I will put it on the list but can make no promises about the timeline / date.

Cheers, Kilian

Geo · June 7, 2016, 12:14am

Hi Kilian,

Thank you a lot. That would greatly enhance KNIME's already amazing text mining features.

Until such tokenizers be built-in, one can always write a "manual" tokenization using KNIME's "regular" nodes:

Cell Splitter (to implement e.g. whitespace as separator of tokens),
Pivot, Unpivot, Groupby (to perform a BoW-type transformation and back),
and String Manipulation (for regex stuff).

So the idea is to preprocess the data (punctuation, numbers, non-ASCII characters, etc.) using regular nodes and then to switch to the text mining nodes (via strings to documents or string to term) only for the language relevant methods (stop words, stemming, tagging, modelling, etc.). Absent punctuation and non-ASCII, the OpenNLP tokenizer appears to behave in a predictible manner. Obviously, such a workflow probably works easiest for simpler documents (à la spam/ham classification) than for analysing full-fledged corpora. After all, there has to be a reason for the OpenNLP tokenization.

P.S.: Thanks alot to both of you for this great discussion!

AngusVeitch · June 9, 2016, 5:33am

Thanks from me, too.

I really appreciate the capabilities that Knime's text processing provides, and the niche Knime fills in allowing it to be done via a graphical interface. I am now starting to learn other tools such as R and regex to complement my workflow, but if I hadn't discovered Knime first, I wouldn't have even started on this path, since the learning curve would have been too steep.

By the way, I'm progressing by usng "7" as my separator instead of "_", and of course it works fine, even if the results aren't quite as readable.