Tokenizer doesn't recognize hyphenated word as one token

f_brustkern · October 4, 2021, 2:59pm

Hey everyone,

first of all: I’m not totally new to KNIME but far awy rom an expert-level and started a few weeks/month ago with text-mining/-preprocessing of a text-corpus consisting out of 306 texts (appr. 1.6 mill. words) with different length for reasearch-project. The texts are now all local stored in pdfs and were generated out of scans, e-book, -papers etc. The language is german. My major goal for now is: getting some good lda-topics with a adaptable preprocessing-workflow. First, I used the “standart-workflows” delivered by the KNIME-Cources und Forum, which helped me and I adapted them for my needs. But step by step i recognized many fails in my data caused either by the data itself (ocr-errors/converting fails etc.) or by preprocessing-workflow-misstakes/bugs.

To cut a long story short: Could anyone help me with the my major-problem, which I could hopefully solved by a workaround:

I read all my files by the PDF Parser node, but with none of the tokenizer it was possible for me to get hypenated words (at the end of a line in a pdf) tokenzied as one word. In some cases the hyphenated word are “correctly” displayed in the Document Viewer Node (means together), but afterwards I find them still seperated e.g. in the Bag-of-Word-Node. Could the solution lie somewhere in a data-preperation (like a convertion of the pdf file in a readablie format with a charset that could be read by the PDF parser node?) or ist there any workflow/ node to get these hyphenated word “back together” afterwards and then tokenized as one word?

Daniel_Weikert · October 4, 2021, 5:36pm

If possible you could provide a sample file as attachment. I think this will help the community to better assist you here.
br

f_brustkern · October 4, 2021, 6:20pm

Hey Daniel,

first of all: Thx for the quick response. And sorry for the missing sample file.
I snipped a part of the workflow and took 2 sample file. Both are texts, which have been " electronicly published" (so not scanned) and are with the same fail.

In the first text (2011 Mendl) e.g. is the word (see screenshot):

“Mangel‑ erscheinungen” which is tokenized in 3 (with the used tokenizer, but even with all other never as one word tokenized") tokens. In the table of the Bag-of-words-Node you can find it in Row 605-607. In my opinion these are all “soft/conditional hyphens”.

In the second text (Steffen 2017 - see screenshot) happens besides these fails als well sth. (at least in the document viewer but problably with consequences e. g. for co-occurence-calculation), which I would discribe as “ignoring the reading order of the page”:

The PDF-Parser-Node gets the text by reading it from left to right but ignores the colums in the published text-format.

And here the “mikro-Workflow”: Tokenizer-fail-hyphens.knwf (12.7 KB)

Thx everyone for the help

best regards

fb

izaychik63 · October 4, 2021, 6:34pm

@f_brustkern , could you try Tika Parser node to see if it works better for you?

f_brustkern · October 4, 2021, 7:24pm

@izaychik63 Thx for your advice, but I already tried this in different combinations. Nevertheless I tried it once again: First read the files with the Tika-Parser-Node, converted the string-cells/columns “content” into documents and used the same pre-processing workflow as before …

But sadly the same results (see atteched workflow).Tokenizer-fail-hyphens_wTika.knwf (17.9 KB)

best regards

fb

izaychik63 · October 4, 2021, 7:51pm

May be not very relevant suggestion, but you can try Corpus Creator from Palladian.

f_brustkern · October 5, 2021, 7:02am

Don’t know if you meant that, but I just recognized that there are with nodepit way more nodes than I knew! And in NodePit that Coprus Creator that you mentioned is now the first on my list! I’ll be back with news. Thx

f_brustkern · October 5, 2021, 8:42am

So back again, but with no satisfying results. I’ve been able to install all nessesary extensions, but the node didn’t really work for me (cause when I was trying to config the node it always gives me […] No column in spec compatible to “CollectionDataValue”).

But anyway: Would there be a work-around for my “incorrect corpus”, which I was able to generate first? Like could I do a renuion of all the hyphenated words e.g. by a regex replacer and tokenize the composed words as one token afterwards?

(I tried stuff with the regex-replacer-node before, but mostly ended up with a even more “destroyed copus”, cause I relpaced indeed the correct things [interpuctuations] but replaced them by a normal whitespace, which occured then as “tokenized whitespace” e. g. in my lda topics)!

thx everyone for the helf

best regards

fb

Daniel_Weikert · October 5, 2021, 4:52pm

So if I understand correctly it would not be an option to remove the “-” completely and create one one word combinations like Mangelerscheinungen. That is what you tried with regex?
br

f_brustkern · October 6, 2021, 12:57pm

Actually that would be a good solution, but I don’t want every “-” to be ereased, cause some words like e.g. “DAV-Konditionen” should be tokenized as one token and not ereased by that regex to “DAVKonditionen”. And, second I couldn’t find a solution to tokenize then afterwards “Mangelerscheinungen” as one token, cause the LDA-Node and so one still recognize here 2 tokens “Mangel” and “erscheininung” (and as well I couldn’t find a tokenizer-node that worked).

hope that described my problems.

best regards

fb

bruno29a · October 6, 2021, 1:28pm

hi @f_brustkern , you just need to figure out the rules you want to apply, for example remove only if it’s a hyphen followed by a space? So instead of looking for “-”, look for "- " instead.

You probably just need to do a replace() via String Manipulation, no need for regular expression if it’s just to look for "- ".

It’s a matter of coming up with the rules first. If you provide the rules, then we can help come up with the solution. Or at least give samples of different cases and show what the expected results for these cases are so we can see patterns and help come up with some rules.

These rules would apply, except if it’s a token if I understood correctly.

AngusVeitch · October 8, 2021, 4:40am

Hi @f_brustkern, I can’t quite tell if your issue only relates to unwanted spaces in the input text, or if you are also having trouble keeping correctly formed hyphenated terms together once they are tokenised. If the latter is a problem, you can try tagging these terms with the Wildcard Tagger or the Dictionary Tagger. This will ensure that they stay together in the bag of words, even if the tokeniser originally separated them.

Good luck!

f_brustkern · October 11, 2021, 8:20pm

Hej everyone,

sorry for the delay. I needed some time to correct/ manipulate the “orginal texts/data”.
So now I can hopefully describe my proclem better, so that you can help me.

I’ve read in my pdf-files with a pdf-Parser node and checked right afterwards the texts with a Document-Viewer-Node. And in mostly every text there are the same fails:

Due to a hyphenation of the word at the end of a line in the pdfs, the words get split to “abc- def”. And for sure tokenized in “abc” “-” and “def”. But I want the hyphented words put back together as one (tokenized) word like “abcdef”.

Some examples would be:

“und extrovertierte Typen, phlegmatische und cholerische Men- schen”
"(IV) und „Änderung der studentischen Konzep- tionen/intellektuelle Entwicklung“ "
""Zur ersten, „vermittlungs- orientiert/inhaltsorientierten“ Orientierung gehören die Kategorien „Informatio- nen abgeben“

So as you can see most fails are charakterized by “-” (in my opinion all conditional hyphens) followed by " " (2 whitespace) and some by " " (3 whitespace).

Hope anyone can help.

thank you very much. and with best regard.

PS: at best the workaround would handle the problem right after the PDF-Parser-Node without going with strings and so on…

bruno29a · October 12, 2021, 1:43am

Hi @f_brustkern , we can’t really see your data or run your workflow as it is pointing to a file on your desktop.

Based on what I’ve read, it looks like if you have a hyphen between 2 words, it should be written as is. For example: DAV-Konditionen should stay DAV-Konditionen

However, if there is a hypen + 1, or +2, or +3, or +however_many_white_spaces, it’s means that it’s a word that was splitted, and therefor should be concatenated.
For example:
Men- schen should be Menschen
vermittlungs- orientiert/inhaltsorientierten should be vermittlungsorientiert/inhaltsorientierten
Informatio- nen should be Informatio- nen

I’ve put something together that modifies the texts based on this rule.

Results:

I added a few sample that takes it to the extreme as you can see:

test-something
test- something
test-  something
test-    something
test-     something

As expected, based on the rules, only test-something will remain as is. The other ones are converted to testsomething according to the rules.

Basically, this is what I did:
I first replace all multiple whitespaces to only 1 whitespace:
regexReplace($column1$, “\s\s+”, " ")

At this point, any hyphen that was followed by 1 or more whitespaces will be followed by only 1 whitespace, and any hyphen that was followed by no whitespace remains unchanged.

I then just need to remove any hyphen that’s followed by a whitespace ("- ")
replace($new column$, "- ", "")

Run all these in 1 operation:
replace(regexReplace($column1$, "\\s\\s+", " "), "- ", "")

I’ve run the Bag-Of-Words on this sample text, and it looks like the terms are properly created:

I highlighted the words that were concatenated in red, and the ones that were to remain as is with a hyphen in green.

Is it the behaviour you were looking for? (Sorry, I’m not too familiar with the Bag-of-words node - never used it before, nor have I ever used “documents” before, so I’m not too familiar about tokenization, but I’m taking a guess of what it means)

The workflow looks like this:

And here’s the workflow: Tokenization with hyphen.knwf (10.3 KB)

f_brustkern · October 12, 2021, 10:04am

Hej @bruno29a ,

thank you very much for ur help and effort!
Actually that comes quite close to what I’m looking for.
With the PDF-Parser-Node it didn’t worked at all and as well not with the the Replacer-Node (regex-based). But when I changed to the Tikka-Parser-Node to get my PDF-Files and used your String-Manipulation Node it at least worked to get with the regex

replace($Content$, “\s\s+”, " ")

to erase all whitespaces more then one. But afterwards I was not able to concatenate the hyphenated words. I’m not really familiar with regex, but could it be nessesary to get something with a dot “.” before and after "- ", so that any character except a linebreak got matched?

And hopfully now you can work with my attached example workflow and a example text.

Thank u so much.

best regards

StringManip.-Workaround 1.knwf (79.7 KB)

f_brustkern · October 12, 2021, 11:01am

little “Ad-on”: just was experimenting with some of my text-fragments and a Regex-Generator. And when I pasted some of my text, it showed me right away, that my hyphen are “soft-hypen” by displaying me the Unicode U+00ad. Perhaps that has as well something to do with my fails/ problems.

bruno29a · October 12, 2021, 1:09pm

Hi @f_brustkern , I’m not sure I understand what I’m looking at from your workflow.

Firstly, I’m looking at what you did in the String Manipulation:
replace($Content$, "- ", " ")

This is not in line with what you said:

In your “abc- def” example, if you apply that replace statement, you will end up with “abc dec”. I thought you wanted “abcdef” according to what you said, that is why my replace statement was:
replace($Content$, "- ", "")

So, what’s the expected result in the end?

Secondly, what is the issue you are trying to show in your workflow?

Is it about the cases where you have “-\n” (hyphen followed by a new line)? For example:
Ge-
nerationen

asphaltier-
ten

etc…

You can look for “-\n” and remove them like this:
replace($Content$, "-\n", "")

Just be mindful though, this is NOT a hyphen. It displays as a hyphen, but it’s actually a character that splits a word at the end of the line. It’s hard to capture that character, so I suggest you take it from the workflow I’m going to attach (I basically copied it from the text itself).

So, it converted this:

to this:

You can have the replace expression from this node that I added:

Here’s the workflow: StringManip.-Workaround 1_bruno.knwf (89.2 KB)

f_brustkern · October 12, 2021, 1:12pm

Hej Everyone,

finally i solved the fails - and for sure with your help!
I used 2 seperated String-Manipulation-Nodes. The first to reduce the whitespace with the function

regexReplace($Content$,"\s\s+", " ")

and second to concatenate the hyphenated words by the function

regexReplace($Content$, “(\S)\s+”,"$1").

And in the second one I needed to copy paste teh “-” so that I got the soft (shy) hyphen (which occured just at a linebreak).

Hopefully that now works for all cases (or do u see any majior problems coming with this solution?).

To sum up: thx for all your help. I probably will need it soon again

best regards

fb

f_brustkern · October 12, 2021, 1:16pm

Hej @bruno29a,

there we have been writing simultaniously! And got more or less to the “same solution” on different ways! And yours is for sure the more “solid and logical” one. I will have a look at your workflow right away and compare it to my solution. To sum up again: Just unbelievable how this community helps each other right away!

best regards

fb

bruno29a · October 12, 2021, 1:25pm

Hi @f_brustkern ,

That’s exactly what I did