Inconsistencies in breaking up expressions starting with special characters

Pepita · June 16, 2014, 2:16pm

Hello :)

I already mentioned this in my massive post about all the problems I, as a non IT newbie to Knime and text mining, have with my university project. But since this is one important part that might actually be figured out quickly, I figured I'd ask seperately.

I retrieve twitter data via the palladian websearcher, create documents via strings to document and use a wildcard tagger to tag some targets (@...) and hashtags (#...) as Named Entities. Applying the BoW afterwards, sometimes the # and @ is seperated from the rest of the (untagged, the tagged stay as I set them to unmodifiable) expression and sometimes not.

Same happens to URLs which are almost always split into http and ://...

I'd understand if this would happen all the time (and prevent it via tagging them or something) but I don't understand why it is inconsistent. Plus, this makes it hard to replace all the untagged hashtags with the word only (so a sentiment dictionary can "find" them)

I'd appreciate any pointers :) Thank you!

Pepita · June 18, 2014, 3:31pm

Just realised this also happens for words like "won't". I really have no idea why knime breaks up some words and others not

kilian.thiel · June 23, 2014, 5:41pm

Hi Pepita,

thank you for your post. We are using the open nlp tokenizer for word and sentence tokenization. This tokenizer is not a plain whitespace tokenizer. So far there is no option in the Textprocessing extension to change tokenizers.

The Wildcard tagger is able to tag based on term or sentences level. Term level means that the regexes are matched on terms. If the # character is not contained in the subsequent word it is not possible to find something like "#aterm". Therefore you can switch the tagging level to sentence based. Then the regex is matched on the complete sentence. With sentence based tagging it is possible to tag more then one term.

Attached you find an example workflow showing how to find terms starting with # or @.

I also see the problem that the open nlp tokenizer splits words like "won't" into two words and will think about a solution.

Cheers, Kilian

P.S.: I will answer your massive post in the next days.

wildcardtagging.zip

Pepita · June 24, 2014, 12:08pm

Thank you so much for your detailed answer, it is really helpful!

If you don't have too much time, don't bother with answering the massive post, by now I found workarounds for a lot of things not all but I think I might be getting into this ;) Just needs some creativity.

I keep getting smaller problems, but now I can pick them out in a more detailed way which makes it easier to answer I guess/hope.

Thanks so much for your help, the support in this forum is impressive!

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.