Wildcard tagger (possible) bug

I'm trying to tag tweets' hashtags and mentions using the Wildcard tagger with the regular expressions #\w+ and @\w+, respectively, but I'm getting weird results. Here's the output of BoW after applying the taggers:


#[]	       "#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
Débriefing[]   "#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
pont[]	       "#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
de[]	       "#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
l'Alma[]       "#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
#zouave[EXAGGERATION(SENTIMENT)]	"#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
#pieddansleau[EXAGGERATION(SENTIMENT)]	"#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"
#crue1910[EXAGGERATION(SENTIMENT)]	"#Débriefing pont de l'Alma  #zouave  #pieddansleau  #crue1910"

I would expect that the first hashtag is tagged correctly, which it isn't. I can't completely correlate the weird results I get, but they appear to happen on the first word of the document text, and only for hashtags (mentions seem to work fine).

Hi Simone,

here's a neat trick: you can use the search field in the interactive document viewer to experiment with different regular expressions. Testing there reveals that the problem is not the tagger, but the Strings To Document node, which turns the first '#' in your example into a term on its own. Some characters seem to be interpreted differently depending on their position, including '#', but also '%' for example. That's probably the result of punctuation handling, so I'm not sure if that is a bug or just the result of tweet-incompatible assumptions.

As a workaround, you could replace all '#' with something else with a String Replacer before applying the Strings To Document node. A regular character from outside your target languages should do the trick.

Hi Simone,

you can use "#\s*\w*\s+" as regular expression instead of "#\w+" and select "Multi term" in the dialog of the Wildcard Tagger. Please be aware that \w+ will not match "Débriefing" due to the é. However, "Debriefing" will match.

Attached is a workflow with an example text with "Débriefing" and "Debriefing" and the right regular expression.

Cheers, Kilian

Kilian,

I'm no expert when it comes to the text processing nodes, but with the "(?U)" flag, as in "(?U)#\s*\w*\s+", it should also match "Débriefing", right?

Hi Marlin, that's right, I just tried it. The regex "(?U)#\s*\w*\s+" matches on "Débriefing". Thanks!