Text with missing spaces and Dictionary Tagger

#1

Hello,

I am working with a Text Corpus that has spaces missing between words e.g.

"The spaceship hovered over Reno NevadaUSA"
(there should have been a space between Nevada and USA)

I am using Dictionary Tagger on the documents in this corpus. I have Reno Nevada in the Dictionary, but since there is no space between Nevada and USA (in the original text), Reno Nevada doesn’t get Tagged.

Any ideas on how I can achieve this?

Thanks,
Saqib

0 Likes

#2

Hi @saqib,
this is tough, as NevadaUSA is treated as a single token. Could you do some preprocessing and simply separate letters where an uppercase letter follows a lowercase letter? This should not occur too often intentionally, right?
Kind regards
Alexander

1 Like

#3

Hi @AlexanderFillbrunn,

That’s what I had in mind earlier. But the issue with that approach is that McAllen, Texas becomes Mc Allen, Texas. Which is incorrect. :frowning_face:

Saqib

0 Likes

#4

Hi @saqib, maybe I can jump in as well.

Is the issue specific to locations, like your examples?
Perhaps you could use a table of state names and loop through them replacing the string “Nevada” with " Nevada " for example?

Only issue I can imagine with that is if you have city names like “KansasCity”.
But I suppose you could then go back and change “Kansas City” to “KansasCity” again… :thinking:

0 Likes

#5

Hi @Corey,

I am only dealing with location data, however it is global geo-location data, not just US State names…

0 Likes

#6

you could use Dictionary Tagger to protect exceptions such as Mc etc., and then separate the rest when case changes

1 Like