I am working with a Text Corpus that has spaces missing between words e.g.
"The spaceship hovered over Reno NevadaUSA"
(there should have been a space between Nevada and USA)
I am using Dictionary Tagger on the documents in this corpus. I have Reno Nevada in the Dictionary, but since there is no space between Nevada and USA (in the original text), Reno Nevada doesn’t get Tagged.
Hi @saqib,
this is tough, as NevadaUSA is treated as a single token. Could you do some preprocessing and simply separate letters where an uppercase letter follows a lowercase letter? This should not occur too often intentionally, right?
Kind regards
Alexander
Is the issue specific to locations, like your examples?
Perhaps you could use a table of state names and loop through them replacing the string “Nevada” with " Nevada " for example?
Only issue I can imagine with that is if you have city names like “KansasCity”.
But I suppose you could then go back and change “Kansas City” to “KansasCity” again…