[Palladian] Optimize Phone Number Formatter

Hi @qqilihq,

I’d like to propose a few optimizations for the Phone Number Formatter Node:

  1. Some area / city codes are recognized, some are not (I got a few land line numbers in Germany I can share in private)
  2. Ability to not assume a Default Region
  3. Ability to extract the Country name / ISO Code based on the recognized country code like 0049, +49, (0049)
  4. Ability to identify validity of recognized country and area / city code

The primary goal would be to identify part of the data, like the country and area / city or if it’s a land or mobile number (or unknown?).

I also implemented a few optimizations to tackle poor data quality which you might consider adding too:

  1. Replace o by zero
  2. Remove HTML Characters using &[^;]+;
  3. Remove duplicated country codes ^(\+\d+)\s?\1 to $1
  4. Remove duplicated country codes ^\+(\d+)\s?00\1 to +$1
  5. Harmonize country codes (##) to +## ^\((\d{2})\)\s? to +$1
  6. Harmonize country codes (00##) to +## ^\(00(\d{2})\)\s? to +$1
  7. Harmonize country codes 00## to +## ^00(\d{2})\s? to +$1
  8. Fix wrong country code not starting with 0 ^([^0+]) to +$1
    Note: I am not fully confident this is not causing false positives as some area / city codes, like in the US, might not start with a zero
  9. Remove (0) \s?\(0\) by
  10. Replace [-/] by space [-/] by
  11. Replace multiple whitespaces by one \s{2,} by

If you like I can send you the part of the workflow.

Best
Mike

Hey Mike - sorry I missed to reply here. If you can share the mentioned details, I will have a look where we can optimize!

Thanks, Philipp