Outlier detection of geodata within Spark nodes

Hi,
Is there a way to detect/handle/remove outliers from geographical data (latitude/longitude)?

If I remember well there was the outlier removal node that worked fine. But if I want to perform it within Spark?
Is there such a workaround for do it?

Thanks in advance.

Hi @gujodm

as a workaround you can use the “Spark SQL” node and the percentile_approx Spark SQL function. For example if you have a table with numeric column “agep”, the following Spark SQL query will remove outliers:

SELECT * FROM 
 #table# as inTable, 
 (SELECT (2.5*percentile_approx(agep, 0.25) - 1.5*percentile_approx(agep, 0.75)) as lower,
         (2.5*percentile_approx(agep, 0.75) - 1.5*percentile_approx(agep, 0.25)) as upper
  FROM #table#) as bound
WHERE 
  agep >= bound.lower
   AND 
  agep <= bound.upper

This is using the outlier definition based on inter-quartile ranges described here:
http://www.purplemath.com/modules/boxwhisk3.htm

  • Björn
3 Likes

Thanks @bjoern.lohrmann,
this is an interesting approach… let me try to apply it.

~g

1 Like