Outlier detection of geodata within Spark nodes

gujodm · July 18, 2018, 9:26am

Hi,
Is there a way to detect/handle/remove outliers from geographical data (latitude/longitude)?

If I remember well there was the outlier removal node that worked fine. But if I want to perform it within Spark?
Is there such a workaround for do it?

Thanks in advance.

bjoern.lohrmann · July 23, 2018, 8:45am

Hi @gujodm

as a workaround you can use the “Spark SQL” node and the percentile_approx Spark SQL function. For example if you have a table with numeric column “agep”, the following Spark SQL query will remove outliers:

SELECT * FROM 
 #table# as inTable, 
 (SELECT (2.5*percentile_approx(agep, 0.25) - 1.5*percentile_approx(agep, 0.75)) as lower,
         (2.5*percentile_approx(agep, 0.75) - 1.5*percentile_approx(agep, 0.25)) as upper
  FROM #table#) as bound
WHERE 
  agep >= bound.lower
   AND 
  agep <= bound.upper

This is using the outlier definition based on inter-quartile ranges described here:
http://www.purplemath.com/modules/boxwhisk3.htm

Björn

gujodm · July 24, 2018, 2:50pm

Thanks @bjoern.lohrmann,
this is an interesting approach… let me try to apply it.

~g

system · June 2, 2023, 9:03pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.