Hi,
Is there a way to detect/handle/remove outliers from geographical data (latitude/longitude)?
If I remember well there was the outlier removal node that worked fine. But if I want to perform it within Spark?
Is there such a workaround for do it?
as a workaround you can use the “Spark SQL” node and the percentile_approx Spark SQL function. For example if you have a table with numeric column “agep”, the following Spark SQL query will remove outliers:
SELECT * FROM
#table# as inTable,
(SELECT (2.5*percentile_approx(agep, 0.25) - 1.5*percentile_approx(agep, 0.75)) as lower,
(2.5*percentile_approx(agep, 0.75) - 1.5*percentile_approx(agep, 0.25)) as upper
FROM #table#) as bound
WHERE
agep >= bound.lower
AND
agep <= bound.upper