How to find agglomeration areas through DBScan?

Fatih97 · March 28, 2022, 5:23am

Hi,

I have a dataset from customers with their names, postal codes, lat and lon.
I want to find agglomeration areas.
These customers are in Germany.
With DBScan, I want to say, if in a radius of 10km exists 5 customers, than is it an agglomeration area.
My main problem is, that the customers are is given with lat and lon and through this, I cant make the correct radius epsilon.

How can I solve this problem?

armingrudd · March 29, 2022, 8:58am

Hi @Fatih97,

You can use the Latitude/Longitude to Coordinate node from Palladian extension to convert the longitude and latitude values to coordinates and then using Geo Distances you can calculate the distance values to pass to the DBSCAN node.

Using DBSCAN, you may have clusters with data points which can have distances more than the epsilon (10 Km here). You may need to divide the clusters afterwards.

I suggest using Hierarchical Clustering (DistMatrix) and Hierarchical Cluster Assigner instead. In this approach you need to calculate the normalized value of your desired radius based on the max distance.

Here is an example workflow:
41060.knwf (16.7 KB)

Fatih97 · March 29, 2022, 9:33am

Thank you, but the nodes are not free…

armingrudd · March 29, 2022, 9:34am

Palladian nodes are free on KNIME Analytics Platform.

Fatih97 · March 29, 2022, 9:34am

I cant install them, how can I install them for free?

armingrudd · March 29, 2022, 9:36am

Follow the instructions:

Fatih97 · March 29, 2022, 10:16am

thank you, it works.
what do you prefer me how to plot the dataset in a scatter plot.
how should i choose the axis?
I want to show agglomeration areas of the customers

armingrudd · March 29, 2022, 10:26am

There can be a base long and lat (e.g. your store coordinates) which serves as 0,0. then you can calculate distances (x,y) for each customer and display them in a scatter plot.

Fatih97 · March 29, 2022, 10:29am

Ok, I will try and reply to you tomorrow
I think it will take time until tomorrow
thank you very much

armingrudd · March 29, 2022, 10:40am

For better representation I suggest using Map Viewer or OSM Map View. If you use a Color Manager node before these nodes, you can colorize the points based on clusters.

Fatih97 · March 29, 2022, 10:44am

Yes I did it. I used the OSM Map view.
But I wanted also a scatter plot.
Is it not better to make on the X-Axis the “number of values in the clusters” and on the Y-Axis: the coordinates, to show agglomerations?

armingrudd · March 29, 2022, 10:51am

If you want to represent the clusters and the number of instances, you can aggregate values and count them based on each cluster and use the output for Bubble Chart (Plotly). You can use the mean value of lat and long as x and y for each cluster (or the distance between the mean and your base location).

41060_extra.knwf (25.5 KB)

Fatih97 · March 29, 2022, 11:05am

I didnt use a hierarchical clustering → Do I need the “Group By” node like in your 41060_extra workflow?

armingrudd · March 29, 2022, 11:07am

Yes, if you want to count number of cluster members you can set the clusters as grouping column and use “Count” on any column.

Fatih97 · March 29, 2022, 11:15am

Where should I make this node link to?
I tried after the DBScan node and after the Scatter Plot.
What would be why correct?

armingrudd · March 29, 2022, 12:25pm

Any where after the color manager node would do. You just need the colors to be the same.

Fatih97 · March 29, 2022, 12:28pm

Yes I did. It works.
Do you know, which dimension the epsilon in DBSCan works for the haversine formula?
E.g. is epsilon=2 → 2kilometers or 200meters or 2meters?

armingrudd · March 29, 2022, 12:30pm

The Geo Distance node produces distances in kilometers as described in the node description. So 2 means 2Km.

Fatih97 · March 29, 2022, 12:32pm

You are the best. Thank you very much

system · April 5, 2022, 12:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.