I have a dataset from customers with their names, postal codes, lat and lon.
I want to find agglomeration areas.
These customers are in Germany.
With DBScan, I want to say, if in a radius of 10km exists 5 customers, than is it an agglomeration area.
My main problem is, that the customers are is given with lat and lon and through this, I cant make the correct radius epsilon.
How can I solve this problem?
You can use the Latitude/Longitude to Coordinate node from Palladian extension to convert the longitude and latitude values to coordinates and then using Geo Distances you can calculate the distance values to pass to the DBSCAN node.
Using DBSCAN, you may have clusters with data points which can have distances more than the epsilon (10 Km here). You may need to divide the clusters afterwards.
I suggest using Hierarchical Clustering (DistMatrix) and Hierarchical Cluster Assigner instead. In this approach you need to calculate the normalized value of your desired radius based on the max distance.
Here is an example workflow:
41060.knwf (16.7 KB)
Thank you, but the nodes are not free…
Palladian nodes are free on KNIME Analytics Platform.
I cant install them, how can I install them for free?
thank you, it works.
what do you prefer me how to plot the dataset in a scatter plot.
how should i choose the axis?
I want to show agglomeration areas of the customers
There can be a base long and lat (e.g. your store coordinates) which serves as 0,0. then you can calculate distances (x,y) for each customer and display them in a scatter plot.
Ok, I will try and reply to you tomorrow
I think it will take time until tomorrow
thank you very much
For better representation I suggest using Map Viewer or OSM Map View. If you use a Color Manager node before these nodes, you can colorize the points based on clusters.
Yes I did it. I used the OSM Map view.
But I wanted also a scatter plot.
Is it not better to make on the X-Axis the “number of values in the clusters” and on the Y-Axis: the coordinates, to show agglomerations?
If you want to represent the clusters and the number of instances, you can aggregate values and count them based on each cluster and use the output for Bubble Chart (Plotly). You can use the mean value of lat and long as x and y for each cluster (or the distance between the mean and your base location).
41060_extra.knwf (25.5 KB)
I didnt use a hierarchical clustering → Do I need the “Group By” node like in your 41060_extra workflow?
Yes, if you want to count number of cluster members you can set the clusters as grouping column and use “Count” on any column.
Where should I make this node link to?
I tried after the DBScan node and after the Scatter Plot.
What would be why correct?
Any where after the color manager node would do. You just need the colors to be the same.
Yes I did. It works.
Do you know, which dimension the epsilon in DBSCan works for the haversine formula?
E.g. is epsilon=2 → 2kilometers or 200meters or 2meters?
The Geo Distance node produces distances in kilometers as described in the node description. So 2 means 2Km.
You are the best. Thank you very much
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.