Optimizing DBSCAN Clustering for geolocated crime hotspots in Boston

Foxyellow · December 10, 2024, 8:27am

Hi everyone,

I’m analyzing crime data in Boston, where each crime is geolocated with GPS coordinates. Using the DBSCAN node in KNIME, I’ve created clusters with an epsilon of 0.4. While this produces clusters, I’m facing some challenges with their size and coherence.

Objective:
The aim is to identify geographically compact crime hotspots that provide meaningful insights for cross-referencing with additional datasets.

Problem:

Some clusters stretch over large distances because of the “chaining effect” in DBSCAN, where points are indirectly linked via intermediate points. This results in clusters that are too large to be useful for identifying specific hotspots.
When I reduce the epsilon to create smaller, more compact clusters, many points are left unclustered even though they are close to others. This leaves gaps in the analysis, as many valid clusters are no longer formed.

Included:
I’ve attached screenshots of the clusters to illustrate the issue. The images show how clusters extend over several kilometers.

Question:
Is there a way to refine these clusters or post-process the results in KNIME to achieve geographically compact clusters while retaining as many clustered points as possible? Any advice or example workflows to address this balance would be greatly appreciated.

Thank you for your help!

Boston_Crimes.knwf (17.3 KB)

thor_landstrom · December 17, 2024, 4:18pm

Hello @Foxyellow,

I downloaded the workflow attached, but it seems to be incomplete. An idea you could do is to group by on each cluster and calculate some ratio for each cluster. This ratio could be like taking the max and min of long/lat for each cluster and subtracting them then dividing them to get some ratio. Now theoretically, if the cluster is stretched it should have a much higher value compared to one that isn’t. You could try to filter off this metric if the ratio is greater than some number. Or you could do a density calculation and go off of that. You could then run something like k-means on only the stretched cluster to see if it fixes the issue.

These are just some suggestions you can explore,
TL

system · March 17, 2025, 4:19pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.