A note on DBSCAN for 2-dimensional data

# A note on DBSCAN for 2-dimensional data

This is a short note on the parameters of DBSCAN for 2-dimensional data, e.g. FBC spectrum scans.

The data is a vector of amplitudes (in dB) of length 8704. DBSCAN needs to find neighbors for each point and thus needs a distance metric between two vectors.

The default metric used by scipy's DBSCAN is euclidean, i.e. each point qi of vector1 is compared to point pi of vector2 in the following manner:

\sqrt{\sum_{i = 1}^{n}(q_i - p_i)^2}

The eps parameter of DBSCAN is exactly that value above (when using the euclidean metric). Now, to find a sensible eps, a simple rough calculation shows, given that 5 dB is the maximum distance allowed between two points to be neighbors: (8704 * 5**2) ** .5 ~= 466. This is a first candidate for a good eps value.

Empirically it can be seen that any value between 100 and 2000 works fine for vectors of length 8704.

As for the second important parameter, min_samples, the following can be said. First, a core point needs to be defined. This is a point which has more than min_samples neighbors in its range (restricted by eps). The important thing to remember here is that only the neighbors of core points are considered in each iteration as candidates for expansion of a cluster. So if min_samples is huge, the clusters will be smaller and more numerous and vice versa. In conclusion, this parameter is the one to tune in terms of the resulting number of clusters. It can be compared to the parameter k of kmeans.

Published on 2019-11-21