## # SUBCLU by example

SUBCLU1 is a subspace clustering algorithm based on DBSCAN2 and A-priori3.

Assume the following small exemplary data set with three dimensions. DBSCAN is parameterized with eps=3 and minpts=2 (including the core point).

``````2 10 2
1 20 1
9 30 9
8 40 9
``````

The algorithm is bottom-up, so let's start with singular dimensions 1 to 3. D(x,y) is the euclidean distance between two points.

Dimension 1 (points: 2, 1, 9, 8): D(2,1) = 1 < eps. Since we have two points, 2 is a core point. Same can be said about 1. So, the first cluster consists of 2 and 1. Same can be applied to 9 and 8. So dimension 1 contains two clusters.

Dimension 2 (points: 10, 20, 30, 40): All points are too far away from each other (distance is at least sqrt(10) between each pair), so no cluster can be found here.

Dimension 3 (points: 2, 1, 9, 9): Analogous to dimension 1. Two clusters.

Next, we generate the k+1 subspace candidates (where k=1, since we just did singular dimensions). Keep A-priori in mind: The k+1 candidates need to have k-1 attributes in common. Not relevant for 2 dimensions, but will be relevant for 3 dimensions later.

So here we have the following combinations of dimensions: 1,2; 1,3; 2,3. We may now prune some of those candidates using the A-priori principle, i.e. if k dimensions of a given candidate do not contain a cluster, that candidate will also not contain a cluster in k+1 dimensions. For proofs, refer to the original paper1. In this case, with k=1, dimension 2 has no clusters and thus both the candidates 1,2 and 2,3 can be pruned.

Next, we apply DBSCAN on subspace 1,3 (points: (2,2), (1,1), (9,9), (8,9)). D((2,2), (1,1)) = D((1,1), (2,2)) = sqrt(2) < eps. That's a cluster with points (2,2) and (1,1) (since we have 2 points and minpts=2). D((9,9), (8,9)) = D((8,9), (9,9)) = 1 < eps. Again, another cluster.

To generate three dimensions, we would need more candidates. For example, with 1,2 and 1,3 one could generate 1,2,3 (since both share the first dimension: 1). However, currently the only 2-dimensional candidate with clusters is 1,3. Thus, we stop here.

Result: The two dimensions 1 and 3 contain two subspace clusters

``````2 2
1 1
``````

and

``````9 9
8 9
``````

References:

1. Density-Connected Subspace Clustering for High-Dimensional Data, Kailing et al., 2004

2. A Density-Based Algorithm for Discovering Clusters, Ester et al., 1996

3. Fast Algorithms for Mining Association Rules, Agrawal et al., 1994

Published on