About the author:
Daniel is CTO at rhome GmbH, and Co-Founder at Aqarios GmbH. He holds a M.Sc. in Computer Science from LMU Munich, and has published papers in reinforcement learning and quantum computing. He writes about technical topics in quantum computing and startups.
jrnl · home about list ventures publications LinkedIn Join my Slack

# PreDeCon - Density based Projected Subspace Clustering

PreDeCon1 is a projected subspace clustering method based on DBSCAN. This will be a short post, mostly to illustrate the concept with an example, since I believe code is worth a thousand words. But the gist is to weigh the distances based on the variance of their neighbors. I.e., if your neighbors are all distributed in a line, you assign a very small weight to distances along the line and very huge weight to the perpendicular dimension. So, if a point is slightly outside of the line, it's already not part of your cluster (since the distance is weighted heavily). Often, those weights can be up to 1 to 100! So going outside of the line penalizes (=weighs more) the distance by a factor of 100.

See for yourself in the following picture of an example data set. PreDeCon will find two clusters, both the lines in the picture.

The way this works, is that the red point is NOT a core point due to our weighted clustering, even though it would be with naive DBSCAN. Thus, DBSCAN stops here, instead of continueing and adding all points to one single cluster. And why is that not a core point? Well, the high variance of the right neighbor of our red point lies actually on the vertical line, and not the horizontal line as with the other neighbor! Thus, the weight given to the horizontal distance between the red point and the right neighbor will be multiplied by 100. That's because the red point looks to the neighbor like an outlier, since it's obviously not on the vertical line. The weight given to the vertical distance would've been simply 1. But the red point and the right neighbor have a vertical distance of 0 and a horizontal distance of 1.. Thus, the weighted distance here will be 0 * 1 + 1 * 100. There's more to it, but that gets the method across.

Finally, here's Python code explaining the above concept, including all details (such as correct euclidean distance calculation and not the simplification we used).

# PreDeCon - Projected Clustering
# Paper: Density Connected Clustering with Local Subspace Preferences, Böhm et al., 2004
# This code is not optimized.
import numpy as np


def get_neighbor(X, candidate):
    """Return the eps-neighbors of candidate.
    """
    neighbors = []
    for pt in X:
        if ((pt - candidate) ** 2).sum() ** .5 <= eps:
            neighbors.append(pt)
    return neighbors


def get_weights(X, candidate):
    dists_x = []
    dists_y = []
    for neighbor in get_neighbor(X, candidate):
        dists_x.append((neighbor[0] - candidate[0]) ** 2)
        dists_y.append((neighbor[1] - candidate[1]) ** 2)

    var_x = sum(dists_x) / len(dists_x)
    var_y = sum(dists_y) / len(dists_y)

    weight_x = 1 if var_x > delta else K
    weight_y = 1 if var_y > delta else K

    return weight_x, weight_y


def pref_weighted_dist(X, neighbor, candidate):
    weights = get_weights(X, neighbor)
    dist = 0
    for i in range(2):
        dist += weights[i] * (neighbor[i] - candidate[i]) ** 2
    return dist ** .5


def is_core(X, candidate):
    good_ones = []
    for neighbor in get_neighbor(X, candidate):
        dist = max(
            pref_weighted_dist(X, neighbor, candidate),
            pref_weighted_dist(X, candidate, neighbor)
        )
        if dist <= eps:
            good_ones.append(dist)
    return len(good_ones) >= minpts


X = np.array([
    [1, 5],
    [2, 5],
    [3, 5],  # p3
    [4, 5],
    [5, 5],
    [6, 5],  # p6, red point
    [7, 5],
    [7, 6],
    [7, 7],
    [7, 4],
    [7, 3],
    [7, 2]
])

minpts = 3
eps = 1
delta = 0.25
K = 100

print('p3', is_core(X, X[2]))
print('p6', is_core(X, X[5]))

Can also be found here.

References:


  1. Density Connected Clustering with Local Subspace Preferences, Böhm et al., 2004 

Published on