Let's first try to understand a few basic things about DBSCAN density-based clustering, the following figure summarizes the basic concepts.
 Let's first create a sample 2D dataset that will be clustered with DBSCAN. The following figure shows how the dataset looks.
Let's first create a sample 2D dataset that will be clustered with DBSCAN. The following figure shows how the dataset looks.
import numpy as np
import matplotlib.pylab as plt
from sklearn.cluster import DBSCAN
X_train = np.array([[60,36], [100,36], [100,70], [60,70],
    [140,55], [135,90], [180,65], [240,40],
    [160,140], [190,140], [220,130], [280,150], 
    [200,170], [185, 170]])
plt.scatter(X_train[:,0], X_train[:,1], s=200)
plt.show()

Now let's use scikit-learn's implementation of DBSCAN to cluster:
eps = 45
min_samples = 4
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X_train)
labels = db.labels_
labels
# [ 0,  0,  0,  0,  0,  0,  0, -1,  1,  1,  1, -1,  1,  1]
db.core_sample_indices_
# [ 1,  2,  4,  9, 12, 13]
Notice from the above results that
- there are 6 core points found by the algorithm
- 2 clusters (with labels 0, 1) and couple of outliers (noise points) are found.
Let's visualize the clusters using the following code snippet:
def dist(a, b):
    return np.sqrt(np.sum((a - b)**2))
colors = ['r', 'g', 'b', 'k']
for i in range(len(X_train)):
    plt.scatter(X_train[i,0], X_train[i,1], 
                s=300, color=colors[labels[i]], 
                marker=('*' if i in db.core_sample_indices_ else 'o'))
                                                            
    for j in range(i+1, len(X_train)):
        if dist(X_train[i], X_train[j])  < eps:
            plt.plot([X_train[i,0], X_train[j,0]], [X_train[i,1], X_train[j,1]], '-', color=colors[labels[i]])
            
plt.title('Clustering with DBSCAN', size=15)
plt.show()
- points in cluster 0 are colored red
- points in cluster 1 are colored green
- outlier points are colored black
- core points are marked with '*'s.
- two points are connected by an edge if they are within ϵ-nbd.

Finally, let's implement the predict() method to predict the cluster of a new data point. The implementation is based on the following:
- in order that the new point x belongs to a cluster, it must be directly density reachable from a core point in the cluster. 
- We shall compute the nearest core point to the cluster, if it's within ϵ distance from x, we shall return the label of the core point, otherwise the point x will be declared a noise point (outlier). 
- Notice that this differs from the training algorithm, since we no longer allow any more point to become a new core point (i.e., number of core points are fixed). 
- the next code snippet implements the - predict()function based on the above idea
 - def predict(db, x):
  dists = np.sqrt(np.sum((db.components_ - x)**2, axis=1))
  i = np.argmin(dists)
  return db.labels_[db.core_sample_indices_[i]] if dists[i] < db.eps else -1
X_test = np.array([[100, 100], [160, 160], [60, 130]])
for i in range(len(X_test)):
   print('test point: {}, predicted label: {}'.format(X_test[i], 
                                               predict(db, X_test[i])))
# test point: [100 100], predicted label: 0
# test point: [160 160], predicted label: 1
# test point: [ 60 130], predicted label: -1
 
The next animation shows how a few new test points are labeled using the predict() function defined above.
