Class: SameSame::DbscanAlgorithm

Inherits:
Object
  • Object
show all
Defined in:
lib/same_same/dbscan_algorithm.rb

Overview

Implementation of DBSCAN clustering algorithm.

Algorithm parameters:

* Eps - threshold value to determine point neighbors. Two points are 
  neighbors if the distance between them does not exceed this threshold value. 
* MinPts - minimum number of points in any cluster.

Choice of parameter values depends on the data.

Point types:

* Core point - point that belongs to the core of the cluster. It has at least
  MinPts neighboring points.
* Border point - is a neighbor to at least one core point but it doesn't 
  have enough neighbors to be a core point.
* Noise point - is a point that doesn't belong to any cluster because it is
  not close to any of the core points.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(attrs = {}) ⇒ DbscanAlgorithm

Initializes algorithm with all data that it needs.

* points            - points to cluster
* eps               - distance threshold value
* min_points        - number of neighbors for point to be considered a
                      core point.
* distance          - distance measure to use (defaults to Cosine)
* vector_calculator - calculates the vectors to use for distance comparison.
                      defaults to DbscanNumericVectors which compares just
                      the numeric attributes of the datapoint.
                      Alternatively use DbscanTermFrequency.


52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/same_same/dbscan_algorithm.rb', line 52

def initialize(attrs = {})
  self.points       = attrs.fetch(:points)
  self.min_points   = attrs.fetch(:min_points)
  distance          = attrs[:distance] || CosineDistance.new
  vector_calculator = attrs[:vector_calculator] || DbscanNumericVectors.new
  
  self.neighborhood = DbscanNeighborhood.new( distance: distance,
                                              eps: attrs.fetch(:eps),
                                              points: points,
                                              vector_calculator: vector_calculator )

  # all points start as unclassifed
  self.dbscan_clusters     = DbscanClusters.new( points )
end

Instance Attribute Details

#dbscan_clustersObject

Sets of points. Initially all points will be assigned into Unclassified points set.



30
31
32
# File 'lib/same_same/dbscan_algorithm.rb', line 30

def dbscan_clusters
  @dbscan_clusters
end

#min_pointsObject

Number of points that should exist in the neighborhood for a point to be a core point.

Best value for this parameter depends on the data set.



36
37
38
# File 'lib/same_same/dbscan_algorithm.rb', line 36

def min_points
  @min_points
end

#neighborhoodObject

Returns the value of attribute neighborhood.



38
39
40
# File 'lib/same_same/dbscan_algorithm.rb', line 38

def neighborhood
  @neighborhood
end

#pointsObject

Returns the value of attribute points.



26
27
28
# File 'lib/same_same/dbscan_algorithm.rb', line 26

def points
  @points
end

Instance Method Details

#clusterObject



67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/same_same/dbscan_algorithm.rb', line 67

def cluster
  cluster_id = dbscan_clusters.get_next_cluster_id
  points.each do |p|
    if dbscan_clusters.unclassified?(p)
      if create_cluster(p, cluster_id)
        cluster_id = dbscan_clusters.get_next_cluster_id
      end
    end
  end
  
  dbscan_clusters.to_clusters
end

#create_cluster(p, cluster_id) ⇒ Object



81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/same_same/dbscan_algorithm.rb', line 81

def create_cluster( p, cluster_id)
  neighbors = neighborhood.neighbors_of p
  if neighbors.size < min_points
    # Assign point into "Noise" group. 
    # It will have a chance to become a border point later on.
    dbscan_clusters.assign_to_noise(p)
    # return false to indicate that we didn't create any cluster
    return false
  end

  # All points are reachable from the core point...
  dbscan_clusters.assign_points(neighbors, cluster_id)
  
  # Remove point itself.
  neighbors.delete(p)
  
  # Process the rest of the neighbors...
  while !neighbors.empty?
    # pick the first neighbor
    neighbor = neighbors.first
    
    # process neighbor
    neighbors_neighbors = neighborhood.neighbors_of neighbor
            
    if neighbors_neighbors.size < min_points
      # do nothing. The neighbor is just a border point.
    else      
      # neighbor is another core point.  
      neighbors_neighbors.each do |neighbors_neighbor|
      
        if dbscan_clusters.noise?(neighbors_neighbor)
          # It's a border point. We know that it doesn't have 
          # enough neighbors to be a core point. Just add it 
          # to the cluster.
          dbscan_clusters.assign_point(neighbors_neighbor, cluster_id)
        elsif dbscan_clusters.unclassified?(neighbors_neighbor)
          
          # We don't know if this point has enough neighbors
          # to be a core point... add it to the list of points
          # to be checked.
          neighbors.add(neighbors_neighbor)
            
          # And assign it to the cluster
          dbscan_clusters.assign_point(neighbors_neighbor, cluster_id)
        end
      end
    end
    
    neighbors.delete neighbor
  end
  true
end