Class: SameSame::DbscanAlgorithm
- Inherits:
-
Object
- Object
- SameSame::DbscanAlgorithm
- Defined in:
- lib/same_same/dbscan_algorithm.rb
Overview
Implementation of DBSCAN clustering algorithm.
Algorithm parameters:
* Eps - threshold value to determine point neighbors. Two points are
neighbors if the distance between them does not exceed this threshold value.
* MinPts - minimum number of points in any cluster.
Choice of parameter values depends on the data.
Point types:
* Core point - point that belongs to the core of the cluster. It has at least
MinPts neighboring points.
* Border point - is a neighbor to at least one core point but it doesn't
have enough neighbors to be a core point.
* Noise point - is a point that doesn't belong to any cluster because it is
not close to any of the core points.
Instance Attribute Summary collapse
-
#dbscan_clusters ⇒ Object
Sets of points.
-
#min_points ⇒ Object
Number of points that should exist in the neighborhood for a point to be a core point.
-
#neighborhood ⇒ Object
Returns the value of attribute neighborhood.
-
#points ⇒ Object
Returns the value of attribute points.
Instance Method Summary collapse
- #cluster ⇒ Object
- #create_cluster(p, cluster_id) ⇒ Object
-
#initialize(attrs = {}) ⇒ DbscanAlgorithm
constructor
Initializes algorithm with all data that it needs.
Constructor Details
#initialize(attrs = {}) ⇒ DbscanAlgorithm
Initializes algorithm with all data that it needs.
* points - points to cluster
* eps - distance threshold value
* min_points - number of neighbors for point to be considered a
core point.
* distance - distance measure to use (defaults to Cosine)
* vector_calculator - calculates the vectors to use for distance comparison.
defaults to DbscanNumericVectors which compares just
the numeric attributes of the datapoint.
Alternatively use DbscanTermFrequency.
52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/same_same/dbscan_algorithm.rb', line 52 def initialize(attrs = {}) self.points = attrs.fetch(:points) self.min_points = attrs.fetch(:min_points) distance = attrs[:distance] || CosineDistance.new vector_calculator = attrs[:vector_calculator] || DbscanNumericVectors.new self.neighborhood = DbscanNeighborhood.new( distance: distance, eps: attrs.fetch(:eps), points: points, vector_calculator: vector_calculator ) # all points start as unclassifed self.dbscan_clusters = DbscanClusters.new( points ) end |
Instance Attribute Details
#dbscan_clusters ⇒ Object
Sets of points. Initially all points will be assigned into Unclassified points set.
30 31 32 |
# File 'lib/same_same/dbscan_algorithm.rb', line 30 def dbscan_clusters @dbscan_clusters end |
#min_points ⇒ Object
Number of points that should exist in the neighborhood for a point to be a core point.
Best value for this parameter depends on the data set.
36 37 38 |
# File 'lib/same_same/dbscan_algorithm.rb', line 36 def min_points @min_points end |
#neighborhood ⇒ Object
Returns the value of attribute neighborhood.
38 39 40 |
# File 'lib/same_same/dbscan_algorithm.rb', line 38 def neighborhood @neighborhood end |
#points ⇒ Object
Returns the value of attribute points.
26 27 28 |
# File 'lib/same_same/dbscan_algorithm.rb', line 26 def points @points end |
Instance Method Details
#cluster ⇒ Object
67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/same_same/dbscan_algorithm.rb', line 67 def cluster cluster_id = dbscan_clusters.get_next_cluster_id points.each do |p| if dbscan_clusters.unclassified?(p) if create_cluster(p, cluster_id) cluster_id = dbscan_clusters.get_next_cluster_id end end end dbscan_clusters.to_clusters end |
#create_cluster(p, cluster_id) ⇒ Object
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/same_same/dbscan_algorithm.rb', line 81 def create_cluster( p, cluster_id) neighbors = neighborhood.neighbors_of p if neighbors.size < min_points # Assign point into "Noise" group. # It will have a chance to become a border point later on. dbscan_clusters.assign_to_noise(p) # return false to indicate that we didn't create any cluster return false end # All points are reachable from the core point... dbscan_clusters.assign_points(neighbors, cluster_id) # Remove point itself. neighbors.delete(p) # Process the rest of the neighbors... while !neighbors.empty? # pick the first neighbor neighbor = neighbors.first # process neighbor neighbors_neighbors = neighborhood.neighbors_of neighbor if neighbors_neighbors.size < min_points # do nothing. The neighbor is just a border point. else # neighbor is another core point. neighbors_neighbors.each do |neighbors_neighbor| if dbscan_clusters.noise?(neighbors_neighbor) # It's a border point. We know that it doesn't have # enough neighbors to be a core point. Just add it # to the cluster. dbscan_clusters.assign_point(neighbors_neighbor, cluster_id) elsif dbscan_clusters.unclassified?(neighbors_neighbor) # We don't know if this point has enough neighbors # to be a core point... add it to the list of points # to be checked. neighbors.add(neighbors_neighbor) # And assign it to the cluster dbscan_clusters.assign_point(neighbors_neighbor, cluster_id) end end end neighbors.delete neighbor end true end |