Class: KMeansPP
- Inherits:
-
Object
- Object
- KMeansPP
- Defined in:
- lib/k_means_pp.rb,
lib/k_means_pp/point.rb,
lib/k_means_pp/cluster.rb,
lib/k_means_pp/version.rb
Overview
Cluster data with the k-means++, k-means and Lloyd algorithm.
Defined Under Namespace
Classes: BasePoint, Centroid, Cluster, Point
Constant Summary collapse
- VERSION =
Version number, happy now?
'0.0.4'
Instance Attribute Summary collapse
-
#centroids ⇒ Array<Centroid>
Centroid points.
-
#points ⇒ Array<Point>
Source data set of points.
Class Method Summary collapse
-
.cluster_for_centroid(centroid, points, &block) ⇒ Cluster
Computed points are a flat structure so this nests each point in an array.
-
.clusters(points, clusters_count, &block) ⇒ Array<Cluster>
Take an array of things and group them into K clusters.
-
.find_nearest_centroid(point, centroids) ⇒ Centroid
Find nearest centroid for a given point in given centroids.
-
.find_nearest_centroid_and_distance(point, centroids) ⇒ Array
Find the nearest centroid in given centroids.
-
.find_nearest_centroid_distance(point, centroids) ⇒ Float
Find distance to the nearest centroid for a given point in given centroids.
Instance Method Summary collapse
-
#calculate_new_centroids ⇒ Object
protected
For each cell calculate its center.
-
#define_initial_clusters ⇒ Object
protected
K-means++ algorithm.
-
#fine_tune_clusters ⇒ Object
protected
This is Lloyd’s algorithm en.wikipedia.org/wiki/Lloyd%27s_algorithm.
-
#group_points ⇒ Object
Group points into clusters.
-
#initialize(points, clusters_count) ⇒ KMeansPP
constructor
Take an array of things and group them into K clusters.
-
#reassign_points ⇒ Fixnum
protected
Loop through all the points and find their nearest centroid.
Constructor Details
#initialize(points, clusters_count) ⇒ KMeansPP
Take an array of things and group them into K clusters.
If no block was given, an array of arrays (of two numbers) is expected. Internally we map them with Point
objects.
If a block was given, the points
is likely an array of other things like hashes or objects. In this case we will keep the original object in a property and once we are done, we will swap those objects. The block is expected to retun an array of two numbers.
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# File 'lib/k_means_pp.rb', line 115 def initialize(points, clusters_count) # Do not mutate original structure points = points.dup if block_given? points.map! do |point_obj| point_ary = yield(point_obj) point = Point.new(point_ary[0], point_ary[1]) point.original = point_obj point end else points.map! do |point_ary| Point.new(point_ary[0], point_ary[1]) end end self.points = points self.centroids = Array.new(clusters_count) end |
Instance Attribute Details
#centroids ⇒ Array<Centroid>
Centroid points
15 16 17 |
# File 'lib/k_means_pp.rb', line 15 def centroids @centroids end |
#points ⇒ Array<Point>
Source data set of points.
10 11 12 |
# File 'lib/k_means_pp.rb', line 10 def points @points end |
Class Method Details
.cluster_for_centroid(centroid, points, &block) ⇒ Cluster
Computed points are a flat structure so this nests each point in an array.
47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/k_means_pp.rb', line 47 def self.cluster_for_centroid(centroid, points, &block) cluster_points = points.select { |p| p.group == centroid } if block cluster_points.map!(&:original) else cluster_points.map! { |p| [p.x, p.y] } end Cluster.new(centroid, cluster_points) end |
.clusters(points, clusters_count, &block) ⇒ Array<Cluster>
Take an array of things and group them into K clusters.
If no block was given, an array of arrays (of two numbers) is expected. At the end an array of Clusters is returned, each wrapping an array or arrays (of two numbers).
If a block was given, the points
is likely an array of other things like hashes or objects. The block is expected to return an array of two numbers. At the end an array of Clusters is returned, each wrapping an array or original objects.
33 34 35 36 37 38 39 |
# File 'lib/k_means_pp.rb', line 33 def self.clusters(points, clusters_count, &block) instance = new(points, clusters_count, &block) instance.group_points instance.centroids.map do |centroid| cluster_for_centroid(centroid, instance.points, &block) end end |
.find_nearest_centroid(point, centroids) ⇒ Centroid
Find nearest centroid for a given point in given centroids.
65 66 67 |
# File 'lib/k_means_pp.rb', line 65 def self.find_nearest_centroid(point, centroids) find_nearest_centroid_and_distance(point, centroids)[0] end |
.find_nearest_centroid_and_distance(point, centroids) ⇒ Array
Find the nearest centroid in given centroids.
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/k_means_pp.rb', line 85 def self.find_nearest_centroid_and_distance(point, centroids) # Assume the current centroid is the closest. nearest_centroid = point.group nearest_distance = Float::INFINITY centroids.each do |centroid| distance = centroid.squared_distance_to(point) next if distance >= nearest_distance nearest_distance = distance nearest_centroid = centroid end [nearest_centroid, nearest_distance] end |
.find_nearest_centroid_distance(point, centroids) ⇒ Float
Find distance to the nearest centroid for a given point in given centroids.
75 76 77 |
# File 'lib/k_means_pp.rb', line 75 def self.find_nearest_centroid_distance(point, centroids) find_nearest_centroid_and_distance(point, centroids)[1] end |
Instance Method Details
#calculate_new_centroids ⇒ Object (protected)
For each cell calculate its center. This is done by averaging X and Y coordinates.
213 214 215 216 217 218 219 220 221 222 223 224 225 |
# File 'lib/k_means_pp.rb', line 213 def calculate_new_centroids # Clear centroids. centroids.each(&:reset) # Sum all X and Y coords into each point's centroid. points.each do |point| centroid = point.group centroid.add(point) end # And then average it to find a center. centroids.each(&:average) end |
#define_initial_clusters ⇒ Object (protected)
K-means++ algorithm.
Find initial centroids and assign points to their nearest centroid, forming cells.
148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/k_means_pp.rb', line 148 def define_initial_clusters # Randomly choose a point as the first centroid. centroids[0] = Centroid.new(points.sample) # Initialize an array of distances of every point. distances = points.size.times.map { 0.0 } centroids.each_with_index do |_, centroid_i| # Skip the first centroid as it's already picked but keep the index. next if centroid_i == 0 # Sum points' distances to their nearest centroid distances_sum = 0.0 points.each_with_index do |point, point_i| distance = self.class.find_nearest_centroid_distance( point, centroids[0...centroid_i] ) distances[point_i] = distance distances_sum += distance end # Randomly cut it. distances_sum *= rand # Keep subtracting those distances until we hit a zero (or lower) # in which case we found a new centroid. distances.each_with_index do |distance, point_i| distances_sum -= distance next if distances_sum > 0 centroids[centroid_i] = Centroid.new(points[point_i]) break end end # Assign each point its nearest centroid. points.each do |point| point.group = self.class.find_nearest_centroid(point, centroids) end end |
#fine_tune_clusters ⇒ Object (protected)
This is Lloyd’s algorithm en.wikipedia.org/wiki/Lloyd%27s_algorithm
At this point we have our points already assigned into cells.
-
We calculate a new center for each cell.
-
For each point find its nearest center and re-assign it if it changed.
-
Repeat until a threshold has been reached.
198 199 200 201 202 203 204 205 206 207 208 209 |
# File 'lib/k_means_pp.rb', line 198 def fine_tune_clusters # When a number of changed points reaches this number, we are done. changed_threshold = points.size >> 10 loop do calculate_new_centroids changed = reassign_points # Stop when 99.9% of points are good break if changed <= changed_threshold end end |
#group_points ⇒ Object
Group points into clusters.
137 138 139 140 |
# File 'lib/k_means_pp.rb', line 137 def group_points define_initial_clusters fine_tune_clusters end |
#reassign_points ⇒ Fixnum (protected)
Loop through all the points and find their nearest centroid. If it’s a different one than current, change it ande take a note.
231 232 233 234 235 236 237 238 239 240 241 242 |
# File 'lib/k_means_pp.rb', line 231 def reassign_points changed = 0 points.each do |point| centroid = self.class.find_nearest_centroid(point, centroids) next if centroid == point.group changed += 1 point.group = centroid end changed end |