Class: KMeansPP

Inherits:
Object
  • Object
show all
Defined in:
lib/k_means_pp.rb,
lib/k_means_pp/point.rb,
lib/k_means_pp/cluster.rb,
lib/k_means_pp/version.rb

Overview

Cluster data with the k-means++, k-means and Lloyd algorithm.

Defined Under Namespace

Classes: BasePoint, Centroid, Cluster, Point

Constant Summary collapse

VERSION =

Version number, happy now?

'0.0.1'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(points, clusters_count) ⇒ KMeansPP

Take an array of things and group them into K clusters.

If no block was given, an array of arrays (of two numbers) is expected. Internally we map them with Point objects.

If a block was given, the points is likely an array of other things like hashes or objects. In this case we will keep the original object in a property and once we are done, we will swap those objects. The block is expected to retun an array of two numbers.

Parameters:

  • points (Array)

    Source data set of points.

  • clusters_count (Fixnum)

    Number of clusters (“k”).

Yield Returns:

  • (Array<Numeric>)


115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# File 'lib/k_means_pp.rb', line 115

def initialize(points, clusters_count)
  if block_given?
    points.map! do |point_obj|
      point_ary      = yield(point_obj)
      point          = Point.new(point_ary[0], point_ary[1])
      point.original = point_obj
      point
    end
  else
    points.map! do |point_ary|
      Point.new(point_ary[0], point_ary[1])
    end
  end

  self.points    = points
  self.centroids = Array.new(clusters_count)
end

Instance Attribute Details

#centroidsArray<Centroid>

Centroid points

Returns:



15
16
17
# File 'lib/k_means_pp.rb', line 15

def centroids
  @centroids
end

#pointsArray<Point>

Source data set of points.

Returns:



10
11
12
# File 'lib/k_means_pp.rb', line 10

def points
  @points
end

Class Method Details

.cluster_for_centroid(centroid, points, &block) ⇒ Cluster

Computed points are a flat structure so this nests each point in an array.

Parameters:

  • centroid (Centroid)

    Centroid of the cluster.

Returns:



47
48
49
50
51
52
53
54
55
56
57
# File 'lib/k_means_pp.rb', line 47

def self.cluster_for_centroid(centroid, points, &block)
  cluster_points = points.select { |p| p.group == centroid }

  if block
    cluster_points.map!(&:original)
  else
    cluster_points.map! { |p| [p.x, p.y] }
  end

  Cluster.new(centroid, cluster_points)
end

.clusters(points, clusters_count, &block) ⇒ Array<Cluster>

Take an array of things and group them into K clusters.

If no block was given, an array of arrays (of two numbers) is expected. At the end an array of Clusters is returned, each wrapping an array or arrays (of two numbers).

If a block was given, the points is likely an array of other things like hashes or objects. The block is expected to return an array of two numbers. At the end an array of Clusters is returned, each wrapping an array or original objects.

Parameters:

  • points (Array)

    Source data set of points.

  • clusters_count (Fixnum)

    Number of clusters (“k”).

Yield Returns:

  • (Array<Numeric>)

Returns:



33
34
35
36
37
38
39
# File 'lib/k_means_pp.rb', line 33

def self.clusters(points, clusters_count, &block)
  instance = new(points, clusters_count, &block)
  instance.group_points
  instance.centroids.map do |centroid|
    cluster_for_centroid(centroid, points, &block)
  end
end

.find_nearest_centroid(point, centroids) ⇒ Centroid

Find nearest centroid for a given point in given centroids.

Parameters:

  • point (Point)

    Measure distance of this point

  • centroids (Array<Centroid>)

    to those cluster centers

Returns:



65
66
67
# File 'lib/k_means_pp.rb', line 65

def self.find_nearest_centroid(point, centroids)
  find_nearest_centroid_and_distance(point, centroids)[0]
end

.find_nearest_centroid_and_distance(point, centroids) ⇒ Array

Find the nearest centroid in given centroids.

Parameters:

  • point (Point)

    Measure distance of this point

  • centroids (Array<Centroid>)

    to those cluster centers

Returns:

  • (Array)


85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/k_means_pp.rb', line 85

def self.find_nearest_centroid_and_distance(point, centroids)
  # Assume the current centroid is the closest.
  nearest_centroid = point.group
  nearest_distance = Float::INFINITY

  centroids.each do |centroid|
    distance = centroid.squared_distance_to(point)

    next if distance >= nearest_distance

    nearest_distance = distance
    nearest_centroid = centroid
  end

  [nearest_centroid, nearest_distance]
end

.find_nearest_centroid_distance(point, centroids) ⇒ Float

Find distance to the nearest centroid for a given point in given centroids.

Parameters:

  • point (Point)

    Measure distance of this point

  • centroids (Array<Centroid>)

    to those cluster centers

Returns:

  • (Float)


75
76
77
# File 'lib/k_means_pp.rb', line 75

def self.find_nearest_centroid_distance(point, centroids)
  find_nearest_centroid_and_distance(point, centroids)[1]
end

Instance Method Details

#calculate_new_centroidsObject (protected)

For each cell calculate its center. This is done by averaging X and Y coordinates.



210
211
212
213
214
215
216
217
218
219
220
221
222
# File 'lib/k_means_pp.rb', line 210

def calculate_new_centroids
  # Clear centroids.
  centroids.each(&:reset)

  # Sum all X and Y coords into each point's centroid.
  points.each do |point|
    centroid = point.group
    centroid.add(point)
  end

  # And then average it to find a center.
  centroids.each(&:average)
end

#define_initial_clustersObject (protected)

K-means++ algorithm.

Find initial centroids and assign points to their nearest centroid, forming cells.



145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# File 'lib/k_means_pp.rb', line 145

def define_initial_clusters
  # Randomly choose a point as the first centroid.
  centroids[0] = Centroid.new(points.sample)

  # Initialize an array of distances of every point.
  distances = points.size.times.map { 0.0 }

  centroids.each_with_index do |_, centroid_i|
    # Skip the first centroid as it's already picked but keep the index.
    next if centroid_i == 0

    # Sum points' distances to their nearest centroid
    distances_sum = 0.0

    points.each_with_index do |point, point_i|
      distance = self.class.find_nearest_centroid_distance(
        point,
        centroids[0...centroid_i]
      )
      distances[point_i] = distance
      distances_sum     += distance
    end

    # Randomly cut it.
    distances_sum *= rand

    # Keep subtracting those distances until we hit a zero (or lower)
    # in which case we found a new centroid.
    distances.each_with_index do |distance, point_i|
      distances_sum -= distance
      next if distances_sum > 0
      centroids[centroid_i] = Centroid.new(points[point_i])
      break
    end
  end

  # Assign each point its nearest centroid.
  points.each do |point|
    point.group = self.class.find_nearest_centroid(point, centroids)
  end
end

#fine_tune_clustersObject (protected)

This is Lloyd’s algorithm en.wikipedia.org/wiki/Lloyd%27s_algorithm

At this point we have our points already assigned into cells.

  1. We calculate a new center for each cell.

  2. For each point find its nearest center and re-assign it if it changed.

  3. Repeat until a threshold has been reached.



195
196
197
198
199
200
201
202
203
204
205
206
# File 'lib/k_means_pp.rb', line 195

def fine_tune_clusters
  # When a number of changed points reaches this number, we are done.
  changed_threshold = points.size >> 10

  loop do
    calculate_new_centroids
    changed = reassign_points

    # Stop when 99.9% of points are good
    break if changed <= changed_threshold
  end
end

#group_pointsObject

Group points into clusters.



134
135
136
137
# File 'lib/k_means_pp.rb', line 134

def group_points
  define_initial_clusters
  fine_tune_clusters
end

#reassign_pointsFixnum (protected)

Loop through all the points and find their nearest centroid. If it’s a different one than current, change it ande take a note.

Returns:

  • (Fixnum)

    Number of changed points.



228
229
230
231
232
233
234
235
236
237
238
239
# File 'lib/k_means_pp.rb', line 228

def reassign_points
  changed = 0

  points.each do |point|
    centroid = self.class.find_nearest_centroid(point, centroids)
    next if centroid == point.group
    changed += 1
    point.group = centroid
  end

  changed
end