KMeans
Attempting to build a fast, memory efficient K-Means program.
Install
gem sources -a http://rubygems.org
sudo gem install k_means
How To Use
require 'rubygems'
require 'k_means'
data = [[1,1], [1,2], [1,1], [1000, 1000], [500, 500]]
kmeans = KMeans.new(data, :centroids => 2)
kmeans.inspect # Use kmeans.view to get hold of the un-inspected array
=> [[3, 4], [0, 1, 2]]
Custom Centroids
require 'rubygems'
require 'k_means'
# Your custom centroid needs to have #position and #reposition methods
class CustomCentroid
attr_accessor :position
def initialize(position); @position = position; end
def reposition(nodes, centroid_positions); end
end
data = [[1,1], [1,2], [1,1], [1000, 1000], [500, 500]]
kmeans = KMeans.new(data, :custom_centroids => @specified_centroids)
Distance Measurements
KMeans uses the Distance Measures Gem (github.com/reddavis/Distance-Measures) so we get quite a range of distance measurements.
The measurements currently available are:
euclidean_distance
cosine_similarity
jaccard_index
jaccard_distance
binary_jaccard_index
binary_jaccard_distance
tanimoto_coefficient
To specify a particular one to use in the KMeans algorithm, just provide it as an option:
KMeans.new(@data, :distance_measure => :jaccard_index)
KMeans.new(@data, :distance_measure => :cosine_similarity)
KMeans.new(@data, :distance_measure => :tanimoto_coefficient)
You get the idea…
Benchmarks
# 1000 records with 50 dimensions
data = Array.new(1000) {Array.new(50) {rand(10)}}
ai4r_data = Ai4r::Data::DataSet.new(:data_items=> data)
# Clustering can happen in magical ways
# so lets do it over multiple times
n = 5
Benchmark.bm do |x|
x.report('KMeans') do
n.times { KMeans.new(data) }
end
x.report("Ai4R") do
n.times do
b = Ai4r::Clusterers::KMeans.new
b.build(ai4r_data, 4)
end
end
end
user system total real
KMeans 15.960000 0.030000 15.990000 ( 16.062639)
Ai4R 70.230000 0.180000 70.410000 ( 70.704843)
Thanks
-
David Richards - For his code reviews and all round helpfulness. - github.com/davidrichards
Copyright
Copyright © 2009 Red Davis. See LICENSE for details.