Class: Rumale::Clustering::KMedoids

Inherits:
Object
  • Object
show all
Includes:
Base::BaseEstimator, Base::ClusterAnalyzer
Defined in:
lib/rumale/clustering/k_medoids.rb

Overview

KMedoids is a class that implements K-Medoids cluster analysis.

Reference

    1. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” Proc. SODA’07, pp. 1027–1035, 2007.

Examples:

analyzer = Rumale::Clustering::KMedoids.new(n_clusters: 10, max_iter: 50)
cluster_labels = analyzer.fit_predict(samples)

Instance Attribute Summary collapse

Attributes included from Base::BaseEstimator

#params

Instance Method Summary collapse

Methods included from Base::ClusterAnalyzer

#score

Constructor Details

#initialize(n_clusters: 8, metric: 'euclidean', init: 'k-means++', max_iter: 50, tol: 1.0e-4, random_seed: nil) ⇒ KMedoids

Create a new cluster analyzer with K-Medoids method.

Parameters:

  • n_clusters (Integer) (defaults to: 8)

    The number of clusters.

  • metric (String) (defaults to: 'euclidean')

    The metric to calculate the distances in original space. If metric is ‘euclidean’, Euclidean distance is calculated for distance in original space. If metric is ‘precomputed’, the fit and fit_transform methods expect to be given a distance matrix.

  • init (String) (defaults to: 'k-means++')

    The initialization method for centroids (‘random’ or ‘k-means++’).

  • max_iter (Integer) (defaults to: 50)

    The maximum number of iterations.

  • tol (Float) (defaults to: 1.0e-4)

    The tolerance of termination criterion.

  • random_seed (Integer) (defaults to: nil)

    The seed value using to initialize the random generator.



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/rumale/clustering/k_medoids.rb', line 39

def initialize(n_clusters: 8, metric: 'euclidean', init: 'k-means++', max_iter: 50, tol: 1.0e-4, random_seed: nil)
  check_params_integer(n_clusters: n_clusters, max_iter: max_iter)
  check_params_float(tol: tol)
  check_params_string(metric: metric, init: init)
  check_params_type_or_nil(Integer, random_seed: random_seed)
  check_params_positive(n_clusters: n_clusters, max_iter: max_iter)
  @params = {}
  @params[:n_clusters] = n_clusters
  @params[:metric] = metric == 'precomputed' ? 'precomputed' : 'euclidean'
  @params[:init] = init == 'random' ? 'random' : 'k-means++'
  @params[:max_iter] = max_iter
  @params[:tol] = tol
  @params[:random_seed] = random_seed
  @params[:random_seed] ||= srand
  @medoid_ids = nil
  @cluster_centers = nil
  @rng = Random.new(@params[:random_seed])
end

Instance Attribute Details

#medoid_idsNumo::Int32 (readonly)

Return the indices of medoids.

Returns:

  • (Numo::Int32)

    (shape: [n_clusters])



23
24
25
# File 'lib/rumale/clustering/k_medoids.rb', line 23

def medoid_ids
  @medoid_ids
end

#rngRandom (readonly)

Return the random generator.

Returns:

  • (Random)


27
28
29
# File 'lib/rumale/clustering/k_medoids.rb', line 27

def rng
  @rng
end

Instance Method Details

#fit(x) ⇒ KMedoids

Analysis clusters with given training data.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The training data to be used for fitting the model. If the metric is ‘precomputed’, x must be a square distance matrix (shape: [n_samples, n_samples]).

Returns:

  • (KMedoids)

    The learned cluster analyzer itself.

Raises:

  • (ArgumentError)


65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/rumale/clustering/k_medoids.rb', line 65

def fit(x, _not_used = nil)
  check_sample_array(x)
  raise ArgumentError, 'Expect the input distance matrix to be square.' if @params[:metric] == 'precomputed' && x.shape[0] != x.shape[1]
  # initialize some varibales.
  distance_mat = @params[:metric] == 'precomputed' ? x : Rumale::PairwiseMetric.euclidean_distance(x)
  init_cluster_centers(distance_mat)
  error = distance_mat[true, @medoid_ids].mean
  @params[:max_iter].times do |_t|
    cluster_labels = assign_cluster(distance_mat[true, @medoid_ids])
    @params[:n_clusters].times do |n|
      assigned_ids = cluster_labels.eq(n).where
      @medoid_ids[n] = assigned_ids[distance_mat[assigned_ids, assigned_ids].sum(axis: 1).min_index]
    end
    new_error = distance_mat[true, @medoid_ids].mean
    break if (error - new_error).abs <= @params[:tol]
    error = new_error
  end
  @cluster_centers = x[@medoid_ids, true].dup if @params[:metric] == 'euclidean'
  self
end

#fit_predict(x) ⇒ Numo::Int32

Analysis clusters and assign samples to clusters.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The training data to be used for cluster analysis. If the metric is ‘precomputed’, x must be a square distance matrix (shape: [n_samples, n_samples]).

Returns:

  • (Numo::Int32)

    (shape: [n_samples]) Predicted cluster label per sample.



105
106
107
108
109
110
111
112
113
# File 'lib/rumale/clustering/k_medoids.rb', line 105

def fit_predict(x)
  check_sample_array(x)
  fit(x)
  if @params[:metric] == 'precomputed'
    predict(x[true, @medoid_ids])
  else
    predict(x)
  end
end

#marshal_dumpHash

Dump marshal data.

Returns:

  • (Hash)

    The marshal data.



117
118
119
120
121
122
# File 'lib/rumale/clustering/k_medoids.rb', line 117

def marshal_dump
  { params: @params,
    medoid_ids: @medoid_ids,
    cluster_centers: @cluster_centers,
    rng: @rng }
end

#marshal_load(obj) ⇒ nil

Load marshal data.

Returns:

  • (nil)


126
127
128
129
130
131
132
# File 'lib/rumale/clustering/k_medoids.rb', line 126

def marshal_load(obj)
  @params = obj[:params]
  @medoid_ids = obj[:medoid_ids]
  @cluster_centers = obj[:cluster_centers]
  @rng = obj[:rng]
  nil
end

#predict(x) ⇒ Numo::Int32

Predict cluster labels for samples.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The samples to predict the cluster label. If the metric is ‘precomputed’, x must be distances between samples and medoids (shape: [n_samples, n_clusters]).

Returns:

  • (Numo::Int32)

    (shape: [n_samples]) Predicted cluster label per sample.



91
92
93
94
95
96
97
98
# File 'lib/rumale/clustering/k_medoids.rb', line 91

def predict(x)
  check_sample_array(x)
  distance_mat = @params[:metric] == 'precomputed' ? x : Rumale::PairwiseMetric.euclidean_distance(x, @cluster_centers)
  if @params[:metric] == 'precomputed' && distance_mat.shape[1] != @medoid_ids.size
    raise ArgumentError, 'Expect the size input matrix to be n_samples-by-n_clusters.'
  end
  assign_cluster(distance_mat)
end