Class: Spark::Mllib::NaiveBayes

Inherits:
Object
  • Object
show all
Defined in:
lib/spark/mllib/classification/naive_bayes.rb

Class Method Summary collapse

Class Method Details

.train(rdd, lambda = 1.0) ⇒ Object

Trains a Naive Bayes model given an RDD of (label, features) pairs.

This is the Multinomial NB (tinyurl.com/lsdw6p) which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification. By making every vector a 0-1 vector, it can also be used as Bernoulli NB (tinyurl.com/p7c96j6). The input feature values must be nonnegative.

Arguments:

rdd

RDD of LabeledPoint.

lambda

The smoothing parameter.



82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/spark/mllib/classification/naive_bayes.rb', line 82

def self.train(rdd, lambda=1.0)
  # Validation
  first = rdd.first
  unless first.is_a?(LabeledPoint)
    raise Spark::MllibError, "RDD should contains LabeledPoint, got #{first.class}"
  end

  labels, pi, theta = Spark.jb.call(RubyMLLibAPI.new, 'trainNaiveBayesModel', rdd, lambda)
  theta = Spark::Mllib::Matrices.dense(theta.size, theta.first.size, theta)

  NaiveBayesModel.new(labels, pi, theta)
end