Class: Selector::BiNormalSeperation

Inherits:
Simple
  • Object
show all
Includes:
BNS
Defined in:
lib/svm_helper/selectors/bi_normal_seperation.rb

Overview

Feature Selection for Text Classification - HP Labs http://www.google.com/patents/US20040059697

Direct Known Subclasses

BNS_IG, InformationGain

Constant Summary

Constants included from BNS

Selector::BNS::A, Selector::BNS::B, Selector::BNS::C, Selector::BNS::D, Selector::BNS::P_HIGH, Selector::BNS::P_LOW, Selector::BNS::SQR2, Selector::BNS::SQR2PI

Constants inherited from Simple

Simple::DEFAULT_DICTIONARY_SIZE

Constants included from ParallelHelper

ParallelHelper::THREAD_COUNT

Instance Attribute Summary

Attributes inherited from Simple

#classification_encoding, #global_dictionary, #gram_size, #word_selection

Instance Method Summary collapse

Methods included from BNS

#bi_normal_seperation, #cdf, #cdf_inverse

Methods inherited from Simple

#extract_words_from_data, #generate_vector, #reset

Methods included from ParallelHelper

#p_map, #p_map_with_index, #parallel?

Constructor Details

#initialize(classification, args = {}) ⇒ BiNormalSeperation

Returns a new instance of BiNormalSeperation.



14
15
16
17
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 14

def initialize classification, args={}
  super
  @word_selection = args.fetch(:word_selection){ :grams1_2 }
end

Instance Method Details

#build_dictionary(data_set, dictionary_size = DEFAULT_DICTIONARY_SIZE) ⇒ Object



70
71
72
73
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 70

def build_dictionary data_set, dictionary_size=DEFAULT_DICTIONARY_SIZE
  words_per_data = extract_words data_set, true
  generate_global_dictionary words_per_data, dictionary_size
end

#extract_words(data_set, keep_label = false) ⇒ Array<OpenStruct<Array<String>,Boolean>>

extracts the words of all provided data entries

Parameters:

  • data_set (Array<PreprocessedData>)

    list of preprocessed data

  • keep_label (defaults to: false)

Returns:

  • (Array<OpenStruct<Array<String>,Boolean>>)

    list of words per data entry



80
81
82
83
84
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 80

def extract_words data_set, keep_label=false
  data_set.map do |data|
    extract_words_from_data data, keep_label
  end
end

#generate_global_dictionary(all_words, size = DEFAULT_DICTIONARY_SIZE) ⇒ Array<String>

generates a list of words used as dictionary

Parameters:

  • size (defaults to: DEFAULT_DICTIONARY_SIZE)

    dictionary size

Returns:

  • (Array<String>)

    list of words



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 42

def generate_global_dictionary all_words, size=DEFAULT_DICTIONARY_SIZE
  return unless global_dictionary.empty?

  label_counts = [0,0]
  features = all_words.reduce(Hash.new { |h, k| h[k] = [0,0] }) do |accumulator, bag|
    label = bag.label ? 1 : 0
    label_counts[label] += 1
    # only count a feature once per bag
    bag.features.uniq.each do |word|
      unless accumulator.has_key?(word)
        accumulator[word] = [0,0]
      end
      accumulator[word][label] += 1
    end
    accumulator
  end
  neg, pos = label_counts
  words = p_map(features) do |word, counts|
            next if counts.any? { |e| e==0 } # skip words only appearing in one class
            bns = bi_normal_seperation(pos, neg, *counts)
            [word, bns.abs]
          end
  @global_dictionary = words.compact
                            .sort_by{|e| e[1]}
                            .last(size)
                            .map{|e| e[0] }
end

#generate_vectors(data_set, dictionary_size = DEFAULT_DICTIONARY_SIZE) ⇒ Array<FeatureVector>

generates a list of feature vetors and their labels from preprocessed data

Parameters:

  • data_set (Array<PreprocessedData>)

    list of preprocessed data

  • classification (Symbol)

    in :industry, :function, :career_level

  • dictionary_size (Integer) (defaults to: DEFAULT_DICTIONARY_SIZE)

    Size of a dictionary to create if non exists

Returns:



25
26
27
28
29
30
31
32
33
34
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 25

def generate_vectors data_set, dictionary_size=DEFAULT_DICTIONARY_SIZE
  words_and_label_per_data = extract_words data_set, true
  generate_global_dictionary words_and_label_per_data, dictionary_size

  words_per_data = words_and_label_per_data.map(&:features)
  p_map_with_index(words_per_data) do |words,index|
    word_set = words.uniq
    make_vector word_set, data_set[index]
  end
end

#labelObject



10
11
12
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 10

def label
  "BiNormalSeperation"
end