Class: Selector::BiNormalSeperation
- Includes:
- BNS
- Defined in:
- lib/svm_helper/selectors/bi_normal_seperation.rb
Overview
Feature Selection for Text Classification - HP Labs http://www.google.com/patents/US20040059697
Direct Known Subclasses
Constant Summary
Constants included from BNS
Selector::BNS::A, Selector::BNS::B, Selector::BNS::C, Selector::BNS::D, Selector::BNS::P_HIGH, Selector::BNS::P_LOW, Selector::BNS::SQR2, Selector::BNS::SQR2PI
Constants inherited from Simple
Simple::DEFAULT_DICTIONARY_SIZE
Constants included from ParallelHelper
Instance Attribute Summary
Attributes inherited from Simple
#classification_encoding, #global_dictionary, #gram_size, #word_selection
Instance Method Summary collapse
- #build_dictionary(data_set, dictionary_size = DEFAULT_DICTIONARY_SIZE) ⇒ Object
-
#extract_words(data_set, keep_label = false) ⇒ Array<OpenStruct<Array<String>,Boolean>>
extracts the words of all provided data entries.
-
#generate_global_dictionary(all_words, size = DEFAULT_DICTIONARY_SIZE) ⇒ Array<String>
generates a list of words used as dictionary.
-
#generate_vectors(data_set, dictionary_size = DEFAULT_DICTIONARY_SIZE) ⇒ Array<FeatureVector>
generates a list of feature vetors and their labels from preprocessed data.
-
#initialize(classification, args = {}) ⇒ BiNormalSeperation
constructor
A new instance of BiNormalSeperation.
- #label ⇒ Object
Methods included from BNS
#bi_normal_seperation, #cdf, #cdf_inverse
Methods inherited from Simple
#extract_words_from_data, #generate_vector, #reset
Methods included from ParallelHelper
#p_map, #p_map_with_index, #parallel?
Constructor Details
#initialize(classification, args = {}) ⇒ BiNormalSeperation
Returns a new instance of BiNormalSeperation.
14 15 16 17 |
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 14 def initialize classification, args={} super @word_selection = args.fetch(:word_selection){ :grams1_2 } end |
Instance Method Details
#build_dictionary(data_set, dictionary_size = DEFAULT_DICTIONARY_SIZE) ⇒ Object
70 71 72 73 |
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 70 def build_dictionary data_set, dictionary_size=DEFAULT_DICTIONARY_SIZE words_per_data = extract_words data_set, true generate_global_dictionary words_per_data, dictionary_size end |
#extract_words(data_set, keep_label = false) ⇒ Array<OpenStruct<Array<String>,Boolean>>
extracts the words of all provided data entries
80 81 82 83 84 |
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 80 def extract_words data_set, keep_label=false data_set.map do |data| extract_words_from_data data, keep_label end end |
#generate_global_dictionary(all_words, size = DEFAULT_DICTIONARY_SIZE) ⇒ Array<String>
generates a list of words used as dictionary
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 42 def generate_global_dictionary all_words, size=DEFAULT_DICTIONARY_SIZE return unless global_dictionary.empty? label_counts = [0,0] features = all_words.reduce(Hash.new { |h, k| h[k] = [0,0] }) do |accumulator, bag| label = bag.label ? 1 : 0 label_counts[label] += 1 # only count a feature once per bag bag.features.uniq.each do |word| unless accumulator.has_key?(word) accumulator[word] = [0,0] end accumulator[word][label] += 1 end accumulator end neg, pos = label_counts words = p_map(features) do |word, counts| next if counts.any? { |e| e==0 } # skip words only appearing in one class bns = bi_normal_seperation(pos, neg, *counts) [word, bns.abs] end @global_dictionary = words.compact .sort_by{|e| e[1]} .last(size) .map{|e| e[0] } end |
#generate_vectors(data_set, dictionary_size = DEFAULT_DICTIONARY_SIZE) ⇒ Array<FeatureVector>
generates a list of feature vetors and their labels from preprocessed data
25 26 27 28 29 30 31 32 33 34 |
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 25 def generate_vectors data_set, dictionary_size=DEFAULT_DICTIONARY_SIZE words_and_label_per_data = extract_words data_set, true generate_global_dictionary words_and_label_per_data, dictionary_size words_per_data = words_and_label_per_data.map(&:features) p_map_with_index(words_per_data) do |words,index| word_set = words.uniq make_vector word_set, data_set[index] end end |
#label ⇒ Object
10 11 12 |
# File 'lib/svm_helper/selectors/bi_normal_seperation.rb', line 10 def label "BiNormalSeperation" end |