Module: Consistency

Included in:
Discretizer, FSelector::INTERACT, FSelector::LasVegasFilter, FSelector::LasVegasIncremental
Defined in:
lib/fselector/consistency.rb

Overview

data consistency-related functions

Instance Method Summary collapse

Instance Method Details

#get_instance_count(my_data = nil) ⇒ Hash

Note:

intended for mulitple calculations, because chekcing data inconsistency rate based on the resultant Hash table is very efficient and avoids reconstructing new data structure and repetitive counting. For instead, you only rebuild the Hash keys and merge relevant counts

get the counts of each (unique) instance (without class label)
for each class, the resulting Hash table, as suggested by Zheng Zhao and Huan Liu, looks like:

{
 'f1:v1|f2:v2|...|fn:vn|' => {k1=>c1, k2=>c2, ..., kn=>cn},
  ...
}

where we use the (sorted) features and their values to construct 
the key for Hash table, i.e., v_i is the value for feature f_i. 
Note the symbol : separates a feature and its value, and the 
symbol | separates a feature-value pair. In other words, they 
should not appear in any feature or its value. If so, please 
replace them with other symbols in advance. The c_i is the 
instance count for class k_i 

ref: Searching for Interacting Features

Parameters:

  • my_data (Hash) (defaults to: nil)

    data of interest, use internal data by default

Returns:

  • (Hash)

    counts of each (unique) instance for each class



32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/fselector/consistency.rb', line 32

def get_instance_count(my_data=nil)
  my_data ||= get_data # use internal data by default
  inst_cnt = {}
  
  my_data.each do |k, ss|
    ss.each do |s|
      # sort make sure a same key
      # : separates a feature and its value
      # | separates a feature-value pair
      key = s.keys.sort.collect { |f| "#{f}:#{s[f]}|"}.join
      inst_cnt[key] ||= Hash.new(0)
      inst_cnt[key][k] += 1 # for key in class k
    end
  end
  
  inst_cnt
end

#get_IR(my_data = nil) ⇒ Float

get data inconsistency rate, suitable for single-time calculation

Parameters:

  • my_data (Hash) (defaults to: nil)

    data of interest, use internal data by default

Returns:

  • (Float)

    data inconsistency rate



108
109
110
111
112
113
114
115
# File 'lib/fselector/consistency.rb', line 108

def get_IR(my_data=nil)
  my_data ||= get_data # use internal data by default
  inst_cnt = get_instance_count(my_data)
  ir = get_IR_by_count(inst_cnt)
  
  # inconsistency rate
  ir
end

#get_IR_by_count(inst_cnt) ⇒ Float

get data inconsistency rate based on the instance count in Hash table

Parameters:

  • inst_cnt (Hash)

    the counts of each (unique) instance (without class label) for each class

Returns:

  • (Float)

    data inconsistency rate



58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/fselector/consistency.rb', line 58

def get_IR_by_count(inst_cnt)    
  incon, sample_size = 0.0, 0.0
  
  inst_cnt.values.each do |hcnt|
    cnt = hcnt.values
    incon += cnt.sum-cnt.max
    sample_size += cnt.sum
  end
  
  # inconsistency rate
  (sample_size.zero?) ? 0.0 : incon/sample_size
end

#get_IR_by_feature(inst_cnt, feats) ⇒ Float

get data inconsistency rate for given features

Parameters:

  • inst_cnt (Hash)

    source Hash table of instance count

  • feats (Array)

    consider only these features

Returns:

  • (Float)

    data inconsistency rate



79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/fselector/consistency.rb', line 79

def get_IR_by_feature(inst_cnt, feats)
  return 0.0 if feats.empty?
  
  # build new inst_count for feats
  inst_cnt_new = {}
  
  inst_cnt.each do |key, hcnt|
    key_new = feats.sort.collect { |f|
      match_data = key.match(/#{f}:.*?\|/)
      match_data[0] if match_data
    }.compact.join # remove nil entry and join
    next if key_new.empty?
    
    hcnt_new = inst_cnt_new[key_new] || Hash.new(0)
    # merge cnts
    inst_cnt_new[key_new] = hcnt_new.merge(hcnt) { |kk, v1, v2| v1+v2 }
  end
  
  # inconsistency rate
  get_IR_by_count(inst_cnt_new)
end