Class: Ai4r::Data::DataSet
- Inherits:
-
Object
- Object
- Ai4r::Data::DataSet
- Defined in:
- lib/ai4r/data/data_set.rb
Overview
A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.
Constant Summary collapse
- @@number_regex =
/(((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b)/
Instance Attribute Summary collapse
-
#data_items ⇒ Object
readonly
Returns the value of attribute data_items.
-
#data_labels ⇒ Object
readonly
Returns the value of attribute data_labels.
Instance Method Summary collapse
-
#<<(data_item) ⇒ Object
Add a data item to the data set.
-
#[](index) ⇒ Object
Retrieve a new DataSet, with the item(s) selected by the provided index.
-
#build_domain(attr) ⇒ Object
Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
-
#build_domains ⇒ Object
Returns an array with the domain of each attribute: * Set instance containing all possible values for nominal attributes * Array with min and max values for numeric attributes (i.e. [min, max]).
-
#check_not_empty ⇒ Object
Raise an exception if there is no data item.
-
#get_index(attr) ⇒ Object
Returns the index of a given attribute (0-based).
-
#get_mean_or_mode ⇒ Object
Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes.
-
#initialize(options = {}) ⇒ DataSet
constructor
Create a new DataSet.
-
#load_csv(filepath) ⇒ Object
Load data items from csv file.
-
#load_csv_with_labels(filepath) ⇒ Object
Load data items from csv file.
-
#num_attributes ⇒ Object
Returns attributes number, including class attribute.
-
#open_csv_file(filepath, &block) ⇒ Object
opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe.
-
#parse_csv(filepath) ⇒ Object
Same as load_csv, but it will try to convert cell contents as numbers.
-
#set_data_items(items) ⇒ Object
Set the data items.
-
#set_data_labels(labels) ⇒ Object
Set data labels.
Constructor Details
#initialize(options = {}) ⇒ DataSet
Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.
e.g. DataSet.new(:data_items => data_items, :data_labels => labels)
If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)
34 35 36 37 38 39 |
# File 'lib/ai4r/data/data_set.rb', line 34 def initialize( = {}) @data_labels = [] @data_items = [:data_items] || [] set_data_labels([:data_labels]) if [:data_labels] set_data_items([:data_items]) if [:data_items] end |
Instance Attribute Details
#data_items ⇒ Object (readonly)
Returns the value of attribute data_items.
25 26 27 |
# File 'lib/ai4r/data/data_set.rb', line 25 def data_items @data_items end |
#data_labels ⇒ Object (readonly)
Returns the value of attribute data_labels.
25 26 27 |
# File 'lib/ai4r/data/data_set.rb', line 25 def data_labels @data_labels end |
Instance Method Details
#<<(data_item) ⇒ Object
Add a data item to the data set
196 197 198 199 200 201 202 203 204 205 206 207 208 |
# File 'lib/ai4r/data/data_set.rb', line 196 def << data_item if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty? raise ArgumentError, "Data must not be an non empty array." elsif @data_items.empty? set_data_items([data_item]) elsif data_item.length != num_attributes raise ArgumentError, "Number of attributes do not match. " + "#{data_item.length} attributes provided, " + "#{num_attributes} attributes expected." else @data_items << data_item end end |
#[](index) ⇒ Object
Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.
43 44 45 46 47 48 |
# File 'lib/ai4r/data/data_set.rb', line 43 def [](index) selected_items = (index.is_a?(Fixnum)) ? [@data_items[index]] : @data_items[index] return DataSet.new(:data_items => selected_items, :data_labels =>@data_labels) end |
#build_domain(attr) ⇒ Object
Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
-
Set instance containing all possible values for nominal attributes
-
Array with min and max values for numeric attributes (i.e. [min, max])
build_domain(“city”)
> #<Set: York”, “Chicago”>
build_domain(“age”)
> [5, 85]
build_domain(2) # In this example, the third attribute is gender
> #<Set: “F”>
166 167 168 169 170 171 172 173 |
# File 'lib/ai4r/data/data_set.rb', line 166 def build_domain(attr) index = get_index(attr) if @data_items.first[index].is_a?(Numeric) return [Statistics.min(self, index), Statistics.max(self, index)] else return @data_items.inject(Set.new){|domain, x| domain << x[index]} end end |
#build_domains ⇒ Object
Returns an array with the domain of each attribute:
-
Set instance containing all possible values for nominal attributes
-
Array with min and max values for numeric attributes (i.e. [min, max])
Return example:
> [#<Set: York”, “Chicago”>,
#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>,
#<Set: {"M", "F"}>,
[5, 85],
#<Set: {"Y", "N"}>]
149 150 151 |
# File 'lib/ai4r/data/data_set.rb', line 149 def build_domains @data_labels.collect {|attr_label| build_domain(attr_label) } end |
#check_not_empty ⇒ Object
Raise an exception if there is no data item.
189 190 191 192 193 |
# File 'lib/ai4r/data/data_set.rb', line 189 def check_not_empty if @data_items.empty? raise ArgumentError, "Examples data set must not be empty." end end |
#get_index(attr) ⇒ Object
Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:
get_index("gender")
=> 2
184 185 186 |
# File 'lib/ai4r/data/data_set.rb', line 184 def get_index(attr) return (attr.is_a?(Fixnum) || attr.is_a?(Range)) ? attr : @data_labels.index(attr) end |
#get_mean_or_mode ⇒ Object
Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes
212 213 214 215 216 217 218 219 220 221 222 223 |
# File 'lib/ai4r/data/data_set.rb', line 212 def get_mean_or_mode mean = [] num_attributes.times do |i| mean[i] = if @data_items.first[i].is_a?(Numeric) Statistics.mean(self, i) else Statistics.mode(self, i) end end return mean end |
#load_csv(filepath) ⇒ Object
Load data items from csv file
51 52 53 54 55 56 57 |
# File 'lib/ai4r/data/data_set.rb', line 51 def load_csv(filepath) items = [] open_csv_file(filepath) do |entry| items << entry end set_data_items(items) end |
#load_csv_with_labels(filepath) ⇒ Object
Load data items from csv file. The first row is used as data labels.
75 76 77 78 79 |
# File 'lib/ai4r/data/data_set.rb', line 75 def load_csv_with_labels(filepath) load_csv(filepath) @data_labels = @data_items.shift return self end |
#num_attributes ⇒ Object
Returns attributes number, including class attribute
176 177 178 |
# File 'lib/ai4r/data/data_set.rb', line 176 def num_attributes return (@data_items.empty?) ? 0 : @data_items.first.size end |
#open_csv_file(filepath, &block) ⇒ Object
opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe
62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/ai4r/data/data_set.rb', line 62 def open_csv_file(filepath, &block) if CSV.const_defined? :Reader CSV::Reader.parse(File.open(filepath, 'r')) do |row| block.call row end else CSV.parse(File.open(filepath, 'r')) do |row| block.call row end end end |
#parse_csv(filepath) ⇒ Object
Same as load_csv, but it will try to convert cell contents as numbers.
82 83 84 85 86 87 88 |
# File 'lib/ai4r/data/data_set.rb', line 82 def parse_csv(filepath) items = [] open_csv_file(filepath) do |row| items << row.collect{|x| (x.match(@@number_regex)) ? x.to_f : x.data } end set_data_items(items) end |
#set_data_items(items) ⇒ Object
Set the data items. M data items with N attributes must have the following format:
[ [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1, CLASS_VAL1],
[ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2, CLASS_VAL2],
...
[ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM],
]
e.g.
[ ['New York', '<30', 'M', 'Y'],
['Chicago', '<30', 'M', 'Y'],
['Chicago', '<30', 'F', 'Y'],
['New York', '<30', 'M', 'Y'],
['New York', '<30', 'M', 'Y'],
['Chicago', '[30-50)', 'M', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[30-50)', 'F', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['New York', '[50-80]', 'M', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['Chicago', '>80', 'F', 'Y']
]
This method returns the classifier (self), allowing method chaining.
132 133 134 135 136 137 |
# File 'lib/ai4r/data/data_set.rb', line 132 def set_data_items(items) check_data_items(items) @data_labels = default_data_labels(items) if @data_labels.empty? @data_items = items return self end |
#set_data_labels(labels) ⇒ Object
Set data labels. Data labels must have the following format:
[ 'city', 'age_range', 'gender', 'marketing_target' ]
If you do not provide labels for you data, the following labels will be created by default:
[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value' ]
97 98 99 100 101 |
# File 'lib/ai4r/data/data_set.rb', line 97 def set_data_labels(labels) check_data_labels(labels) @data_labels = labels return self end |