Class: Ai4r::Data::DataSet
- Inherits:
-
Object
- Object
- Ai4r::Data::DataSet
- Defined in:
- lib/ai4r/data/data_set.rb
Overview
A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.
Instance Attribute Summary collapse
-
#data_items ⇒ Object
readonly
Returns the value of attribute data_items.
-
#data_labels ⇒ Object
readonly
Returns the value of attribute data_labels.
Instance Method Summary collapse
-
#<<(data_item) ⇒ Object
Add a data item to the data set.
-
#[](index) ⇒ Object
Retrieve a new DataSet, with the item(s) selected by the provided index.
-
#build_domain(attr) ⇒ Object
Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
-
#build_domains ⇒ Object
Returns an array with the domain of each attribute: * Set instance containing all possible values for nominal attributes * Array with min and max values for numeric attributes (i.e. [min, max]).
-
#check_not_empty ⇒ Object
Raise an exception if there is no data item.
-
#get_index(attr) ⇒ Object
Returns the index of a given attribute (0-based).
-
#get_mean_or_mode ⇒ Object
Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes.
-
#initialize(options = {}) ⇒ DataSet
constructor
Create a new DataSet.
-
#load_csv(filepath) ⇒ Object
Load data items from csv file.
-
#load_csv_with_labels(filepath) ⇒ Object
Load data items from csv file.
-
#num_attributes ⇒ Object
Returns attributes number, including class attribute.
-
#open_csv_file(filepath, &block) ⇒ Object
opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe.
-
#parse_csv(filepath) ⇒ Object
Same as load_csv, but it will try to convert cell contents as numbers.
-
#parse_csv_with_labels(filepath) ⇒ Object
Same as load_csv_with_labels, but it will try to convert cell contents as numbers.
-
#set_data_items(items) ⇒ Object
Set the data items.
-
#set_data_labels(labels) ⇒ Object
Set data labels.
Constructor Details
#initialize(options = {}) ⇒ DataSet
Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.
e.g. DataSet.new(:data_items => data_items, :data_labels => labels)
If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)
32 33 34 35 36 37 |
# File 'lib/ai4r/data/data_set.rb', line 32 def initialize( = {}) @data_labels = [] @data_items = [:data_items] || [] set_data_labels([:data_labels]) if [:data_labels] set_data_items([:data_items]) if [:data_items] end |
Instance Attribute Details
#data_items ⇒ Object (readonly)
Returns the value of attribute data_items.
23 24 25 |
# File 'lib/ai4r/data/data_set.rb', line 23 def data_items @data_items end |
#data_labels ⇒ Object (readonly)
Returns the value of attribute data_labels.
23 24 25 |
# File 'lib/ai4r/data/data_set.rb', line 23 def data_labels @data_labels end |
Instance Method Details
#<<(data_item) ⇒ Object
Add a data item to the data set
201 202 203 204 205 206 207 208 209 210 211 212 213 |
# File 'lib/ai4r/data/data_set.rb', line 201 def << data_item if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty? raise ArgumentError, "Data must not be an non empty array." elsif @data_items.empty? set_data_items([data_item]) elsif data_item.length != num_attributes raise ArgumentError, "Number of attributes do not match. " + "#{data_item.length} attributes provided, " + "#{num_attributes} attributes expected." else @data_items << data_item end end |
#[](index) ⇒ Object
Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.
41 42 43 44 45 46 |
# File 'lib/ai4r/data/data_set.rb', line 41 def [](index) selected_items = (index.is_a?(Fixnum)) ? [@data_items[index]] : @data_items[index] return DataSet.new(:data_items => selected_items, :data_labels =>@data_labels) end |
#build_domain(attr) ⇒ Object
Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).
-
Set instance containing all possible values for nominal attributes
-
Array with min and max values for numeric attributes (i.e. [min, max])
build_domain(“city”)
> #<Set: York”, “Chicago”>
build_domain(“age”)
> [5, 85]
build_domain(2) # In this example, the third attribute is gender
> #<Set: “F”>
171 172 173 174 175 176 177 178 |
# File 'lib/ai4r/data/data_set.rb', line 171 def build_domain(attr) index = get_index(attr) if @data_items.first[index].is_a?(Numeric) return [Statistics.min(self, index), Statistics.max(self, index)] else return @data_items.inject(Set.new){|domain, x| domain << x[index]} end end |
#build_domains ⇒ Object
Returns an array with the domain of each attribute:
-
Set instance containing all possible values for nominal attributes
-
Array with min and max values for numeric attributes (i.e. [min, max])
Return example:
> [#<Set: York”, “Chicago”>,
#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>,
#<Set: {"M", "F"}>,
[5, 85],
#<Set: {"Y", "N"}>]
154 155 156 |
# File 'lib/ai4r/data/data_set.rb', line 154 def build_domains @data_labels.collect {|attr_label| build_domain(attr_label) } end |
#check_not_empty ⇒ Object
Raise an exception if there is no data item.
194 195 196 197 198 |
# File 'lib/ai4r/data/data_set.rb', line 194 def check_not_empty if @data_items.empty? raise ArgumentError, "Examples data set must not be empty." end end |
#get_index(attr) ⇒ Object
Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:
get_index("gender")
=> 2
189 190 191 |
# File 'lib/ai4r/data/data_set.rb', line 189 def get_index(attr) return (attr.is_a?(Fixnum) || attr.is_a?(Range)) ? attr : @data_labels.index(attr) end |
#get_mean_or_mode ⇒ Object
Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes
217 218 219 220 221 222 223 224 225 226 227 228 |
# File 'lib/ai4r/data/data_set.rb', line 217 def get_mean_or_mode mean = [] num_attributes.times do |i| mean[i] = if @data_items.first[i].is_a?(Numeric) Statistics.mean(self, i) else Statistics.mode(self, i) end end return mean end |
#load_csv(filepath) ⇒ Object
Load data items from csv file
49 50 51 52 53 54 55 |
# File 'lib/ai4r/data/data_set.rb', line 49 def load_csv(filepath) items = [] open_csv_file(filepath) do |entry| items << entry end set_data_items(items) end |
#load_csv_with_labels(filepath) ⇒ Object
Load data items from csv file. The first row is used as data labels.
73 74 75 76 77 |
# File 'lib/ai4r/data/data_set.rb', line 73 def load_csv_with_labels(filepath) load_csv(filepath) @data_labels = @data_items.shift return self end |
#num_attributes ⇒ Object
Returns attributes number, including class attribute
181 182 183 |
# File 'lib/ai4r/data/data_set.rb', line 181 def num_attributes return (@data_items.empty?) ? 0 : @data_items.first.size end |
#open_csv_file(filepath, &block) ⇒ Object
opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe
60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/ai4r/data/data_set.rb', line 60 def open_csv_file(filepath, &block) if CSV.const_defined? :Reader CSV::Reader.parse(File.open(filepath, 'r')) do |row| block.call row end else CSV.parse(File.open(filepath, 'r')) do |row| block.call row end end end |
#parse_csv(filepath) ⇒ Object
Same as load_csv, but it will try to convert cell contents as numbers.
80 81 82 83 84 85 86 |
# File 'lib/ai4r/data/data_set.rb', line 80 def parse_csv(filepath) items = [] open_csv_file(filepath) do |row| items << row.collect{|x| is_number?(x) ? Float(x) : x } end set_data_items(items) end |
#parse_csv_with_labels(filepath) ⇒ Object
Same as load_csv_with_labels, but it will try to convert cell contents as numbers.
89 90 91 92 93 |
# File 'lib/ai4r/data/data_set.rb', line 89 def parse_csv_with_labels(filepath) parse_csv(filepath) @data_labels = @data_items.shift return self end |
#set_data_items(items) ⇒ Object
Set the data items. M data items with N attributes must have the following format:
[ [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1, CLASS_VAL1],
[ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2, CLASS_VAL2],
...
[ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM],
]
e.g.
[ ['New York', '<30', 'M', 'Y'],
['Chicago', '<30', 'M', 'Y'],
['Chicago', '<30', 'F', 'Y'],
['New York', '<30', 'M', 'Y'],
['New York', '<30', 'M', 'Y'],
['Chicago', '[30-50)', 'M', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[30-50)', 'F', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['New York', '[50-80]', 'M', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['Chicago', '>80', 'F', 'Y']
]
This method returns the classifier (self), allowing method chaining.
137 138 139 140 141 142 |
# File 'lib/ai4r/data/data_set.rb', line 137 def set_data_items(items) check_data_items(items) @data_labels = default_data_labels(items) if @data_labels.empty? @data_items = items return self end |
#set_data_labels(labels) ⇒ Object
Set data labels. Data labels must have the following format:
[ 'city', 'age_range', 'gender', 'marketing_target' ]
If you do not provide labels for you data, the following labels will be created by default:
[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value' ]
102 103 104 105 106 |
# File 'lib/ai4r/data/data_set.rb', line 102 def set_data_labels(labels) check_data_labels(labels) @data_labels = labels return self end |