Class: Ai4r::Data::DataSet

Inherits:
Object
  • Object
show all
Defined in:
lib/ai4r/data/data_set.rb

Overview

A data set is a collection of N data items. Each data item is described by a set of attributes, represented as an array. Optionally, you can assign a label to the attributes, using the data_labels property.

Constant Summary collapse

@@number_regex =
/(((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b)/

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(options = {}) ⇒ DataSet

Create a new DataSet. By default, empty. Optionaly, you can provide the initial data items and data labels.

e.g. DataSet.new(:data_items => data_items, :data_labels => labels)

If you provide data items, but no data labels, the data set will use the default data label values (see set_data_labels)



34
35
36
37
38
39
# File 'lib/ai4r/data/data_set.rb', line 34

def initialize(options = {})
  @data_labels = []
  @data_items = options[:data_items] || []
  set_data_labels(options[:data_labels]) if options[:data_labels]
  set_data_items(options[:data_items]) if options[:data_items]
end

Instance Attribute Details

#data_itemsObject (readonly)

Returns the value of attribute data_items.



25
26
27
# File 'lib/ai4r/data/data_set.rb', line 25

def data_items
  @data_items
end

#data_labelsObject (readonly)

Returns the value of attribute data_labels.



25
26
27
# File 'lib/ai4r/data/data_set.rb', line 25

def data_labels
  @data_labels
end

Instance Method Details

#<<(data_item) ⇒ Object

Add a data item to the data set



196
197
198
199
200
201
202
203
204
205
206
207
208
# File 'lib/ai4r/data/data_set.rb', line 196

def << data_item
  if data_item.nil? || !data_item.is_a?(Enumerable) || data_item.empty?
    raise ArgumentError, "Data must not be an non empty array."
  elsif @data_items.empty?
    set_data_items([data_item])
  elsif data_item.length != num_attributes
    raise ArgumentError, "Number of attributes do not match. " +
            "#{data_item.length} attributes provided, " +
            "#{num_attributes} attributes expected."
  else
    @data_items << data_item
  end
end

#[](index) ⇒ Object

Retrieve a new DataSet, with the item(s) selected by the provided index. You can specify an index range, too.



43
44
45
46
47
48
# File 'lib/ai4r/data/data_set.rb', line 43

def [](index)
  selected_items = (index.is_a?(Fixnum)) ?
          [@data_items[index]] : @data_items[index]
  return DataSet.new(:data_items => selected_items,
                     :data_labels =>@data_labels)
end

#build_domain(attr) ⇒ Object

Returns a Set instance containing all possible values for an attribute The parameter can be an attribute label or index (0 based).

  • Set instance containing all possible values for nominal attributes

  • Array with min and max values for numeric attributes (i.e. [min, max])

    build_domain(“city”)

    > #<Set: York”, “Chicago”>

    build_domain(“age”)

    > [5, 85]

    build_domain(2) # In this example, the third attribute is gender

    > #<Set: “F”>



166
167
168
169
170
171
172
173
# File 'lib/ai4r/data/data_set.rb', line 166

def build_domain(attr)
  index = get_index(attr)
  if @data_items.first[index].is_a?(Numeric)
    return [Statistics.min(self, index), Statistics.max(self, index)]
  else
    return @data_items.inject(Set.new){|domain, x| domain << x[index]}
  end
end

#build_domainsObject

Returns an array with the domain of each attribute:

  • Set instance containing all possible values for nominal attributes

  • Array with min and max values for numeric attributes (i.e. [min, max])

Return example:

> [#<Set: York”, “Chicago”>,

#<Set: {"<30", "[30-50)", "[50-80]", ">80"}>, 
#<Set: {"M", "F"}>,
[5, 85], 
#<Set: {"Y", "N"}>]


149
150
151
# File 'lib/ai4r/data/data_set.rb', line 149

def build_domains
  @data_labels.collect {|attr_label| build_domain(attr_label) }
end

#check_not_emptyObject

Raise an exception if there is no data item.



189
190
191
192
193
# File 'lib/ai4r/data/data_set.rb', line 189

def check_not_empty
  if @data_items.empty?
    raise ArgumentError, "Examples data set must not be empty."
  end
end

#get_index(attr) ⇒ Object

Returns the index of a given attribute (0-based). For example, if “gender” is the third attribute, then:

get_index("gender") 
=> 2


184
185
186
# File 'lib/ai4r/data/data_set.rb', line 184

def get_index(attr)
  return (attr.is_a?(Fixnum) || attr.is_a?(Range)) ? attr : @data_labels.index(attr)
end

#get_mean_or_modeObject

Returns an array with the mean value of numeric attributes, and the most frequent value of non numeric attributes



212
213
214
215
216
217
218
219
220
221
222
223
# File 'lib/ai4r/data/data_set.rb', line 212

def get_mean_or_mode
  mean = []
  num_attributes.times do |i|
    mean[i] =
            if @data_items.first[i].is_a?(Numeric)
              Statistics.mean(self, i)
            else
              Statistics.mode(self, i)
            end
  end
  return mean
end

#load_csv(filepath) ⇒ Object

Load data items from csv file



51
52
53
54
55
56
57
# File 'lib/ai4r/data/data_set.rb', line 51

def load_csv(filepath)
  items = []
  open_csv_file(filepath) do |entry|
    items << entry
  end
  set_data_items(items)
end

#load_csv_with_labels(filepath) ⇒ Object

Load data items from csv file. The first row is used as data labels.



75
76
77
78
79
# File 'lib/ai4r/data/data_set.rb', line 75

def load_csv_with_labels(filepath)
  load_csv(filepath)
  @data_labels = @data_items.shift
  return self
end

#num_attributesObject

Returns attributes number, including class attribute



176
177
178
# File 'lib/ai4r/data/data_set.rb', line 176

def num_attributes
  return (@data_items.empty?) ? 0 : @data_items.first.size
end

#open_csv_file(filepath, &block) ⇒ Object

opens a csv-file and reads it line by line for each line, a block is called and the row is passed to the block ruby1.8 and 1.9 safe



62
63
64
65
66
67
68
69
70
71
72
# File 'lib/ai4r/data/data_set.rb', line 62

def open_csv_file(filepath, &block)
  if CSV.const_defined? :Reader
    CSV::Reader.parse(File.open(filepath, 'r')) do |row|
      block.call row
    end
  else
    CSV.parse(File.open(filepath, 'r')) do |row|
      block.call row
    end
  end
end

#parse_csv(filepath) ⇒ Object

Same as load_csv, but it will try to convert cell contents as numbers.



82
83
84
85
86
87
88
# File 'lib/ai4r/data/data_set.rb', line 82

def parse_csv(filepath)
  items = []
  open_csv_file(filepath) do |row|
    items << row.collect{|x| (x.match(@@number_regex)) ? x.to_f : x.data }
  end
  set_data_items(items)
end

#set_data_items(items) ⇒ Object

Set the data items. M data items with N attributes must have the following format:

[   [ATT1_VAL1, ATT2_VAL1, ATT3_VAL1, ... , ATTN_VAL1,  CLASS_VAL1], 
    [ATT1_VAL2, ATT2_VAL2, ATT3_VAL2, ... , ATTN_VAL2,  CLASS_VAL2], 
    ...
    [ATTM1_VALM, ATT2_VALM, ATT3_VALM, ... , ATTN_VALM, CLASS_VALM], 
]

e.g.

[   ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'M', 'Y'],
     ['Chicago',     '<30',      'F', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['New York',  '<30',      'M', 'Y'],
     ['Chicago',     '[30-50)',  'M', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[30-50)',  'F', 'Y'],
     ['New York',  '[30-50)',  'F', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['New York',  '[50-80]', 'M', 'N'],
     ['Chicago',     '[50-80]', 'M', 'N'],
     ['New York',  '[50-80]', 'F', 'N'],
     ['Chicago',     '>80',      'F', 'Y']
   ]

This method returns the classifier (self), allowing method chaining.



132
133
134
135
136
137
# File 'lib/ai4r/data/data_set.rb', line 132

def set_data_items(items)
  check_data_items(items)
  @data_labels = default_data_labels(items) if @data_labels.empty?
  @data_items = items
  return self
end

#set_data_labels(labels) ⇒ Object

Set data labels. Data labels must have the following format:

[ 'city', 'age_range', 'gender', 'marketing_target'  ]

If you do not provide labels for you data, the following labels will be created by default:

[ 'attribute_1', 'attribute_2', 'attribute_3', 'class_value'  ]


97
98
99
100
101
# File 'lib/ai4r/data/data_set.rb', line 97

def set_data_labels(labels)
  check_data_labels(labels)
  @data_labels = labels
  return self
end