Class: IMW::Metadata::Schema

Inherits:
Array
  • Object
show all
Defined in:
lib/imw/metadata/schema.rb

Overview

A class to describe the schema of a resource.

A Schema is built on top of an Array because it is often important to have an ordering for a record’s fields.

For fields with no such ordering, an Array also works because each of its element will be a field with a name that can be used to index the corresponding field.

A Schema is instantiated with a basic Ruby data structure.

Tabular Data

Tabular data formats (CSV, TSV, &c.) contain flat records consisting of repeating rows with the same fields in the same position. A sample of delimited data looks like

ID,Name,Genus,Species            
001,Gray-bellied Night Monkey,Aotus,lemurinus      
002,Panamanian Night Monkey,Aotus,zonalis      
003,Hern

The schema of these records is summarized as a Ruby data structure in the following way

[
  { :name => :id,      :type => :integer                         },
  { :name => :name,    :type => :string, :title => "Common Name" },
  { :name => :genus,   :type => :string, :title => "Genus"       },
  { :name => :species, :type => :string, :title => "Species"     }
]

The outer-most Array represents each row and each Hash in the Array represents one of the fields in a row. A Schema initialized with the above Ruby code can be thought of and played with as an Array of Hashes even though it really is a Schema object of Field objects.

Hierarchical Data

Hierarchical data formats (JSON, YAML, XML, &c.) can have arbitrarily complex records with fields within fields and so on. A sample of hierarchical XML data looks like

<genera>              
  <genus>             
    <name>Mandrillus</name>         
    <species>             
      <species id="113">          
        <name>sphinx</name>         
        <common_name>Mandrill</common_name>   
      </species>            
      <species id="114">          
        <name>leucophaeus</name>        
        <common_name>Drill</common_name>      
      </species>            
    </species>              
  </genus>              
  <genus>             
    <name>Rungwecebus</name>          
    <species>             
      <species id="100">          
        <name>kipunji</name>          
        <common_name>Kipunji</common_name>    
      </species>            
    </species>              
  </genus>

These records are described by the following Ruby data structure

[
  { :name     => :genera,
    :has_many => [
      { :name => 'name',    :type => :string, title => "Genus" },
      { :name => 'species',
        :has_many => [
          { :name => :id,          :type => :integer                         },
          { :name => :name,        :type => :string, :title => "Species"     },
          { :name => :common_name, :type => :string, :title => "Common Name" }
        ]
      }
    ]
  }
]

By IMW convention, the outer-most element of the Schema is still an Array describing a collection of identical records even though XML data must have a single root node, limiting the collection to a single record.

The first field of the Schema is named genera and it uses the special field property has_many to denote that the field points to a collection of sub-records.

Each of these sub-records has its own sub-schema defined by the Array that the has_many property keys to. In this case, the two fields are name and species. name is a simple String value while species itself points at another collection of objects.

This second-level nested record (a particular species) is itself composed of the three (flat) fields id, name, and common_name. Note that the Schema doesn’t know (or care) that the id field is contained in an XML attribute while the name and common_name fields are contained as text within daughter nodes.

A different way of structure the same information, this time expressed in YAML:

---                      
Mandrillus:              
- :species: sphinx       
  :name: Mandrill        
  :id: "113"             
- :species: leucophaeus  
  :name: Drill           
  :id: "114"             
Rungwecebus:             
- :species: kipunji      
  :name: Kipunji         
  :id: "100"

Would lead to a different Schema

[
  { :name => :genus, :title => "Genus",
    :has_many => [
      { :name => :id,          :type => :integer                         },
      { :name => :name,        :type => :string, :title => "Common Name" },
      { :name => :species,     :type => :string, :title => "Species"     }
    ]
  }
]

Where the unnecessary outer wrapper field genera has been dispensed with.

In addition to “has many” relationships a record can have a “has_one” relationship. The above data might be expressed

---                      
Mandrillus:
  - species: sphinx       
    name: Mandrill        
    id: "113"
    discoverer:
      name: Dr. Monkeypants
      year: 1838
  - species: leucophaeus  
    name: Drill           
    id: "114"
    discoverer:
      name: Ms. Cecelia Apefingers
      year: 1921

would result in the following Schema:

[
  { :name => :genus, :title => "Genus",
    :has_many => [
      { :name => :id,         :type => :integer                         },
      { :name => :name,       :type => :string, :title => "Common Name" },
      { :name => :species,    :type => :string                          },
      { :name => :discoverer,
        :has_one => [
          { :name => 'name', :type => :string  },
          { :name => 'year', :type => :integer }
        ]
      }
    ]
  }
]

The discoverer field is marked as has_one which means the name and year fields in the corresponding Array will be interpreted as fields in a single attached sub-record.

Compact Schemas

The internal hashes in a Schema specification are really Field objects and the initializer will promote Strings and Symbols to Field objects automatically. This means that the above Schema specification could be replaced by

[
  { :name => :genus
    :has_many => [
      :id,
      :name,
      :species,
      { :name => :discoverer,
        :has_one => [
          :name,
          :year
        ]
      }
    ]
  }
]

though there is an accompanying loss of metadata about each field.

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Array

#emit, #in_groups_of, #merge_hashes, #most, #quote_items_with, #random_element, #rest, #terminals, #to_openstruct, #unzip

Constructor Details

#initialize(input = nil) ⇒ Schema

Returns a new instance of Schema.



212
213
214
215
# File 'lib/imw/metadata/schema.rb', line 212

def initialize input=nil
  super()
  concat(input.map { |field| IMW::::Field.new(field) }) if input
end

Class Method Details

.load(resource) ⇒ Object



217
218
219
# File 'lib/imw/metadata/schema.rb', line 217

def self.load resource
  new(IMW.open(resource).load)
end

Instance Method Details

#[](index) ⇒ Object



221
222
223
# File 'lib/imw/metadata/schema.rb', line 221

def [] index
  [Integer, Range].include?(index.class) ? super(index) : detect { |field| field[:name].to_s == index.to_s }
end