Class: Modsulator

Inherits:
Object
  • Object
show all
Defined in:
lib/modsulator.rb

Overview

The main class for the MODSulator API, which lets you work with metadata spreadsheets and MODS XML.

Constant Summary collapse

NAMESPACE =

We define our own namespace for <xmlDocs>

"http://library.stanford.edu/xmlDocs"

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(file, filename, options = {}) ⇒ Modsulator

The reason for requiring both a file and filename is that within the API that is one of the users of this class, the file and filename exist separately. Note that if neither :template_file nor :template_string are specified, the gem’s built-in XML template is used.

Parameters:

  • file (File)

    Input spreadsheet file.

  • filename (String)

    The filename for the input spreadsheet.

  • options (Hash) (defaults to: {})

Options Hash (options):

  • :template_file (String)

    The full path to the desired template file (a spreadsheet).

  • :template_string (String)

    The template contents as a string



31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/modsulator.rb', line 31

def initialize file, filename, options = {}
  @file = file
  @filename = filename

  @rows = ModsulatorSheet.new(@file, @filename).rows
  
  if options[:template_string]
    @template_xml = options[:template_string]
  elsif options[:template_file]
    @template_xml = File.read(options[:template_file])
  else
    @template_xml = File.read(File.expand_path("../modsulator/modsulator_template.xml", __FILE__))
  end
end

Instance Attribute Details

#fileObject (readonly)

Returns the value of attribute file.



20
21
22
# File 'lib/modsulator.rb', line 20

def file
  @file
end

#rowsObject (readonly)

Returns the value of attribute rows.



20
21
22
# File 'lib/modsulator.rb', line 20

def rows
  @rows
end

#template_xmlObject (readonly)

Returns the value of attribute template_xml.



20
21
22
# File 'lib/modsulator.rb', line 20

def template_xml
  @template_xml
end

Instance Method Details

#convert_rowsString

Generates an XML document with one <mods> entry per input row. Example output:

<xmlDocs datetime="2015-03-23 09:22:11AM" sourceFile="FitchMLK-v1.xlsx">
     <xmlDoc id="descMetadata" objectId="druid:aa111aa1111">
         <mods ... >
             :
         </mods>
     </xmlDoc>
     <xmlDoc id="descMetadata" objectId="druid:aa222aa2222">
         <mods ... >
             :
         </mods>
     </xmlDoc>
</xmlDocs>

Returns:

  • (String)

    An XML string containing all the <mods> documents within a nested structure as shown in the example.



62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/modsulator.rb', line 62

def convert_rows()
  time_stamp = Time.now.strftime("%Y-%m-%d %I:%M:%S%p")
  header = "<xmlDocs xmlns=\"#{NAMESPACE}\" datetime=\"#{time_stamp}\" sourceFile=\"#{@filename}\">"
  full_doc = Nokogiri::XML(header)
  root = full_doc.root

  @rows.each do |row|
    mods_xml_doc = row_to_xml(row)

    sub_doc = full_doc.create_element('xmlDoc', :id => 'descMetadata', :objectId => "#{row['druid']}")
    sub_doc.add_child(mods_xml_doc.root)
    root.add_child(sub_doc)
  end

  full_doc.to_s
end

#generate_normalized_mods(output_directory) ⇒ Void

Generates normalized (Stanford) MODS XML, writing output to files.

Parameters:

  • output_directory (String)

    The directory where output files should be stored.

Returns:

  • (Void)


115
116
117
118
119
120
121
122
123
124
# File 'lib/modsulator.rb', line 115

def generate_normalized_mods(output_directory)
  # Write one XML file per data row in the input spreadsheet
  rows.each do |row|
    sourceid = row['sourceId']
    output_filename = output_directory + "/" + sourceid + ".xml"

    mods_doc = row_to_xml(row)
    File.open(output_filename, 'w') { |fh| fh.puts(mods_doc.root.to_s) }
  end
end

#generate_xml(metadata_row) ⇒ String

Generates an XML string for a given row in a spreadsheet.

Parameters:

  • metadata_row (Hash)

    A single row in a MODS metadata spreadsheet, as provided by the ModsulatorSheet#rows method.

Returns:

  • (String)

    XML template, with data from the row substituted in.



84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/modsulator.rb', line 84

def generate_xml()
  manifest_row = 

  # XML escape all of the entries in the manifest row so they won't break the XML
  manifest_row.each {|k,v| manifest_row[k]=Nokogiri::XML::Text.new(v.to_s,Nokogiri::XML('')).to_s if v }

  # Enable access with symbol or string keys 
  manifest_row = manifest_row.with_indifferent_access
  
  # Run the XML template through ERB. This creates a new ERB object from the template XML,
  # NOT creating a separate thread, and omitting newlines for lines ending with '%>'
  template     = ERB.new(template_xml, nil, '>')

  # ERB.result() actually computes the template. This just passes the top level binding.
   = template.result(binding)

  # The manifest_row is a hash, with column names as the key.
  # In the template, as a convenience we allow users to put specific column placeholders inside
  # double brackets: "blah [[column_name]] blah".
  # Here we replace those placeholders with the corresponding value
  # from the manifest row.
  manifest_row.each { |k,v| .gsub! "[[#{k}]]", v.to_s.strip }

  
end

#row_to_xml(row) ⇒ Object

Converts a single data row into a normalized MODS XML document.

Parameters:

  • row

    A single row in a MODS metadata spreadsheet, as provided by the ModsulatorSheet#rows method.

Returns:

  • An instance of Nokogiri::XML::Document that holds a normalized MODS XML instance.



143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# File 'lib/modsulator.rb', line 143

def row_to_xml(row)

  # Generate an XML string, then remove any text carried over from the template
  mods_xml = generate_xml(row)
  mods_xml.gsub!(/\[\[[^\]]+\]\]/, "")

  # Remove empty tags from when e.g. <[[sn1:p2:type]]> does not get filled in when [[sn1:p2:type]] has no value in the source spreadsheet
  mods_xml.gsub!(/<\s[^>]+><\/>/, "")

  mods_xml_doc = Nokogiri::XML(mods_xml)
  normalizer = Normalizer.new
  normalizer.normalize_document(mods_xml_doc.root)

  return mods_xml_doc
end

#validate_headers(spreadsheet_headers) ⇒ Array<String>

Checks that all the headers in the spreadsheet has a corresponding entry in the XML template.

Parameters:

  • spreadsheet_headers (Array<String>)

    A list of all the headers in the spreadsheet

Returns:

  • (Array<String>)

    A list of spreadsheet headers that did not appear in the XML template. This list will be empty if all the headers were present.



132
133
134
135
136
# File 'lib/modsulator.rb', line 132

def validate_headers(spreadsheet_headers)
  spreadsheet_headers.reject do |header|
    header.nil? || header == "sourceId" || template_xml.include?(header)
  end
end