Class: Normalizer

Inherits:
Object
  • Object
show all
Defined in:
lib/modsulator/normalizer.rb

Overview

This class provides methods to normalize MODS XML according to the Stanford guidelines.

Constant Summary collapse

LINEFEED =

Linefeed character entity reference

'
'

Instance Method Summary collapse

Instance Method Details

#clean_date_attributes(root) ⇒ Void

Removes the point attribute from single <dateCreated> and <dateIssued> elements.

Parameters:

  • root (Nokogiri::XML::Element)

    The root of a MODS XML document.

Returns:

  • (Void)

    The given document is modified in place.



169
170
171
172
173
174
175
176
177
178
# File 'lib/modsulator/normalizer.rb', line 169

def clean_date_attributes(root)
  
  # Find all the <dateCreated> and <dateIssued> elements that are NOT immediately followed by another element with the same name
  root.xpath('//mods:originInfo/mods:dateCreated[1][not(following-sibling::*[1][self::mods:dateCreated])] | //mods:originInfo/mods:dateIssued[1][not(following-sibling::*[1][self::mods:dateIssued])]', 'mods' => 'http://www.loc.gov/mods/v3').each do |current_element|
    attributes = current_element.attributes
    if(attributes.has_key?('point'))
      current_element.remove_attribute('point')
    end
  end
end

#clean_date_values(root) ⇒ Void

Sometimes there are spurious decimal digits within the date fields. This method removes any trailing decimal points within <dateCreated> and <dateIssued>.

Parameters:

  • root (Nokogiri::XML::Element)

    The root of a MODS XML document.

Returns:

  • (Void)

    The given document is modified in place.



186
187
188
189
190
# File 'lib/modsulator/normalizer.rb', line 186

def clean_date_values(root)
  root.xpath('//mods:dateCreated | //mods:dateIssued', 'mods' => 'http://www.loc.gov/mods/v3').each do |current_node|
    current_node.content = current_node.content.sub(/(.*)\.\d+$/, '\1')
  end
end

#clean_linefeeds(node) ⇒ Void

Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by &#10; n, r,
and <br/> are all replaced by a single &#10; <p> is replaced by two &#10; </p> is removed rn is replaced by &#10; Any tags not listed above are removed. MODS 3.5 does not allow for anything other than text inside these three nodes.

Parameters:

  • node (Nokogiri::XML::Element)

    The root node of an XML document

Returns:

  • (Void)

    This method doesn’t return anything, but introduces UTF-8 linefeed characters in place, as described above.



77
78
79
80
81
82
83
# File 'lib/modsulator/normalizer.rb', line 77

def clean_linefeeds(node)
  node.xpath("//abstract | //tableOfContents | //note").each do |current_node|
    new_text = substitute_linefeeds(current_node)
    current_node.children.remove
    current_node.content = new_text
  end
end

#clean_text(s) ⇒ String

Cleans up the text of a node:

  • Removes extra whitespace at the beginning and end.

  • Removes any consecutive whitespace within the string.

Parameters:

  • s (String)

    The text of an XML node.

Returns:

  • (String)

    The cleaned string, as described. Returns nil if the input is nil, or if the input is an empty string.



93
94
95
96
# File 'lib/modsulator/normalizer.rb', line 93

def clean_text(s)
  return nil unless s != nil && s != ""
  return s.gsub!(/\s+/, " ").strip!
end

#exceptional?(node) ⇒ Boolean

Checks if a node has attributes that we make exeptions for. There are two such exceptions.

  • A “collection” attribute with the value “yes” on a typeOfResource tag.

  • A “manuscript” attribute with the value “yes” on a typeOfResource tag.

Nodes that fall under any of these exceptions should not be deleted, even if they have no content.

Parameters:

  • node (Nokogiri::XML::Element)

    An XML node.

Returns:

  • (Boolean)

    true if the node contains any of the exceptional attributes, false otherwise.



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# File 'lib/modsulator/normalizer.rb', line 20

def exceptional?(node)
  return false unless node != nil
  
  tag = node.name
  attributes = node.attributes

  if(attributes.empty?)
    return false
  end

  for key, value in attributes do
    if(tag == "typeOfResource")  # Note that according to the MODS schema, any other value than 'yes' for these attributes is invalid
      if((key == "collection" && value.to_s.downcase == "yes") ||
         (key == "manuscript" && value.to_s.downcase == "yes"))
        return true
      end
    end
  end
  return false
end

#normalize_document(root) ⇒ Void

Normalizes the given XML document according to the Stanford guidelines.

Parameters:

  • root (Nokogiri::XML::Element)

    The root of a MODS XML document.

Returns:

  • (Void)

    The given document is modified in place.



197
198
199
200
201
202
203
204
# File 'lib/modsulator/normalizer.rb', line 197

def normalize_document(root)
  remove_empty_attributes(root)
  remove_empty_nodes(root)
  trim_text(root)
  clean_linefeeds(root)
  clean_date_attributes(root)
  clean_date_values(root)
end

#normalize_xml_string(xml_string) ⇒ String

Normalizes the given XML document string according to the Stanford guidelines.

Parameters:

  • xml_string (String)

    An XML document

Returns:

  • (String)

    The XML string, with normalizations applied.



211
212
213
214
215
# File 'lib/modsulator/normalizer.rb', line 211

def normalize_xml_string(xml_string)
  doc = Nokogiri::XML(xml_string)
  normalize_document(doc.root)
  doc.to_s
end

#remove_empty_attributes(node) ⇒ Void

Removes empty attributes from a given node.

Parameters:

  • node (Nokogiri::XML::Element)

    An XML node.

Returns:

  • (Void)

    This method doesn’t return anything, but modifies the XML tree starting at the given node.



104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/modsulator/normalizer.rb', line 104

def remove_empty_attributes(node)
  children = node.children
  attributes = node.attributes
  
  for key, value in attributes do
    if(value.to_s.strip.empty?)
      node.remove_attribute(key)
    end
  end

  children.each do |c|
    remove_empty_attributes(c)
  end
end

#remove_empty_nodes(node) ⇒ Void

Removes empty nodes from an XML tree. See #exceptional? for nodes that are kept even if empty.

Parameters:

  • node (Nokogiri::XML::Element)

    An XML node.

Returns:

  • (Void)

    This method doesn’t return anything, but modifies the XML tree starting at the given node.



125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# File 'lib/modsulator/normalizer.rb', line 125

def remove_empty_nodes(node)
  children = node.children

  if(node.text?)
    if(node.to_s.strip.empty?)
      node.remove
    else
      return
    end
  elsif(children.length > 0)
    children.each do |c|
      remove_empty_nodes(c)
    end
  end

  if(!exceptional?(node) && (node.children.length == 0))
    node.remove
  end
end

#substitute_linefeeds(node) ⇒ String

Recursive helper method for #clean_linefeeds to do string substitution.

Parameters:

  • node (Nokogiri::XML::Element)

    An XML node

Returns:

  • (String)

    A string composed of the entire contents of the given node, with substitutions made as described for #clean_linefeeds.



46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/modsulator/normalizer.rb', line 46

def substitute_linefeeds(node)
  new_text = String.new

  # If we substitute in '&#10;' by itself, Nokogiri interprets that and then prints '&amp;#10;' when printing the document later. This
  # is an ugly way to add linefeed characters in a way that we at least get well-formatted output in the end.
  if(node.text?)
    new_text = node.content.gsub(/\r\n/, Nokogiri::HTML(LINEFEED).text).gsub(/\n/, Nokogiri::HTML(LINEFEED).text).gsub(/\r/, Nokogiri::HTML(LINEFEED).text) 
  else
    if(node.node_name == "br")
      new_text += Nokogiri::HTML(LINEFEED).text
    elsif(node.node_name == "p")
      new_text += Nokogiri::HTML(LINEFEED).text + Nokogiri::HTML(LINEFEED).text;
    end
    
    node.children.each do |c|
      new_text += substitute_linefeeds(c)
    end
  end
  return new_text
end

#trim_text(node) ⇒ Void

Removes leading and trailing spaces from a node.

Parameters:

  • node (Nokogiri::XML::Element)

    An XML node.

Returns:

  • (Void)

    This method doesn’t return anything, but modifies the entire XML tree starting at the the given node, removing leading and trailing spaces from all text. If the input is nil, an exception will be raised.



152
153
154
155
156
157
158
159
160
161
162
# File 'lib/modsulator/normalizer.rb', line 152

def trim_text(node)
  children = node.children

  if(node.text?)
    node.parent.content = node.text.strip
  else
    children.each do |c|
      trim_text(c)
    end
  end
end