Class: Normalizer
- Inherits:
-
Object
- Object
- Normalizer
- Defined in:
- lib/modsulator/normalizer.rb
Overview
This class provides methods to normalize MODS XML according to the Stanford guidelines.
Constant Summary collapse
- LINEFEED =
Linefeed character entity reference
' '
Instance Method Summary collapse
-
#clean_date_attributes(root) ⇒ Void
Removes the point attribute from single <dateCreated> and <dateIssued> elements.
-
#clean_date_values(root) ⇒ Void
Sometimes there are spurious decimal digits within the date fields.
-
#clean_linefeeds(node) ⇒ Void
Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by n, r,
and <br/> are all replaced by a single <p> is replaced by two </p> is removed rn is replaced by Any tags not listed above are removed. -
#clean_text(s) ⇒ String
Cleans up the text of a node:.
-
#exceptional?(node) ⇒ Boolean
Checks if a node has attributes that we make exeptions for.
-
#normalize_document(root) ⇒ Void
Normalizes the given XML document according to the Stanford guidelines.
-
#normalize_xml_string(xml_string) ⇒ String
Normalizes the given XML document string according to the Stanford guidelines.
-
#remove_empty_attributes(node) ⇒ Void
Removes empty attributes from a given node.
-
#remove_empty_nodes(node) ⇒ Void
Removes empty nodes from an XML tree.
-
#substitute_linefeeds(node) ⇒ String
Recursive helper method for #clean_linefeeds to do string substitution.
-
#trim_text(node) ⇒ Void
Removes leading and trailing spaces from a node.
Instance Method Details
#clean_date_attributes(root) ⇒ Void
Removes the point attribute from single <dateCreated> and <dateIssued> elements.
169 170 171 172 173 174 175 176 177 178 |
# File 'lib/modsulator/normalizer.rb', line 169 def clean_date_attributes(root) # Find all the <dateCreated> and <dateIssued> elements that are NOT immediately followed by another element with the same name root.xpath('//mods:originInfo/mods:dateCreated[1][not(following-sibling::*[1][self::mods:dateCreated])] | //mods:originInfo/mods:dateIssued[1][not(following-sibling::*[1][self::mods:dateIssued])]', 'mods' => 'http://www.loc.gov/mods/v3').each do |current_element| attributes = current_element.attributes if(attributes.has_key?('point')) current_element.remove_attribute('point') end end end |
#clean_date_values(root) ⇒ Void
Sometimes there are spurious decimal digits within the date fields. This method removes any trailing decimal points within <dateCreated> and <dateIssued>.
186 187 188 189 190 |
# File 'lib/modsulator/normalizer.rb', line 186 def clean_date_values(root) root.xpath('//mods:dateCreated | //mods:dateIssued', 'mods' => 'http://www.loc.gov/mods/v3').each do |current_node| current_node.content = current_node.content.sub(/(.*)\.\d+$/, '\1') end end |
#clean_linefeeds(node) ⇒ Void
Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by n, r,
and <br/> are all replaced by a single <p> is replaced by two </p> is removed rn is replaced by Any tags not listed above are removed. MODS 3.5 does not allow for anything other than text inside these three nodes.
77 78 79 80 81 82 83 |
# File 'lib/modsulator/normalizer.rb', line 77 def clean_linefeeds(node) node.xpath("//abstract | //tableOfContents | //note").each do |current_node| new_text = substitute_linefeeds(current_node) current_node.children.remove current_node.content = new_text end end |
#clean_text(s) ⇒ String
Cleans up the text of a node:
-
Removes extra whitespace at the beginning and end.
-
Removes any consecutive whitespace within the string.
93 94 95 96 |
# File 'lib/modsulator/normalizer.rb', line 93 def clean_text(s) return nil unless s != nil && s != "" return s.gsub!(/\s+/, " ").strip! end |
#exceptional?(node) ⇒ Boolean
Checks if a node has attributes that we make exeptions for. There are two such exceptions.
-
A “collection” attribute with the value “yes” on a typeOfResource tag.
-
A “manuscript” attribute with the value “yes” on a typeOfResource tag.
Nodes that fall under any of these exceptions should not be deleted, even if they have no content.
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# File 'lib/modsulator/normalizer.rb', line 20 def exceptional?(node) return false unless node != nil tag = node.name attributes = node.attributes if(attributes.empty?) return false end for key, value in attributes do if(tag == "typeOfResource") # Note that according to the MODS schema, any other value than 'yes' for these attributes is invalid if((key == "collection" && value.to_s.downcase == "yes") || (key == "manuscript" && value.to_s.downcase == "yes")) return true end end end return false end |
#normalize_document(root) ⇒ Void
Normalizes the given XML document according to the Stanford guidelines.
197 198 199 200 201 202 203 204 |
# File 'lib/modsulator/normalizer.rb', line 197 def normalize_document(root) remove_empty_attributes(root) remove_empty_nodes(root) trim_text(root) clean_linefeeds(root) clean_date_attributes(root) clean_date_values(root) end |
#normalize_xml_string(xml_string) ⇒ String
Normalizes the given XML document string according to the Stanford guidelines.
211 212 213 214 215 |
# File 'lib/modsulator/normalizer.rb', line 211 def normalize_xml_string(xml_string) doc = Nokogiri::XML(xml_string) normalize_document(doc.root) doc.to_s end |
#remove_empty_attributes(node) ⇒ Void
Removes empty attributes from a given node.
104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/modsulator/normalizer.rb', line 104 def remove_empty_attributes(node) children = node.children attributes = node.attributes for key, value in attributes do if(value.to_s.strip.empty?) node.remove_attribute(key) end end children.each do |c| remove_empty_attributes(c) end end |
#remove_empty_nodes(node) ⇒ Void
Removes empty nodes from an XML tree. See #exceptional? for nodes that are kept even if empty.
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# File 'lib/modsulator/normalizer.rb', line 125 def remove_empty_nodes(node) children = node.children if(node.text?) if(node.to_s.strip.empty?) node.remove else return end elsif(children.length > 0) children.each do |c| remove_empty_nodes(c) end end if(!exceptional?(node) && (node.children.length == 0)) node.remove end end |
#substitute_linefeeds(node) ⇒ String
Recursive helper method for #clean_linefeeds to do string substitution.
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/modsulator/normalizer.rb', line 46 def substitute_linefeeds(node) new_text = String.new # If we substitute in ' ' by itself, Nokogiri interprets that and then prints '&#10;' when printing the document later. This # is an ugly way to add linefeed characters in a way that we at least get well-formatted output in the end. if(node.text?) new_text = node.content.gsub(/\r\n/, Nokogiri::HTML(LINEFEED).text).gsub(/\n/, Nokogiri::HTML(LINEFEED).text).gsub(/\r/, Nokogiri::HTML(LINEFEED).text) else if(node.node_name == "br") new_text += Nokogiri::HTML(LINEFEED).text elsif(node.node_name == "p") new_text += Nokogiri::HTML(LINEFEED).text + Nokogiri::HTML(LINEFEED).text; end node.children.each do |c| new_text += substitute_linefeeds(c) end end return new_text end |
#trim_text(node) ⇒ Void
Removes leading and trailing spaces from a node.
152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/modsulator/normalizer.rb', line 152 def trim_text(node) children = node.children if(node.text?) node.parent.content = node.text.strip else children.each do |c| trim_text(c) end end end |