Class: Stanford::Mods::Normalizer
- Inherits:
-
Object
- Object
- Stanford::Mods::Normalizer
- Defined in:
- lib/stanford/mods/normalizer.rb,
lib/stanford/mods/normalizer/version.rb
Constant Summary collapse
- LINEFEED =
Linefeed character entity reference
' '.freeze
- LONE_DATE_XPATH =
Select all single <dateCreated> and <dateIssued> fields
'//mods:originInfo/mods:dateCreated[1][not(following-sibling::*[1][self::mods:dateCreated])]' \ ' | //mods:originInfo/mods:dateIssued[1][not(following-sibling::*[1][self::mods:dateIssued])]'.freeze
- DATE_CREATED_ISSUED_XPATH =
Select all <dateCreated> and <dateIssued> fields
'//mods:dateCreated | //mods:dateIssued'.freeze
- MODS_NAMESPACE =
The official MODS namespace, courtesy of the Library of Congress
'http://www.loc.gov/mods/v3'.freeze
- LINEFEED_XPATH =
Selects <abstract>, <tableOfContents> and <note> when no namespace is present
'//abstract | //tableOfContents | //note'.freeze
- LINEFEED_XPATH_NAMESPACED =
Selects <abstract>, <tableOfContents> and <note> when a namespace is present
'//ns:abstract | //ns:tableOfContents | //ns:note'.freeze
- VERSION =
'0.1.0'.freeze
Instance Method Summary collapse
-
#clean_date_values(nodes) ⇒ Void
Sometimes there are spurious decimal digits within the date fields.
-
#clean_linefeeds(node_list) ⇒ Void
Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by n, r,
and <br/> are all replaced by a single <p> is replaced by two </p> is removed rn is replaced by Any tags not listed above are removed. -
#clean_text(s) ⇒ String
Cleans up the text of a node:.
-
#exceptional?(node) ⇒ Boolean
Checks if a node has attributes that we make exeptions for.
-
#normalize_document(root) ⇒ Void
deprecated
Deprecated.
Use normalize_mods_document instead.
-
#normalize_mods_document(root) ⇒ Void
Normalizes the given MODS XML document according to the Stanford guidelines.
-
#normalize_xml_string(xml_string) ⇒ String
Normalizes the given XML document string according to the Stanford guidelines.
-
#remove_empty_attributes(node) ⇒ Void
Removes empty attributes from a given node.
-
#remove_empty_nodes(node) ⇒ Void
Removes empty nodes from an XML tree.
-
#substitute_linefeeds(node) ⇒ String
Recursive helper method for #clean_linefeeds to do string substitution.
-
#trim_text(node) ⇒ Void
Removes leading and trailing spaces from a node.
Instance Method Details
#clean_date_values(nodes) ⇒ Void
Sometimes there are spurious decimal digits within the date fields. This method removes any trailing decimal points within <dateCreated> and <dateIssued>.
173 174 175 176 177 |
# File 'lib/stanford/mods/normalizer.rb', line 173 def clean_date_values(nodes) nodes.each do |current_node| current_node.content = current_node.content.sub(/(.*)\.\d+$/, '\1') end end |
#clean_linefeeds(node_list) ⇒ Void
Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by n, r,
and <br/> are all replaced by a single <p> is replaced by two </p> is removed rn is replaced by Any tags not listed above are removed. MODS 3.5 does not allow for anything other than text inside these three nodes.
94 95 96 97 98 99 100 |
# File 'lib/stanford/mods/normalizer.rb', line 94 def clean_linefeeds(node_list) node_list.each do |current_node| new_text = substitute_linefeeds(current_node) current_node.children.remove current_node.content = new_text end end |
#clean_text(s) ⇒ String
Cleans up the text of a node:
-
Removes extra whitespace at the beginning and end.
-
Removes any consecutive whitespace within the string.
109 110 111 112 |
# File 'lib/stanford/mods/normalizer.rb', line 109 def clean_text(s) return nil unless !s.nil? && s != '' s.gsub(/\s+/, ' ').strip end |
#exceptional?(node) ⇒ Boolean
Checks if a node has attributes that we make exeptions for. There are two such exceptions.
-
A “collection” attribute with the value “yes” on a typeOfResource tag.
-
A “manuscript” attribute with the value “yes” on a typeOfResource tag.
Nodes that fall under any of these exceptions should not be deleted, even if they have no content.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# File 'lib/stanford/mods/normalizer.rb', line 39 def exceptional?(node) return false if node.nil? tag = node.name attributes = node.attributes return false if attributes.empty? attributes.each do |key, value| next unless tag == 'typeOfResource' # Note that according to the MODS schema, any other value than 'yes' for these attributes is invalid if (key == 'collection' && value.to_s.casecmp('yes').zero?) || (key == 'manuscript' && value.to_s.casecmp('yes').zero?) return true end end false end |
#normalize_document(root) ⇒ Void
Use normalize_mods_document instead.
Normalizes the given MODS XML document according to the Stanford guidelines.
202 203 204 |
# File 'lib/stanford/mods/normalizer.rb', line 202 def normalize_document(root) normalize_mods_document(root) end |
#normalize_mods_document(root) ⇒ Void
Normalizes the given MODS XML document according to the Stanford guidelines.
183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/stanford/mods/normalizer.rb', line 183 def normalize_mods_document(root) node_list = if root.namespace.nil? root.xpath(LINEFEED_XPATH) else root.xpath(LINEFEED_XPATH_NAMESPACED, 'ns' => root.namespace.href) end clean_linefeeds(node_list) # Do this before deleting <br> and <p> with remove_empty_nodes() remove_empty_attributes(root) remove_empty_nodes(root) trim_text(root) clean_date_values(root.xpath(DATE_CREATED_ISSUED_XPATH, 'mods' => MODS_NAMESPACE)) end |
#normalize_xml_string(xml_string) ⇒ String
Normalizes the given XML document string according to the Stanford guidelines.
210 211 212 213 214 |
# File 'lib/stanford/mods/normalizer.rb', line 210 def normalize_xml_string(xml_string) doc = Nokogiri::XML(xml_string) normalize_document(doc.root) doc.to_s end |
#remove_empty_attributes(node) ⇒ Void
Removes empty attributes from a given node.
118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/stanford/mods/normalizer.rb', line 118 def remove_empty_attributes(node) children = node.children attributes = node.attributes attributes.each do |key, value| node.remove_attribute(key) if value.to_s.strip.empty? end children.each do |c| remove_empty_attributes(c) end end |
#remove_empty_nodes(node) ⇒ Void
Removes empty nodes from an XML tree. See #exceptional? for nodes that are kept even if empty.
135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
# File 'lib/stanford/mods/normalizer.rb', line 135 def remove_empty_nodes(node) children = node.children if node.text? return node.remove if node.to_s.strip.empty? return elsif !children.empty? children.each do |c| remove_empty_nodes(c) end end node.remove if !exceptional?(node) && node.children.empty? end |
#substitute_linefeeds(node) ⇒ String
Recursive helper method for #clean_linefeeds to do string substitution.
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
# File 'lib/stanford/mods/normalizer.rb', line 63 def substitute_linefeeds(node) new_text = '' # If we substitute in ' ' by itself, Nokogiri interprets that and then prints '&#10;' when printing the document later. This # is an ugly way to add linefeed characters in a way that we at least get well-formatted output in the end. if node.text? new_text = node.content.gsub(/(\r\n|\n|\r|\\n)/, Nokogiri::HTML(LINEFEED).text) else if node.node_name == 'br' new_text += Nokogiri::HTML(LINEFEED).text elsif node.node_name == 'p' new_text += Nokogiri::HTML(LINEFEED).text + Nokogiri::HTML(LINEFEED).text end node.children.each do |c| new_text += substitute_linefeeds(c) end end new_text end |
#trim_text(node) ⇒ Void
Removes leading and trailing spaces from a node.
156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/stanford/mods/normalizer.rb', line 156 def trim_text(node) children = node.children if node.text? node.parent.content = node.text.strip else children.each do |c| trim_text(c) end end end |