Class: Stanford::Mods::Normalizer

Inherits:

Object

Object
Stanford::Mods::Normalizer

show all

Defined in:: lib/stanford/mods/normalizer.rb,
lib/stanford/mods/normalizer/version.rb

Constant Summary collapse

LINEFEED = Linefeed character entity reference

'&#10;'.freeze

LONE_DATE_XPATH = Select all single <dateCreated> and <dateIssued> fields

'//mods:originInfo/mods:dateCreated[1][not(following-sibling::*[1][self::mods:dateCreated])]' \
' | //mods:originInfo/mods:dateIssued[1][not(following-sibling::*[1][self::mods:dateIssued])]'.freeze

DATE_CREATED_ISSUED_XPATH = Select all <dateCreated> and <dateIssued> fields

'//mods:dateCreated | //mods:dateIssued'.freeze

MODS_NAMESPACE = The official MODS namespace, courtesy of the Library of Congress

'http://www.loc.gov/mods/v3'.freeze

LINEFEED_XPATH = Selects <abstract>, <tableOfContents> and <note> when no namespace is present

'//abstract | //tableOfContents | //note'.freeze

LINEFEED_XPATH_NAMESPACED = Selects <abstract>, <tableOfContents> and <note> when a namespace is present

'//ns:abstract | //ns:tableOfContents | //ns:note'.freeze

VERSION =

'0.1.0'.freeze

Instance Method Summary collapse

#clean_date_values(nodes) ⇒ Void

Sometimes there are spurious decimal digits within the date fields.
#clean_linefeeds(node_list) ⇒ Void

Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by 
 n, r,
and are all replaced by a single 
 is replaced by two 
 is removed rn is replaced by 
 Any tags not listed above are removed.
#clean_text(s) ⇒ String

Cleans up the text of a node:.
#exceptional?(node) ⇒ Boolean

Checks if a node has attributes that we make exeptions for.
#normalize_document(root) ⇒ Void deprecated Deprecated.

Use normalize_mods_document instead.
#normalize_mods_document(root) ⇒ Void

Normalizes the given MODS XML document according to the Stanford guidelines.
#normalize_xml_string(xml_string) ⇒ String

Normalizes the given XML document string according to the Stanford guidelines.
#remove_empty_attributes(node) ⇒ Void

Removes empty attributes from a given node.
#remove_empty_nodes(node) ⇒ Void

Removes empty nodes from an XML tree.
#substitute_linefeeds(node) ⇒ String

Recursive helper method for #clean_linefeeds to do string substitution.
#trim_text(node) ⇒ Void

Removes leading and trailing spaces from a node.

Instance Method Details

#clean_date_values(nodes) ⇒ `Void`

Sometimes there are spurious decimal digits within the date fields. This method removes any trailing decimal points within <dateCreated> and <dateIssued>.

Parameters:

nodes (Nokogiri::XML::NodeSet) —

A set of all affected <dateCreated> and <dateIssued> elements.

Returns:

(Void) —

The given document is modified in place.

# File 'lib/stanford/mods/normalizer.rb', line 173

def clean_date_values(nodes)
  nodes.each do |current_node|
    current_node.content = current_node.content.sub(/(.*)\.\d+$/, '\1')
  end
end

#clean_linefeeds(node_list) ⇒ `Void`

Given the root of an XML document, replaces linefeed characters inside <tableOfContents>, <abstract> and <note> XML node by 
 n, r,
and are all replaced by a single 
 is replaced by two 
 is removed rn is replaced by 
 Any tags not listed above are removed. MODS 3.5 does not allow for anything other than text inside these three nodes.

Parameters:

node_list (Nokogiri::XML::NodeSet) —

All <tableOfContents>, <abstract> and <node> elements.

Returns:

(Void) —

This method doesn’t return anything, but introduces UTF-8 linefeed characters in place, as described above.

# File 'lib/stanford/mods/normalizer.rb', line 94

def clean_linefeeds(node_list)
  node_list.each do |current_node|
    new_text = substitute_linefeeds(current_node)
    current_node.children.remove
    current_node.content = new_text
  end
end

#clean_text(s) ⇒ `String`

Cleans up the text of a node:

Removes extra whitespace at the beginning and end.
Removes any consecutive whitespace within the string.

Parameters:

s (String) —

The text of an XML node.

Returns:

(String) —

The cleaned string, as described. Returns nil if the input is nil, or if the input is an empty string.

# File 'lib/stanford/mods/normalizer.rb', line 109

def clean_text(s)
  return nil unless !s.nil? && s != ''
  s.gsub(/\s+/, ' ').strip
end

#exceptional?(node) ⇒ `Boolean`

Checks if a node has attributes that we make exeptions for. There are two such exceptions.

A “collection” attribute with the value “yes” on a typeOfResource tag.
A “manuscript” attribute with the value “yes” on a typeOfResource tag.

Nodes that fall under any of these exceptions should not be deleted, even if they have no content.

Parameters:

node (Nokogiri::XML::Element) —

An XML node.

Returns:

(Boolean) —

true if the node contains any of the exceptional attributes, false otherwise.

# File 'lib/stanford/mods/normalizer.rb', line 39

def exceptional?(node)
  return false if node.nil?

  tag = node.name
  attributes = node.attributes

  return false if attributes.empty?

  attributes.each do |key, value|
    next unless tag == 'typeOfResource'
    # Note that according to the MODS schema, any other value than 'yes' for these attributes is invalid
    if (key == 'collection' && value.to_s.casecmp('yes').zero?) ||
       (key == 'manuscript' && value.to_s.casecmp('yes').zero?)
      return true
    end
  end
  false
end

#normalize_document(root) ⇒ `Void`

Deprecated.

Use normalize_mods_document instead.

Normalizes the given MODS XML document according to the Stanford guidelines.

Parameters:

root (Nokogiri::XML::Element) —

The root of a MODS XML document.

Returns:

(Void) —

The given document is modified in place.



202
203
204

# File 'lib/stanford/mods/normalizer.rb', line 202

def normalize_document(root)
  normalize_mods_document(root)
end

#normalize_mods_document(root) ⇒ `Void`

Normalizes the given MODS XML document according to the Stanford guidelines.

Parameters:

root (Nokogiri::XML::Element) —

The root of a MODS XML document.

Returns:

(Void) —

The given document is modified in place.

# File 'lib/stanford/mods/normalizer.rb', line 183

def normalize_mods_document(root)
  node_list = if root.namespace.nil?
                root.xpath(LINEFEED_XPATH)
              else
                root.xpath(LINEFEED_XPATH_NAMESPACED, 'ns' => root.namespace.href)
              end
  clean_linefeeds(node_list) # Do this before deleting <br> and <p> with remove_empty_nodes()

  remove_empty_attributes(root)
  remove_empty_nodes(root)
  trim_text(root)
  clean_date_values(root.xpath(DATE_CREATED_ISSUED_XPATH, 'mods' => MODS_NAMESPACE))
end

#normalize_xml_string(xml_string) ⇒ `String`

Normalizes the given XML document string according to the Stanford guidelines.

Parameters:

xml_string (String) —

An XML document

Returns:

(String) —

The XML string, with normalizations applied.

# File 'lib/stanford/mods/normalizer.rb', line 210

def normalize_xml_string(xml_string)
  doc = Nokogiri::XML(xml_string)
  normalize_document(doc.root)
  doc.to_s
end

#remove_empty_attributes(node) ⇒ `Void`

Removes empty attributes from a given node.

Parameters:

node (Nokogiri::XML::Element) —

An XML node.

Returns:

(Void) —

This method doesn’t return anything, but modifies the XML tree starting at the given node.

# File 'lib/stanford/mods/normalizer.rb', line 118

def remove_empty_attributes(node)
  children = node.children
  attributes = node.attributes

  attributes.each do |key, value|
    node.remove_attribute(key) if value.to_s.strip.empty?
  end

  children.each do |c|
    remove_empty_attributes(c)
  end
end

#remove_empty_nodes(node) ⇒ `Void`

Removes empty nodes from an XML tree. See #exceptional? for nodes that are kept even if empty.

Parameters:

node (Nokogiri::XML::Element) —

An XML node.

Returns:

(Void) —

This method doesn’t return anything, but modifies the XML tree starting at the given node.

# File 'lib/stanford/mods/normalizer.rb', line 135

def remove_empty_nodes(node)
  children = node.children

  if node.text?
    return node.remove if node.to_s.strip.empty?
    return
  elsif !children.empty?
    children.each do |c|
      remove_empty_nodes(c)
    end
  end

  node.remove if !exceptional?(node) && node.children.empty?
end

#substitute_linefeeds(node) ⇒ `String`

Recursive helper method for #clean_linefeeds to do string substitution.

Parameters:

node (Nokogiri::XML::Element) —

An XML node

Returns:

(String) —

A string composed of the entire contents of the given node, with substitutions made as described for #clean_linefeeds.

# File 'lib/stanford/mods/normalizer.rb', line 63

def substitute_linefeeds(node)
  new_text = ''

  # If we substitute in '&#10;' by itself, Nokogiri interprets that and then prints '&amp;#10;' when printing the document later. This
  # is an ugly way to add linefeed characters in a way that we at least get well-formatted output in the end.
  if node.text?
    new_text = node.content.gsub(/(\r\n|\n|\r|\\n)/, Nokogiri::HTML(LINEFEED).text)
  else
    if node.node_name == 'br'
      new_text += Nokogiri::HTML(LINEFEED).text
    elsif node.node_name == 'p'
      new_text += Nokogiri::HTML(LINEFEED).text + Nokogiri::HTML(LINEFEED).text
    end

    node.children.each do |c|
      new_text += substitute_linefeeds(c)
    end
  end
  new_text
end

#trim_text(node) ⇒ `Void`

Removes leading and trailing spaces from a node.

Parameters:

node (Nokogiri::XML::Element) —

An XML node.

Returns:

(Void) —

This method doesn’t return anything, but modifies the entire XML tree starting at the the given node, removing leading and trailing spaces from all text. If the input is nil, an exception will be raised.

# File 'lib/stanford/mods/normalizer.rb', line 156

def trim_text(node)
  children = node.children

  if node.text?
    node.parent.content = node.text.strip
  else
    children.each do |c|
      trim_text(c)
    end
  end
end

Class: Stanford::Mods::Normalizer

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#clean_date_values(nodes) ⇒ Void

#clean_linefeeds(node_list) ⇒ Void

#clean_text(s) ⇒ String

#exceptional?(node) ⇒ Boolean

#normalize_document(root) ⇒ Void

#normalize_mods_document(root) ⇒ Void

#normalize_xml_string(xml_string) ⇒ String

#remove_empty_attributes(node) ⇒ Void

#remove_empty_nodes(node) ⇒ Void

#substitute_linefeeds(node) ⇒ String

#trim_text(node) ⇒ Void

#clean_date_values(nodes) ⇒ `Void`

#clean_linefeeds(node_list) ⇒ `Void`

#clean_text(s) ⇒ `String`

#exceptional?(node) ⇒ `Boolean`

#normalize_document(root) ⇒ `Void`

#normalize_mods_document(root) ⇒ `Void`

#normalize_xml_string(xml_string) ⇒ `String`

#remove_empty_attributes(node) ⇒ `Void`

#remove_empty_nodes(node) ⇒ `Void`

#substitute_linefeeds(node) ⇒ `String`

#trim_text(node) ⇒ `Void`