Module: DS::Extractor::DsMetsXmlExtractor::ClassMethods

Included in:
DS::Extractor::DsMetsXmlExtractor
Defined in:
lib/ds/extractor/ds_mets_xml_extractor.rb

Constant Summary collapse

NS =
{
  mods: 'http://www.loc.gov/mods/v3',
  mets: 'http://www.loc.gov/METS/',
}
DATE_START_XPATH =
'mods:mods/mods:originInfo/mods:dateCreated[@point="start"]'
DATE_END_XPATH =
'mods:mods/mods:originInfo/mods:dateCreated[@point="end"]'

Instance Method Summary collapse

Instance Method Details

#dated_by_scribe?(xml) ⇒ Boolean

Determines if the XML document is dated by a scribe.

Parameters:

  • xml (Nokogiri::XML:Node)

    the XML document to check

Returns:

  • (Boolean)

    true if the document is dated by a scribe, false otherwise



529
530
531
532
533
534
535
536
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 529

def dated_by_scribe? xml
  parts = find_parts xml
  # mods:mods/mods:note
  xpath = 'mods:mods/mods:note[@type="date"]'
  parts.any? { |part|
    part.xpath(xpath).text.upcase == 'Y'
  }
end

#extract_acknowledgments(xml) ⇒ Array<String>

Extracts acknowledgments from the given XML document.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML document to extract acknowledgments from

Returns:

  • (Array<String>)

    the extracted acknowledgments



645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 645

def extract_acknowledgments xml
  notes = []
  notes += find_ms(xml).flat_map { |ms| note_by_type ms, 'admin' }

  notes += find_parts(xml).flat_map { |part|
    extent = extract_extent part
    note_by_type part, 'admin', tag: extent
  }

  notes += find_texts(xml).flat_map { |text|
    extent = extract_extent text
    note_by_type text, 'admin', tag: extent
  }

  notes += find_pages(xml).flat_map { |page|
    extent = extract_extent page
    note_by_type page, 'admin', tag: extent
  }

  clean_notes notes
end

#extract_all_subjects(record) ⇒ Array<DS::Extractor::Subject>

Note:

method returns #extract_subjects to fulfill DS::Extractor contract

Extracts all subjects from the given record.

Parameters:

  • record (Object)

    the record to extract subjects from

Returns:



921
922
923
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 921

def extract_all_subjects record
  extract_subjects record
end

#extract_all_subjects_as_recorded(xml) ⇒ Array<String>

Extract all subjects as recorded from the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML to extract subjects from

Returns:

  • (Array<String>)

    the extracted subjects as recorded



510
511
512
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 510

def extract_all_subjects_as_recorded xml
  extract_subjects_as_recorded xml
end

#extract_artists(record) ⇒ Array<DS::Extractor::Name>

Extracts artists from the given record using the specified type and role.

Parameters:

  • record (Object)

    the record to extract artists from

Returns:



292
293
294
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 292

def extract_artists record
  DS::Extractor::DsMetsXmlExtractor.extract_name record, *%w{ artist [artist] illuminator }
end

#extract_artists_as_recorded(record) ⇒ Object

Extracts artists as recorded from the given record.

Parameters:

  • record (Object)

    the record to extract artists



284
285
286
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 284

def extract_artists_as_recorded record
  extract_artists(record).map &:as_recorded
end

#extract_assigned_date(part) ⇒ Array<Integer>

Return dates found in the otherDate element, reformatting them as needed. These examples are taken from several METS files.

<mods:dateOther>[ca. 1410]</mods:dateOther>
<mods:dateOther>[between 1100 and 1200]</mods:dateOther>
<mods:dateOther>[between 1450 and 1460]</mods:dateOther>
<mods:dateOther>[between 1450 and 1500]</mods:dateOther>
<mods:dateOther>s. XV#^3/4#</mods:dateOther>
<mods:dateOther>s. XV</mods:dateOther>
<mods:dateOther>s. XVI#^4/4#</mods:dateOther>
<mods:dateOther>s. XVIII#^2/4#</mods:dateOther>
<mods:dateOther>s. XV#^in#</mods:dateOther>

Most dateOther values have the format:

s. XVII#^2#

The notation #^<VAL># encodes a portion of the string that was presented as superscript on the Berkeley DS site. DS 2.0 doesn’t use the superscripts; thus, when it occurs, this portion of the string is reformatted ‘(<VAL>)`:

s. XVII#^2#   =>    s. XVII(2)
s. XV#^ex#    =>    s. XV(ex)
s. XVI#^in#   =>    s. XVI(in)
s. X#^med#    =>    s. X(med)
s. XII#^med#  =>    s. XII(med)

Parameters:

  • part (Nokogiri::XML:Node)

    a part-level node

Returns:

  • (Array<Integer>)

    the date string reformatted as described above



635
636
637
638
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 635

def extract_assigned_date part
  xpath = 'mods:mods/mods:originInfo/mods:dateOther'
  part.xpath(xpath).text.gsub %r{#\^?([\w/]+)(\^|#)}, '(\1)'
end

#extract_associated_agents(record) ⇒ Array<String>

Extract other names from the given record.

Parameters:

  • record (Object)

    the record to extract other names from

Returns:

  • (Array<String>)

    the extracted other names



324
325
326
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 324

def extract_associated_agents record
  DS::Extractor::DsMetsXmlExtractor.extract_name record, 'other'
end

#extract_authors(record) ⇒ Array<DS::Extractor::Name>

Extracts authors from the given record.

Parameters:

  • record (Object)

    the record to extract authors from

Returns:



269
270
271
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 269

def extract_authors record
  DS::Extractor::DsMetsXmlExtractor.extract_name record, *%w{ author [author] }
end

#extract_authors_as_recorded(record) ⇒ Array<String>

Extracts authors as recorded from the given record.

Parameters:

  • record (Object)

    the record to extract authors from

Returns:

  • (Array<String>)

    the extracted authors as recorded



277
278
279
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 277

def extract_authors_as_recorded record
  extract_authors(record).map &:as_recorded
end

#extract_cataloging_convention(xml) ⇒ Object



17
18
19
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 17

def extract_cataloging_convention xml
  'ds-mets'
end

#extract_date_created(part) ⇒ Array<Integer>

Return any date not found in the otherDate or in a dateCreated date range (see #extract_date_range); thus:

<mods:dateCreated>1537</mods:dateCreated>
<mods:dateCreated>1531</mods:dateCreated>
<mods:dateCreated>14??, October 21</mods:dateCreated>
<mods:dateCreated>1462, July 23</mods:dateCreated>
<mods:dateCreated>1549, November</mods:dateCreated>

These values commonly give the date for “dated” manuscripts

Parameters:

  • part (Nokogiri::XML:Node)

    a part-level node

Returns:

  • (Array<Integer>)

    the content of any dateCreated without ‘@point’ defined



599
600
601
602
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 599

def extract_date_created part
  xpath = 'mods:mods/mods:originInfo/mods:dateCreated[not(@point)]'
  part.xpath(xpath).map(&:text).join ', '
end

#extract_date_range(xml, range_sep:) ⇒ Array<String>

Extract ranges from mods:dateCreated elements where a @point is defined, thus:

<mods:dateCreated point="start" encoding="iso8601">1300</mods:dateCreated>
<mods:dateCreated point="end" encoding="iso8601">1399</mods:dateCreated>

Parameters:

  • part (Nokogiri::XML:Node)

    a part-level node

Returns:

  • (Array<String>)

    the start and end dates as an array of integers



563
564
565
566
567
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 563

def extract_date_range xml, range_sep:
  find_parts(xml).map { |part|
    extract_date_range_for_part(part).join range_sep
  }
end

#extract_date_range_for_part(part) ⇒ Array<Integer>

Extract ranges from mods:dateCreated elements where a @point is start and end

Parameters:

  • part (Nokogiri::XML:Node)

    a part-level node

Returns:

  • (Array<Integer>)

    the start and end dates as an array of integers



578
579
580
581
582
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 578

def extract_date_range_for_part part
  start_date = part.xpath(DATE_START_XPATH).text
  end_date   = part.xpath(DATE_END_XPATH).text
  [start_date, end_date].reject(&:empty?).map(&:to_i)
end

#extract_docket(xml) ⇒ Array<String>

DS METS can have mods:abstract elments with @displayLabel=“docket”. Extract these values and return as an array.

Parameters:

  • xml (Nokogiri::XML::Node)

    the document xml

Returns:

  • (Array<String>)

    the note values



889
890
891
892
893
894
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 889

def extract_docket xml
  xpath = %q{//mods:abstract[@displayLabel = 'docket']/text()}
  xml.xpath(xpath, NS).map { |docket|
    "Docket: #{docket.text}"
  }
end

#extract_explicit(node, tag:) ⇒ Array<String>

Extracts explicit information from the given node based on the provided tag.

Parameters:

  • node (Nokogiri::XML::Node)

    the XML node to extract information from

  • tag (String)

    the tag to prepend to each extracted information

Returns:

  • (Array<String>)

    an array of extracted information



790
791
792
793
794
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 790

def extract_explicit node, tag:
  node.xpath('mods:mods/mods:abstract/text()').map { |n|
    "#{tag}: #{n.text}"
  }
end

#extract_extent(node) ⇒ String

Extracts the extent from the given node.

Parameters:

  • node (Nokogiri::XML::Node)

    the XML node to extract extent from

Returns:

  • (String)

    the extracted extent



211
212
213
214
215
216
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 211

def extract_extent node
  xpath = 'mods:mods/mods:physicalDescription/mods:extent'
  node.xpath(xpath).flat_map { |extent|
    extent.text.split(%r{;;}).first
  }.join ', '
end

#extract_filenames(page) ⇒ Array<String>

Extract the filename for page. This will be either:

* the values for +mods:identifier+ with +@type='filename'+; or

* the filenames pointed to by the linked +mets:fptr+ in the
     +mets:fileGrp+ with +@USE='image/master'+

* an array containing +['NO_FILE']+, if no files are associated with
     the page

There will almost always be one file, but at least one manuscript has page with two associated images. Thus, we return an array.

Parameters:

  • page (Nokogiri::XML::Node)

    the mets:dmdSec node for the page

Returns:

  • (Array<String>)

    array of all the filenames for page



683
684
685
686
687
688
689
690
691
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 683

def extract_filenames page
  # mods:mods/mods:identifier[@type='filename']
  xpath     = 'mods:mods/mods:identifier[@type="filename"]'
  filenames = page.xpath(xpath).map(&:text)
  return filenames unless filenames.empty?

  # no filename; find the ARK URL for the master image for this page
  extract_master_mets_file page
end

#extract_folio_num(page) ⇒ String

Extracts the folio number from the given page node.

Parameters:

  • page (Nokogiri::XML::Node)

    the XML node representing the page

Returns:

  • (String)

    the extracted folio number



697
698
699
700
701
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 697

def extract_folio_num page
  # mods:mods/mods:physicalDescription/mods:extent
  xpath = 'mods:mods/mods:physicalDescription/mods:extent'
  page.xpath(xpath).map(&:text).join '|'
end

#extract_former_owners(record) ⇒ Array<DS::Extractor::Name>

Extracts former owners from the given record.

Parameters:

  • record (Nokogiri::XML::Node)

    the XML node representing the record

Returns:



253
254
255
256
257
258
259
260
261
262
263
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 253

def extract_former_owners record
  xpath = "./descendant::mods:note[@type='ownership']/text()"
  notes = clean_notes(record.xpath(xpath).flat_map(&:text))

  notes.flat_map { |n|
    splits = Recon::Type::Splits._lookup_single(n, from_column: 'authorized_label')
    splits.present? ? splits.split('|') : n
  }.map { |n|
    DS::Extractor::Name.new as_recorded: DS.mark_long(n), role: 'former owner'
  }
end

#extract_former_owners_as_recorded(xml, lookup_split: true) ⇒ Array<String>

Extracts former owners as recorded from the given XML.

Parameters:

  • xml (Nokogiri::XML::NodeSet)

    the parsed XML to extract former owners from

  • lookup_split (Boolean) (defaults to: true)

    whether to lookup split information or not

Returns:

  • (Array<String>)

    the extracted former owners as recorded



245
246
247
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 245

def extract_former_owners_as_recorded xml, lookup_split: true
  extract_former_owners(xml).map &:as_recorded
end

#extract_genres(xml) ⇒ Object



462
463
464
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 462

def extract_genres xml
  []
end

#extract_incipit_explicit(xml) ⇒ Object

If the mods:mods element has a <mods:titleInfo type="alternative"> element and a <mods:abstract[not(@displayLabel)]>, then the content of the <mods:abstract[not(@displayLabel)]> is an incipit; XPath:

//mods:mods[./mods:titleInfo[@type="alternative"] and ./mods:abstract[not(@displayLabel)]]

//mods:mods[./mods:titleInfo[@type="alternative"]]/mods:abstract[not(@displayLabel)]/text()

If the mods:mods element has a ‘mods:titleInfo type=“alternative”` element and a `<mods:note type=“content”>`, then the content of the `<mods:note type=“content”>` is an explicit; XPath:

//mods:mods[./mods:titleInfo[@type="alternative"] and ./mods:note[@type="content"]]

//mods:mods[./mods:titleInfo[@type="alternative"]]/mods:note[@type="content"]/text()


864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 864

def extract_incipit_explicit xml
  # ./descendant::mods:physicalDescription
  # mods:mods/mods:originInfo/mods:place/mods:placeTerm
  # find any mod:mods containing an incipit or explicit
  xpath = %q{//mods:mods[./mods:titleInfo[@type="alternative"] and
        (./mods:abstract[not(@displayLabel)] or
        ./mods:note[@type="content"])]}

  find_texts(xml).flat_map { |node|
    # return an array for formatted incipits and explicits for this manuscript
    extent = node.xpath('./descendant::mods:physicalDescription/mods:extent/text()', NS).text
    node.xpath('./descendant::mods:abstract[not(@displayLabel)]/text()').map { |inc|
      "Incipit, #{extent}: #{inc}"
    } + node.xpath('./descendant::mods:note[@type="content"]/text()').map { |exp|
      "Explicit, #{extent}: #{exp}"
    }
  }
end

#extract_institution_name(xml) ⇒ String

Extracts the institution name from the given XML document.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML document to extract the institution name from

Returns:

  • (String)

    the extracted institution name



25
26
27
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 25

def extract_institution_name xml
  extract_mets_creator(xml).first
end

#extract_languages(record) ⇒ Array<DS::Extractor::Language>

Extract languages from the given record.

Parameters:

  • record (Object)

    the record to extract languages from

Returns:



342
343
344
345
346
347
348
349
350
351
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 342

def extract_languages record
  # /mets:mets/mets:dmdSec/mets:mdWrap/mets:xmlData/mods:mods/mods:note
  # Can be Lang: or lang: or ???, so down case the text with translate()
  xpath = './descendant::mods:note[starts-with(translate(text(), "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "lang:")]'
  find_texts(record).flat_map { |text|
    text.xpath(xpath).map { |note| note.text.sub(%r{^lang:\s*}i, '') }
  }.uniq.map { |as_recorded|
    DS::Extractor::Language.new as_recorded: as_recorded
  }
end

#extract_languages_as_recorded(record) ⇒ String

Return a list of unique languages from the text-level <mods:note>s that start with “lang:” (case -insensitive), joined with separator; so, “Latin”, rather than “Latin|Latin|Latin”, etc.

Returns:

  • (String)


334
335
336
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 334

def extract_languages_as_recorded record
  extract_languages(record).map &:as_recorded
end

Extract link to institution record from the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML to extract the link from

Returns:

  • (String)

    the extracted link to the institution record



518
519
520
521
522
523
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 518

def extract_link_to_inst_record xml
  ms = find_ms xml
  # xpath mods:mods/mods:relatedItem/mods:location/mods:url
  xpath = "mods:mods/mods:relatedItem/mods:location/mods:url"
  ms.xpath(xpath).map(&:text).join '|'
end

#extract_master_mets_file(page) ⇒ Array<String>

In some METS files each page has a list of mets:fptr elements, we need to get the @FILEID for the master image, but we don’t know which one is for the master. Thus we get all the @FILEIDs.

<mets:structMap>
  <mets:div TYPE="text" LABEL="[No Title for Display]" ADMID="RMD1" DMDID="DM1">
    <mets:div TYPE="item" LABEL="[No Title for Display]" DMDID="DM2">
      <mets:div TYPE="item" LABEL="[No Title for Display]" DMDID="DM3">
        <mets:div TYPE="item" LABEL="Music extending into right margin, upper right column." DMDID="DM4">
          <mets:fptr FILEID="FID1"/>
          <mets:fptr FILEID="FID3"/>
          <mets:fptr FILEID="FID5"/>
          <mets:fptr FILEID="FID7"/>
          <mets:fptr FILEID="FID9"/>
        </mets:div>
        <!-- snip -->
      </mets:div>
    </mets:div>
  </mets:div>
</mets:structMap>

Using the FILEIDs, find the corresponding mets:file in the mets:fileGrp with @USE=‘image/master’.

<mets:fileGrp USE="image/master">
  <mets:file ID="FID1" MIMETYPE="image/tiff" SEQ="1" CREATED="2010-11-08T10:26:20.3" ADMID="ADM1 ADM4" GROUPID="GID1">
    <mets:FLocat xlink:href="http://nma.berkeley.edu/ark:/28722/bk0008v1k7q" LOCTYPE="URL"/>
  </mets:file>
  <mets:file ID="FID2" MIMETYPE="image/tiff" SEQ="2" CREATED="2010-11-08T10:26:20.393" ADMID="ADM1 ADM5" GROUPID="GID2">
    <mets:FLocat xlink:href="http://nma.berkeley.edu/ark:/28722/bk0008v1k88" LOCTYPE="URL"/>
  </mets:file>
</mets:fileGrp>

We then follow the xlink:href to get the filename from the ‘location’ HTTP header.

Parameters:

  • page (Nokogiri::XML::Node)

    the mets:dmdSec node for the page

Returns:

  • (Array<String>)

    array of all the filenames for page



742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 742

def extract_master_mets_file page
  dmdid = page['ID']
  # all the mets:fptr @FILEIDs for this page
  xpath = %Q{//mets:structMap/descendant::mets:div[@DMDID='#{dmdid}']/mets:fptr/@FILEID}

  # create an OR query because we don't know which FILEID is for the
  # master mets:file:
  #     "@ID = 'FID1' or @ID = 'FID3' or @ID = 'FID5' ... etc."
  id_query = page.xpath(xpath).map(&:text).map { |id| "@ID='#{id}'" }.join ' or '
  return ['NO_FILE'] if id_query.strip.empty? # there is no associated mets:fptr

  # the @xlink:href is the Berkeley ARK address; e.g., http://nma.berkeley.edu/ark:/28722/bk0008v1k88
  xpath          = "//mets:fileGrp[@USE='image/master']/mets:file[#{id_query}]/mets:FLocat/@xlink:href"
  fptr_addresses = page.xpath(xpath).map &:text
  return ['NO_FILE'] if fptr_addresses.empty? # I don't know if this happens, but just in case...

  # for each ARK address, find the TIFF filename
  fptr_addresses.map { |address| locate_filename address }
end

#extract_material_as_recorded(record) ⇒ String

Extracts the material as recorded from the given record.

Parameters:

  • record (CSV::Row)

    the record to extract material from

Returns:

  • (String)

    the extracted material as recorded



222
223
224
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 222

def extract_material_as_recorded record
  extract_materials(record).map(&:as_recorded).join '|'
end

#extract_materials(record) ⇒ Array<DS::Extractor::Material>

Extracts materials from the given record.

Parameters:

  • record (Object)

    the record to extract materials from

Returns:



230
231
232
233
234
235
236
237
238
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 230

def extract_materials record
  find_parts(record).flat_map { |part|
    physdesc_note part, 'support'
  }.map { |s|
    s.downcase.chomp('.').strip
  }.uniq.map { |as_recorded|
    DS::Extractor::Material.new as_recorded: as_recorded
  }
end

#extract_mets_creator(xml) ⇒ Array<String>

Extracts the creator information from the METS XML document.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML document containing METS data

Returns:

  • (Array<String>)

    an array of creator information



33
34
35
36
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 33

def extract_mets_creator xml
  creator = xml.xpath('/mets:mets/mets:metsHdr/mets:agent[@ROLE="CREATOR" and @TYPE="ORGANIZATION"]/mets:name', NS).text
  creator.split %r{;;}
end

#extract_ms_note(xml) ⇒ Array<String>

Extracts the manuscript note from the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML node to extract manuscript note from

Returns:

  • (Array<String>)

    an array of manuscript notes



766
767
768
769
770
771
772
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 766

def extract_ms_note xml
  notes = []
  ms    = find_ms xml
  notes += note_by_type ms, :none, tag: 'Manuscript note'
  notes += note_by_type ms, 'bibliography', tag: 'Bibliography'
  notes
end

#extract_ms_phys_desc(xml) ⇒ Object



88
89
90
91
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 88

def extract_ms_phys_desc xml
  ms = find_ms xml
  physdesc_note ms, 'presentation', tag: 'Binding'
end

#extract_name(node, *roles) ⇒ Array<DS::Extractor::Name>

Extract name from the given node based on the provided roles.

Parameters:

  • node (Object)

    the node to extract name from

  • roles (Array<String>)

    the roles to search for

Returns:



358
359
360
361
362
363
364
365
366
367
368
369
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 358

def extract_name node, *roles
  # Roles have different cases: Author, author, etc.
  # Xpath 1.0 has no lower-case function, so use translate()
  translate = "translate(./mods:role/mods:roleTerm/text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')"
  props     = roles.map { |r| "#{translate} = '#{r}'" }.join ' or '
  xpath     = "./descendant::mods:name[#{props}]"
  node.xpath(xpath).flat_map { |name|
    name.xpath('mods:namePart').text.split %r{\s*;\s*}
  }.uniq.map { |as_recorded|
    DS::Extractor::Name.new as_recorded: as_recorded, role: roles.first
  }
end

#extract_notes(xml) ⇒ Array<String>

Extract the notes at all level from the xml, and return an array of strings

Parameters:

  • xml (Nokogiri::XML::Node)

    the document’s xml

Returns:

  • (Array<String>)

    the note values



833
834
835
836
837
838
839
840
841
842
843
844
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 833

def extract_notes xml
  notes = []
  # get all notes that don't have @type
  xpath = %q{//mods:note[not(@type)]/text()}
  notes += extract_ms_note xml
  notes += extract_part_note xml
  notes += extract_text_note xml
  notes += extract_docket xml
  notes += extract_page_note xml

  clean_notes notes
end

#extract_other_names_as_recorded(record) ⇒ Array<String>

Extract other names as recorded from the given record.

Parameters:

  • record (Object)

    the record to extract other names from

Returns:

  • (Array<String>)

    the extracted other names as recorded



316
317
318
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 316

def extract_other_names_as_recorded record
  extract_associated_agents(record).map &:as_recorded
end

#extract_page_note(xml) ⇒ Array<String>

Extracts notes for each page in the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML node to extract notes from

Returns:

  • (Array<String>)

    an array of extracted notes



816
817
818
819
820
821
822
823
824
825
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 816

def extract_page_note xml
  find_pages(xml).flat_map { |page|
    extent = extract_extent page
    notes  = []
    notes  += note_by_type page, :none, tag: extent
    notes  += note_by_type page, 'content', tag: "Incipit, #{extent}"
    notes  += extract_explicit page, tag: "Explicit, #{extent}"
    notes
  }
end

#extract_part_note(xml) ⇒ Array<String>

Extracts notes for each part in the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML node to extract notes from

Returns:

  • (Array<String>)

    an array of extracted notes



778
779
780
781
782
783
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 778

def extract_part_note xml
  find_parts(xml).flat_map { |part|
    extent = extract_extent part
    note_by_type part, :none, tag: extent
  }
end

#extract_part_phys_desc(xml) ⇒ Array<String>

Extracts physical description notes for each part in the XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML node to extract parts from

Returns:

  • (Array<String>)

    an array of extracted physical description notes



119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 119

def extract_part_phys_desc xml
  parts = find_parts xml
  parts.flat_map { |part|
    extent = extract_extent part
    notes  = []

    tag   = "Figurative details, #{extent}"
    notes += physdesc_note part, 'physical details', tag: tag
    notes += extract_pd_note part
    tag   = "Script, #{extent}"
    notes += physdesc_note part, 'script', tag: tag
    tag   = "Music, #{extent}"
    notes += physdesc_note part, 'medium', tag: tag
    tag   = "Layout, #{extent}"
    notes += physdesc_note part, 'technique', tag: tag
    tag   = "Watermarks, #{extent}"
    notes += physdesc_note part, 'marks', tag: tag
    notes
  }
end

#extract_pd_note(part) ⇒ Array<String>

Extracts physical description notes from the given part object.

Parameters:

  • part (Nokogiri::XML::Node)

    the XML node representing the part

Returns:

  • (Array<String>)

    an array of extracted physical description notes



97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 97

def extract_pd_note part
  extent = extract_extent part

  xpath = %q{mods:mods/mods:physicalDescription/mods:note[@type = 'physical description']/text()}
  part.xpath(xpath).flat_map { |node|
    text  = node.text
    notes = []
    if text =~ %r{;;}
      other_deco, num_scribes = text.split %r{;;+}
      notes << "Other decoration, #{extent}: #{other_deco}" unless other_deco.blank?
      notes << "Number of scribes, #{extent}: #{num_scribes}" unless num_scribes.blank?
    else
      notes << "Other decoration, #{extent}: #{text}" unless text.empty?
    end
    notes
  }
end

#extract_physical_description(xml) ⇒ Array

Extract and format all the physical description values for the manuscript and each part.

# MS Note Phys desc

  • presentation -> Binding

# MS Part phys description

- support -- accounted for as support

- marks - 'Watermarks'
- medium -> 'Music'
- physical description -> 'Other decoration'
- physical details -> 'Figurative details'
- script -> 'Script'
- technique -> 'Layout'

Parameters:

  • xml (Nokogiri::XML::Node)

    the document’s xml

Returns:

  • (Array)

    the physical description values



59
60
61
62
63
64
65
66
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 59

def extract_physical_description xml
  physdesc = []
  physdesc += extract_ms_phys_desc xml
  physdesc += extract_part_phys_desc xml
  physdesc.flatten!

  clean_notes physdesc
end

#extract_places(record) ⇒ Array<DS::Extractor::Place>

Extracts places from the given record.

Parameters:

  • record (Object)

    the record to extract places from

Returns:



904
905
906
907
908
909
910
911
912
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 904

def extract_places record
  parts = find_parts record
  xpath = 'mods:mods/mods:originInfo/mods:place/mods:placeTerm'
  parts.flat_map { |node|
    node.xpath(xpath).map { |place|
      DS::Extractor::Place.new as_recorded: place.text.split(%r{;;}).join(', ')
    }
  }
end

#extract_production_date_as_recorded(xml) ⇒ Array<String>

Return as a single string all the date values for the manuscript. This is a concatenation of the values returned by DS10.extract_date_created, DS10.extract_assigned_date, DS10.extract_date_range.

Parameters:

  • xml (Nokogiri::XML:Node)

    the parsed METS xml document

Returns:

  • (Array<String>)

    the concatenated date values



545
546
547
548
549
550
551
552
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 545

def extract_production_date_as_recorded xml
  find_parts(xml).map { |part|
    date_created = extract_date_created part
    assigned     = extract_assigned_date part
    range        = extract_date_range_for_part(part).join '-'
    [date_created, assigned, range].uniq.reject(&:empty?).join '; '
  }.reject { |date| date.to_s.strip.empty? }
end

#extract_production_places_as_recorded(xml) ⇒ Array<String>

Extract production places as recorded from the given XML.

Parameters:

  • xml (Object)

    the XML to extract production places from

Returns:

  • (Array<String>)

    the extracted production places as recorded



398
399
400
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 398

def extract_production_places_as_recorded xml
  extract_places(xml).map &:as_recorded
end

#extract_recon_names(xml) ⇒ Array<Array>

Extract reconciliation names from the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    a <METS_XML> node

Returns:

  • (Array<Array>)

    an array of arrays of names for reconciliation



430
431
432
433
434
435
436
437
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 430

def extract_recon_names xml
  data = extract_authors(xml).map &:to_a
  data += extract_artists(xml).map &:to_a
  data += extract_scribes(xml).map &:to_a
  data += extract_former_owners(xml).map &:to_a
  data += extract_associated_agents(xml).map &:to_a
  data
end

#extract_recon_places(xml) ⇒ Array<Array>

Extract the places of production for reconciliation CSV output.

Returns a two-dimensional array, each row is a place; and each row has one column: place name; for example:

[["Austria"],
 ["Germany"],
 ["France (?)"]]

Parameters:

  • xml (Nokogiri::XML:Node)

    a <METS_XML> node

Returns:

  • (Array<Array>)

    an array of arrays of values



414
415
416
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 414

def extract_recon_places xml
  extract_places(xml).map &:to_a
end

#extract_recon_splits(xml) ⇒ Object

Extract acknowledgments, notes, physical descriptions, and former owners; return all strings that start with SPLIT:, remove ‘SPLIT: ’ and return an array of arrays that can be treated as rows by Recon::Type::Splits



444
445
446
447
448
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 444

def extract_recon_splits xml
  data = []
  data += DS::Extractor::DsMetsXmlExtractor.extract_former_owners_as_recorded xml, lookup_split: false
  data.flatten.select { |d| d.to_s.size >= 400 }.map { |d| [d.strip] }
end

#extract_recon_subjects(xml) ⇒ Array<String,String>

See the note for [Recon::Type::Subjects]: Each source subject extraction method should return a two dimensional array:

[["Islamic law--Early works to 1800", ""],
  ["Malikites--Early works to 1800", ""],
  ["Islamic law", ""],s
  ["Malikites", ""],
  ["Arabic language--Grammar--Early works to 1800", ""],
  ["Arabic language--Grammar", ""],
  ...
  ]

The second value is for those cases where the source provides an authority URI. The METS records don’t give a URI so this method always returns the empty string for the second value.

Parameters:

  • xml (Nokogiri::XML:Node)

    a <METS_XML> node

Returns:

  • (Array<String,String>)

    a two-dimenional array of subject and URI



485
486
487
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 485

def extract_recon_subjects xml
  extract_subjects(xml).map &:to_a
end

#extract_recon_titles(xml) ⇒ Array<String>

Extract reconciliation titles from the given XML.

Parameters:

  • xml (Nokogiri::XML::Node)

    a <METS_XML> node

Returns:

  • (Array<String>)

    an array of titles for reconciliation



422
423
424
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 422

def extract_recon_titles xml
  extract_titles(xml).to_a
end

#extract_scribes(record) ⇒ Array<String>

Extract scribes from the given record.

Parameters:

  • record (Object)

    the record to extract scribes from

Returns:

  • (Array<String>)

    the extracted scribes



308
309
310
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 308

def extract_scribes record
  DS::Extractor::DsMetsXmlExtractor.extract_name record, *%w{ scribe [scribe] }
end

#extract_scribes_as_recorded(record) ⇒ Array<String>

Extracts scribes as recorded from the given record.

Parameters:

  • record (Object)

    the record to extract scribes from

Returns:

  • (Array<String>)

    the extracted scribes as recorded



300
301
302
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 300

def extract_scribes_as_recorded record
  extract_scribes(record).map &:as_recorded
end

#extract_shelfmark(xml) ⇒ String

For the legacy DS METS, this value is the value of mods:identifier is the shelf mark. If there are other ID types, we can’t distinguish them from shelfmarks.

Parameters:

  • xml (Nokogiri::XML:Node)

    a <METS_XML> node

Returns:

  • (String)

    the shelfmark



457
458
459
460
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 457

def extract_shelfmark xml
  ms = find_ms xml
  ms.xpath('mods:mods/mods:identifier[@type="local"]/text()').text
end

#extract_subjects(record) ⇒ Array<DS::Extractor::Subject>

Extracts subjects from the given record.

Parameters:

  • record (Object)

    the record to extract subjects from

Returns:



929
930
931
932
933
934
935
936
937
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 929

def extract_subjects record
  xpath = '//mods:originInfo/mods:edition'
  find_texts(record).flat_map { |text|
    text.xpath(xpath).map { |subj|
      as_recorded = subj.text.strip.gsub(/\s+/, ' ')
      DS::Extractor::Subject.new as_recorded: as_recorded, vocab: 'ds-subject'
    }
  }
end

#extract_subjects_as_recorded(xml) ⇒ Array<String>

Extract subjects, the mods:originInfo/mods:edition values for each text. For example,

<mods:originInfo>
  <mods:edition>Alexander, de Villa Dei.</mods:edition>
  <mods:edition>Latin language--Grammar.</mods:edition>
  <mods:edition>Latin poetry, Medieval and modern.</mods:edition>
  <mods:edition>Manuscripts, Medieval--Connecticut--New Haven.</mods:edition>
</mods:originInfo>

Parameters:

  • xml (Nokogiri::XML:Node)

    a <METS_XML> node

Returns:

  • (Array<String>)

    an of subjects



502
503
504
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 502

def extract_subjects_as_recorded xml
  extract_subjects(xml).map(&:as_recorded)
end

#extract_text_note(xml) ⇒ Array<String>

Extracts text notes from the given XML document.

Parameters:

  • xml (Nokogiri::XML::Node)

    the XML document to extract text notes from

Returns:

  • (Array<String>)

    the extracted text notes



800
801
802
803
804
805
806
807
808
809
810
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 800

def extract_text_note xml
  find_texts(xml).flat_map { |text|
    extent = extract_extent text
    notes  = []
    notes  += note_by_type text, :none, tag: extent
    notes  += note_by_type text, 'condition', tag: "Status of text, #{extent}"
    notes  += note_by_type text, 'content', tag: "Incipit, #{extent}"
    notes  += extract_explicit text, tag: "Explicit, #{extent}"
    notes
  }
end

#extract_titles(record) ⇒ Array<DS::Extractor::Title>

Extract titles from the given record.

Parameters:

  • record (Object)

    the record to extract titles from

Returns:



383
384
385
386
387
388
389
390
391
392
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 383

def extract_titles record
  xpath = 'mods:mods/mods:titleInfo/mods:title'
  find_texts(record).flat_map { |text|
    text.xpath(xpath).map(&:text)
  }.reject {
    |t| t == '[Title not supplied]'
  }.map { |as_recorded|
    DS::Extractor::Title.new as_recorded: as_recorded
  }
end

#extract_titles_as_recorded(record) ⇒ Array<String>

Extract titles as recorded from the given record.

Parameters:

  • record (Object)

    the record to extract titles from

Returns:

  • (Array<String>)

    the extracted titles as recorded



375
376
377
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 375

def extract_titles_as_recorded record
  extract_titles(record).map &:as_recorded
end

#find_ms(xml) ⇒ Object

METS structMap extraction

Extract mods:mods elements by catalog description level: manuscript, manuscript part, text, page, image



946
947
948
949
950
951
952
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 946

def find_ms xml
  # the manuscript is one div deep in the structMap
  # /mets:mets/mets:structMap/mets:div/@DMDID
  xpath = '/mets:mets/mets:structMap/mets:div/@DMDID'
  id    = xml.xpath(xpath).first.text
  xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData"
end

#find_pages(xml) ⇒ Arry<Nokogiri::XML::Node>

Returns array of the page-level mets:dmdSec nodes.

Parameters:

  • xml (Nokogiri::XML::Node)

    parsed XML of the METS document

Returns:

  • (Arry<Nokogiri::XML::Node>)

    array of the page-level mets:dmdSec nodes



992
993
994
995
996
997
998
999
1000
1001
1002
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 992

def find_pages xml
  # /mets:mets/mets:structMap/mets:div/mets:div/mets:div/mets:div/@DMDID
  # the pages are four divs deep in the structMap
  # We need the IDs in order
  xpath = '/mets:mets/mets:structMap/mets:div/mets:div/mets:div/mets:div/@DMDID'
  ids   = xml.xpath(xpath).map &:text
  # collect dmdSec's for all the page IDs
  ids.flat_map { |id|
    xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData"
  }
end

#find_parts(xml) ⇒ Array<Nokogiri::XML::Node>

Find the manuscript parts in the XML document.

Parameters:

  • xml (Nokogiri::XML::Node)

    the parsed XML document

Returns:

  • (Array<Nokogiri::XML::Node>)

    an array of manuscript parts in the correct order



958
959
960
961
962
963
964
965
966
967
968
969
970
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 958

def find_parts xml
  # /mets:mets/mets:structMap/mets:div/mets:div/@DMDID
  # manuscripts parts are two divs deep in the structMap
  # We need to get the IDs in order
  xpath = '/mets:mets/mets:structMap/mets:div/mets:div/@DMDID'
  ids   = xml.xpath(xpath).map &:text
  # We can't count on the order or the numbering of the mets:dmdSec
  # elements outside of the structMap. Thus, we have to return an
  # array with the parts mets:dmdSec in the correct order.
  ids.map { |id|
    xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData"
  }
end

#find_texts(xml) ⇒ Array<Nokogiri::XML::Node>

Find the texts in the XML document.

Parameters:

  • xml (Nokogiri::XML::Node)

    the parsed XML document

Returns:

  • (Array<Nokogiri::XML::Node>)

    an array of text nodes in the correct order



977
978
979
980
981
982
983
984
985
986
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 977

def find_texts xml
  # /mets:mets/mets:structMap/mets:div/mets:div/mets:div/@DMDID
  # texts are three divs deep in the structMap
  # We need to get the IDs in order
  xpath = '/mets:mets/mets:structMap/mets:div/mets:div/mets:div/@DMDID'
  ids   = xml.xpath(xpath).map &:text
  ids.map { |id|
    xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData"
  }
end

#note_by_type(node, note_type, tag: nil) ⇒ Object

DS 1.0 METS note types:

# MS Note types:

Accounted for
- ownership -- accounted for, former owner
- action -- skip; administrative note: "Inputter ...."
- admin -- acknowledgments
- untyped -- 'Manuscript Note'
- bibliography -- 'Bibliography'
- source note -- skip; not present on DS legacy pages

# MS Note Phys desc

  • presentation -> Binding

# Part note types:

- date - already accounted for
- content - skip
- admin - Acknowledgments

- untyped

# MS Part phys description

 - support -- accounted for as support

 - marks - 'Watermarks'
 - medium -> 'Music'
 - physical description -> 'Other decoration'
 - physical details -> 'Figurative details'
 - script -> 'Script'
 - technique -> 'Layout'

# Text note types

 Accounted for
 - admin - acknowledgments

 - condition -> 'Status of text'
 - content -> handled as Text Incipit
 - untyped -> 'Text note'

# Page note types

 Accounted for
   None

 - content -> Folio Incipit
 - date -- skip
 - untyped -> 'Folio note'


195
196
197
198
199
200
201
202
203
204
205
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 195

def note_by_type node, note_type, tag: nil
  if note_type == :none
    xpath = %q{mods:mods/mods:note[not(@type)]/text()}
  else
    xpath = %Q{mods:mods/mods:note[@type = '#{note_type}']/text()}
  end

  node.xpath(xpath).map { |x|
    tag.nil? ? x.text : "#{tag}: #{x.text}"
  }
end

#physdesc_note(node, note_type, tag: nil) ⇒ Array<String>

Extracts the physical description notes from the given node based on the note type and optional tag.

Parameters:

  • node (Nokogiri::XML::Node)

    the XML node to extract notes from

  • note_type (Symbol)

    the type of note to extract

  • tag (String) (defaults to: nil)

    an optional tag to prepend to each extracted note

Returns:

  • (Array<String>)

    an array of extracted notes



74
75
76
77
78
79
80
81
82
83
84
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 74

def physdesc_note node, note_type, tag: nil
  if note_type == :none
    xpath = %q{mods:mods/mods:physicalDescription/mods:note[not(@type)]}
  else
    xpath = %Q{mods:mods/mods:physicalDescription/mods:note[@type = '#{note_type}']}
  end

  node.xpath(xpath).map { |x|
    tag.nil? ? x.text : "#{tag}: #{x.text}"
  }
end

#source_modifiedString

A method to return the date when the source was last modified. For DS METS we have chosen the date 2021-10-01.

Returns:

  • (String)

    “2021-10-01”



1007
1008
1009
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 1007

def source_modified
  "2021-10-01"
end