Module: DS::Extractor::DsMetsXmlExtractor::ClassMethods
- Included in:
- DS::Extractor::DsMetsXmlExtractor
- Defined in:
- lib/ds/extractor/ds_mets_xml_extractor.rb
Constant Summary collapse
- NS =
{ mods: 'http://www.loc.gov/mods/v3', mets: 'http://www.loc.gov/METS/', }
- DATE_START_XPATH =
'mods:mods/mods:originInfo/mods:dateCreated[@point="start"]'- DATE_END_XPATH =
'mods:mods/mods:originInfo/mods:dateCreated[@point="end"]'
Instance Method Summary collapse
-
#dated_by_scribe?(xml) ⇒ Boolean
Determines if the XML document is dated by a scribe.
-
#extract_acknowledgments(xml) ⇒ Array<String>
Extracts acknowledgments from the given XML document.
-
#extract_all_subjects(record) ⇒ Array<DS::Extractor::Subject>
Extracts all subjects from the given record.
-
#extract_all_subjects_as_recorded(xml) ⇒ Array<String>
Extract all subjects as recorded from the given XML.
-
#extract_artists(record) ⇒ Array<DS::Extractor::Name>
Extracts artists from the given record using the specified type and role.
-
#extract_artists_as_recorded(record) ⇒ Object
Extracts artists as recorded from the given record.
-
#extract_assigned_date(part) ⇒ Array<Integer>
Return dates found in the
otherDateelement, reformatting them as needed. -
#extract_associated_agents(record) ⇒ Array<String>
Extract other names from the given record.
-
#extract_authors(record) ⇒ Array<DS::Extractor::Name>
Extracts authors from the given record.
-
#extract_authors_as_recorded(record) ⇒ Array<String>
Extracts authors as recorded from the given record.
- #extract_cataloging_convention(xml) ⇒ Object
-
#extract_date_created(part) ⇒ Array<Integer>
Return any date not found in the
otherDateor in a dateCreated date range (see #extract_date_range); thus:. -
#extract_date_range(xml, range_sep:) ⇒ Array<String>
Extract ranges from
mods:dateCreatedelements where a @point is defined, thus:. -
#extract_date_range_for_part(part) ⇒ Array<Integer>
Extract ranges from
mods:dateCreatedelements where a @point is start and end. -
#extract_docket(xml) ⇒ Array<String>
DS METS can have
mods:abstractelments with @displayLabel=“docket”. -
#extract_explicit(node, tag:) ⇒ Array<String>
Extracts explicit information from the given node based on the provided tag.
-
#extract_extent(node) ⇒ String
Extracts the extent from the given node.
-
#extract_filenames(page) ⇒ Array<String>
Extract the filename for page.
-
#extract_folio_num(page) ⇒ String
Extracts the folio number from the given page node.
-
#extract_former_owners(record) ⇒ Array<DS::Extractor::Name>
Extracts former owners from the given record.
-
#extract_former_owners_as_recorded(xml, lookup_split: true) ⇒ Array<String>
Extracts former owners as recorded from the given XML.
- #extract_genres(xml) ⇒ Object
-
#extract_incipit_explicit(xml) ⇒ Object
If the
mods:modselement has a<mods:titleInfo type="alternative">element and a<mods:abstract[not(@displayLabel)]>, then the content of the<mods:abstract[not(@displayLabel)]>is an incipit; XPath:. -
#extract_institution_name(xml) ⇒ String
Extracts the institution name from the given XML document.
-
#extract_languages(record) ⇒ Array<DS::Extractor::Language>
Extract languages from the given record.
-
#extract_languages_as_recorded(record) ⇒ String
Return a list of unique languages from the text-level <mods:note>s that start with “lang:” (case -insensitive), joined with separator; so, “Latin”, rather than “Latin|Latin|Latin”, etc.
-
#extract_link_to_inst_record(xml) ⇒ String
Extract link to institution record from the given XML.
-
#extract_master_mets_file(page) ⇒ Array<String>
In some METS files each page has a list of mets:fptr elements, we need to get the @FILEID for the master image, but we don’t know which one is for the master.
-
#extract_material_as_recorded(record) ⇒ String
Extracts the material as recorded from the given record.
-
#extract_materials(record) ⇒ Array<DS::Extractor::Material>
Extracts materials from the given record.
-
#extract_mets_creator(xml) ⇒ Array<String>
Extracts the creator information from the METS XML document.
-
#extract_ms_note(xml) ⇒ Array<String>
Extracts the manuscript note from the given XML.
- #extract_ms_phys_desc(xml) ⇒ Object
-
#extract_name(node, *roles) ⇒ Array<DS::Extractor::Name>
Extract name from the given node based on the provided roles.
-
#extract_notes(xml) ⇒ Array<String>
Extract the notes at all level from the
xml, and return an array of strings. -
#extract_other_names_as_recorded(record) ⇒ Array<String>
Extract other names as recorded from the given record.
-
#extract_page_note(xml) ⇒ Array<String>
Extracts notes for each page in the given XML.
-
#extract_part_note(xml) ⇒ Array<String>
Extracts notes for each part in the given XML.
-
#extract_part_phys_desc(xml) ⇒ Array<String>
Extracts physical description notes for each part in the XML.
-
#extract_pd_note(part) ⇒ Array<String>
Extracts physical description notes from the given part object.
-
#extract_physical_description(xml) ⇒ Array
Extract and format all the physical description values for the manuscript and each part.
-
#extract_places(record) ⇒ Array<DS::Extractor::Place>
Extracts places from the given record.
-
#extract_production_date_as_recorded(xml) ⇒ Array<String>
Return as a single string all the date values for the manuscript.
-
#extract_production_places_as_recorded(xml) ⇒ Array<String>
Extract production places as recorded from the given XML.
-
#extract_recon_names(xml) ⇒ Array<Array>
Extract reconciliation names from the given XML.
-
#extract_recon_places(xml) ⇒ Array<Array>
Extract the places of production for reconciliation CSV output.
-
#extract_recon_splits(xml) ⇒ Object
Extract acknowledgments, notes, physical descriptions, and former owners; return all strings that start with SPLIT:, remove ‘SPLIT: ’ and return an array of arrays that can be treated as rows by Recon::Type::Splits.
-
#extract_recon_subjects(xml) ⇒ Array<String,String>
See the note for [Recon::Type::Subjects]: Each source subject extraction method should return a two dimensional array:.
-
#extract_recon_titles(xml) ⇒ Array<String>
Extract reconciliation titles from the given XML.
-
#extract_scribes(record) ⇒ Array<String>
Extract scribes from the given record.
-
#extract_scribes_as_recorded(record) ⇒ Array<String>
Extracts scribes as recorded from the given record.
-
#extract_shelfmark(xml) ⇒ String
For the legacy DS METS, this value is the value of mods:identifier is the shelf mark.
-
#extract_subjects(record) ⇒ Array<DS::Extractor::Subject>
Extracts subjects from the given record.
-
#extract_subjects_as_recorded(xml) ⇒ Array<String>
Extract subjects, the
mods:originInfo/mods:editionvalues for each text. -
#extract_text_note(xml) ⇒ Array<String>
Extracts text notes from the given XML document.
-
#extract_titles(record) ⇒ Array<DS::Extractor::Title>
Extract titles from the given record.
-
#extract_titles_as_recorded(record) ⇒ Array<String>
Extract titles as recorded from the given record.
-
#find_ms(xml) ⇒ Object
METS structMap extraction.
-
#find_pages(xml) ⇒ Arry<Nokogiri::XML::Node>
Array of the page-level
mets:dmdSecnodes. -
#find_parts(xml) ⇒ Array<Nokogiri::XML::Node>
Find the manuscript parts in the XML document.
-
#find_texts(xml) ⇒ Array<Nokogiri::XML::Node>
Find the texts in the XML document.
-
#note_by_type(node, note_type, tag: nil) ⇒ Object
DS 1.0 METS note types:.
-
#physdesc_note(node, note_type, tag: nil) ⇒ Array<String>
Extracts the physical description notes from the given node based on the note type and optional tag.
-
#source_modified ⇒ String
A method to return the date when the source was last modified.
Instance Method Details
#dated_by_scribe?(xml) ⇒ Boolean
Determines if the XML document is dated by a scribe.
529 530 531 532 533 534 535 536 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 529 def dated_by_scribe? xml parts = find_parts xml # mods:mods/mods:note xpath = 'mods:mods/mods:note[@type="date"]' parts.any? { |part| part.xpath(xpath).text.upcase == 'Y' } end |
#extract_acknowledgments(xml) ⇒ Array<String>
Extracts acknowledgments from the given XML document.
645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 645 def extract_acknowledgments xml notes = [] notes += find_ms(xml).flat_map { |ms| note_by_type ms, 'admin' } notes += find_parts(xml).flat_map { |part| extent = extract_extent part note_by_type part, 'admin', tag: extent } notes += find_texts(xml).flat_map { |text| extent = extract_extent text note_by_type text, 'admin', tag: extent } notes += find_pages(xml).flat_map { |page| extent = extract_extent page note_by_type page, 'admin', tag: extent } clean_notes notes end |
#extract_all_subjects(record) ⇒ Array<DS::Extractor::Subject>
method returns #extract_subjects to fulfill DS::Extractor contract
Extracts all subjects from the given record.
921 922 923 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 921 def extract_all_subjects record extract_subjects record end |
#extract_all_subjects_as_recorded(xml) ⇒ Array<String>
Extract all subjects as recorded from the given XML.
510 511 512 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 510 def extract_all_subjects_as_recorded xml extract_subjects_as_recorded xml end |
#extract_artists(record) ⇒ Array<DS::Extractor::Name>
Extracts artists from the given record using the specified type and role.
292 293 294 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 292 def extract_artists record DS::Extractor::DsMetsXmlExtractor.extract_name record, *%w{ artist [artist] illuminator } end |
#extract_artists_as_recorded(record) ⇒ Object
Extracts artists as recorded from the given record.
284 285 286 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 284 def extract_artists_as_recorded record extract_artists(record).map &:as_recorded end |
#extract_assigned_date(part) ⇒ Array<Integer>
Return dates found in the otherDate element, reformatting them as needed. These examples are taken from several METS files.
<mods:dateOther>[ca. 1410]</mods:dateOther>
<mods:dateOther>[between 1100 and 1200]</mods:dateOther>
<mods:dateOther>[between 1450 and 1460]</mods:dateOther>
<mods:dateOther>[between 1450 and 1500]</mods:dateOther>
<mods:dateOther>s. XV#^3/4#</mods:dateOther>
<mods:dateOther>s. XV</mods:dateOther>
<mods:dateOther>s. XVI#^4/4#</mods:dateOther>
<mods:dateOther>s. XVIII#^2/4#</mods:dateOther>
<mods:dateOther>s. XV#^in#</mods:dateOther>
Most dateOther values have the format:
s. XVII#^2#
The notation #^<VAL># encodes a portion of the string that was presented as superscript on the Berkeley DS site. DS 2.0 doesn’t use the superscripts; thus, when it occurs, this portion of the string is reformatted ‘(<VAL>)`:
s. XVII#^2# => s. XVII(2)
s. XV#^ex# => s. XV(ex)
s. XVI#^in# => s. XVI(in)
s. X#^med# => s. X(med)
s. XII#^med# => s. XII(med)
635 636 637 638 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 635 def extract_assigned_date part xpath = 'mods:mods/mods:originInfo/mods:dateOther' part.xpath(xpath).text.gsub %r{#\^?([\w/]+)(\^|#)}, '(\1)' end |
#extract_associated_agents(record) ⇒ Array<String>
Extract other names from the given record.
324 325 326 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 324 def extract_associated_agents record DS::Extractor::DsMetsXmlExtractor.extract_name record, 'other' end |
#extract_authors(record) ⇒ Array<DS::Extractor::Name>
Extracts authors from the given record.
269 270 271 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 269 def record DS::Extractor::DsMetsXmlExtractor.extract_name record, *%w{ author [author] } end |
#extract_authors_as_recorded(record) ⇒ Array<String>
Extracts authors as recorded from the given record.
277 278 279 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 277 def record (record).map &:as_recorded end |
#extract_cataloging_convention(xml) ⇒ Object
17 18 19 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 17 def extract_cataloging_convention xml 'ds-mets' end |
#extract_date_created(part) ⇒ Array<Integer>
Return any date not found in the otherDate or in a dateCreated date range (see #extract_date_range); thus:
<mods:dateCreated>1537</mods:dateCreated>
<mods:dateCreated>1531</mods:dateCreated>
<mods:dateCreated>14??, October 21</mods:dateCreated>
<mods:dateCreated>1462, July 23</mods:dateCreated>
<mods:dateCreated>1549, November</mods:dateCreated>
These values commonly give the date for “dated” manuscripts
599 600 601 602 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 599 def extract_date_created part xpath = 'mods:mods/mods:originInfo/mods:dateCreated[not(@point)]' part.xpath(xpath).map(&:text).join ', ' end |
#extract_date_range(xml, range_sep:) ⇒ Array<String>
Extract ranges from mods:dateCreated elements where a @point is defined, thus:
<mods:dateCreated point="start" encoding="iso8601">1300</mods:dateCreated>
<mods:dateCreated point="end" encoding="iso8601">1399</mods:dateCreated>
563 564 565 566 567 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 563 def extract_date_range xml, range_sep: find_parts(xml).map { |part| extract_date_range_for_part(part).join range_sep } end |
#extract_date_range_for_part(part) ⇒ Array<Integer>
Extract ranges from mods:dateCreated elements where a @point is start and end
578 579 580 581 582 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 578 def extract_date_range_for_part part start_date = part.xpath(DATE_START_XPATH).text end_date = part.xpath(DATE_END_XPATH).text [start_date, end_date].reject(&:empty?).map(&:to_i) end |
#extract_docket(xml) ⇒ Array<String>
DS METS can have mods:abstract elments with @displayLabel=“docket”. Extract these values and return as an array.
889 890 891 892 893 894 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 889 def extract_docket xml xpath = %q{//mods:abstract[@displayLabel = 'docket']/text()} xml.xpath(xpath, NS).map { |docket| "Docket: #{docket.text}" } end |
#extract_explicit(node, tag:) ⇒ Array<String>
Extracts explicit information from the given node based on the provided tag.
790 791 792 793 794 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 790 def extract_explicit node, tag: node.xpath('mods:mods/mods:abstract/text()').map { |n| "#{tag}: #{n.text}" } end |
#extract_extent(node) ⇒ String
Extracts the extent from the given node.
211 212 213 214 215 216 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 211 def extract_extent node xpath = 'mods:mods/mods:physicalDescription/mods:extent' node.xpath(xpath).flat_map { |extent| extent.text.split(%r{;;}).first }.join ', ' end |
#extract_filenames(page) ⇒ Array<String>
Extract the filename for page. This will be either:
* the values for +mods:identifier+ with +@type='filename'+; or
* the filenames pointed to by the linked +mets:fptr+ in the
+mets:fileGrp+ with +@USE='image/master'+
* an array containing +['NO_FILE']+, if no files are associated with
the page
There will almost always be one file, but at least one manuscript has page with two associated images. Thus, we return an array.
683 684 685 686 687 688 689 690 691 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 683 def extract_filenames page # mods:mods/mods:identifier[@type='filename'] xpath = 'mods:mods/mods:identifier[@type="filename"]' filenames = page.xpath(xpath).map(&:text) return filenames unless filenames.empty? # no filename; find the ARK URL for the master image for this page extract_master_mets_file page end |
#extract_folio_num(page) ⇒ String
Extracts the folio number from the given page node.
697 698 699 700 701 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 697 def extract_folio_num page # mods:mods/mods:physicalDescription/mods:extent xpath = 'mods:mods/mods:physicalDescription/mods:extent' page.xpath(xpath).map(&:text).join '|' end |
#extract_former_owners(record) ⇒ Array<DS::Extractor::Name>
Extracts former owners from the given record.
253 254 255 256 257 258 259 260 261 262 263 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 253 def extract_former_owners record xpath = "./descendant::mods:note[@type='ownership']/text()" notes = clean_notes(record.xpath(xpath).flat_map(&:text)) notes.flat_map { |n| splits = Recon::Type::Splits._lookup_single(n, from_column: 'authorized_label') splits.present? ? splits.split('|') : n }.map { |n| DS::Extractor::Name.new as_recorded: DS.mark_long(n), role: 'former owner' } end |
#extract_former_owners_as_recorded(xml, lookup_split: true) ⇒ Array<String>
Extracts former owners as recorded from the given XML.
245 246 247 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 245 def extract_former_owners_as_recorded xml, lookup_split: true extract_former_owners(xml).map &:as_recorded end |
#extract_genres(xml) ⇒ Object
462 463 464 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 462 def extract_genres xml [] end |
#extract_incipit_explicit(xml) ⇒ Object
If the mods:mods element has a <mods:titleInfo type="alternative"> element and a <mods:abstract[not(@displayLabel)]>, then the content of the <mods:abstract[not(@displayLabel)]> is an incipit; XPath:
//mods:mods[./mods:titleInfo[@type="alternative"] and ./mods:abstract[not(@displayLabel)]]
//mods:mods[./mods:titleInfo[@type="alternative"]]/mods:abstract[not(@displayLabel)]/text()
If the mods:mods element has a ‘mods:titleInfo type=“alternative”` element and a `<mods:note type=“content”>`, then the content of the `<mods:note type=“content”>` is an explicit; XPath:
//mods:mods[./mods:titleInfo[@type="alternative"] and ./mods:note[@type="content"]]
//mods:mods[./mods:titleInfo[@type="alternative"]]/mods:note[@type="content"]/text()
864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 864 def extract_incipit_explicit xml # ./descendant::mods:physicalDescription # mods:mods/mods:originInfo/mods:place/mods:placeTerm # find any mod:mods containing an incipit or explicit xpath = %q{//mods:mods[./mods:titleInfo[@type="alternative"] and (./mods:abstract[not(@displayLabel)] or ./mods:note[@type="content"])]} find_texts(xml).flat_map { |node| # return an array for formatted incipits and explicits for this manuscript extent = node.xpath('./descendant::mods:physicalDescription/mods:extent/text()', NS).text node.xpath('./descendant::mods:abstract[not(@displayLabel)]/text()').map { |inc| "Incipit, #{extent}: #{inc}" } + node.xpath('./descendant::mods:note[@type="content"]/text()').map { |exp| "Explicit, #{extent}: #{exp}" } } end |
#extract_institution_name(xml) ⇒ String
Extracts the institution name from the given XML document.
25 26 27 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 25 def extract_institution_name xml extract_mets_creator(xml).first end |
#extract_languages(record) ⇒ Array<DS::Extractor::Language>
Extract languages from the given record.
342 343 344 345 346 347 348 349 350 351 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 342 def extract_languages record # /mets:mets/mets:dmdSec/mets:mdWrap/mets:xmlData/mods:mods/mods:note # Can be Lang: or lang: or ???, so down case the text with translate() xpath = './descendant::mods:note[starts-with(translate(text(), "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"), "lang:")]' find_texts(record).flat_map { |text| text.xpath(xpath).map { |note| note.text.sub(%r{^lang:\s*}i, '') } }.uniq.map { |as_recorded| DS::Extractor::Language.new as_recorded: as_recorded } end |
#extract_languages_as_recorded(record) ⇒ String
Return a list of unique languages from the text-level <mods:note>s that start with “lang:” (case -insensitive), joined with separator; so, “Latin”, rather than “Latin|Latin|Latin”, etc.
334 335 336 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 334 def extract_languages_as_recorded record extract_languages(record).map &:as_recorded end |
#extract_link_to_inst_record(xml) ⇒ String
Extract link to institution record from the given XML.
518 519 520 521 522 523 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 518 def extract_link_to_inst_record xml ms = find_ms xml # xpath mods:mods/mods:relatedItem/mods:location/mods:url xpath = "mods:mods/mods:relatedItem/mods:location/mods:url" ms.xpath(xpath).map(&:text).join '|' end |
#extract_master_mets_file(page) ⇒ Array<String>
In some METS files each page has a list of mets:fptr elements, we need to get the @FILEID for the master image, but we don’t know which one is for the master. Thus we get all the @FILEIDs.
<mets:structMap>
<mets:div TYPE="text" LABEL="[No Title for Display]" ADMID="RMD1" DMDID="DM1">
<mets:div TYPE="item" LABEL="[No Title for Display]" DMDID="DM2">
<mets:div TYPE="item" LABEL="[No Title for Display]" DMDID="DM3">
<mets:div TYPE="item" LABEL="Music extending into right margin, upper right column." DMDID="DM4">
<mets:fptr FILEID="FID1"/>
<mets:fptr FILEID="FID3"/>
<mets:fptr FILEID="FID5"/>
<mets:fptr FILEID="FID7"/>
<mets:fptr FILEID="FID9"/>
</mets:div>
<!-- snip -->
</mets:div>
</mets:div>
</mets:div>
</mets:structMap>
Using the FILEIDs, find the corresponding mets:file in the mets:fileGrp with @USE=‘image/master’.
<mets:fileGrp USE="image/master">
<mets:file ID="FID1" MIMETYPE="image/tiff" SEQ="1" CREATED="2010-11-08T10:26:20.3" ADMID="ADM1 ADM4" GROUPID="GID1">
<mets:FLocat xlink:href="http://nma.berkeley.edu/ark:/28722/bk0008v1k7q" LOCTYPE="URL"/>
</mets:file>
<mets:file ID="FID2" MIMETYPE="image/tiff" SEQ="2" CREATED="2010-11-08T10:26:20.393" ADMID="ADM1 ADM5" GROUPID="GID2">
<mets:FLocat xlink:href="http://nma.berkeley.edu/ark:/28722/bk0008v1k88" LOCTYPE="URL"/>
</mets:file>
</mets:fileGrp>
We then follow the xlink:href to get the filename from the ‘location’ HTTP header.
742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 742 def extract_master_mets_file page dmdid = page['ID'] # all the mets:fptr @FILEIDs for this page xpath = %Q{//mets:structMap/descendant::mets:div[@DMDID='#{dmdid}']/mets:fptr/@FILEID} # create an OR query because we don't know which FILEID is for the # master mets:file: # "@ID = 'FID1' or @ID = 'FID3' or @ID = 'FID5' ... etc." id_query = page.xpath(xpath).map(&:text).map { |id| "@ID='#{id}'" }.join ' or ' return ['NO_FILE'] if id_query.strip.empty? # there is no associated mets:fptr # the @xlink:href is the Berkeley ARK address; e.g., http://nma.berkeley.edu/ark:/28722/bk0008v1k88 xpath = "//mets:fileGrp[@USE='image/master']/mets:file[#{id_query}]/mets:FLocat/@xlink:href" fptr_addresses = page.xpath(xpath).map &:text return ['NO_FILE'] if fptr_addresses.empty? # I don't know if this happens, but just in case... # for each ARK address, find the TIFF filename fptr_addresses.map { |address| locate_filename address } end |
#extract_material_as_recorded(record) ⇒ String
Extracts the material as recorded from the given record.
222 223 224 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 222 def extract_material_as_recorded record extract_materials(record).map(&:as_recorded).join '|' end |
#extract_materials(record) ⇒ Array<DS::Extractor::Material>
Extracts materials from the given record.
230 231 232 233 234 235 236 237 238 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 230 def extract_materials record find_parts(record).flat_map { |part| physdesc_note part, 'support' }.map { |s| s.downcase.chomp('.').strip }.uniq.map { |as_recorded| DS::Extractor::Material.new as_recorded: as_recorded } end |
#extract_mets_creator(xml) ⇒ Array<String>
Extracts the creator information from the METS XML document.
33 34 35 36 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 33 def extract_mets_creator xml creator = xml.xpath('/mets:mets/mets:metsHdr/mets:agent[@ROLE="CREATOR" and @TYPE="ORGANIZATION"]/mets:name', NS).text creator.split %r{;;} end |
#extract_ms_note(xml) ⇒ Array<String>
Extracts the manuscript note from the given XML.
766 767 768 769 770 771 772 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 766 def extract_ms_note xml notes = [] ms = find_ms xml notes += note_by_type ms, :none, tag: 'Manuscript note' notes += note_by_type ms, 'bibliography', tag: 'Bibliography' notes end |
#extract_ms_phys_desc(xml) ⇒ Object
88 89 90 91 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 88 def extract_ms_phys_desc xml ms = find_ms xml physdesc_note ms, 'presentation', tag: 'Binding' end |
#extract_name(node, *roles) ⇒ Array<DS::Extractor::Name>
Extract name from the given node based on the provided roles.
358 359 360 361 362 363 364 365 366 367 368 369 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 358 def extract_name node, *roles # Roles have different cases: Author, author, etc. # Xpath 1.0 has no lower-case function, so use translate() translate = "translate(./mods:role/mods:roleTerm/text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')" props = roles.map { |r| "#{translate} = '#{r}'" }.join ' or ' xpath = "./descendant::mods:name[#{props}]" node.xpath(xpath).flat_map { |name| name.xpath('mods:namePart').text.split %r{\s*;\s*} }.uniq.map { |as_recorded| DS::Extractor::Name.new as_recorded: as_recorded, role: roles.first } end |
#extract_notes(xml) ⇒ Array<String>
Extract the notes at all level from the xml, and return an array of strings
833 834 835 836 837 838 839 840 841 842 843 844 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 833 def extract_notes xml notes = [] # get all notes that don't have @type xpath = %q{//mods:note[not(@type)]/text()} notes += extract_ms_note xml notes += extract_part_note xml notes += extract_text_note xml notes += extract_docket xml notes += extract_page_note xml clean_notes notes end |
#extract_other_names_as_recorded(record) ⇒ Array<String>
Extract other names as recorded from the given record.
316 317 318 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 316 def extract_other_names_as_recorded record extract_associated_agents(record).map &:as_recorded end |
#extract_page_note(xml) ⇒ Array<String>
Extracts notes for each page in the given XML.
816 817 818 819 820 821 822 823 824 825 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 816 def extract_page_note xml find_pages(xml).flat_map { |page| extent = extract_extent page notes = [] notes += note_by_type page, :none, tag: extent notes += note_by_type page, 'content', tag: "Incipit, #{extent}" notes += extract_explicit page, tag: "Explicit, #{extent}" notes } end |
#extract_part_note(xml) ⇒ Array<String>
Extracts notes for each part in the given XML.
778 779 780 781 782 783 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 778 def extract_part_note xml find_parts(xml).flat_map { |part| extent = extract_extent part note_by_type part, :none, tag: extent } end |
#extract_part_phys_desc(xml) ⇒ Array<String>
Extracts physical description notes for each part in the XML.
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 119 def extract_part_phys_desc xml parts = find_parts xml parts.flat_map { |part| extent = extract_extent part notes = [] tag = "Figurative details, #{extent}" notes += physdesc_note part, 'physical details', tag: tag notes += extract_pd_note part tag = "Script, #{extent}" notes += physdesc_note part, 'script', tag: tag tag = "Music, #{extent}" notes += physdesc_note part, 'medium', tag: tag tag = "Layout, #{extent}" notes += physdesc_note part, 'technique', tag: tag tag = "Watermarks, #{extent}" notes += physdesc_note part, 'marks', tag: tag notes } end |
#extract_pd_note(part) ⇒ Array<String>
Extracts physical description notes from the given part object.
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 97 def extract_pd_note part extent = extract_extent part xpath = %q{mods:mods/mods:physicalDescription/mods:note[@type = 'physical description']/text()} part.xpath(xpath).flat_map { |node| text = node.text notes = [] if text =~ %r{;;} other_deco, num_scribes = text.split %r{;;+} notes << "Other decoration, #{extent}: #{other_deco}" unless other_deco.blank? notes << "Number of scribes, #{extent}: #{num_scribes}" unless num_scribes.blank? else notes << "Other decoration, #{extent}: #{text}" unless text.empty? end notes } end |
#extract_physical_description(xml) ⇒ Array
Extract and format all the physical description values for the manuscript and each part.
# MS Note Phys desc
-
presentation -> Binding
# MS Part phys description
- support -- accounted for as support
- marks - 'Watermarks'
- medium -> 'Music'
- physical description -> 'Other decoration'
- physical details -> 'Figurative details'
- script -> 'Script'
- technique -> 'Layout'
59 60 61 62 63 64 65 66 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 59 def extract_physical_description xml physdesc = [] physdesc += extract_ms_phys_desc xml physdesc += extract_part_phys_desc xml physdesc.flatten! clean_notes physdesc end |
#extract_places(record) ⇒ Array<DS::Extractor::Place>
Extracts places from the given record.
904 905 906 907 908 909 910 911 912 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 904 def extract_places record parts = find_parts record xpath = 'mods:mods/mods:originInfo/mods:place/mods:placeTerm' parts.flat_map { |node| node.xpath(xpath).map { |place| DS::Extractor::Place.new as_recorded: place.text.split(%r{;;}).join(', ') } } end |
#extract_production_date_as_recorded(xml) ⇒ Array<String>
Return as a single string all the date values for the manuscript. This is a concatenation of the values returned by DS10.extract_date_created, DS10.extract_assigned_date, DS10.extract_date_range.
545 546 547 548 549 550 551 552 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 545 def extract_production_date_as_recorded xml find_parts(xml).map { |part| date_created = extract_date_created part assigned = extract_assigned_date part range = extract_date_range_for_part(part).join '-' [date_created, assigned, range].uniq.reject(&:empty?).join '; ' }.reject { |date| date.to_s.strip.empty? } end |
#extract_production_places_as_recorded(xml) ⇒ Array<String>
Extract production places as recorded from the given XML.
398 399 400 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 398 def extract_production_places_as_recorded xml extract_places(xml).map &:as_recorded end |
#extract_recon_names(xml) ⇒ Array<Array>
Extract reconciliation names from the given XML.
430 431 432 433 434 435 436 437 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 430 def extract_recon_names xml data = (xml).map &:to_a data += extract_artists(xml).map &:to_a data += extract_scribes(xml).map &:to_a data += extract_former_owners(xml).map &:to_a data += extract_associated_agents(xml).map &:to_a data end |
#extract_recon_places(xml) ⇒ Array<Array>
Extract the places of production for reconciliation CSV output.
Returns a two-dimensional array, each row is a place; and each row has one column: place name; for example:
[["Austria"],
["Germany"],
["France (?)"]]
414 415 416 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 414 def extract_recon_places xml extract_places(xml).map &:to_a end |
#extract_recon_splits(xml) ⇒ Object
Extract acknowledgments, notes, physical descriptions, and former owners; return all strings that start with SPLIT:, remove ‘SPLIT: ’ and return an array of arrays that can be treated as rows by Recon::Type::Splits
444 445 446 447 448 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 444 def extract_recon_splits xml data = [] data += DS::Extractor::DsMetsXmlExtractor.extract_former_owners_as_recorded xml, lookup_split: false data.flatten.select { |d| d.to_s.size >= 400 }.map { |d| [d.strip] } end |
#extract_recon_subjects(xml) ⇒ Array<String,String>
See the note for [Recon::Type::Subjects]: Each source subject extraction method should return a two dimensional array:
[["Islamic law--Early works to 1800", ""],
["Malikites--Early works to 1800", ""],
["Islamic law", ""],s
["Malikites", ""],
["Arabic language--Grammar--Early works to 1800", ""],
["Arabic language--Grammar", ""],
...
]
The second value is for those cases where the source provides an authority URI. The METS records don’t give a URI so this method always returns the empty string for the second value.
485 486 487 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 485 def extract_recon_subjects xml extract_subjects(xml).map &:to_a end |
#extract_recon_titles(xml) ⇒ Array<String>
Extract reconciliation titles from the given XML.
422 423 424 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 422 def extract_recon_titles xml extract_titles(xml).to_a end |
#extract_scribes(record) ⇒ Array<String>
Extract scribes from the given record.
308 309 310 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 308 def extract_scribes record DS::Extractor::DsMetsXmlExtractor.extract_name record, *%w{ scribe [scribe] } end |
#extract_scribes_as_recorded(record) ⇒ Array<String>
Extracts scribes as recorded from the given record.
300 301 302 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 300 def extract_scribes_as_recorded record extract_scribes(record).map &:as_recorded end |
#extract_shelfmark(xml) ⇒ String
For the legacy DS METS, this value is the value of mods:identifier is the shelf mark. If there are other ID types, we can’t distinguish them from shelfmarks.
457 458 459 460 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 457 def extract_shelfmark xml ms = find_ms xml ms.xpath('mods:mods/mods:identifier[@type="local"]/text()').text end |
#extract_subjects(record) ⇒ Array<DS::Extractor::Subject>
Extracts subjects from the given record.
929 930 931 932 933 934 935 936 937 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 929 def extract_subjects record xpath = '//mods:originInfo/mods:edition' find_texts(record).flat_map { |text| text.xpath(xpath).map { |subj| as_recorded = subj.text.strip.gsub(/\s+/, ' ') DS::Extractor::Subject.new as_recorded: as_recorded, vocab: 'ds-subject' } } end |
#extract_subjects_as_recorded(xml) ⇒ Array<String>
Extract subjects, the mods:originInfo/mods:edition values for each text. For example,
<mods:originInfo>
<mods:edition>Alexander, de Villa Dei.</mods:edition>
<mods:edition>Latin language--Grammar.</mods:edition>
<mods:edition>Latin poetry, Medieval and modern.</mods:edition>
<mods:edition>Manuscripts, Medieval--Connecticut--New Haven.</mods:edition>
</mods:originInfo>
502 503 504 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 502 def extract_subjects_as_recorded xml extract_subjects(xml).map(&:as_recorded) end |
#extract_text_note(xml) ⇒ Array<String>
Extracts text notes from the given XML document.
800 801 802 803 804 805 806 807 808 809 810 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 800 def extract_text_note xml find_texts(xml).flat_map { |text| extent = extract_extent text notes = [] notes += note_by_type text, :none, tag: extent notes += note_by_type text, 'condition', tag: "Status of text, #{extent}" notes += note_by_type text, 'content', tag: "Incipit, #{extent}" notes += extract_explicit text, tag: "Explicit, #{extent}" notes } end |
#extract_titles(record) ⇒ Array<DS::Extractor::Title>
Extract titles from the given record.
383 384 385 386 387 388 389 390 391 392 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 383 def extract_titles record xpath = 'mods:mods/mods:titleInfo/mods:title' find_texts(record).flat_map { |text| text.xpath(xpath).map(&:text) }.reject { |t| t == '[Title not supplied]' }.map { |as_recorded| DS::Extractor::Title.new as_recorded: as_recorded } end |
#extract_titles_as_recorded(record) ⇒ Array<String>
Extract titles as recorded from the given record.
375 376 377 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 375 def extract_titles_as_recorded record extract_titles(record).map &:as_recorded end |
#find_ms(xml) ⇒ Object
METS structMap extraction
Extract mods:mods elements by catalog description level: manuscript, manuscript part, text, page, image
946 947 948 949 950 951 952 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 946 def find_ms xml # the manuscript is one div deep in the structMap # /mets:mets/mets:structMap/mets:div/@DMDID xpath = '/mets:mets/mets:structMap/mets:div/@DMDID' id = xml.xpath(xpath).first.text xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData" end |
#find_pages(xml) ⇒ Arry<Nokogiri::XML::Node>
Returns array of the page-level mets:dmdSec nodes.
992 993 994 995 996 997 998 999 1000 1001 1002 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 992 def find_pages xml # /mets:mets/mets:structMap/mets:div/mets:div/mets:div/mets:div/@DMDID # the pages are four divs deep in the structMap # We need the IDs in order xpath = '/mets:mets/mets:structMap/mets:div/mets:div/mets:div/mets:div/@DMDID' ids = xml.xpath(xpath).map &:text # collect dmdSec's for all the page IDs ids.flat_map { |id| xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData" } end |
#find_parts(xml) ⇒ Array<Nokogiri::XML::Node>
Find the manuscript parts in the XML document.
958 959 960 961 962 963 964 965 966 967 968 969 970 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 958 def find_parts xml # /mets:mets/mets:structMap/mets:div/mets:div/@DMDID # manuscripts parts are two divs deep in the structMap # We need to get the IDs in order xpath = '/mets:mets/mets:structMap/mets:div/mets:div/@DMDID' ids = xml.xpath(xpath).map &:text # We can't count on the order or the numbering of the mets:dmdSec # elements outside of the structMap. Thus, we have to return an # array with the parts mets:dmdSec in the correct order. ids.map { |id| xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData" } end |
#find_texts(xml) ⇒ Array<Nokogiri::XML::Node>
Find the texts in the XML document.
977 978 979 980 981 982 983 984 985 986 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 977 def find_texts xml # /mets:mets/mets:structMap/mets:div/mets:div/mets:div/@DMDID # texts are three divs deep in the structMap # We need to get the IDs in order xpath = '/mets:mets/mets:structMap/mets:div/mets:div/mets:div/@DMDID' ids = xml.xpath(xpath).map &:text ids.map { |id| xml.xpath "/mets:mets/mets:dmdSec[@ID='#{id}']/mets:mdWrap/mets:xmlData" } end |
#note_by_type(node, note_type, tag: nil) ⇒ Object
DS 1.0 METS note types:
# MS Note types:
Accounted for
- ownership -- accounted for, former owner
- action -- skip; administrative note: "Inputter ...."
- admin -- acknowledgments
- untyped -- 'Manuscript Note'
- bibliography -- 'Bibliography'
- source note -- skip; not present on DS legacy pages
# MS Note Phys desc
-
presentation -> Binding
# Part note types:
- date - already accounted for
- content - skip
- admin - Acknowledgments
- untyped
# MS Part phys description
- support -- accounted for as support
- marks - 'Watermarks'
- medium -> 'Music'
- physical description -> 'Other decoration'
- physical details -> 'Figurative details'
- script -> 'Script'
- technique -> 'Layout'
# Text note types
Accounted for
- admin - acknowledgments
- condition -> 'Status of text'
- content -> handled as Text Incipit
- untyped -> 'Text note'
# Page note types
Accounted for
None
- content -> Folio Incipit
- date -- skip
- untyped -> 'Folio note'
195 196 197 198 199 200 201 202 203 204 205 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 195 def note_by_type node, note_type, tag: nil if note_type == :none xpath = %q{mods:mods/mods:note[not(@type)]/text()} else xpath = %Q{mods:mods/mods:note[@type = '#{note_type}']/text()} end node.xpath(xpath).map { |x| tag.nil? ? x.text : "#{tag}: #{x.text}" } end |
#physdesc_note(node, note_type, tag: nil) ⇒ Array<String>
Extracts the physical description notes from the given node based on the note type and optional tag.
74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 74 def physdesc_note node, note_type, tag: nil if note_type == :none xpath = %q{mods:mods/mods:physicalDescription/mods:note[not(@type)]} else xpath = %Q{mods:mods/mods:physicalDescription/mods:note[@type = '#{note_type}']} end node.xpath(xpath).map { |x| tag.nil? ? x.text : "#{tag}: #{x.text}" } end |
#source_modified ⇒ String
A method to return the date when the source was last modified. For DS METS we have chosen the date 2021-10-01.
1007 1008 1009 |
# File 'lib/ds/extractor/ds_mets_xml_extractor.rb', line 1007 def source_modified "2021-10-01" end |