Module: MetadataHelper
- Includes:
- MarcHelper
- Included in:
- Amazon, Blacklight, CoverThing, DissertationCatch, GoogleBookSearch, Gpo, HathiTrust, Hip3Service, Illiad, InternetArchive, IsbnLink, Isi, Jcr, OpenLibrary, Pubmed, Referent, Scopus, Scopus2, SearchMethods::Sfx4, Worldcat, WorldcatIdentities
- Defined in:
- app/mixin_logic/metadata_helper.rb
Overview
Helper class to get keyword searchable terms from OpenURL author and title
OpenURLs have some commonly agreed upon metadata elements. This module is meant to help simplify things by sorting through the metadata and extracting what we need in a simpler interface. These values are specifically constructed from the citation to work well as keyword searches in other services.
Also includes some helpful methods for getting identifiers out in a convenient to work with way, regardless of non-standard ways they may have been stored.
Class Method Summary collapse
-
.title_is_serial?(rft) ⇒ Boolean
Look at weird bad OpenURLs, use heuristics to see if the ‘title’ probably represents a journal rather than a book.
Instance Method Summary collapse
- #get_doi(rft) ⇒ Object
-
#get_epage(rft) ⇒ Object
uses ‘epage` or tries to parse `pages`.
-
#get_gpo_item_nums(rft) ⇒ Object
Returns an array, possibly empty.
-
#get_identifier(type, sub_scheme, referent, options = {}) ⇒ Object
oclcnum, lccn, and isbn are both supposed to be stored as identifiers with an info: uri.
-
#get_isbn(rft) ⇒ Object
Gets isbn, also removes any weird stuff on the end sometimes included as ‘isbn’, but not part of the isbn.
-
#get_issn(rft) ⇒ Object
Gets an ISSN, makes sure it’s a valid ISSN or else returns nil.
-
#get_lccn(rft) ⇒ Object
finds and normalizes an LCCN.
- #get_month(rft) ⇒ Object
- #get_oclcnum(rft) ⇒ Object
- #get_pmid(rft) ⇒ Object
-
#get_search_creator(rft) ⇒ Object
chooses the best available creator for the format.
-
#get_search_terms(rft) ⇒ Object
DEPRECATED, not flexible enough, you really need to custom fit for your given target.
-
#get_search_title(rft, options = {}) ⇒ Object
chooses the best available title for the format, normalizes.
-
#get_spage(rft) ⇒ Object
uses ‘spage` or tries to parse `pages`.
- #get_sudoc(rft) ⇒ Object
- #get_top_level_creator(rft) ⇒ Object
- #get_year(rft) ⇒ Object
-
#normalize_lccn(lccn) ⇒ Object
Some normalization.
-
#normalize_title(arg_title, options = {}) ⇒ Object
A utility method to ‘normalize’ a title, for use when trying to match a title from one place with records in another database.
-
#raw_search_title(rft) ⇒ Object
pick title out of OpenURL referent from best element available, no normalization.
Methods included from MarcHelper
#add_856_links, #edition_statement, #get_title, #get_years, #gmd_values, #service_type_for_856, #should_skip_856_link?, #strip_gmd
Class Method Details
.title_is_serial?(rft) ⇒ Boolean
Look at weird bad OpenURLs, use heuristics to see if the ‘title’ probably represents a journal rather than a book. A guess at best, based on the bad data we’ve seen, sigh.
344 345 346 347 348 349 350 351 |
# File 'app/mixin_logic/metadata_helper.rb', line 344 def title_is_serial?(rft) ( rft.format != "book" && rft.format != "dissertation") && ( rft.["btitle"].blank? ) && ( %w{journal article}.include?(rft.["genre"]) || rft.['jtitle'].present? || (rft.["genre"].blank? && rft.["issn"].present?) ) end |
Instance Method Details
#get_doi(rft) ⇒ Object
263 264 265 |
# File 'app/mixin_logic/metadata_helper.rb', line 263 def get_doi(rft) return get_identifier(:info, "doi", rft) end |
#get_epage(rft) ⇒ Object
uses ‘epage` or tries to parse `pages`
329 330 331 332 333 334 335 336 337 338 339 |
# File 'app/mixin_logic/metadata_helper.rb', line 329 def get_epage(rft) if rft.['epage'].present? return rft.['epage'] elsif rft.['pages'] =~ /\A.*\- *(.*) *\Z/ return $1 elsif rft.['pages'].present? return rft.['pages'] else return nil end end |
#get_gpo_item_nums(rft) ⇒ Object
Returns an array, possibly empty.
272 273 274 275 276 277 |
# File 'app/mixin_logic/metadata_helper.rb', line 272 def get_gpo_item_nums(rft) # In a technically illegal but used by OCLC info:gpo uri ids = get_identifier(:info, "gpo", rft, :multiple => true) # Remove the uri part. return ids.collect {|id| id.sub(/^info:gpo\//, '') } end |
#get_identifier(type, sub_scheme, referent, options = {}) ⇒ Object
oclcnum, lccn, and isbn are both supposed to be stored as identifiers with an info: uri. info:oclcnum/#, info:lccn/#. But SFX sometimes stores them in the referent metadata instead: rft.lccn, rft.oclcnum. .
On the other hand, isbn and issn can legitimately be included in referent metadata or as a urn.
This method will find you an identifier accross multiple places.
type: :urn or :info subscheme: “lccn”, “oclcnum”, “isbn”, “issn”, or anything else that could be found in either a urn an info uri or a referent metadata. referent: an umlaut Referent object
returns nil if no identifier found, otherwise the bare identifier (not formatted into a urn/uri right now. Option should be maybe be added?)
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
# File 'app/mixin_logic/metadata_helper.rb', line 180 def get_identifier(type, sub_scheme, referent, = {} ) [:multiple] ||= false raise Exception.new("type must be :urn or :info") unless type == :urn or type == :info prefix = case type when :info then "info:#{sub_scheme}/" when :urn then "urn:#{sub_scheme}:" end = nil identifiers = referent.identifiers.collect {|id| $1 if id =~ /^#{prefix}(.*)/}.compact if ( identifiers.blank? && ['lccn', 'oclcnum', 'isbn', 'issn', 'doi', 'pmid'].include?(sub_scheme) ) # try the referent metadata from_rft = referent.[sub_scheme] identifiers = [from_rft] unless from_rft.blank? end if ( [:multiple]) return identifiers elsif ( identifiers[0].blank? ) return nil else return identifiers[0] end end |
#get_isbn(rft) ⇒ Object
Gets isbn, also removes any weird stuff on the end sometimes included as ‘isbn’, but not part of the isbn. Like (paperback) and such.
252 253 254 255 256 257 |
# File 'app/mixin_logic/metadata_helper.rb', line 252 def get_isbn(rft) isbn = get_identifier(:urn, "isbn", rft) isbn = isbn.gsub(/[^\dX]/, '') if isbn return nil if isbn.blank? return isbn end |
#get_issn(rft) ⇒ Object
Gets an ISSN, makes sure it’s a valid ISSN or else returns nil. So will return a valid ISSN (NOT empty string) or nil.
221 222 223 224 225 |
# File 'app/mixin_logic/metadata_helper.rb', line 221 def get_issn(rft) issn = rft.['issn'] issn = nil unless issn =~ /\d{4}(-)?\d{3}(\d|X)/ return issn end |
#get_lccn(rft) ⇒ Object
finds and normalizes an LCCN. If multiple LCCNs are in the record, returns the first one.
211 212 213 214 215 216 217 |
# File 'app/mixin_logic/metadata_helper.rb', line 211 def get_lccn(rft) lccn = get_identifier(:info, "lccn", rft) lccn = normalize_lccn(lccn) return lccn end |
#get_month(rft) ⇒ Object
304 305 306 307 308 309 310 311 312 313 |
# File 'app/mixin_logic/metadata_helper.rb', line 304 def get_month(rft) if rft.['date'] =~ /\d\d\d\d\-(\d\d?)/ return $1 elsif rft.['month'] # some link generators use an illegal 'month' parameter return rft.['month'] else return nil end end |
#get_oclcnum(rft) ⇒ Object
259 260 261 |
# File 'app/mixin_logic/metadata_helper.rb', line 259 def get_oclcnum(rft) return get_identifier(:info, "oclcnum", rft) end |
#get_pmid(rft) ⇒ Object
267 268 269 |
# File 'app/mixin_logic/metadata_helper.rb', line 267 def get_pmid(rft) return get_identifier(:info, "pmid", rft) end |
#get_search_creator(rft) ⇒ Object
chooses the best available creator for the format
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# File 'app/mixin_logic/metadata_helper.rb', line 135 def get_search_creator(rft) # Just make one call to create metadata hash = rft. # Identify dc.creator query. Prefer aulast alone if available. creator = nil creator = ['aulast'] unless ['aulast'].blank? creator = ['au'] if creator.blank? # FIXME if capital letters are next to each other should we insert a space? # Should we assume capitals next to each other are initials? # Maybe only if we use au? # Logic like this makes refactoring to use Referent.to_citation less useful. # FIXME strip out commas from creator if we use au? return nil if creator.blank? return creator end |
#get_search_terms(rft) ⇒ Object
DEPRECATED, not flexible enough, you really need to custom fit for your given target. method that accepts a referent to return hash of common metadata elements choosing the available element for the format and the best available for searching. Wrapper around the other methods.
18 19 20 21 22 23 24 25 |
# File 'app/mixin_logic/metadata_helper.rb', line 18 def get_search_terms(rft) title = get_search_title(rft) creator = get_search_creator(rft) # returns a hash of values so that more keys can be added # and not break services that use this module return {:title => title, :creator => creator} end |
#get_search_title(rft, options = {}) ⇒ Object
chooses the best available title for the format, normalizes
121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'app/mixin_logic/metadata_helper.rb', line 121 def get_search_title(rft, = {}) #defaults = {:remove_all_parens => true, :subtitle_on_semicolon => true, :remove_subtitle => true, :remove_punctuation => true}.merge() title = raw_search_title(rft) return normalize_title(title, ) end |
#get_spage(rft) ⇒ Object
uses ‘spage` or tries to parse `pages`
316 317 318 319 320 321 322 323 324 325 326 |
# File 'app/mixin_logic/metadata_helper.rb', line 316 def get_spage(rft) if rft.['spage'].present? return rft.['spage'] elsif rft.['pages'] =~ /\A *(.*?) *\-.*\Z/ return $1 elsif rft.['pages'].present? return rft.['pages'] else return nil end end |
#get_sudoc(rft) ⇒ Object
279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 |
# File 'app/mixin_logic/metadata_helper.rb', line 279 def get_sudoc(rft) # Don't forget to unescape the sudoc that was escaped to maek it a uri! # Option 1: In a technically illegal but oh well info:sudoc uri sudoc = get_identifier(:info, "sudoc", rft) sudoc = CGI.unescape(sudoc) if sudoc # Option 2: rsinger's purl for sudoc. http://dilettantes.code4lib.org/2009/03/a-uri-scheme-for-sudocs/ unless sudoc sudoc = rft.identifiers.collect {|id| $1 if id =~ /^http:\/\/purl.org\/NET\/sudoc\/(.*)$/}.compact.slice(0) sudoc = CGI.unescape(sudoc) if sudoc end return sudoc end |
#get_top_level_creator(rft) ⇒ Object
155 156 157 158 159 160 161 162 163 164 |
# File 'app/mixin_logic/metadata_helper.rb', line 155 def get_top_level_creator(rft) # If it's a non-journal thing, add the author if we have an aulast (preferred) or au. # But wait--if it's a book _part_, don't include the author name, since # it _might_ just be the author of the part, not of the book. unless (rft.format == "journal" || ( rft.format == "book" && ! rft.['atitle'].blank?)) return get_search_creator(rft) end return nil end |
#get_year(rft) ⇒ Object
296 297 298 299 300 301 302 |
# File 'app/mixin_logic/metadata_helper.rb', line 296 def get_year(rft) # Some link generators use an illegal 'year' parameter if (date = (rft['date'] || rft['year'])) return date[0,4] end return nil end |
#normalize_lccn(lccn) ⇒ Object
Some normalization. See: info-uri.info/registry/OAIHandler?verb=GetRecord&metadataPrefix=reg&identifier=info:lccn/ doesn’t validate right now, only normalizes. tbd, raise exception if invalid string.
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
# File 'app/mixin_logic/metadata_helper.rb', line 231 def normalize_lccn(lccn) if ( lccn ) # remove whitespace lccn = lccn.gsub(/\s/, '') # remove any forward slashes and anything after them lccn = lccn.sub(/\/.*$/, '') # pad anything after a hyphen before removing hyphen, if neccesary lccn = lccn.sub(/-(.*)/) do |match_str| if $1.length < 6 ("0" * (6 - $1.length)) + $1 else $1 end end end return lccn end |
#normalize_title(arg_title, options = {}) ⇒ Object
A utility method to ‘normalize’ a title, for use when trying to match a title from one place with records in another database. Does lowercasing and removing puncutation, but also stripping out a bunch of other things that may result in false negatives. Exactly how you want to do for best results depends on the particular data you are working with, you need to experiment to see. MANY options are offered, although defaults are somewhat sensible. Much of this stuff especially takes account of titles that may have been generated from mark. Will never return the emtpy string, will sometimes return nil.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'app/mixin_logic/metadata_helper.rb', line 38 def normalize_title(arg_title, = {}) # default options [:rstrip_parens] ||= true [:remove_all_parens] ||= true [:strip_gmd] ||= true [:subtitle_on_semicolon] ||=false [:remove_subtitle] ||= false [:normalize_ampersand] ||= true [:remove_punctuation] ||= true # Even if you're removing other punctuation, keep the apostrophes? [:keep_apostrophes] ||=false return nil if arg_title.nil? title = arg_title.clone return nil if title.blank? # Sometimes titles given in the OpenURL have some additional stuff # in parens at the end, that messes up the search and isn't really # part of the title. Eliminate! title.gsub!(/\([^)]*\)\s*$/, '') if [:rstrip_parens] # Or, not even just at the end, but anywhere! title.gsub!(/\([^)]*\)/, '') if [:remove_all_parens] # Remove things in brackets, part of an AACR2 GMD that's made it in. # replace with ':' so we can keep track of the fact that everything # that came afterwards was a sub-title like thing. title = strip_gmd(title) if [:strip_gmd] # There seems to be some catoging/metadata disagreement about when to # use ';' for a subtitle instead of ':'. Normalize to ':'. title.sub!(/[\;]/, ':') if [:subtitle_on_semicolon] title.sub!(/\:(.*)$/, '') if [:remove_subtitle] # Change ampersands to 'and' for consistency, we see it both ways. title.gsub!(/\&/, ' and ') if [:normalize_ampersand] # remove non-alphanumeric, excluding apostrophe title.gsub!(/[^[[:alnum:]][[:space:]]\']/, ' ') if [:remove_punctuation] # apostrophe not to space, just eat it. title.gsub!(/[\']/, '') if [:remove_punctuation] && ! [:keep_apostrophes] # compress whitespace title.strip! title.gsub!(/\s+/, ' ') title.downcase! title = nil if title.blank? return title end |
#raw_search_title(rft) ⇒ Object
pick title out of OpenURL referent from best element available, no normalization.
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'app/mixin_logic/metadata_helper.rb', line 95 def raw_search_title(rft) # Just make one call to create metadata hash = rft. title = nil if rft.format == 'journal' && ['atitle'] title = ['atitle'] elsif rft.format == 'book' title = ['btitle'] unless ['btitle'].blank? title = ['title'] if title.blank? # Well, if we don't know the format and we do have a title use that. # This might happen if we only have an ISBN to start and then enhance. # So should services like Amazon also enhance with a format, should # we simplify this method to not worry about format so much, or do we # keep this as is? elsif ['btitle'] title = ['btitle'] elsif ['title'] title = ['title'] elsif ['jtitle'] title = ['jtitle'] end return title end |