Class: BentoSearch::WorldcatSruDcEngine
- Inherits:
-
Object
- Object
- BentoSearch::WorldcatSruDcEngine
- Extended by:
- HTTPClientPatch::IncludeClient
- Includes:
- SearchEngine
- Defined in:
- app/search_engines/bento_search/worldcat_sru_dc_engine.rb
Overview
Attempt to search using the WorldCat Search SRU variant, asking API for results in DC format. We’ll see how far this takes us.
Does require an API key, and requires OCLC membership/FirstSearch subscription for access.
link is set to worldcat.org link. Change config link_base_url to, say, link to a worldcat local instance.
Limitations
Worldcat SRU APU provides _very little_ usable data on format/type. We provide some limited heuristics to try and clean up what IS there, but user-displayable format_str may be weird sometimes (and is frequently ‘Text’), and machine readable semantic #format is often defaulted to “Book”, which may not always be right.
WorldCat doesn’t let you paginate past start_record 9999. If client asks, this engine will silenly reset to 9999.
API Docs
-
oclc.org/developer/documentation/worldcat-search-api/using-api
-
oclc.org/developer/documentation/worldcat-search-api/parameters
-
oclc.org/developer/documentation/worldcat-search-api/service-levels
-
oclc.org/developer/documentation/worldcat-search-api/complete-list-indexes
Required configuration keys
-
api_key
Optional configuration keys
- frbrGrouping
-
default nil, use worldcat default (which is ‘on’). See oclc.org/developer/documentation/worldcat-search-api/parameters for meaning of frbrGrouping. set to true or false.
- auth
-
default false. Set to true to assume all users are authenticated and servicelevel=full for OCLC.
Extra search args
- auth
-
default false. Set to true to specify current user is authenticated and servicelevel=full for OCLC. Overrides config ‘auth’ value.
Constant Summary collapse
- MaxStartRecord =
at least as of Sep 2012, worldcat errors if you ask for pagination beyond this
9999
Constants included from SearchEngine
Class Method Summary collapse
Instance Method Summary collapse
-
#construct_cql_query(args) ⇒ Object
construct valid CQL for the API’s “query” param, from search args.
-
#construct_query_url(args) ⇒ Object
Note, if pagination start record is beyond what we think is worldcat’s max, it will silently reset to max, and mutate the args passed in so pagination appears to be at max too!.
-
#fielded_cql_query(query, field = nil) ⇒ Object
construct valid CQL for the API’s “query” param, from search args.
- #first_text_if_present(node, xpath) ⇒ Object
-
#format_heuristics(record_xml) ⇒ Object
input is a nokogiri node for a recordData/oclcdcs representing a hit.
-
#get(id) ⇒ Object
get a single record, by it’s #unique_id (which is also an oclcnum), returns record, or raises BentoSearch::NotFound, BentoSearch::TooManyFound, or possibly something weird.
- #max_per_page ⇒ Object
- #multi_field_search? ⇒ Boolean
-
#search_field_definitions ⇒ Object
WorldCat offers more search fields than this, this is what we think is useful right now.
- #search_implementation(args) ⇒ Object
-
#sort_definitions ⇒ Object
date sort seems to work pretty terribly on worldcat.
-
#xpath_contains(node, xpath, text) ⇒ Object
if ‘node` has an `xpath` whose text() contains `text`.
Methods included from HTTPClientPatch::IncludeClient
Methods included from SearchEngine
#display_configuration, #engine_id, #fill_in_search_metadata_for, #initialize, #normalized_search_arguments, #public_settable_search_args, #search
Methods included from SearchEngine::Capabilities
#search_keys, #semantic_search_keys, #semantic_search_map, #sort_keys
Class Method Details
.default_configuration ⇒ Object
343 344 345 346 347 348 349 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 343 def self.default_configuration { :base_url => "http://www.worldcat.org/webservices/catalog/search/sru?", :linking_base_url => "http://worldcat.org/oclc/", :auth => false } end |
.required_configuration ⇒ Object
339 340 341 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 339 def self.required_configuration [:api_key] end |
Instance Method Details
#construct_cql_query(args) ⇒ Object
construct valid CQL for the API’s “query” param, from search args. Tricky because we need to split terms/phrases ourselves
returns CQL that is NOT uri escaped yet.
265 266 267 268 269 270 271 272 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 265 def construct_cql_query(args) if args[:query].kind_of?(Hash) # multi-field args[:query].collect {|field, query| fielded_cql_query(query, field)}.join(" AND ") else fielded_cql_query(args[:query], args[:search_field] || "srw.kw") end end |
#construct_query_url(args) ⇒ Object
Note, if pagination start record is beyond what we think is worldcat’s max, it will silently reset to max, and mutate the args passed in so pagination appears to be at max too!
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 164 def construct_query_url(args) url = configuration.base_url url += "&wskey=#{CGI.escape configuration.api_key}" url += "&recordSchema=#{CGI.escape 'info:srw/schema/1/dc'}" url += "&maximumRecords=#{args[:per_page]}" if args[:per_page] # pagination, WorldCat 'start' is 1-based, ours is 0-based. Catch max. if args[:start] && args[:start] > (MaxStartRecord-1) args[:start] = MaxStartRecord - 1 args[:page] = (args[:start] / (args[:per_page] || 10)) + 1 end url += "&startRecord=#{args[:start] + 1}" if args[:start] url += "&query=#{CGI.escape construct_cql_query(args)}" if (args[:sort]) && (value = sort_definitions[args[:sort]].try {|h| h[:implementation]}) url += "&sortKeys=#{CGI.escape value}" end unless configuration.frbrGrouping.nil? value = configuration.frbrGrouping ? "on" : "off" url += "&frbrGrouping=#{value}" end # service level? search arg over-rides config auth = args[:auth] auth = configuration.auth if auth.nil? if auth url += "&servicelevel=full" end return url end |
#fielded_cql_query(query, field = nil) ⇒ Object
construct valid CQL for the API’s “query” param, from search args. Tricky because we need to split terms/phrases ourselves
returns CQL that is NOT uri escaped yet.
278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 278 def fielded_cql_query(query, field = nil) # default is srw.kw, Keyword anywhere. field ||= "srw.kw" # We need to split terms and phrases, so we can formulate # CQL with seperate clauses for each, bah. tokens = query.split(%r{\s|("[^"]+")}).delete_if {|a| a.blank?} return tokens.collect do |token| quoted_token = nil if token =~ /^".*"$/ # phrase quoted_token = token else # escape internal double quotes with single backslash. sorry ruby escaping # makes this crazy. token = token.gsub('"', %Q{\\"}) quoted_token = %Q{"#{token}"} end "#{field} = #{quoted_token}" end.join(" AND ") end |
#first_text_if_present(node, xpath) ⇒ Object
250 251 252 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 250 def first_text_if_present(node, xpath) node.at_xpath(xpath).try {|n| n.text} end |
#format_heuristics(record_xml) ⇒ Object
input is a nokogiri node for a recordData/oclcdcs representing a hit. (with namespaces stripped).
output is [format, format_str], based on rough guess heuristics of what we can do, OCLC does not provide particularly useful data here for either user display passthrough OR semantics, this is inherently flawed but better than nothing.
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 207 def format_heuristics(record_xml) # default semantic format to "Book", it'll sometimes be wrong, # but right more often than it's wrong when we lack sufficient # info to know otherwise. format = "Book" # user display string, default to none, unless we come up with something. format_str = nil if xpath_contains(record_xml, "./subject", "--Periodicals") # if a subject includes "--Periodicals", we're going to guess it's # a serial/journal. format = :serial format_str = "Journal or Serial" elsif record_xml.xpath("./type[text()='Image']").length > 0 # "Image" can mean video OR actual images, only thing we # can do really for user-presentable format is use the terrible "./format", # which will often tell the user more (along with a bunch of weird stuff). format_str = first_text_if_present(record_xml, "./format") elsif record_xml.xpath("./type[text()='Sound']").length > 0 # No great thing to display to user to say what this really is, # but at least we know it's Sound. format_str = first_text_if_present(record_xml, "./format") || "Sound" format = "AudioObject" elsif record_xml.xpath("./description").find {|node| node.text =~ /^Thesis \([^)]+\)--/} # yeah, to tag it as a dissertation we've got to heursitically regex # a description value for looking like a thesis label. format = :dissertation format_str = "Dissertation/Thesis" elsif (type = first_text_if_present(record_xml, "./type")) # defaults, # If we have a type, titleize it to change things like MovingImage to # 'Moving Image'. format_str = type.titleize else # if we don't even have a 'type', use the 'format' if it's there, # even though it's gonna be weird. format_str = first_text_if_present(record_xml, "format") end return [format, format_str] end |
#get(id) ⇒ Object
get a single record, by it’s #unique_id (which is also an oclcnum), returns record, or raises BentoSearch::NotFound, BentoSearch::TooManyFound, or possibly something weird.
151 152 153 154 155 156 157 158 159 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 151 def get(id) results = search(id, :semantic_search_field => :oclcnum) raise (results.error[:exception] || Exception.new(results.error)) if results.failed? raise BentoSearch::NotFound.new("ID: #{id}") if results.total_items == 0 raise BentoSearch::TooManyFound.new("ID: #{ID}") if results.total_items > 1 return results.first end |
#max_per_page ⇒ Object
335 336 337 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 335 def max_per_page 100 end |
#multi_field_search? ⇒ Boolean
351 352 353 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 351 def multi_field_search? true end |
#search_field_definitions ⇒ Object
WorldCat offers more search fields than this, this is what we think is useful right now. Some WorldCat search fields are only available at ‘full’ service level, but we think all the ones we’re listing now are available even at ‘default’ service level.
320 321 322 323 324 325 326 327 328 329 330 331 332 333 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 320 def search_field_definitions { nil => {:semantic => :general}, "srw.ti" => {:semantic => :title}, "srw.au" => {:semantic => :author}, "srw.su" => {:semantic => :subject}, "srw.bn" => {:semantic => :isbn}, "srw.in" => {:semantic => :issn}, "srw.dn" => {:semantic => :lccn}, # generic 'number', probably not useful "srw.sn" => {:semantic => :number}, "srw.no" => {:semantic => :oclcnum} } end |
#search_implementation(args) ⇒ Object
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 55 def search_implementation(args) url = construct_query_url(args) results = BentoSearch::Results.new response = http_client.get(url) # check for http errors if response.status != 200 results.error ||= {} results.error[:status] = response.status results.error[:info] = response.body results.error[:url] = url return results end xml = Nokogiri::XML(response.body) # namespaces only get in the way xml.remove_namespaces! results.total_items = xml.at_xpath("//numberOfRecords").try {|n| n.text.to_i } # check for SRU fatal errors, no results AND a diagnostic message # is a fatal error always, I think. if (results.total_items == 0 && error_xml = xml.at_xpath("./searchRetrieveResponse/diagnostics/diagnostic")) results.error ||= {} results.error[:info] = error_xml.children.to_xml end (xml.xpath("/searchRetrieveResponse/records/record/recordData/oclcdcs") || []).each do |record| item = BentoSearch::ResultItem.new item.title = first_text_if_present record, "title" # May have one (or more?) 'creator' and one or more 'contributor'. # We'll use just creators if we got em, else contributors. = record.xpath("./creator") = record.xpath("./contributor") if .empty? .each do |auth_node| item. << BentoSearch::Author.new(:display => auth_node.text) end # date may have garbage in it, just take the first four digits item.year = record.at_xpath("date").try do |date_node| date_node.text =~ /(\d{4})/ ? $1 : nil end # weird garbled from MARC format, best we have (item.format, item.format_str) = format_heuristics(record) item.publisher = first_text_if_present record, "publisher" # OCLC DC format gives us a bunch of jumbled 'description' elements # with any Marc 5xx. Sigh. We'll just concat em all and call it an # abstract, best we can do. item.abstract = record.xpath("description").collect {|n| n.text}.join("... \n") # dc.identifier is a terrible smorgasbord of different identifiers, # with no way to tell for sure what's what other than pattern matching # of literals. sigh. if ( id = first_text_if_present(record, "identifier")) possible_isxn = id.scan(/\d|X/).join('') # we could test check digit validity, but we ain't if possible_isxn.length == 10 || possible_isxn.length == 13 item.isbn = possible_isxn elsif possible_isxn.length == 8 item.issn = possible_isxn end end # The recordIdentifier with no "xsi:type" attrib is an oclcnum. sigh. # lccn may also be in there if we wanted to keep it. item.oclcnum = first_text_if_present(record, "./recordIdentifier[not(@type)]") # oclcnum is our engine-specific unique id too. item.unique_id = item.oclcnum item.link = "#{configuration.linking_base_url}#{item.oclcnum}" item.language_code = first_text_if_present record, "./language[@type='http://purl.org/dc/terms/ISO639-2']" results << item end return results end |
#sort_definitions ⇒ Object
date sort seems to work pretty terribly on worldcat. Author, Title, and “Score” (don’t know what that is) also avail on worldcat, asc and desc, but we aren’t advertising here, cause, who needs em.
308 309 310 311 312 313 314 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 308 def sort_definitions { "relevance" => {:implementation => "relevance"}, "date_desc" => {:implementation => "Date,,0"}, "library_count_desc" => {:implementation => "Library Count,,0"} } end |
#xpath_contains(node, xpath, text) ⇒ Object
if ‘node` has an `xpath` whose text() contains `text`.
uses some tricky xpath, may not work with unsuual xpath passed in
256 257 258 |
# File 'app/search_engines/bento_search/worldcat_sru_dc_engine.rb', line 256 def xpath_contains(node, xpath, text) node.xpath(xpath).xpath("./text()[contains(.,'#{text}')]").length > 0 end |