Class: BentoSearch::EbscoHostEngine

Inherits:
Object
  • Object
show all
Extended by:
HTTPClientPatch::IncludeClient
Includes:
SearchEngine
Defined in:
app/search_engines/bento_search/ebsco_host_engine.rb

Overview

Right now for EbscoHost API (Ebsco Integration Toolkit/EIT), may be expanded or refactored for EDS too.

Required Configuration

  • profile_id

  • profile_password

  • databases: ARRAY of ebsco shortcodes of what databases to include in search. If you specify one you don’t have access to, you get an error message from ebsco, alas.

Limits

While waiting for a future bento_box generalized limit api, this engine accepts custom search arguments to apply limits:

:peer_reviewed_only

Set to boolean true or string ‘true’, to restrict results to peer-reviewed only. (Or ask EBSCOHost api to do so, what we get is what we get).

:pubyear_start
:pubyear_end

Date range limiting, pass in custom search args, one or both of pubyear_start and pubyear_end #to_i will be called on it, so can be string. .search(:query => “foo”, :pubyear_start => 2000)

:databases

List of licensed EBSCO dbs to search, can override list set in config databases, just for this search.

Custom response data

Iff EBSCO API reports that fulltext is available for the hit, then result.custom_data will be non-nil, and will be an array of one or more of EBSCO’s internal codes (P=PDF, T=HTML, C=HTML+Images). If no fulltext is avail according to EBSCO API, result.custom_data will be nil.

#link_is_fulltext also set to true/false

You can use this to, for instance, hyperlink the displayed title directly to record on EBSCO if and only if there’s fulltext. By writing a custom decorator. See wiki on decorators.

Limitations

We do set language of ResultItems based on what ebsco tells us, but ebsoc seems to often leave out language or say ‘english’ for things that are not (maybe cause abstract is in English?). Config variable to tell us to ignore language?

Note on including databases

Need to specifically configure all databases your institution licenses from EBSCO that you want included in the search. You can’t just say “all of them” the api doesn’t support that, and also more than 30 or 40 starts getting horribly slow. If you include a db you do not have access to, EBSCO api fatal errors.

You may want to make sure all your licensed databases are included in your EIT profile. Log onto ebscoadmin, Customize Services, choose EIT profile, choose ‘databases’ tag.

Download databases from EBSCO api

We include a utility to download ALL activated databases for EIT profile and generate a file putting them in a ruby array. You may want to use this file as a starting point, and edit by hand:

First configure your EBSCO search engine with bento_search, say under key ‘ebscohost’.

Then run:

rails generate bento_search:pull_ebsco_dbs ebscohost

assuming ‘ebscohost’ is the key you registered the EBSCO search engine.

This will create a file at ./config/ebsco_dbs.rb. You may want to hand edit it. Then, in your bento search config, you can:

require "#{Rails.root}/config/ebsco_dbs.rb"
BentoSearch.register_engine("ebscohost") do |conf|
   # ....
   conf.databases = $ebsco_dbs
end

Vendor documentation

Vendor documentation is a bit scattered, main page:

Some other useful pages we discovered:

Hard to find docs page on embedding EBSCO limiters (like peer reviewed only "RV Y") in search query:
   http://support.epnet.com/knowledge_base/detail.php?id=5397

EBSCO searchable support portal has a section for the EIT api we use here:
   http://support.epnet.com/knowledge_base/search.php?keyword=&interface_id=1082&document_type=&page_function=search

Constant Summary collapse

HttpTimeout =

Can’t change http timeout in config, because we keep an http client at class-wide level, and config is not class-wide. Change this ‘constant’ if you want to change it, I guess.

In some tests we did, 5.2s was 95th percentile slowest, but in actual percentage 5.2s is still timing out way too many requests, let’s try 6.3, why not.

6.3

Constants included from SearchEngine

SearchEngine::DefaultPerPage

Class Method Summary collapse

Instance Method Summary collapse

Methods included from HTTPClientPatch::IncludeClient

include_http_client

Methods included from SearchEngine

#display_configuration, #engine_id, #fill_in_search_metadata_for, #initialize, #normalized_search_arguments, #search

Methods included from SearchEngine::Capabilities

#search_keys, #semantic_search_keys, #semantic_search_map, #sort_keys

Class Method Details

.default_configurationObject



584
585
586
587
588
589
590
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 584

def self.default_configuration
  {
    # /Search
    :base_url => "http://eit.ebscohost.com/Services/SearchService.asmx",
    :databases => []
  }
end

.required_configurationObject



580
581
582
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 580

def self.required_configuration
  ["profile_id", "profile_password"]
end

Instance Method Details

#ebsco_query_escape(txt) ⇒ Object

escape or replace special chars to ebsco



308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 308

def ebsco_query_escape(txt)
  # it's unclear if ebsco API actually allows escaping of special chars,
  # or what the special chars are. But we know parens are special, can't
  # escape em, we'll just remove em (should not effect search).

  # undocumented but question mark seems to cause a problem for ebsco,
  # even inside quoted phrases, not sure why. Square brackets too.
  txt = txt.gsub(/[)(\?\[\]]/, ' ')

  # 'and' and 'or' need to be in phrase quotes to avoid being
  # interpreted as boolean. For instance, when people just
  # paste in a title: << A strategy for decreasing anxiety of ICU transfer patients and their families >>
  # You'd think 'and' as boolean would still work there, but it resulted
  # in zero hits unless quoted, I dunno. lowercase and uppercase and/or/not
  # both cause observed weirdness.
  if ['and', 'or', 'not'].include?( txt.downcase )
    txt = %Q{"#{txt}"}
  end

  return txt
end

#ebsco_query_prepare(txt) ⇒ Object

Actually turn the user’s query into an EBSCO “AND” boolean query, seems only way to get decent results where terms can match cross-fields at the moment, for EIT. We’ll see for EDS.



333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 333

def ebsco_query_prepare(txt)
  # use string split with regex cleverly to split into space
  # seperated terms and phrases, keeping phrases as unit.
  terms = txt.split %r{[[:space:]]+|("[^"]+")}

  # Remove parens in non-phrase-quoted terms
  terms = terms.collect do |t|
    ebsco_query_escape(t)
  end


  # Remove empty strings. Remove terms that are solely punctuation
  # without any letters.
  terms.delete_if do |term|
    (
      term.blank? ||
      term =~ /\A[^[[:alnum:]]]+\Z/
    )
  end

  terms.join(" AND ")
end

#fielded_query(query, field = nil) ⇒ Object



403
404
405
406
407
408
409
410
411
412
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 403

def fielded_query(query, field = nil)
  output = ebsco_query_prepare(query)

  # wrap in (FI $query) if fielded search
  if field.present?
    output= "(#{field} #{query})"
  end

  return output
end

#fulltext_formats(record_xml) ⇒ Object

pass in nokogiri record xml for the records/rec node. Returns nil if NO fulltext is avail on ebsco platform, non-nil if fulltext is available. Non-nil value will actually be a non-empty ARRAY of internal EBSCO codes, P=PDF, T=HTML, C=HTML with images. support.epnet.com/knowledge_base/detail.php?topic=996&id=3778&page=1



212
213
214
215
216
217
218
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 212

def fulltext_formats(record_xml)
  fulltext_formats = record_xml.xpath("./header/controlInfo/artinfo/formats/fmt/@type").collect {|n| n.text }

  return nil if fulltext_formats.empty?

  return fulltext_formats
end

#get(id) ⇒ Object

Method to get a single record by “identifier” string, which is really a combined “db:id” string, same string that would be returned by an individual item.identifier

Returns an individual BentoSearch::Result, or raises an exception. Can raise BentoSearch::NotFound, BentoSearch::TooManyFound, or any other weird random exception caused by problems fetching (network error etc. Is it bad that we don’t wrap these in an expected single exception type? Should we?)

Raises:

  • (ArgumentError)


188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 188

def get(id)
  # split on first colon only.
  id =~ /^([^:]+)\:(.*)$/
  db = $1 ; an = $2

  raise ArgumentError.new("EbscoHostEngine#get requires an id with a colon, like `a9h:12345`. Instead, we got #{id}") unless db && an

  # "AN" search_field is not listed in our search_field_definitions,
  # but it is an internal EBSCOHost search index on 'accession number'

  results = search(an, :search_field => "AN", :databases => [db])

  raise (results.error[:exception] || Exception.new) if results.failed?
  raise BentoSearch::NotFound.new("For id: #{id}") if results.length == 0
  raise BentoSearch::TooManyFound.new("For id: #{id}") if results.length > 1

  return results.first
end

#get_infoObject

This method is not used for normal searching, but can be used by other code to retrieve the results of the EBSCO API Info command, using connection details configured in this engine. The Info command can tell you what databases your account is authorized to see. Returns the complete Nokogiri response, but WITH NAMESPACES REMOVED



529
530
531
532
533
534
535
536
537
538
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 529

def get_info
  url =
    "#{configuration.base_url}/Info?prof=#{configuration.profile_id}&pwd=#{configuration.profile_password}"

  noko = Nokogiri::XML( http_client.get( url ).body )

  noko.remove_namespaces!

  return noko
end

pass in <rec> nokogiri, will determine best link



302
303
304
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 302

def get_link(xml)
  text_if_present(xml.at_xpath("./pdfLink")) || text_if_present(xml.at_xpath("./plink") )
end

#item_from_xml(xml_rec) ⇒ Object

pass in a nokogiri representing an EBSCO <rec> result, we’ll turn it into a BentoSearch::ResultItem.



416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 416

def item_from_xml(xml_rec)
  info = xml_rec.at_xpath("./header/controlInfo")

  item = BentoSearch::ResultItem.new

  # Get unique id. Think we need both the database code and accession
  # number combined, accession numbers not neccesarily unique accross
  # dbs. We'll combine with a colon.
  db                  = text_if_present xml_rec.at_xpath("./header/@shortDbName")
  accession           = text_if_present xml_rec.at_xpath("./header/@uiTerm")
  item.unique_id             = "#{db}:#{accession}" if db && accession


  item.link           = get_link(xml_rec)

  # EBSCO is somewhat inconsistent with where it puts the ISSN
  item.issn           = text_if_present(info.at_xpath("./jinfo/issn")) || text_if_present(info.at_xpath("./jinfo/jid[@type='issn']"))

  # Dealing with titles is a bit crazy, while articles usually have atitles and
  # jtitles, sometimes they have a btitle instead. A book will usually have
  # both btitle and atitle, but sometimes just atitle. Book chapter, oh boy.

  jtitle        = text_if_present(info.at_xpath("./jinfo/jtl"))
  btitle        = text_if_present info.at_xpath("./bkinfo/btl")
  atitle        = text_if_present info.at_xpath("./artinfo/tig/atl")

  if jtitle && atitle
    item.title          = atitle
    item.source_title   = jtitle
  elsif btitle && atitle && atitle != btitle
    # for a book, sometimes there's an atitle block and a btitle block
    # when they're identical, this ain't a book section, it's a book.
    item.title          = atitle
    item.source_title   = btitle
  else
    item.title  = atitle || btitle
  end
  # EBSCO sometimes has crazy long titles, truncate em.
  if item.title.present?
    item.title        = text_helper.truncate(item.title, :length => 200, :separator => ' ', :omission => '')
  end



  item.publisher      = text_if_present info.at_xpath("./pubinfo/pub")
  # if no publisher, but a dissertation institution, use that
  # as publisher.
  unless item.publisher
    item.publisher    = text_if_present info.at_xpath("./dissinfo/dissinst")
  end


  # Might have multiple ISBN's in record, just take first for now
  item.isbn           = text_if_present info.at_xpath("./bkinfo/isbn")

  item.year           = text_if_present info.at_xpath("./pubinfo/dt/@year")
  # fill in complete publication_date too only if we've got it.
  if (item.year &&
      (month = text_if_present info.at_xpath("./pubinfo/dt/@month")) &&
      (day = text_if_present info.at_xpath("./pubinfo/dt/@day"))
    )
    if (item.year.to_i != 0 && month.to_i != 0 && day.to_i != 0)
      item.publication_date = Date.new(item.year.to_i, month.to_i, day.to_i)
    end
  end

  item.volume         = text_if_present info.at_xpath("./pubinfo/vid")
  item.issue          = text_if_present info.at_xpath("./pubinfo/iid")




  item.start_page     = text_if_present info.at_xpath("./artinfo/ppf")

  item.doi            = text_if_present info.at_xpath("./artinfo/ui[@type='doi']")

  item.pmid           = text_if_present info.at_xpath("./artinfo/ui[@type='pmid']")

  item.abstract       = text_if_present info.at_xpath("./artinfo/ab")
  # EBSCO abstracts have an annoying habit of beginning with "Abstract:"
  if item.abstract
    item.abstract.gsub!(/^Abstract\: /, "")
  end

  # authors, only get full display name from EBSCO.
  info.xpath("./artinfo/aug/au").each do |author|
    a = BentoSearch::Author.new(:display => author.text)
    item.authors << a
  end

  item.format          = sniff_format info
  item.format_str      = sniff_format_str info

  # Totally unreliable, seems to report english for everything? Maybe
  # because abstracts are in english? Nevertheless we include for now.
  item.language_code   = text_if_present info.at_xpath("./language/@code")
  # why does EBSCO return 'undetermined' sometimes? That might as well be
  # not there, bah.
  item.language_code = nil if item.language_code == "und"

  # array of custom ebsco codes (or nil) for fulltext formats avail.
  item.custom_data["fulltext_formats"] = fulltext_formats xml_rec
  # if any fulltext format, mark present
  item.link_is_fulltext = item.custom_data["fulltext_formats"].present?

  return item
end

#max_per_pageObject



574
575
576
577
578
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 574

def max_per_page
  # Actually only '50' if you ask for 'full' records, but I don't think
  # we need to do that ever, that's actually getting fulltext back!
  200
end

#multi_field_search?Boolean

Returns:

  • (Boolean)


570
571
572
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 570

def multi_field_search?
  true
end

#public_settable_search_argsObject



540
541
542
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 540

def public_settable_search_args
  super + [:peer_reviewed_only, :pubyear_start, :pubyear_end]
end

#query_url(args) ⇒ Object



356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 356

def query_url(args)

  url =
    "#{configuration.base_url}/Search?prof=#{configuration.profile_id}&pwd=#{configuration.profile_password}"

  query = if args[:query].kind_of?(Hash)
    # multi-field query
    args[:query].collect {|field, query| fielded_query(query, field)}.join(" AND ")
  else
    fielded_query(args[:query], args[:search_field])
  end

  # peer-reviewed only?
  if [true, "true"].include? args[:peer_reviewed_only]
    query += " AND (RV Y)"
  end

  if args[:pubyear_start] || args[:pubyear_end]
    from = args[:pubyear_start].to_i
    from = nil if from == 0

    to = args[:pubyear_end].to_i
    to = nil if to == 0

    query += " AND (DT #{from}-#{to})"
  end


  url += "&query=#{CGI.escape query}"

  # startrec is 1-based for ebsco, not 0-based like for us.
  url += "&startrec=#{args[:start] + 1}" if args[:start]
  url += "&numrec=#{args[:per_page]}" if args[:per_page]

  # Make relevance our default sort, rather than EBSCO's date.
  args[:sort] ||= "relevance"
  url += "&sort=#{ sort_definitions[args[:sort]][:implementation]}"

  # Contrary to docs, don't pass these comma-seperated, pass em in seperate
  # query params. args databases overrides config databases.
  (args[:databases] || configuration.databases).each do |db|
    url += "&db=#{db}"
  end

  return url
end

#search_field_definitionsObject



553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 553

def search_field_definitions
  {
    nil     => {:semantic => :general},
    "AU"    => {:semantic => :author},
    "TI"    => {:semantic => :title},
    "SU"    => {:semantic => :subject},
    "IS"    => {:semantic => :issn},
    "IB"    => {:semantic => :isbn},
    "SO"    => {:semantic => :source_title},
    # These may not be defined in all databases....
    "VI"    => {:semantic => :volume},
    "IP"    => {:semantic => :issue},
    "SP"    => {:semantic => :start_page},
    "AF"    => {:semantic => :author_affiliation}
  }
end

#search_implementation(args) ⇒ Object



130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 130

def search_implementation(args)
  url = query_url(args)

  Rails.logger.debug("EbscoHostEngine Search for: #{url}")

  results = BentoSearch::Results.new
  xml, response, exception = nil, nil, nil

  begin
    response = http_client.get(url)
    xml = Nokogiri::XML(response.body)
  rescue BentoSearch::RubyTimeoutClass, HTTPClient::ConfigurationError, HTTPClient::BadResponseError, Nokogiri::SyntaxError  => e
      exception = e
  end
  # error handle
  if ( response.nil? ||
       xml.nil? ||
       exception ||
       (! HTTP::Status.successful? response.status) ||
       (fault = xml.at_xpath("./Fault")))

       results.error ||= {}
       results.error[:api_url] = url
       results.error[:exception] = exception if exception
       results.error[:status] = response.status if response

       if fault
         results.error[:error_info] = text_if_present fault.at_xpath("./Message")
       end

       return results
  end



  # the namespaces they provide are weird and don't help and sometimes
  # not clearly even legal. Remove em!
  xml.remove_namespaces!

  results.total_items = xml.at_xpath("./searchResponse/Hits").text.to_i

  xml.xpath("./searchResponse/SearchResults/records/rec").each do |xml_rec|
    results << item_from_xml( xml_rec )
  end

  return results

end

#sniff_format(xml_node) ⇒ Object

Figure out proper controlled format for an ebsco item. EBSCOHost (not sure about EDS) publication/document type are totally unusable non-normalized vocabulary for controlled types, we’ll try to guess from other metadata features.



235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 235

def sniff_format(xml_node)
  return nil if xml_node.nil?

  if xml_node.at_xpath("./dissinfo/*")
    :dissertation
  elsif xml_node.at_xpath("./jinfo/*") && xml_node.at_xpath("./artinfo/*")
    "Article"
  elsif xml_node.at_xpath("./dissinfo/disstl")
    :dissertation
  elsif xml_node.at_xpath("./bkinfo") && xml_node.at_xpath("./chapinfo")
    :book_item
  elsif xml_node.at_xpath("./bkinfo/btl") && xml_node.at_xpath("./artinfo/tig/atl") &&
      (text_if_present(xml_node.at_xpath "./bkinfo/btl") != text_if_present(xml_node.at_xpath "./artinfo/tig/atl"))
    # pathological case of book_item, if it has a bkinfo and an artinfo
    # but the titles in both sections MATCH, it's just a book. If they're
    # differnet, it's a book section, bah@
    :book_item
  elsif xml_node.at_xpath("./bkinfo/*")
    "Book"
  elsif xml_node.at_xpath("./jinfo/*")
    :serial
  else
    nil
  end
end

#sniff_format_str(xml_node) ⇒ Object

Figure out uncontrolled literal string format to show to users. We’re going to try combining Ebsco Publication Type and Document Type, when both are present. Then a few hard-coded special transformations.



264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 264

def sniff_format_str(xml_node)
  pubtype = text_if_present( xml_node.at_xpath("./artinfo/pubtype") )
  doctype = text_if_present( xml_node.at_xpath("./artinfo/doctype") )

  components = []
  components.push pubtype
  components.push doctype unless doctype == pubtype

  components.compact!

  components = components.collect {|a| a.titlecase if a}
  components.uniq! # no need to have the same thing twice


  # some hard-coded cases for better user-displayable string, and other
  # normalization.
  if ["Academic Journal", "Journal"].include?(components.first) && ["Article", "Journal Article"].include?(components.last)
    return "Journal Article"
  elsif components.last == "Book: Monograph"
    return "Book" # Book: Monograph what??
  elsif components.first == "Book Article"
    return "Book Chapter"
  elsif components.first == "Periodical" && components.length > 1
    return components.last
  elsif components.size == 2 && components.first.include?(components.last)
    # last is strict substring, don't need it
    return components.first
  elsif components.size == 2 && components.last.include?(components.first)
    # first is strict substring, don't need it
    return components.last
  end



  return components.join(": ")
end

#sort_definitionsObject

David Walker says pretty much only relevance and date are realiable in EBSCOhost cross-search.



546
547
548
549
550
551
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 546

def sort_definitions
  {
    "relevance" => {:implementation => "relevance"},
    "date_desc" => {:implementation => "date"}
  }
end

#text_helperObject

Include some rails helpers, text_helper.trucate



122
123
124
125
126
127
128
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 122

def text_helper
  @@truncate ||= begin
    o = Object.new
    o.extend ActionView::Helpers::TextHelper
    o
  end
end

#text_if_present(node) ⇒ Object

Pass in a nokogiri node, return node.text, or nil if arg was nil or node.text was blank?



223
224
225
226
227
228
229
# File 'app/search_engines/bento_search/ebsco_host_engine.rb', line 223

def text_if_present(node)
  if node.nil? || node.text.blank?
    nil
  else
    node.text
  end
end