Module: Whitepaper::Engine::Google
- Defined in:
- lib/whitepaper/engine/google.rb
Overview
This engine simply uses a google filetype:pdf search to find paper information.
Class Method Summary collapse
-
.find(url) ⇒ Object
Return the url and title of the first result as a hash with keys :url and :title.
-
.find_by_title(title) ⇒ Object
Finds a Whitespace::Paper by looking up a paper with the given title keywords.
-
.score(title, keywords) ⇒ Object
Get an early score rating – TODO: move into own class for Whitepaper::Paper.
Class Method Details
.find(url) ⇒ Object
Return the url and title of the first result as a hash with keys :url and :title.
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/whitepaper/engine/google.rb', line 11 def find(url) @agent = Mechanize.new page = @agent.get url results = page.search '//h3[@class="r"]' urls = results.map do |r| a = r.search './a' # sanitize url = a.attribute "href" url = url.to_s.match(/\/url\?q=([^&]+)&/)[1] title = a.first.content = r.search '../div[@class="s"]/span[@class="f"]' = .map do |e| e.content.to_s end {:url => url, :title => title, :authors => } end if urls.length > 0 urls.first else nil end end |
.find_by_title(title) ⇒ Object
Finds a Whitespace::Paper by looking up a paper with the given title keywords.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# File 'lib/whitepaper/engine/google.rb', line 45 def find_by_title(title) pdf = find("https://www.google.com/search?q=#{URI::encode(title)}+filetype%3Apdf") ps = find("https://www.google.com/search?q=#{URI::encode(title)}+filetype%3Aps") pdf_urls = [] ps_urls = [] pdf_score = score(pdf[:title], title) ps_score = score(ps[:title], title) if pdf and pdf_score >= ps_score pdf_urls << pdf[:url] end if ps and ps_score >= pdf_score ps_urls << ps[:url] end Paper.new(pdf[:title], [], {:pdf_urls => pdf_urls, :ps_urls => ps_urls}) end |
.score(title, keywords) ⇒ Object
Get an early score rating – TODO: move into own class for Whitepaper::Paper
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/whitepaper/engine/google.rb', line 70 def score(title, keywords) keywords = keywords.split(" ").map(&:strip).map(&:downcase) title_words = title.split(" ").map(&:strip).map(&:downcase) score = 1.0 # found words are worth x10 # not found words are worth /2 keywords.each do |k| if title_words.include? k score *= 10.0 end end title_words.each do |k| unless keywords.include? k score /= 2.0 end end score end |