Module: Whitepaper::Engine::Google

Defined in:
lib/whitepaper/engine/google.rb

Overview

This engine simply uses a google filetype:pdf search to find paper information.

Class Method Summary collapse

Class Method Details

.find(url) ⇒ Object

Return the url and title of the first result as a hash with keys :url and :title.



11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/whitepaper/engine/google.rb', line 11

def find(url)
  @agent = Mechanize.new

  page = @agent.get url 

  results = page.search '//h3[@class="r"]'

  urls = results.map do |r|
    a = r.search './a'

    # sanitize
    url = a.attribute "href"

    url = url.to_s.match(/\/url\?q=([^&]+)&/)[1]

    title = a.first.content

    author = r.search '../div[@class="s"]/span[@class="f"]'

    authors = author.map do |e|
      e.content.to_s
    end

    {:url => url, :title => title, :authors => authors}
  end

  if urls.length > 0
    urls.first
  else
    nil
  end
end

.find_by_title(title) ⇒ Object

Finds a Whitespace::Paper by looking up a paper with the given title keywords.



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/whitepaper/engine/google.rb', line 45

def find_by_title(title)
  pdf = find("https://www.google.com/search?q=#{URI::encode(title)}+filetype%3Apdf")
  ps = find("https://www.google.com/search?q=#{URI::encode(title)}+filetype%3Aps")

  pdf_urls = []
  ps_urls = []

  pdf_score = score(pdf[:title], title)
  ps_score  = score(ps[:title],  title)

  if pdf and pdf_score >= ps_score
    pdf_urls << pdf[:url]
  end

  if ps and ps_score >= pdf_score
    ps_urls << ps[:url]
  end

  Paper.new(pdf[:title], [], {:pdf_urls => pdf_urls,
                              :ps_urls  => ps_urls})
end

.score(title, keywords) ⇒ Object

Get an early score rating – TODO: move into own class for Whitepaper::Paper



70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/whitepaper/engine/google.rb', line 70

def score(title, keywords)
  keywords = keywords.split(" ").map(&:strip).map(&:downcase)
  title_words = title.split(" ").map(&:strip).map(&:downcase)

  score = 1.0

  # found words are worth x10
  # not found words are worth /2

  keywords.each do |k|
    if title_words.include? k
      score *= 10.0
    end
  end

  title_words.each do |k|
    unless keywords.include? k
      score /= 2.0
    end
  end

  score
end