Class: Chomchom::Summary

Inherits:
Object
  • Object
show all
Defined in:
lib/chomchom/summary.rb

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.best_sentences(text, topics, length = 400) ⇒ Object

select the highest scoring sentence from each paragraph, then run love_at_first_sight



74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/chomchom/summary.rb', line 74

def self.best_sentences(text, topics, length=400)
  paragraphs = text.split(/\n+/)
  best_sentences = []
  paragraphs.each do |p|
    sentences = p.split_sentences
    best_score = 0
    index = 0
    sentences.each_with_index do |s, i|
      current_score = Chomchom::Summary.compute_score(s, topics)
      if best_score < current_score
        index = i
        best_score = current_score
      end
    end
    best_sentences.push(sentences[index]) if sentences[index] and best_score > topics.last[1]
  end
  
  summary = Chomchom::Summary.love_at_first_sight(best_sentences, topics, length)
end

.compute_score(text, topics) ⇒ Object

add the score of each topic occurrs in text up



95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/chomchom/summary.rb', line 95

def self.compute_score(text, topics)
  begin
    sum = 0
    #compute geometric sum of occurrences (1 occurrence =1/2*score, 2 occurrences=(1/2+1/4)*score)...
    #SUM(score*r^k)k:0..n = a*(1-r^(n+1))/(1-r), a=score/2 and r=1/2
    #this is to limit too much diversity, a mention of a topic shouldn't get all the score
    #if a topic has high score that means it's important and mentioning it several times in summary should be rewarded regressively
    topics.each do |t|
      f = text.scan(/\b#{Regexp.quote(t[0])}\b/).size
      sum += t[1]*(1-(1/2.0)**(f+1))/(1-1/2.0) if f > 0
    end
    sum
  #rescue
  #  0
  end        
end

.first_mentions(text, topics, length = 500) ⇒ Object

a variation of topic sentences extraction, this starts with most important topic extract first sentence mentioning it, do the same for the next topic unless already mentioned by previous sentence continue until length is reached or all topics covered pros: ok coherent and decent coverage cons: irrelevant long intro mentioning main topics will throw this off



56
57
58
59
# File 'lib/chomchom/summary.rb', line 56

def self.first_mentions(text, topics, length=500)
  sentences = text.split_sentences
  summary = Chomchom::Summary.love_at_first_sight(sentences, topics, length)
end

.love_at_first_sight(sentences, topics, length) ⇒ Object

for each topic, select the first sentence that has the topic unless the summary already covers it



113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/chomchom/summary.rb', line 113

def self.love_at_first_sight(sentences, topics, length)
  separator = "\n"
  summary = ''
  t = 0
  points = []
  while summary.size < length and t < topics.size
    if summary.match(/\b#{Regexp.quote(topics[t][0])}\b/)    
      #find the next occurrence sentence not already in the summary
      #what if this sentence will be covered by next topics?
    else  
      match_sentence = sentences.detect { |s| s.match(/\b#{Regexp.quote(topics[t][0])}\b/) }
      if match_sentence and (new_summary = summary + match_sentence + separator).size < length
        summary = new_summary
        points.push(sentences.index(match_sentence)) #track sentence order
      end          
    end
    t += 1
  end
  #have a strategy to include other sentences when summary is less than half the length
  #backups array which stores possible candidates, sort by score
  #run a loop and add to points if summary is < length
  #for low topic article like the reddit one (no candidates) just use the unused topic sentences
  
  #or unused = points.each { |i| sentences.delete_at(i) } #must delete from highest index back
  #then rerun this first_sight search
  
  #reorder the summary
  points.sort! {|a,b| a <=> b}
  summary = points.map { |i| sentences[i] }.join(separator).gsub(/\n+/,"").gsub(/\s+/," ")
end

.topic_sentences(text, topics, length = 400) ⇒ Object

the result of this is almost similar to first mention, except this runs greater risk of not reaching length this is also a minorly more computationally expensive



63
64
65
66
67
68
69
70
71
# File 'lib/chomchom/summary.rb', line 63

def self.topic_sentences(text, topics, length=400)
  topic_sentences = []
  paragraphs = text.split(/\n+/).each do |p| 
    sentences = p.split_sentences
    topic_sentences.push(sentences[0]) if sentences[0] and Chomchom::Summary.compute_score(sentences[0], topics) > topics.last[1]
  end
  
  summary = Chomchom::Summary.love_at_first_sight(topic_sentences, topics, length)
end

Instance Method Details

#center_of_gravity(text, topics, length = 500) ⇒ Object

select the stretch with highest scoring sentences, basically captures the center of gravity of article pros: very coherent and computationally feasible cons: not good with coverage, only good when capturing passage is a summarizing intro/conclusion, otherwise just the key paragraph, not the whole



8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/chomchom/summary.rb', line 8

def center_of_gravity(text, topics, length=500)    
  sentences = text.split_sentences
  summary = ''
  if sentences.size > 0
    start_index = 0
    stop_index = 0
    best_score = 0
    (0...sentences.size).each do |i|
      j = passage_last_index(sentences, i, length) #this returns the index of last sentence
      #avoid extracting passage from 2 different paragraphs
      #this usually lowers the score b/c less text means less match against topics
      #but if a short passage has higher score then more power to it
      passage = get_passage(sentences,i,j)

      #this following score computation doesn't account for diversity
      #so it often gives passages where the main topics are repeated in every sentences
      #current_score = scores[i..j].inject { |sum, sc| sum + sc } 
  
      #this computation here count all topics once per passage
      current_score = Chomchom::Summary.compute_score(passage, topics)
      if best_score < current_score
        best_score = current_score
        start_index = i
        stop_index = j 
      end
    end

    #use intro if the score is too low
    if best_score < 3
      start_index = 0
      stop_index = passage_last_index(sentences, start_index, length)
      
      #this following avoids using intro that are too short (usually are title)
      #start_index = (0...sentences.size).detect { |i| get_passage(sentences,i,stop_index=passage_last_index(sentences, i, length)).split(' ').size > 5 }
    end
    
    #the .limit(length) prevents single sentence that's longer than allowable length
    #select a substring within max length by removing the last occurrence of a punctuation
    summary = get_passage(sentences, start_index, stop_index).limit(length).limit(length)        
  end
  summary
end