Class: Chomchom::Summary
- Inherits:
-
Object
- Object
- Chomchom::Summary
- Defined in:
- lib/chomchom/summary.rb
Class Method Summary collapse
-
.best_sentences(text, topics, length = 400) ⇒ Object
select the highest scoring sentence from each paragraph, then run love_at_first_sight.
-
.compute_score(text, topics) ⇒ Object
add the score of each topic occurrs in text up.
-
.first_mentions(text, topics, length = 500) ⇒ Object
a variation of topic sentences extraction, this starts with most important topic extract first sentence mentioning it, do the same for the next topic unless already mentioned by previous sentence continue until length is reached or all topics covered pros: ok coherent and decent coverage cons: irrelevant long intro mentioning main topics will throw this off.
-
.love_at_first_sight(sentences, topics, length) ⇒ Object
for each topic, select the first sentence that has the topic unless the summary already covers it.
-
.topic_sentences(text, topics, length = 400) ⇒ Object
the result of this is almost similar to first mention, except this runs greater risk of not reaching length this is also a minorly more computationally expensive.
Instance Method Summary collapse
-
#center_of_gravity(text, topics, length = 500) ⇒ Object
select the stretch with highest scoring sentences, basically captures the center of gravity of article pros: very coherent and computationally feasible cons: not good with coverage, only good when capturing passage is a summarizing intro/conclusion, otherwise just the key paragraph, not the whole.
Class Method Details
.best_sentences(text, topics, length = 400) ⇒ Object
select the highest scoring sentence from each paragraph, then run love_at_first_sight
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/chomchom/summary.rb', line 74 def self.best_sentences(text, topics, length=400) paragraphs = text.split(/\n+/) best_sentences = [] paragraphs.each do |p| sentences = p.split_sentences best_score = 0 index = 0 sentences.each_with_index do |s, i| current_score = Chomchom::Summary.compute_score(s, topics) if best_score < current_score index = i best_score = current_score end end best_sentences.push(sentences[index]) if sentences[index] and best_score > topics.last[1] end summary = Chomchom::Summary.love_at_first_sight(best_sentences, topics, length) end |
.compute_score(text, topics) ⇒ Object
add the score of each topic occurrs in text up
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/chomchom/summary.rb', line 95 def self.compute_score(text, topics) begin sum = 0 #compute geometric sum of occurrences (1 occurrence =1/2*score, 2 occurrences=(1/2+1/4)*score)... #SUM(score*r^k)k:0..n = a*(1-r^(n+1))/(1-r), a=score/2 and r=1/2 #this is to limit too much diversity, a mention of a topic shouldn't get all the score #if a topic has high score that means it's important and mentioning it several times in summary should be rewarded regressively topics.each do |t| f = text.scan(/\b#{Regexp.quote(t[0])}\b/).size sum += t[1]*(1-(1/2.0)**(f+1))/(1-1/2.0) if f > 0 end sum #rescue # 0 end end |
.first_mentions(text, topics, length = 500) ⇒ Object
a variation of topic sentences extraction, this starts with most important topic extract first sentence mentioning it, do the same for the next topic unless already mentioned by previous sentence continue until length is reached or all topics covered pros: ok coherent and decent coverage cons: irrelevant long intro mentioning main topics will throw this off
56 57 58 59 |
# File 'lib/chomchom/summary.rb', line 56 def self.first_mentions(text, topics, length=500) sentences = text.split_sentences summary = Chomchom::Summary.love_at_first_sight(sentences, topics, length) end |
.love_at_first_sight(sentences, topics, length) ⇒ Object
for each topic, select the first sentence that has the topic unless the summary already covers it
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/chomchom/summary.rb', line 113 def self.love_at_first_sight(sentences, topics, length) separator = "\n" summary = '' t = 0 points = [] while summary.size < length and t < topics.size if summary.match(/\b#{Regexp.quote(topics[t][0])}\b/) #find the next occurrence sentence not already in the summary #what if this sentence will be covered by next topics? else match_sentence = sentences.detect { |s| s.match(/\b#{Regexp.quote(topics[t][0])}\b/) } if match_sentence and (new_summary = summary + match_sentence + separator).size < length summary = new_summary points.push(sentences.index(match_sentence)) #track sentence order end end t += 1 end #have a strategy to include other sentences when summary is less than half the length #backups array which stores possible candidates, sort by score #run a loop and add to points if summary is < length #for low topic article like the reddit one (no candidates) just use the unused topic sentences #or unused = points.each { |i| sentences.delete_at(i) } #must delete from highest index back #then rerun this first_sight search #reorder the summary points.sort! {|a,b| a <=> b} summary = points.map { |i| sentences[i] }.join(separator).gsub(/\n+/,"").gsub(/\s+/," ") end |
.topic_sentences(text, topics, length = 400) ⇒ Object
the result of this is almost similar to first mention, except this runs greater risk of not reaching length this is also a minorly more computationally expensive
63 64 65 66 67 68 69 70 71 |
# File 'lib/chomchom/summary.rb', line 63 def self.topic_sentences(text, topics, length=400) topic_sentences = [] paragraphs = text.split(/\n+/).each do |p| sentences = p.split_sentences topic_sentences.push(sentences[0]) if sentences[0] and Chomchom::Summary.compute_score(sentences[0], topics) > topics.last[1] end summary = Chomchom::Summary.love_at_first_sight(topic_sentences, topics, length) end |
Instance Method Details
#center_of_gravity(text, topics, length = 500) ⇒ Object
select the stretch with highest scoring sentences, basically captures the center of gravity of article pros: very coherent and computationally feasible cons: not good with coverage, only good when capturing passage is a summarizing intro/conclusion, otherwise just the key paragraph, not the whole
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/chomchom/summary.rb', line 8 def center_of_gravity(text, topics, length=500) sentences = text.split_sentences summary = '' if sentences.size > 0 start_index = 0 stop_index = 0 best_score = 0 (0...sentences.size).each do |i| j = passage_last_index(sentences, i, length) #this returns the index of last sentence #avoid extracting passage from 2 different paragraphs #this usually lowers the score b/c less text means less match against topics #but if a short passage has higher score then more power to it passage = get_passage(sentences,i,j) #this following score computation doesn't account for diversity #so it often gives passages where the main topics are repeated in every sentences #current_score = scores[i..j].inject { |sum, sc| sum + sc } #this computation here count all topics once per passage current_score = Chomchom::Summary.compute_score(passage, topics) if best_score < current_score best_score = current_score start_index = i stop_index = j end end #use intro if the score is too low if best_score < 3 start_index = 0 stop_index = passage_last_index(sentences, start_index, length) #this following avoids using intro that are too short (usually are title) #start_index = (0...sentences.size).detect { |i| get_passage(sentences,i,stop_index=passage_last_index(sentences, i, length)).split(' ').size > 5 } end #the .limit(length) prevents single sentence that's longer than allowable length #select a substring within max length by removing the last occurrence of a punctuation summary = get_passage(sentences, start_index, stop_index).limit(length).limit(length) end summary end |