Class: Chomchom::Summary

Inherits:

Object

Object
Chomchom::Summary

show all

Defined in:: lib/chomchom/summary.rb

Class Method Summary collapse

.best_sentences(text, topics, length = 400) ⇒ Object

select the highest scoring sentence from each paragraph, then run love_at_first_sight.
.compute_score(text, topics) ⇒ Object

add the score of each topic occurrs in text up.
.first_mentions(text, topics, length = 500) ⇒ Object

a variation of topic sentences extraction, this starts with most important topic extract first sentence mentioning it, do the same for the next topic unless already mentioned by previous sentence continue until length is reached or all topics covered pros: ok coherent and decent coverage cons: irrelevant long intro mentioning main topics will throw this off.
.love_at_first_sight(sentences, topics, length) ⇒ Object

for each topic, select the first sentence that has the topic unless the summary already covers it.
.topic_sentences(text, topics, length = 400) ⇒ Object

the result of this is almost similar to first mention, except this runs greater risk of not reaching length this is also a minorly more computationally expensive.

Instance Method Summary collapse

#center_of_gravity(text, topics, length = 500) ⇒ Object

select the stretch with highest scoring sentences, basically captures the center of gravity of article pros: very coherent and computationally feasible cons: not good with coverage, only good when capturing passage is a summarizing intro/conclusion, otherwise just the key paragraph, not the whole.

Class Method Details

.best_sentences(text, topics, length = 400) ⇒ `Object`

select the highest scoring sentence from each paragraph, then run love_at_first_sight

# File 'lib/chomchom/summary.rb', line 74

def self.best_sentences(text, topics, length=400)
  paragraphs = text.split(/\n+/)
  best_sentences = []
  paragraphs.each do |p|
    sentences = p.split_sentences
    best_score = 0
    index = 0
    sentences.each_with_index do |s, i|
      current_score = Chomchom::Summary.compute_score(s, topics)
      if best_score < current_score
        index = i
        best_score = current_score
      end
    end
    best_sentences.push(sentences[index]) if sentences[index] and best_score > topics.last[1]
  end
  
  summary = Chomchom::Summary.love_at_first_sight(best_sentences, topics, length)
end

.compute_score(text, topics) ⇒ `Object`

add the score of each topic occurrs in text up

# File 'lib/chomchom/summary.rb', line 95

def self.compute_score(text, topics)
  begin
    sum = 0
    #compute geometric sum of occurrences (1 occurrence =1/2*score, 2 occurrences=(1/2+1/4)*score)...
    #SUM(score*r^k)k:0..n = a*(1-r^(n+1))/(1-r), a=score/2 and r=1/2
    #this is to limit too much diversity, a mention of a topic shouldn't get all the score
    #if a topic has high score that means it's important and mentioning it several times in summary should be rewarded regressively
    topics.each do |t|
      f = text.scan(/\b#{Regexp.quote(t[0])}\b/).size
      sum += t[1]*(1-(1/2.0)**(f+1))/(1-1/2.0) if f > 0
    end
    sum
  #rescue
  #  0
  end        
end

.first_mentions(text, topics, length = 500) ⇒ `Object`

a variation of topic sentences extraction, this starts with most important topic extract first sentence mentioning it, do the same for the next topic unless already mentioned by previous sentence continue until length is reached or all topics covered pros: ok coherent and decent coverage cons: irrelevant long intro mentioning main topics will throw this off

# File 'lib/chomchom/summary.rb', line 56

def self.first_mentions(text, topics, length=500)
  sentences = text.split_sentences
  summary = Chomchom::Summary.love_at_first_sight(sentences, topics, length)
end

.love_at_first_sight(sentences, topics, length) ⇒ `Object`

for each topic, select the first sentence that has the topic unless the summary already covers it

# File 'lib/chomchom/summary.rb', line 113

def self.love_at_first_sight(sentences, topics, length)
  separator = "\n"
  summary = ''
  t = 0
  points = []
  while summary.size < length and t < topics.size
    if summary.match(/\b#{Regexp.quote(topics[t][0])}\b/)    
      #find the next occurrence sentence not already in the summary
      #what if this sentence will be covered by next topics?
    else  
      match_sentence = sentences.detect { |s| s.match(/\b#{Regexp.quote(topics[t][0])}\b/) }
      if match_sentence and (new_summary = summary + match_sentence + separator).size < length
        summary = new_summary
        points.push(sentences.index(match_sentence)) #track sentence order
      end          
    end
    t += 1
  end
  #have a strategy to include other sentences when summary is less than half the length
  #backups array which stores possible candidates, sort by score
  #run a loop and add to points if summary is < length
  #for low topic article like the reddit one (no candidates) just use the unused topic sentences
  
  #or unused = points.each { |i| sentences.delete_at(i) } #must delete from highest index back
  #then rerun this first_sight search
  
  #reorder the summary
  points.sort! {|a,b| a <=> b}
  summary = points.map { |i| sentences[i] }.join(separator).gsub(/\n+/,"").gsub(/\s+/," ")
end

.topic_sentences(text, topics, length = 400) ⇒ `Object`

the result of this is almost similar to first mention, except this runs greater risk of not reaching length this is also a minorly more computationally expensive

# File 'lib/chomchom/summary.rb', line 63

def self.topic_sentences(text, topics, length=400)
  topic_sentences = []
  paragraphs = text.split(/\n+/).each do |p| 
    sentences = p.split_sentences
    topic_sentences.push(sentences[0]) if sentences[0] and Chomchom::Summary.compute_score(sentences[0], topics) > topics.last[1]
  end
  
  summary = Chomchom::Summary.love_at_first_sight(topic_sentences, topics, length)
end

Instance Method Details

#center_of_gravity(text, topics, length = 500) ⇒ `Object`

select the stretch with highest scoring sentences, basically captures the center of gravity of article pros: very coherent and computationally feasible cons: not good with coverage, only good when capturing passage is a summarizing intro/conclusion, otherwise just the key paragraph, not the whole

# File 'lib/chomchom/summary.rb', line 8

def center_of_gravity(text, topics, length=500)    
  sentences = text.split_sentences
  summary = ''
  if sentences.size > 0
    start_index = 0
    stop_index = 0
    best_score = 0
    (0...sentences.size).each do |i|
      j = passage_last_index(sentences, i, length) #this returns the index of last sentence
      #avoid extracting passage from 2 different paragraphs
      #this usually lowers the score b/c less text means less match against topics
      #but if a short passage has higher score then more power to it
      passage = get_passage(sentences,i,j)

      #this following score computation doesn't account for diversity
      #so it often gives passages where the main topics are repeated in every sentences
      #current_score = scores[i..j].inject { |sum, sc| sum + sc } 
  
      #this computation here count all topics once per passage
      current_score = Chomchom::Summary.compute_score(passage, topics)
      if best_score < current_score
        best_score = current_score
        start_index = i
        stop_index = j 
      end
    end

    #use intro if the score is too low
    if best_score < 3
      start_index = 0
      stop_index = passage_last_index(sentences, start_index, length)
      
      #this following avoids using intro that are too short (usually are title)
      #start_index = (0...sentences.size).detect { |i| get_passage(sentences,i,stop_index=passage_last_index(sentences, i, length)).split(' ').size > 5 }
    end
    
    #the .limit(length) prevents single sentence that's longer than allowable length
    #select a substring within max length by removing the last occurrence of a punctuation
    summary = get_passage(sentences, start_index, stop_index).limit(length).limit(length)        
  end
  summary
end

Class: Chomchom::Summary

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.best_sentences(text, topics, length = 400) ⇒ Object

.compute_score(text, topics) ⇒ Object

.first_mentions(text, topics, length = 500) ⇒ Object

.love_at_first_sight(sentences, topics, length) ⇒ Object

.topic_sentences(text, topics, length = 400) ⇒ Object