Class: TextRank::KeywordExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/text_rank/keyword_extractor.rb

Overview

Primary class for keyword extraction and hub for filters, tokenizers, and graph strategies # that customize how the text is processed and how the TextRank algorithm is applied.

See Also:

  • README

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**options) ⇒ KeywordExtractor

Returns a new instance of KeywordExtractor.

Parameters:

  • options (Hash)

    a customizable set of options

Options Hash (**options):

  • :char_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied prior to tokenization

  • :tokenizers (Array<Symbol, Regexp, String>)

    A list of tokenizer regular expressions to perform tokenization

  • :token_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to each token after tokenization

  • :graph_strategy (Class, Symbol, #build_graph)

    A class or strategy instance for producing a graph from tokens

  • :rank_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to the keyword ranks after keyword extraction

  • :strategy (Symbol)

    PageRank strategy to use (either :sparse or :dense)

  • :damping (Float)

    The probability of following the graph vs. randomly choosing a new node

  • :tolerance (Float)

    The desired accuracy of the results



42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/text_rank/keyword_extractor.rb', line 42

def initialize(**options)
  @page_rank_options = {
    strategy:  options[:strategy] || :sparse,
    damping:   options[:damping],
    tolerance: options[:tolerance],
  }
  @char_filters = options[:char_filters] || []
  @tokenizers = options[:tokenizers] || [Tokenizer::Word]
  @token_filters = options[:token_filters] || []
  @rank_filters = options[:rank_filters] || []
  @graph_strategy = options[:graph_strategy] || GraphStrategy::Coocurrence
end

Instance Attribute Details

#graph_strategy=(value) ⇒ Class, ... (writeonly)

Sets the graph strategy for producing a graph from tokens

Returns:

  • (Class, Symbol, #build_graph)


75
76
77
# File 'lib/text_rank/keyword_extractor.rb', line 75

def graph_strategy=(value)
  @graph_strategy = value
end

Class Method Details

.advanced(**options) ⇒ KeywordExtractor

Creates an "advanced" keyword extractor with a larger set of default filters

Options Hash (**options):

  • :char_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied prior to tokenization

  • :tokenizers (Array<Symbol, Regexp, String>)

    A list of tokenizer regular expressions to perform tokenization

  • :token_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to each token after tokenization

  • :graph_strategy (Class, Symbol, #build_graph)

    A class or strategy instance for producing a graph from tokens

  • :rank_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to the keyword ranks after keyword extraction

  • :strategy (Symbol)

    PageRank strategy to use (either :sparse or :dense)

  • :damping (Float)

    The probability of following the graph vs. randomly choosing a new node

  • :tolerance (Float)

    The desired accuracy of the results

Returns:



26
27
28
29
30
31
32
33
34
# File 'lib/text_rank/keyword_extractor.rb', line 26

def self.advanced(**options)
  new(**{
    char_filters:   %i[AsciiFolding Lowercase StripHtml StripEmail UndoContractions StripPossessive],
    tokenizers:     %i[Url Money Number Word Punctuation],
    token_filters:  %i[PartOfSpeech Stopwords MinLength],
    graph_strategy: :Coocurrence,
    rank_filters:   %i[CollapseAdjacent NormalizeUnitVector SortByValue],
  }.merge(options))
end

.basic(**options) ⇒ KeywordExtractor

Creates a "basic" keyword extractor with default options

Options Hash (**options):

  • :char_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied prior to tokenization

  • :tokenizers (Array<Symbol, Regexp, String>)

    A list of tokenizer regular expressions to perform tokenization

  • :token_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to each token after tokenization

  • :graph_strategy (Class, Symbol, #build_graph)

    A class or strategy instance for producing a graph from tokens

  • :rank_filters (Array<Class, Symbol, #filter!>)

    A list of filters to be applied to the keyword ranks after keyword extraction

  • :strategy (Symbol)

    PageRank strategy to use (either :sparse or :dense)

  • :damping (Float)

    The probability of following the graph vs. randomly choosing a new node

  • :tolerance (Float)

    The desired accuracy of the results

Returns:



14
15
16
17
18
19
20
21
# File 'lib/text_rank/keyword_extractor.rb', line 14

def self.basic(**options)
  new(**{
    char_filters:   %i[AsciiFolding Lowercase],
    tokenizers:     %i[Word],
    token_filters:  %i[Stopwords MinLength],
    graph_strategy: :Coocurrence,
  }.merge(options))
end

Instance Method Details

#add_char_filter(filter, **options) ⇒ nil

Add a new CharFilter for processing text before tokenization

Parameters:

  • filter (Class, Symbol, #filter!)

    A filter to process text before tokenization

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


59
60
61
62
# File 'lib/text_rank/keyword_extractor.rb', line 59

def add_char_filter(filter, **options)
  add_into(@char_filters, filter, **options)
  nil
end

#add_rank_filter(filter, **options) ⇒ nil

Add a new RankFilter for processing ranks after calculating

Parameters:

  • filter (Class, Symbol, #filter!)

    A filter to process ranks

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


90
91
92
93
# File 'lib/text_rank/keyword_extractor.rb', line 90

def add_rank_filter(filter, **options)
  add_into(@rank_filters, filter, **options)
  nil
end

#add_token_filter(filter, **options) ⇒ nil

Add a new TokenFilter for processing tokens after tokenization

Parameters:

  • filter (Class, Symbol, #filter!)

    A filter to process tokens after tokenization

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


81
82
83
84
# File 'lib/text_rank/keyword_extractor.rb', line 81

def add_token_filter(filter, **options)
  add_into(@token_filters, filter, **options)
  nil
end

#add_tokenizer(tokenizer, **options) ⇒ nil

Add a tokenizer regular expression for producing tokens from filtered text

Parameters:

  • tokenizer (Symbol, Regexp, String)

    Tokenizer regular expression

  • before (Class, Symbol, Object)

    item to add before

  • at (Fixnum)

    index to insert new item

Returns:

  • (nil)


68
69
70
71
# File 'lib/text_rank/keyword_extractor.rb', line 68

def add_tokenizer(tokenizer, **options)
  add_into(@tokenizers, tokenizer, **options)
  nil
end

#extract(text, **options) ⇒ Hash<String, Float>

Filter & tokenize text, and return PageRank

Parameters:

  • text (String, Array<String>)

    unfiltered text to be processed

Returns:

  • (Hash<String, Float>)

    tokens and page ranks (in descending order)



107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/text_rank/keyword_extractor.rb', line 107

def extract(text, **options)
  text = Array(text)
  tokens_per_text = text.map do |t|
    tokenize(t)
  end
  graph = PageRank.new(**@page_rank_options)
  strategy = classify(@graph_strategy, context: GraphStrategy)
  tokens_per_text.each do |tokens|
    strategy.build_graph(tokens, graph)
  end
  ranks = graph.calculate(**options)
  tokens_per_text.each_with_index do |tokens, i|
    ranks = apply_rank_filters(ranks, tokens: tokens, original_text: text[i])
  end
  ranks
end

#tokenize(text) ⇒ Array<String>

Filters and tokenizes text

Parameters:

  • text (String)

    unfiltered text to be tokenized

Returns:

  • (Array<String>)

    tokens



98
99
100
101
102
# File 'lib/text_rank/keyword_extractor.rb', line 98

def tokenize(text)
  filtered_text = apply_char_filters(text)
  tokens = Tokenizer.tokenize(filtered_text, *tokenizer_regular_expressions)
  apply_token_filters(tokens)
end