Class: TextRank::KeywordExtractor

Inherits:

Object

Object
TextRank::KeywordExtractor

show all

Defined in:: lib/text_rank/keyword_extractor.rb

Overview

Primary class for keyword extraction and hub for filters, tokenizers, and graph strategies # that customize how the text is processed and how the TextRank algorithm is applied.

Instance Attribute Summary collapse

#graph_strategy ⇒ Class, ... writeonly
Sets the graph strategy for producing a graph from tokens.

Class Method Summary collapse

.advanced(**options) ⇒ KeywordExtractor
Creates an "advanced" keyword extractor with a larger set of default filters.
.basic(**options) ⇒ KeywordExtractor
Creates a "basic" keyword extractor with default options.

Instance Method Summary collapse

#add_char_filter(filter, **options) ⇒ nil
Add a new CharFilter for processing text before tokenization.
#add_rank_filter(filter, **options) ⇒ nil
Add a new RankFilter for processing ranks after calculating.
#add_token_filter(filter, **options) ⇒ nil
Add a new TokenFilter for processing tokens after tokenization.
#add_tokenizer(tokenizer, **options) ⇒ nil
Add a tokenizer regular expression for producing tokens from filtered text.
#extract(text, **options) ⇒ Hash<String, Float>
Filter & tokenize text, and return PageRank.
#initialize(**options) ⇒ KeywordExtractor constructor
A new instance of KeywordExtractor.
#tokenize(text) ⇒ Array<String>
Filters and tokenizes text.

Constructor Details

#initialize(**options) ⇒ `KeywordExtractor`

Returns a new instance of KeywordExtractor.

Parameters:

options (Hash) —
a customizable set of options

Options Hash (**options):

:char_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied prior to tokenization
:tokenizers (Array<Symbol, Regexp, String>) —
A list of tokenizer regular expressions to perform tokenization
:token_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied to each token after tokenization
:graph_strategy (Class, Symbol, #build_graph) —
A class or strategy instance for producing a graph from tokens
:rank_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied to the keyword ranks after keyword extraction
:strategy (Symbol) —
PageRank strategy to use (either :sparse or :dense)
:damping (Float) —
The probability of following the graph vs. randomly choosing a new node
:tolerance (Float) —
The desired accuracy of the results

# File 'lib/text_rank/keyword_extractor.rb', line 42

def initialize(**options)
  @page_rank_options = {
    strategy:  options[:strategy] || :sparse,
    damping:   options[:damping],
    tolerance: options[:tolerance],
  }
  @char_filters = options[:char_filters] || []
  @tokenizers = options[:tokenizers] || [Tokenizer::Word]
  @token_filters = options[:token_filters] || []
  @rank_filters = options[:rank_filters] || []
  @graph_strategy = options[:graph_strategy] || GraphStrategy::Coocurrence
end

Instance Attribute Details

#graph_strategy=(value) ⇒ `Class`, ... (writeonly)

Sets the graph strategy for producing a graph from tokens

Returns:

(Class, Symbol, #build_graph)



75
76
77

# File 'lib/text_rank/keyword_extractor.rb', line 75

def graph_strategy=(value)
  @graph_strategy = value
end

Class Method Details

.advanced(**options) ⇒ `KeywordExtractor`

Creates an "advanced" keyword extractor with a larger set of default filters

Options Hash (**options):

:char_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied prior to tokenization
:tokenizers (Array<Symbol, Regexp, String>) —
A list of tokenizer regular expressions to perform tokenization
:token_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied to each token after tokenization
:graph_strategy (Class, Symbol, #build_graph) —
A class or strategy instance for producing a graph from tokens
:rank_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied to the keyword ranks after keyword extraction
:strategy (Symbol) —
PageRank strategy to use (either :sparse or :dense)
:damping (Float) —
The probability of following the graph vs. randomly choosing a new node
:tolerance (Float) —
The desired accuracy of the results

Returns:

(KeywordExtractor)

# File 'lib/text_rank/keyword_extractor.rb', line 26

def self.advanced(**options)
  new(**{
    char_filters:   %i[AsciiFolding Lowercase StripHtml StripEmail UndoContractions StripPossessive],
    tokenizers:     %i[Url Money Number Word Punctuation],
    token_filters:  %i[PartOfSpeech Stopwords MinLength],
    graph_strategy: :Coocurrence,
    rank_filters:   %i[CollapseAdjacent NormalizeUnitVector SortByValue],
  }.merge(options))
end

.basic(**options) ⇒ `KeywordExtractor`

Creates a "basic" keyword extractor with default options

Options Hash (**options):

:char_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied prior to tokenization
:tokenizers (Array<Symbol, Regexp, String>) —
A list of tokenizer regular expressions to perform tokenization
:token_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied to each token after tokenization
:graph_strategy (Class, Symbol, #build_graph) —
A class or strategy instance for producing a graph from tokens
:rank_filters (Array<Class, Symbol, #filter!>) —
A list of filters to be applied to the keyword ranks after keyword extraction
:strategy (Symbol) —
PageRank strategy to use (either :sparse or :dense)
:damping (Float) —
The probability of following the graph vs. randomly choosing a new node
:tolerance (Float) —
The desired accuracy of the results

Returns:

(KeywordExtractor)

# File 'lib/text_rank/keyword_extractor.rb', line 14

def self.basic(**options)
  new(**{
    char_filters:   %i[AsciiFolding Lowercase],
    tokenizers:     %i[Word],
    token_filters:  %i[Stopwords MinLength],
    graph_strategy: :Coocurrence,
  }.merge(options))
end

Instance Method Details

#add_char_filter(filter, **options) ⇒ `nil`

Add a new CharFilter for processing text before tokenization

Parameters:

filter (Class, Symbol, #filter!) —
A filter to process text before tokenization
before (Class, Symbol, Object) —
item to add before
at (Fixnum) —
index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 59

def add_char_filter(filter, **options)
  add_into(@char_filters, filter, **options)
  nil
end

#add_rank_filter(filter, **options) ⇒ `nil`

Add a new RankFilter for processing ranks after calculating

Parameters:

filter (Class, Symbol, #filter!) —
A filter to process ranks
before (Class, Symbol, Object) —
item to add before
at (Fixnum) —
index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 90

def add_rank_filter(filter, **options)
  add_into(@rank_filters, filter, **options)
  nil
end

#add_token_filter(filter, **options) ⇒ `nil`

Add a new TokenFilter for processing tokens after tokenization

Parameters:

filter (Class, Symbol, #filter!) —
A filter to process tokens after tokenization
before (Class, Symbol, Object) —
item to add before
at (Fixnum) —
index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 81

def add_token_filter(filter, **options)
  add_into(@token_filters, filter, **options)
  nil
end

#add_tokenizer(tokenizer, **options) ⇒ `nil`

Add a tokenizer regular expression for producing tokens from filtered text

Parameters:

tokenizer (Symbol, Regexp, String) —
Tokenizer regular expression
before (Class, Symbol, Object) —
item to add before
at (Fixnum) —
index to insert new item

Returns:

(nil)

# File 'lib/text_rank/keyword_extractor.rb', line 68

def add_tokenizer(tokenizer, **options)
  add_into(@tokenizers, tokenizer, **options)
  nil
end

#extract(text, **options) ⇒ `Hash<String, Float>`

Filter & tokenize text, and return PageRank

Parameters:

text (String, Array<String>) —
unfiltered text to be processed

Returns:

(Hash<String, Float>) —
tokens and page ranks (in descending order)

# File 'lib/text_rank/keyword_extractor.rb', line 107

def extract(text, **options)
  text = Array(text)
  tokens_per_text = text.map do |t|
    tokenize(t)
  end
  graph = PageRank.new(**@page_rank_options)
  strategy = classify(@graph_strategy, context: GraphStrategy)
  tokens_per_text.each do |tokens|
    strategy.build_graph(tokens, graph)
  end
  ranks = graph.calculate(**options)
  tokens_per_text.each_with_index do |tokens, i|
    ranks = apply_rank_filters(ranks, tokens: tokens, original_text: text[i])
  end
  ranks
end

#tokenize(text) ⇒ `Array<String>`

Filters and tokenizes text

Parameters:

text (String) —
unfiltered text to be tokenized

Returns:

(Array<String>) —
tokens

# File 'lib/text_rank/keyword_extractor.rb', line 98

def tokenize(text)
  filtered_text = apply_char_filters(text)
  tokens = Tokenizer.tokenize(filtered_text, *tokenizer_regular_expressions)
  apply_token_filters(tokens)
end

Class: TextRank::KeywordExtractor

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(**options) ⇒ KeywordExtractor

Instance Attribute Details

#graph_strategy=(value) ⇒ Class, ... (writeonly)

Class Method Details

.advanced(**options) ⇒ KeywordExtractor

.basic(**options) ⇒ KeywordExtractor

Instance Method Details

#add_char_filter(filter, **options) ⇒ nil

#add_rank_filter(filter, **options) ⇒ nil

#add_token_filter(filter, **options) ⇒ nil

#add_tokenizer(tokenizer, **options) ⇒ nil

#extract(text, **options) ⇒ Hash<String, Float>

#tokenize(text) ⇒ Array<String>

#initialize(**options) ⇒ `KeywordExtractor`

#graph_strategy=(value) ⇒ `Class`, ... (writeonly)

.advanced(**options) ⇒ `KeywordExtractor`

.basic(**options) ⇒ `KeywordExtractor`

#add_char_filter(filter, **options) ⇒ `nil`

#add_rank_filter(filter, **options) ⇒ `nil`

#add_token_filter(filter, **options) ⇒ `nil`

#add_tokenizer(tokenizer, **options) ⇒ `nil`

#extract(text, **options) ⇒ `Hash<String, Float>`

#tokenize(text) ⇒ `Array<String>`