Class: TextRank::KeywordExtractor
- Inherits:
-
Object
- Object
- TextRank::KeywordExtractor
- Defined in:
- lib/text_rank/keyword_extractor.rb
Overview
Primary class for keyword extraction and hub for filters, tokenizers, and graph strategies # that customize how the text is processed and how the TextRank algorithm is applied.
Instance Attribute Summary collapse
-
#graph_strategy ⇒ Class, ...
writeonly
Sets the graph strategy for producing a graph from tokens.
Class Method Summary collapse
-
.advanced(**options) ⇒ KeywordExtractor
Creates an "advanced" keyword extractor with a larger set of default filters.
-
.basic(**options) ⇒ KeywordExtractor
Creates a "basic" keyword extractor with default options.
Instance Method Summary collapse
-
#add_char_filter(filter, **options) ⇒ nil
Add a new CharFilter for processing text before tokenization.
-
#add_rank_filter(filter, **options) ⇒ nil
Add a new RankFilter for processing ranks after calculating.
-
#add_token_filter(filter, **options) ⇒ nil
Add a new TokenFilter for processing tokens after tokenization.
-
#add_tokenizer(tokenizer, **options) ⇒ nil
Add a tokenizer regular expression for producing tokens from filtered text.
-
#extract(text, **options) ⇒ Hash<String, Float>
Filter & tokenize text, and return PageRank.
-
#initialize(**options) ⇒ KeywordExtractor
constructor
A new instance of KeywordExtractor.
-
#tokenize(text) ⇒ Array<String>
Filters and tokenizes text.
Constructor Details
#initialize(**options) ⇒ KeywordExtractor
Returns a new instance of KeywordExtractor.
42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/text_rank/keyword_extractor.rb', line 42 def initialize(**) @page_rank_options = { strategy: [:strategy] || :sparse, damping: [:damping], tolerance: [:tolerance], } @char_filters = [:char_filters] || [] @tokenizers = [:tokenizers] || [Tokenizer::Word] @token_filters = [:token_filters] || [] @rank_filters = [:rank_filters] || [] @graph_strategy = [:graph_strategy] || GraphStrategy::Coocurrence end |
Instance Attribute Details
#graph_strategy=(value) ⇒ Class, ... (writeonly)
Sets the graph strategy for producing a graph from tokens
75 76 77 |
# File 'lib/text_rank/keyword_extractor.rb', line 75 def graph_strategy=(value) @graph_strategy = value end |
Class Method Details
.advanced(**options) ⇒ KeywordExtractor
Creates an "advanced" keyword extractor with a larger set of default filters
26 27 28 29 30 31 32 33 34 |
# File 'lib/text_rank/keyword_extractor.rb', line 26 def self.advanced(**) new(**{ char_filters: %i[AsciiFolding Lowercase StripHtml StripEmail UndoContractions StripPossessive], tokenizers: %i[Url Money Number Word Punctuation], token_filters: %i[PartOfSpeech Stopwords MinLength], graph_strategy: :Coocurrence, rank_filters: %i[CollapseAdjacent NormalizeUnitVector SortByValue], }.merge()) end |
.basic(**options) ⇒ KeywordExtractor
Creates a "basic" keyword extractor with default options
14 15 16 17 18 19 20 21 |
# File 'lib/text_rank/keyword_extractor.rb', line 14 def self.basic(**) new(**{ char_filters: %i[AsciiFolding Lowercase], tokenizers: %i[Word], token_filters: %i[Stopwords MinLength], graph_strategy: :Coocurrence, }.merge()) end |
Instance Method Details
#add_char_filter(filter, **options) ⇒ nil
Add a new CharFilter for processing text before tokenization
59 60 61 62 |
# File 'lib/text_rank/keyword_extractor.rb', line 59 def add_char_filter(filter, **) add_into(@char_filters, filter, **) nil end |
#add_rank_filter(filter, **options) ⇒ nil
Add a new RankFilter for processing ranks after calculating
90 91 92 93 |
# File 'lib/text_rank/keyword_extractor.rb', line 90 def add_rank_filter(filter, **) add_into(@rank_filters, filter, **) nil end |
#add_token_filter(filter, **options) ⇒ nil
Add a new TokenFilter for processing tokens after tokenization
81 82 83 84 |
# File 'lib/text_rank/keyword_extractor.rb', line 81 def add_token_filter(filter, **) add_into(@token_filters, filter, **) nil end |
#add_tokenizer(tokenizer, **options) ⇒ nil
Add a tokenizer regular expression for producing tokens from filtered text
68 69 70 71 |
# File 'lib/text_rank/keyword_extractor.rb', line 68 def add_tokenizer(tokenizer, **) add_into(@tokenizers, tokenizer, **) nil end |
#extract(text, **options) ⇒ Hash<String, Float>
Filter & tokenize text, and return PageRank
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/text_rank/keyword_extractor.rb', line 107 def extract(text, **) text = Array(text) tokens_per_text = text.map do |t| tokenize(t) end graph = PageRank.new(**@page_rank_options) strategy = classify(@graph_strategy, context: GraphStrategy) tokens_per_text.each do |tokens| strategy.build_graph(tokens, graph) end ranks = graph.calculate(**) tokens_per_text.each_with_index do |tokens, i| ranks = apply_rank_filters(ranks, tokens: tokens, original_text: text[i]) end ranks end |