Class: Wikipedia::VandalismDetection::WikitextExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/wikipedia/vandalism_detection/wikitext_extractor.rb

Overview

This class wrapps the de.webis.sweble.WikitextExtractor Java class and provides methods to extract plaintext from wiki markup text both space preserving and cleaned without line breaks and whitespace.

Author:

Constant Summary collapse

REDIRECT =
'#REDIRECT'

Class Method Summary collapse

Class Method Details

.extract(wiki_text) ⇒ Object

Returns the extracted text from the given wiki markup preserving spacing with added section numbers.



32
33
34
35
36
37
38
39
40
41
# File 'lib/wikipedia/vandalism_detection/wikitext_extractor.rb', line 32

def self.extract(wiki_text)
  begin
    wiki_text = wiki_text.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
    wiki_text = wiki_text.gsub(REDIRECT, '')

    WikitextExtractor.new.extract(wiki_text)
  rescue => exception
    raise WikitextExtractionError, "Wikitext extraction failed: \n#{exception.message}", caller
  end
end

.extract_clean(wiki_text) ⇒ Object

Returns the cleaned extracted text from the given wiki markup. Cleaned means a single string without breaks, multiple spaces and section numbers.



45
46
47
48
49
50
51
52
53
54
# File 'lib/wikipedia/vandalism_detection/wikitext_extractor.rb', line 45

def self.extract_clean(wiki_text)
  wiki_text = extract wiki_text

  wiki_text = remove_section_numbering_from wiki_text
  wiki_text = remove_line_breaks_from wiki_text
  wiki_text = remove_uris_from wiki_text
  wiki_text = remove_special_signes_from wiki_text
  wiki_text = remove_multiple_spaces_from wiki_text
  wiki_text = wiki_text.strip
end