Class: Wikipedia::VandalismDetection::WikitextExtractor
- Inherits:
-
Object
- Object
- Wikipedia::VandalismDetection::WikitextExtractor
- Defined in:
- lib/wikipedia/vandalism_detection/wikitext_extractor.rb
Overview
This class wrapps the de.webis.sweble.WikitextExtractor Java class and provides methods to extract plaintext from wiki markup text both space preserving and cleaned without line breaks and whitespace.
Constant Summary collapse
- REDIRECT =
'#REDIRECT'
Class Method Summary collapse
-
.extract(wiki_text) ⇒ Object
Returns the extracted text from the given wiki markup preserving spacing with added section numbers.
-
.extract_clean(wiki_text) ⇒ Object
Returns the cleaned extracted text from the given wiki markup.
Class Method Details
.extract(wiki_text) ⇒ Object
Returns the extracted text from the given wiki markup preserving spacing with added section numbers.
32 33 34 35 36 37 38 39 40 41 |
# File 'lib/wikipedia/vandalism_detection/wikitext_extractor.rb', line 32 def self.extract(wiki_text) begin wiki_text = wiki_text.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') wiki_text = wiki_text.gsub(REDIRECT, '') WikitextExtractor.new.extract(wiki_text) rescue => exception raise WikitextExtractionError, "Wikitext extraction failed: \n#{exception.}", caller end end |
.extract_clean(wiki_text) ⇒ Object
Returns the cleaned extracted text from the given wiki markup. Cleaned means a single string without breaks, multiple spaces and section numbers.
45 46 47 48 49 50 51 52 53 54 |
# File 'lib/wikipedia/vandalism_detection/wikitext_extractor.rb', line 45 def self.extract_clean(wiki_text) wiki_text = extract wiki_text wiki_text = remove_section_numbering_from wiki_text wiki_text = remove_line_breaks_from wiki_text wiki_text = remove_uris_from wiki_text wiki_text = remove_special_signes_from wiki_text wiki_text = remove_multiple_spaces_from wiki_text wiki_text = wiki_text.strip end |