Class: WikiParser
- Inherits:
-
Object
- Object
- WikiParser
- Defined in:
- lib/wikiParser.rb,
lib/wikiParserPage.rb
Overview
Parses a Wikipedia dump and extracts internal links, content, and page type.
Defined Under Namespace
Classes: Page
Constant Summary collapse
- LanguageNodePropertyName =
"xml:lang"
Instance Attribute Summary collapse
-
#language ⇒ Object
readonly
Language of the dump (e.g: “en”,“fr”,“ru”,etc..).
-
#path ⇒ Object
readonly
path to the Wikipedia dump.
Instance Method Summary collapse
-
#close ⇒ Object
Closes the file reader.
-
#get_language ⇒ String
Obtains the language by reading the ‘xml:lang’ property in the xml of the dump.
-
#get_next_page(opts = {}) ⇒ WikiParser::Page, NilClass
Reads the next node in the xml tree and returns it as a #::Page if it exists.
-
#initialize(opts = {}) ⇒ Enumerator<Nokogiri::XML::Node>
constructor
Convert the opened path to a dump to an enumerator of Page.
-
#prepare_enumerator ⇒ Enumerator<Nokogiri::XML::Node>
Convert the opened path to a dump to an enumerator of Page.
-
#skip ⇒ Object
Skips a Page in the enumeration.
Constructor Details
#initialize(opts = {}) ⇒ Enumerator<Nokogiri::XML::Node>
Convert the opened path to a dump to an enumerator of Page
27 28 29 30 31 32 33 34 35 36 |
# File 'lib/wikiParser.rb', line 27 def initialize (opts = {}) @file, new_path = nil, opts[:path] if File.exists? new_path and !File.directory? new_path @path = new_path prepare_enumerator get_language else raise ArgumentError.new "Cannot open file. Check path please." end end |
Instance Attribute Details
#language ⇒ Object (readonly)
Language of the dump (e.g: “en”,“fr”,“ru”,etc..)
13 14 15 |
# File 'lib/wikiParser.rb', line 13 def language @language end |
#path ⇒ Object (readonly)
path to the Wikipedia dump.
11 12 13 |
# File 'lib/wikiParser.rb', line 11 def path @path end |
Instance Method Details
#close ⇒ Object
Closes the file reader.
39 |
# File 'lib/wikiParser.rb', line 39 def close; @xml_file.close if @xml_file; end |
#get_language ⇒ String
Obtains the language by reading the ‘xml:lang’ property in the xml of the dump.
56 57 58 59 60 61 62 63 64 65 66 67 |
# File 'lib/wikiParser.rb', line 56 def get_language begin node = @reader.next if node.name == "mediawiki" and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT @language = node.attribute(LanguageNodePropertyName) else get_language end rescue StopIteration, NoMethodError nil end end |
#get_next_page(opts = {}) ⇒ WikiParser::Page, NilClass
Reads the next node in the xml tree and returns it as a Page if it exists.
74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/wikiParser.rb', line 74 def get_next_page(opts={}) begin node = @reader.next if node.name == "page" and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT xml = Nokogiri::XML::parse("<page>"+node.inner_xml+"</page>").first_element_child return WikiParser::Page.new({:node => xml, :language => @language}.merge(opts)) else get_next_page(opts) end rescue StopIteration, NoMethodError nil end end |
#prepare_enumerator ⇒ Enumerator<Nokogiri::XML::Node>
Convert the opened path to a dump to an enumerator of Page
17 18 19 20 21 |
# File 'lib/wikiParser.rb', line 17 def prepare_enumerator @xml_file = File.open(@path) @file = Nokogiri::XML::Reader((@path.match(/.+\.bz2/) ? (require 'bzip2';Bzip2::Reader.open(@path)) : @xml_file), nil, 'utf-8', Nokogiri::XML::ParseOptions::NOERROR) @reader = @file.to_enum end |
#skip ⇒ Object
Skips a Page in the enumeration
42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/wikiParser.rb', line 42 def skip begin node = @reader.next if node.name == "page" and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT else skip end rescue StopIteration nil end end |