Class: WikiParser

Inherits:
Object
  • Object
show all
Defined in:
lib/wikiParser.rb,
lib/wikiParserPage.rb

Overview

Parses a Wikipedia dump and extracts internal links, content, and page type.

Defined Under Namespace

Classes: Page

Constant Summary collapse

LanguageNodePropertyName =
"xml:lang"

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(opts = {}) ⇒ Enumerator<Nokogiri::XML::Node>

Convert the opened path to a dump to an enumerator of Page

Parameters:

  • opts (Hash) (defaults to: {})

    the parameters to parse a wikipedia page.

Options Hash (opts):

  • :path (String)

    The path to the Wikipedia dump in .xml or .bz2 format.



27
28
29
30
31
32
33
34
35
36
# File 'lib/wikiParser.rb', line 27

def initialize (opts = {})
	@file, new_path = nil, opts[:path]
	if File.exists? new_path and !File.directory? new_path
		@path = new_path
		prepare_enumerator
		get_language
	else
		raise ArgumentError.new "Cannot open file. Check path please."
	end
end

Instance Attribute Details

#languageObject (readonly)

Language of the dump (e.g: “en”,“fr”,“ru”,etc..)



13
14
15
# File 'lib/wikiParser.rb', line 13

def language
  @language
end

#pathObject (readonly)

path to the Wikipedia dump.



11
12
13
# File 'lib/wikiParser.rb', line 11

def path
  @path
end

Instance Method Details

#closeObject

Closes the file reader.



39
# File 'lib/wikiParser.rb', line 39

def close; @xml_file.close if @xml_file; end

#get_languageString

Obtains the language by reading the ‘xml:lang’ property in the xml of the dump.

Returns:

  • (String)

    the language of the dump.



56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/wikiParser.rb', line 56

def get_language
	begin
		node = @reader.next
		if node.name == "mediawiki" and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
			@language = node.attribute(LanguageNodePropertyName)
		else
			get_language
		end
	rescue StopIteration, NoMethodError
		nil
	end
end

#get_next_page(opts = {}) ⇒ WikiParser::Page, NilClass

Reads the next node in the xml tree and returns it as a Page if it exists.

Parameters:

  • opts (Hash) (defaults to: {})

    the parameters to instantiate a page.

Options Hash (opts):

  • :until (String)

    A node-name stopping point for the parsing. (Useful for not parsing an entire page until some property is checked.)

Returns:

See Also:



74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/wikiParser.rb', line 74

def get_next_page(opts={})
	begin
		node = @reader.next
		if node.name == "page" and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
			xml = Nokogiri::XML::parse("<page>"+node.inner_xml+"</page>").first_element_child
			return WikiParser::Page.new({:node => xml, :language => @language}.merge(opts))
		else
			get_next_page(opts)
		end
	rescue StopIteration, NoMethodError
		nil
	end
end

#prepare_enumeratorEnumerator<Nokogiri::XML::Node>

Convert the opened path to a dump to an enumerator of Page

Returns:

  • (Enumerator<Nokogiri::XML::Node>)

    the enumerator.



17
18
19
20
21
# File 'lib/wikiParser.rb', line 17

def prepare_enumerator
	@xml_file = File.open(@path)
	@file = Nokogiri::XML::Reader((@path.match(/.+\.bz2/) ? (require 'bzip2';Bzip2::Reader.open(@path)) : @xml_file), nil, 'utf-8', Nokogiri::XML::ParseOptions::NOERROR)
	@reader = @file.to_enum
end

#skipObject

Skips a Page in the enumeration



42
43
44
45
46
47
48
49
50
51
52
# File 'lib/wikiParser.rb', line 42

def skip
	begin
		node = @reader.next
		if node.name == "page" and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
		else
			skip
		end
	rescue StopIteration
		nil
	end
end