Class: WikiParser::Page

Inherits:
Object
  • Object
show all
Defined in:
lib/wikiParserPage.rb

Overview

A Wikipedia article page object.

Constant Summary collapse

Namespaces =

The Wikipedia namespaces for all special pages #special_page, #page_type.

%w(WP Aide Help Talk User Template Wikipedia File Book Portal Portail TimedText Module MediaWiki Special Spécial Media Category Catégorie [^:]+)
Disambiguation =
["disambiguation","homonymie", "значения", "disambigua", "peker", "ujednoznacznienie", "olika betydelser", "Begriffsklärung", "desambiguación"]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(opts = {}) ⇒ Page

Create a new article page from an XML node.

Parameters:

  • opts (Hash) (defaults to: {})

    the parameters to instantiate a page.

Options Hash (opts):

  • :node (Nokogiri::XML::Node)

    the Nokogiri::XML::Node containing the article.

  • :from (Fixnum)

    the index from which to resume parsing among the nodes.

  • :until (String)

    A node-name stopping point for the parsing.

  • :language (String)

    The language of the dump this article was read from.



34
35
36
37
38
39
40
41
42
43
# File 'lib/wikiParserPage.rb', line 34

def initialize (opts={})
	@language = opts[:language]
	@title    = @article      = @redirect_title      = ""
	@redirect = @special_page = @disambiguation_page = false
	@internal_links, @page_type = [], nil
	return unless !opts[:node].nil?
	process_node opts
	trigs = article_to_internal_links(@article)
	@internal_links = trigs
end

Instance Attribute Details

#articleObject (readonly)

the content of the Wikipedia article



16
17
18
# File 'lib/wikiParserPage.rb', line 16

def article
  @article
end

#disambiguation_pageObject (readonly)

Returns the value of attribute disambiguation_page.



25
26
27
# File 'lib/wikiParserPage.rb', line 25

def disambiguation_page
  @disambiguation_page
end

#idObject (readonly)

The Wikipedia id of the article.



13
14
15
# File 'lib/wikiParserPage.rb', line 13

def id
  @id
end

Returns the value of attribute internal_links.



14
15
16
# File 'lib/wikiParserPage.rb', line 14

def internal_links
  @internal_links
end

#languageObject (readonly)

Returns the value of attribute language.



26
27
28
# File 'lib/wikiParserPage.rb', line 26

def language
  @language
end

#page_typeObject (readonly)

the wikipedia namespace for this page



22
23
24
# File 'lib/wikiParserPage.rb', line 22

def page_type
  @page_type
end

#redirectObject (readonly)

is this page a redirection page?



18
19
20
# File 'lib/wikiParserPage.rb', line 18

def redirect
  @redirect
end

#redirect_titleObject (readonly)

the title of the page this article redirects to.



20
21
22
# File 'lib/wikiParserPage.rb', line 20

def redirect_title
  @redirect_title
end

#special_pageObject (readonly)

is this page ‘special`? Is it in the Namespaces?



24
25
26
# File 'lib/wikiParserPage.rb', line 24

def special_page
  @special_page
end

#titleObject (readonly)

Title of the Wikipedia article.



11
12
13
# File 'lib/wikiParserPage.rb', line 11

def title
  @title
end

Instance Method Details

#article_to_internal_links(article) ⇒ Array<Hash>

Extracts internals links from a wikipedia article into an array of ‘uri`s and `title`s:

Parameters:

  • article (String)

    the article content to extract links from.

Returns:

  • (Array<Hash>)

    the internal links in hash form.



88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# File 'lib/wikiParserPage.rb', line 88

def article_to_internal_links article
	links = []
	matches = article.scan(/\[\[(?<name>[^\]\|:]+)(?<trigger>\|[^\]]+)?\]\]/)
	if matches
		matches.each do |match|
			name_match = match[0].strip.chomp.match(/^(?<name>[^#]+)(?<hashtag>#.+)?/)
			link_match = match[1] ? match[1].strip.chomp.match(/^\|[\t\n\s\/]*(?<name>[^#]+)(?<hashtag>#.+)?/) : name_match
			if name_match
				name_match = name_match[:name].gsub('_', ' ')
				link_match = link_match ? link_match[:name] : name_match
				links << {:uri => name_match, :title => {@language => link_match}}
			end
		end
	end
	links
end

#finish_processingWikiParser::Page

Extracts internals links from a wikipedia article into an array of ‘uri`s and `title`s, starting from the stopping point given to the parser earlier.

Returns:



76
77
78
79
80
81
82
83
# File 'lib/wikiParserPage.rb', line 76

def finish_processing
	@stop_index||= 0
	process_node :node => @node, :from => @stop_index
	@node = nil
	trigs = article_to_internal_links(@article)
	@internal_links = trigs
	self
end