Class: RDF::Microdata::Reader

Inherits:
Reader
  • Object
show all
Defined in:
lib/rdf/microdata/reader.rb

Overview

An Microdata parser in Ruby

Based on processing rules, amended with the following: * property generation from tokens now uses the associated @itemtype as the basis for generation * implicit triples are not generated, only those with @item* * @datetime values are scanned lexically to find appropriate datatype

See Also:

Author:

Defined Under Namespace

Classes: CrawlFailure

Constant Summary

XHTML =
"http://www.w3.org/1999/xhtml"
URL_PROPERTY_ELEMENTS =
%w(a area audio embed iframe img link object source track video)

Instance Method Summary (collapse)

Constructor Details

- (reader) initialize(input = $stdin, options = {}) {|reader| ... }

Initializes the Microdata reader instance.

Parameters:

  • input (Nokogiri::HTML::Document, Nokogiri::XML::Document, IO, File, String) (defaults to: $stdin)

    the input stream to read

  • options (Hash{Symbol => Object}) (defaults to: {})

    any additional options

Options Hash (options):

  • :encoding (Encoding) — default: Encoding::UTF_8

    the encoding of the input stream (Ruby 1.9+)

  • :validate (Boolean) — default: false

    whether to validate the parsed statements and values

  • :canonicalize (Boolean) — default: false

    whether to canonicalize parsed literals

  • :intern (Boolean) — default: true

    whether to intern all parsed URIs

  • :base_uri (#to_s) — default: nil

    the base URI to use when resolving relative URIs

  • :debug (Array)

    Array to place debug messages

Yields:

  • (reader)

    self

Yield Parameters:

  • reader (RDF::Reader)

Yield Returns:

  • (void)

    ignored

Raises:

  • (Error)

    :: Raises RDF::ReaderError if validate



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/rdf/microdata/reader.rb', line 58

def initialize(input = $stdin, options = {}, &block)
  super do
    @debug = options[:debug]

    @doc = case input
    when Nokogiri::HTML::Document, Nokogiri::XML::Document
      input
    else
      # Try to detect charset from input
      options[:encoding] ||= input.charset if input.respond_to?(:charset)
      
      # Otherwise, default is utf-8
      options[:encoding] ||= 'utf-8'

      add_debug(nil, "base_uri: #{base_uri}")
      Nokogiri::HTML.parse(input, base_uri.to_s, options[:encoding])
    end
    
    errors = @doc.errors.reject {|e| e.to_s =~ /Tag (audio|source|track|video|time) invalid/}
    raise RDF::ReaderError, "Syntax errors:\n#{errors}" if !errors.empty? && validate?
    raise RDF::ReaderError, "Empty document" if (@doc.nil? || @doc.root.nil?) && validate?

    if block_given?
      case block.arity
        when 0 then instance_eval(&block)
        else block.call(self)
      end
    end
  end
end

Instance Method Details

- (Hash{Symbol => RDF::URI}) base_uri

Returns the base URI determined by this reader.

Examples:

reader.prefixes[:dc]  #=> RDF::URI('http://purl.org/dc/terms/')

Returns:

  • (Hash{Symbol => RDF::URI})

Since:

  • 0.3.0



30
31
32
# File 'lib/rdf/microdata/reader.rb', line 30

def base_uri
  @options[:base_uri]
end

- each_statement {|statement| ... }

This method returns an undefined value.

Iterates the given block for each RDF statement in the input.

Yields:

  • (statement)

Yield Parameters:

  • statement (RDF::Statement)


95
96
97
98
99
100
# File 'lib/rdf/microdata/reader.rb', line 95

def each_statement(&block)
  @callback = block

  # parse
  parse_whole_document(@doc, base_uri)
end

- each_triple {|subject, predicate, object| ... }

This method returns an undefined value.

Iterates the given block for each RDF triple in the input.

Yields:

  • (subject, predicate, object)

Yield Parameters:

  • subject (RDF::Resource)
  • predicate (RDF::URI)
  • object (RDF::Value)


110
111
112
113
114
# File 'lib/rdf/microdata/reader.rb', line 110

def each_triple(&block)
  each_statement do |statement|
    block.call(*statement.to_triple)
  end
end