Class: Arrow::HTMLTokenizer

Inherits:

Object

Object
Object
Arrow::HTMLTokenizer

show all

Includes:: Enumerable

Defined in:: lib/arrow/htmltokenizer.rb

Overview

The Arrow::HTMLTokenizer class -- a simple HTML parser that can be used to break HTML down into tokens.

Some of the code and design were stolen from the excellent HTMLTokenizer library by Ben Giddings [email protected].

VCS Id

$Id$

Authors

Michael Granger [email protected]

:include: LICENSE

Please see the file LICENSE in the top-level directory for licensing details.

Constant Summary collapse

SVNRev = SVN Revision

%q$Rev$

SVNId = SVN Id

%q$Id$

Instance Attribute Summary collapse

#scanner ⇒ Object readonly
The StringScanner doing the tokenizing.
#source ⇒ Object readonly
The HTML source being tokenized.

Instance Method Summary collapse

#each ⇒ Object
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
#initialize(source) ⇒ HTMLTokenizer constructor
Create a new Arrow::HtmlTokenizer object.

Methods inherited from Object

deprecate_class_method, deprecate_method, inherited

Constructor Details

#initialize(source) ⇒ `HTMLTokenizer`

Create a new Arrow::HtmlTokenizer object.

# File 'lib/arrow/htmltokenizer.rb', line 41

def initialize( source )
  @source = source
  @scanner = StringScanner.new( source )
end

Instance Attribute Details

#scanner ⇒ `Object` (readonly)

The StringScanner doing the tokenizing



55
56
57

# File 'lib/arrow/htmltokenizer.rb', line 55

def scanner
  @scanner
end

#source ⇒ `Object` (readonly)

The HTML source being tokenized



52
53
54

# File 'lib/arrow/htmltokenizer.rb', line 52

def source
  @source
end

Instance Method Details

#each ⇒ `Object`

Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.

# File 'lib/arrow/htmltokenizer.rb', line 60

def each
  @scanner.reset

  until @scanner.empty?
    if @scanner.peek(1) == '<'
      tag = @scanner.scan_until( />/ )

      case tag
      when /^<!--/
        token = HTMLComment.new( tag )
      when /^<!/
        token = DocType.new( tag )
      when /^<\?/
        token = ProcessingInstruction.new( tag )
      else
        token = HTMLTag.new( tag )
      end
    else
      text = @scanner.scan( /[^<]+/ )
      token = HTMLText.new( text )
    end

    yield( token )
  end
end