Class: Arrow::HTMLTokenizer
- Includes:
- Enumerable
- Defined in:
- lib/arrow/htmltokenizer.rb
Overview
The Arrow::HTMLTokenizer class – a simple HTML parser that can be used to break HTML down into tokens.
Some of the code and design were stolen from the excellent HTMLTokenizer library by Ben Giddings <[email protected]>.
VCS Id
$Id$
Authors
-
Michael Granger <[email protected]>
:include: LICENSE
–
Please see the file LICENSE in the top-level directory for licensing details.
Constant Summary collapse
- SVNRev =
SVN Revision
%q$Rev$
- SVNId =
SVN Id
%q$Id$
Instance Attribute Summary collapse
-
#scanner ⇒ Object
readonly
The StringScanner doing the tokenizing.
-
#source ⇒ Object
readonly
The HTML source being tokenized.
Instance Method Summary collapse
-
#each ⇒ Object
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
-
#initialize(source) ⇒ HTMLTokenizer
constructor
Create a new Arrow::HtmlTokenizer object.
Methods inherited from Object
deprecate_class_method, deprecate_method, inherited
Constructor Details
#initialize(source) ⇒ HTMLTokenizer
Create a new Arrow::HtmlTokenizer object.
41 42 43 44 |
# File 'lib/arrow/htmltokenizer.rb', line 41 def initialize( source ) @source = source @scanner = StringScanner.new( source ) end |
Instance Attribute Details
#scanner ⇒ Object (readonly)
The StringScanner doing the tokenizing
55 56 57 |
# File 'lib/arrow/htmltokenizer.rb', line 55 def scanner @scanner end |
#source ⇒ Object (readonly)
The HTML source being tokenized
52 53 54 |
# File 'lib/arrow/htmltokenizer.rb', line 52 def source @source end |
Instance Method Details
#each ⇒ Object
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/arrow/htmltokenizer.rb', line 60 def each @scanner.reset until @scanner.empty? if @scanner.peek(1) == '<' tag = @scanner.scan_until( />/ ) case tag when /^<!--/ token = HTMLComment.new( tag ) when /^<!/ token = DocType.new( tag ) when /^<\?/ token = ProcessingInstruction.new( tag ) else token = HTMLTag.new( tag ) end else text = @scanner.scan( /[^<]+/ ) token = HTMLText.new( text ) end yield( token ) end end |