Class: Oga::XML::Lexer
- Inherits:
-
Object
- Object
- Oga::XML::Lexer
- Defined in:
- lib/oga/xml/lexer.rb
Overview
Low level lexer that supports both XML and HTML (using an extra option).
To lex HTML input set the :html
option to true
when creating an
instance of the lexer:
lexer = Oga::XML::Lexer.new(:html => true)
This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.
Thread Safety
Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:
# Don't do this!
lexer = Oga::XML::Lexer.new('....')
threads = []
2.times do
threads << Thread.new do
lexer.advance do |*args|
p args
end
end
end
threads.each(&:join)
However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.
Strict Mode
By default the lexer is rather permissive regarding the input. For
example, missing closing tags are inserted by default. To disable this
behaviour the lexer can be run in "strict mode" by setting :strict
to
true
:
lexer = Oga::XML::Lexer.new('...', :strict => true)
Strict mode only applies to XML documents.
Constant Summary collapse
- HTML_SCRIPT =
These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.
'script'.freeze
- HTML_STYLE =
'style'.freeze
- HTML_TABLE_ALLOWED =
Elements that are allowed directly in a
element.
Whitelist.new( %w{thead tbody tfoot tr caption colgroup col} )
- HTML_SCRIPT_ELEMENTS =
Whitelist.new(%w{script template})
- HTML_TABLE_ROW_ELEMENTS =
Whitelist.new(%w{tr}) + HTML_SCRIPT_ELEMENTS
- HTML_CLOSE_SELF =
Elements that should be closed automatically before a new opening tag is processed.
{ 'head' => Blacklist.new(%w{head body}), 'body' => Blacklist.new(%w{head body}), 'li' => Blacklist.new(%w{li}), 'dt' => Blacklist.new(%w{dt dd}), 'dd' => Blacklist.new(%w{dt dd}), 'p' => Blacklist.new(%w{ address article aside blockquote details div dl fieldset figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr main menu nav ol p pre section table ul }), 'rb' => Blacklist.new(%w{rb rt rtc rp}), 'rt' => Blacklist.new(%w{rb rt rtc rp}), 'rtc' => Blacklist.new(%w{rb rtc}), 'rp' => Blacklist.new(%w{rb rt rtc rp}), 'optgroup' => Blacklist.new(%w{optgroup}), 'option' => Blacklist.new(%w{optgroup option}), 'colgroup' => Whitelist.new(%w{col template}), 'caption' => HTML_TABLE_ALLOWED.to_blacklist, 'table' => HTML_TABLE_ALLOWED + HTML_SCRIPT_ELEMENTS, 'thead' => HTML_TABLE_ROW_ELEMENTS, 'tbody' => HTML_TABLE_ROW_ELEMENTS, 'tfoot' => HTML_TABLE_ROW_ELEMENTS, 'tr' => Whitelist.new(%w{td th}) + HTML_SCRIPT_ELEMENTS, 'td' => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED, 'th' => Blacklist.new(%w{td th}) + HTML_TABLE_ALLOWED }
- LITERAL_HTML_ELEMENTS =
Names of HTML tags of which the content should be lexed as-is.
Whitelist.new([HTML_SCRIPT, HTML_STYLE])
Instance Method Summary collapse
-
#advance {|type, value, line| ... } ⇒ Object
Advances through the input and generates the corresponding tokens.
- #html? ⇒ TrueClass|FalseClass
- #html_script? ⇒ TrueClass|FalseClass
- #html_style? ⇒ TrueClass|FalseClass
-
#initialize(data, options = {}) ⇒ Lexer
constructor
A new instance of Lexer.
-
#lex ⇒ Array
Gathers all the tokens for the input and returns them as an Array.
-
#read_data {|| ... } ⇒ String
Yields the data to lex to the supplied block.
-
#reset ⇒ Object
Resets the internal state of the lexer.
- #strict? ⇒ TrueClass|FalseClass
Constructor Details
#initialize(data, options = {}) ⇒ Lexer
Returns a new instance of Lexer.
117 118 119 120 121 122 123
# File 'lib/oga/xml/lexer.rb', line 117 def initialize(data, = {}) @data = data @html = [:html] @strict = [:strict] || false reset end
Instance Method Details
#advance {|type, value, line| ... } ⇒ Object
Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.
Each token is an Array in the following format:
[TYPE, VALUE]
The type is a symbol, the value is either nil or a String.
This method stores the supplied block in
@block
and resets it after the lexer loop has finished.This method does not reset the internal state of the lexer.
200 201 202 203 204 205 206 207 208 209 210 211 212 213
# File 'lib/oga/xml/lexer.rb', line 200 def advance(&block) @block = block read_data do |chunk| advance_native(chunk) end # Add any missing closing tags if !strict? and !@elements.empty? @elements.length.times { on_element_end } end ensure @block = nil end
#html? ⇒ TrueClass|FalseClass
218 219 220
# File 'lib/oga/xml/lexer.rb', line 218 def html? @html == true end
#html_script? ⇒ TrueClass|FalseClass
232 233 234
# File 'lib/oga/xml/lexer.rb', line 232 def html_script? html? && current_element == HTML_SCRIPT end
#html_style? ⇒ TrueClass|FalseClass
239 240 241
# File 'lib/oga/xml/lexer.rb', line 239 def html_style? html? && current_element == HTML_STYLE end
#lex ⇒ Array
Gathers all the tokens for the input and returns them as an Array.
This method resets the internal state of the lexer after consuming the input.
169 170 171 172 173 174 175 176 177 178 179
# File 'lib/oga/xml/lexer.rb', line 169 def lex tokens = [] advance do |type, value, line| tokens << [type, value, line] end reset tokens end
#read_data {|| ... } ⇒ String
Yields the data to lex to the supplied block.
145 146 147 148 149 150 151 152 153 154 155 156 157 158
# File 'lib/oga/xml/lexer.rb', line 145 def read_data if @data.is_a?(String) yield @data # IO, StringIO, etc # THINK: read(N) would be nice, but currently this screws up the C code elsif @data.respond_to?(:each_line) @data.each_line { |line| yield line } # Enumerator, Array, etc elsif @data.respond_to?(:each) @data.each { |chunk| yield chunk } end end
#reset ⇒ Object
Resets the internal state of the lexer. Typically you don't need to call this method yourself as its called by #lex after lexing a given String.
130 131 132 133 134 135 136 137
# File 'lib/oga/xml/lexer.rb', line 130 def reset @line = 1 @elements = [] @data.rewind if @data.respond_to?(:rewind) reset_native end
#strict? ⇒ TrueClass|FalseClass
225 226 227
# File 'lib/oga/xml/lexer.rb', line 225 def strict? @strict end