Module: Nokogiri::HTML5::Inference
- Defined in:
- lib/nokogiri/html5/inference.rb,
lib/nokogiri/html5/inference/version.rb
Overview
:markup: markdown
The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
precise context-dependent parsing rules which can make it challenging to "just parse" a
fragment of HTML without knowing the parent node -- also called the "context node" -- in
which it will be inserted.
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
["in body" insertion
mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
are some notable exceptions. Perhaps the most problematic to web developers are the
table-related tags, which will not be parsed properly unless the parser is in the ["in
table" insertion
mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
For example:
``` ruby
Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
# => "foo" # where did the tag go!?
```
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
here", and drop the tag. This particular fragment must be parsed "in the context" of a
table in order to parse properly.
Thankfully, libgumbo and Nokogiri allow us to set the context node:
``` ruby
Nokogiri::HTML5::DocumentFragment.new(
Nokogiri::HTML5::Document.new,
"<td>foo</td>",
"table" # <--- this is the context node
).to_html
# => "<tbody><tr><td>foo</td></tr></tbody>"
```
This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
We can fix this to only return the tags we provided by using the `<template>` tag as the
context node, which the HTML5 spec provides exactly for this purpose:
``` ruby
Nokogiri::HTML5::DocumentFragment.new(
Nokogiri::HTML5::Document.new,
"<td>foo</td>",
"template" # <--- this is the context node
).to_html
# => "<td>foo</td>"
```
Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
``` ruby
Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
# => "<td>foo</td>"
```
Defined Under Namespace
Modules: ContextRegexp, ContextTags, PluckRegexp
Constant Summary collapse
- VERSION =
"0.3.0"
Class Method Summary collapse
-
.context(input) ⇒ Object
call-seq: context(input) => (String | nil).
-
.parse(input, pluck: true) ⇒ Object
call-seq: parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet).
-
.pluck_path(input) ⇒ Object
call-seq: pluck_path(input) => (String | nil).
Class Method Details
.context(input) ⇒ Object
call-seq: context(input) => (String | nil)
Based on the start of the input HTML5 string, make a guess about whether it's a full
document or a document fragment; and if it's a fragment, whether we need to parse it
within a specific context node.
[Parameters]
- +input+ (String) The input HTML5 string, which may represent a document or a fragment.
[Returns]
The String name of the context node required to parse the fragment, or +nil+ if the
input represents a full document.
149 150 151 152 153 154 155 |
# File 'lib/nokogiri/html5/inference.rb', line 149 def context(input) # :nodoc: case input when ContextRegexp::DOCUMENT then nil when ContextRegexp::HTML then "html" else "template" end end |
.parse(input, pluck: true) ⇒ Object
call-seq:
parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
Based on the start of the input HTML5 string, guess whether it's a full document or a
fragment and, using the fragment context node if necessary, parse it properly and
return the correct set of nodes.
The keyword parameter +pluck+ can be set to +false+ to disable the narrowing of a
parsed fragment to omit any intermediate parent nodes. This "plucking" is necessary,
for example, when the input fragment begins with "<td>", which the HTML5 spec requires
to be wrapped in <tt><tbody><tr>...</tr></tbody></tt> tags. By default, this method
will return only the children of <tt><tbody><tr></tt>, but setting this flag to +false+
will return the +tbody+ tag and its children.
[Parameters]
- +input+ (String) The input HTML5 string, which may represent a document or a fragment.
[Keyword Parameters]
- +pluck+ (Boolean) Default: +true+. Set to +false+ if you want the method to always
return what Nokogiri parsed, without attempting to remove any sibling or intermediate
parent nodes. This shouldn't be necessary if the library is working properly, but may
be useful to allow user to work around a bad guess.
[Returns]
- A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
- A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/nokogiri/html5/inference.rb', line 121 def parse(input, pluck: true) context = Nokogiri::HTML5::Inference.context(input) if context.nil? Nokogiri::HTML5::Document.parse(input) else fragment = Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, input, context) if pluck && (path = pluck_path(input)) fragment.xpath(path) else fragment.children end end end |
.pluck_path(input) ⇒ Object
call-seq: pluck_path(input) => (String | nil)
Based on the start of the input HTML5 fragment string, determine whether the fragment
will need to be selected out of a parent node. This is necessary, for example, when the
fragment begins with "<td>", a tag which the HTML5 spec requires to be wrapped in
"<tbody><tr>...</tr></tbody>".
[Parameters]
- +input+ (String) The input HTML5 string, which should represent a fragment (not a full document).
[Returns]
The String XPath query of the context node required to parse the fragment, or +nil+
if no plucking is necessary.
172 173 174 175 176 |
# File 'lib/nokogiri/html5/inference.rb', line 172 def pluck_path(input) # :nodoc: case input when PluckRegexp::BODY_OUTER then "body" end end |