Class: BxBuilderChain::Processors::Html
- Defined in:
- lib/bx_builder_chain/processors/html.rb
Constant Summary collapse
- EXTENSIONS =
[".html", ".htm"]
- CONTENT_TYPES =
["text/html"]
- TEXT_CONTENT_TAGS =
We only look for headings and paragraphs
%w[h1 h2 h3 h4 h5 h6 p]
Instance Method Summary collapse
-
#initialize ⇒ Html
constructor
A new instance of Html.
-
#parse(data) ⇒ String
Parse the document and return the text.
Methods included from DependencyHelper
Constructor Details
#initialize ⇒ Html
Returns a new instance of Html.
12 13 14 15 |
# File 'lib/bx_builder_chain/processors/html.rb', line 12 def initialize(*) depends_on "nokogiri" require "nokogiri" end |
Instance Method Details
#parse(data) ⇒ String
Parse the document and return the text
20 21 22 23 24 25 26 |
# File 'lib/bx_builder_chain/processors/html.rb', line 20 def parse(data) Nokogiri::HTML(data.read) .css(TEXT_CONTENT_TAGS.join(",")) .map(&:inner_text) .join("\n\n") .strip end |