Class: Langchain::Processors::HTML
- Defined in:
- lib/langchain/processors/html.rb
Constant Summary collapse
- EXTENSIONS =
[".html", ".htm"]
- CONTENT_TYPES =
["text/html"]
- TEXT_CONTENT_TAGS =
We only look for headings and paragraphs
%w[h1 h2 h3 h4 h5 h6 p]
Instance Method Summary collapse
-
#initialize ⇒ HTML
constructor
A new instance of HTML.
-
#parse(data) ⇒ String
Parse the document and return the text.
Methods included from DependencyHelper
Constructor Details
#initialize ⇒ HTML
Returns a new instance of HTML.
12 13 14 |
# File 'lib/langchain/processors/html.rb', line 12 def initialize(*) depends_on "nokogiri" end |
Instance Method Details
#parse(data) ⇒ String
Parse the document and return the text
19 20 21 22 23 24 |
# File 'lib/langchain/processors/html.rb', line 19 def parse(data) Nokogiri::HTML(data.read) .css(TEXT_CONTENT_TAGS.join(",")) .map(&:inner_text) .join("\n\n") end |