Class: Langchain::Processors::HTML

Inherits:
Base
  • Object
show all
Defined in:
lib/langchain/processors/html.rb

Constant Summary collapse

EXTENSIONS =
[".html", ".htm"]
CONTENT_TYPES =
["text/html"]
TEXT_CONTENT_TAGS =

We only look for headings and paragraphs

%w[h1 h2 h3 h4 h5 h6 p]

Instance Method Summary collapse

Methods included from DependencyHelper

#depends_on

Constructor Details

#initializeHTML

Returns a new instance of HTML.



12
13
14
# File 'lib/langchain/processors/html.rb', line 12

def initialize(*)
  depends_on "nokogiri"
end

Instance Method Details

#parse(data) ⇒ String

Parse the document and return the text

Parameters:

  • data (File)

Returns:

  • (String)


19
20
21
22
23
24
# File 'lib/langchain/processors/html.rb', line 19

def parse(data)
  Nokogiri::HTML(data.read)
    .css(TEXT_CONTENT_TAGS.join(","))
    .map(&:inner_text)
    .join("\n\n")
end