Module: ReadabilityJs

Defined in:: lib/readability_js.rb,
lib/custom_errors/error.rb,
lib/readability_js/nodo.rb,
lib/readability_js/version.rb,
lib/readability_js/extended.rb

Overview

ReadabilityJs

Defined Under Namespace

Constant Summary collapse

VERSION =

'0.0.3'.freeze

Class Method Summary collapse

.is_probably_readerable(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ Boolean

Decides whether a document is probably readerable without parsing the whole document.
.parse(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ Hash

Parse a HTML document and extract its main content using Mozilla’s Readability library.
.parse_extended(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ Hash

Like #parse but with additional pre- and post-processing to enhance content extraction.
.probably_readerable?(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ Boolean

Decides whether a document is probably readerable without parsing the whole document.

Class Method Details

.is_probably_readerable(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ `Boolean`

Decides whether a document is probably readerable without parsing the whole document.

Only ‘html’ is a required parameter, all others are optional.

html = “<html>…</html>”

visibility_checker = <<~JS

(node) => {
 const style = node.ownerDocument.defaultView.getComputedStyle(node);
 return (style && style.display !== 'none' && style.visibility !== 'hidden' && parseFloat(style.opacity) > 0);
}

ReadabilityJs.is_probably_readerable(html, min_content_length: 200, min_score: 25, visibility_checker: visibility_checker)

Parameters:

html (String) —

The HTML document as a string.
min_content_length (Integer) (defaults to: 140) —

Minimum content length to consider the document readerable
min_score (Integer) (defaults to: 20) —

Minimum score to consider the document readerable
visibility_checker (String) (defaults to: nil) —

anonymous JavaScript function definition to check node visibility as string. Uses default visibility checker if not provided.

Returns:

(Boolean) —

true if the document is probably readerable, false otherwise.

Raises:

(ReadabilityJs::Error) —

if an error occurs during execution

# File 'lib/readability_js.rb', line 102

def self.is_probably_readerable(html, min_content_length: 140, min_score: 20, visibility_checker: nil)
  begin
    ReadabilityJs::Nodo.is_probably_readerable(html, min_content_length: min_content_length, min_score: min_score, visibility_checker: visibility_checker)
  rescue => e
    raise ReadabilityJs::Error.new e.message
  end
end

.parse(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ `Hash`

Parse a HTML document and extract its main content using Mozilla’s Readability library.

‘html’ is a required parameters, all others are optional.

Parameters:

html (String) —

The HTML document as a string.
url (String, nil) (defaults to: nil) —

The URL of the document (optional, used for resolving relative links).
debug (Boolean) (defaults to: false) —

Enable debug mode (default: false).
max_elems_to_parse (Integer) (defaults to: 0) —

Maximum number of elements to parse (default: 0, meaning no limit).
nb_top_candidates (Integer) (defaults to: 5) —

Number of top candidates to consider (default: 5).
char_threshold (Integer) (defaults to: 500) —

Minimum number of characters for an element to be considered (default: 500).
classes_to_preserve (Array<String>) (defaults to: []) —

List of CSS classes to preserve in the output (default: []).
keep_classes (Boolean) (defaults to: false) —

Whether to keep the original classes in the output (default: false).
disable_json_ld (Boolean) (defaults to: false) —

Disable JSON-LD parsing (default: false).
serializer (String, nil) (defaults to: nil) —

Serializer to use for output (optional).
allow_video_regex (String, nil) (defaults to: nil) —

Regular expression to allow video URLs (optional).
link_density_modifier (Float) (defaults to: 0) —

Modifier for link density calculation (default: 0).

Returns:

(Hash) —

A hash containing the extracted content and metadata.

Raises:

(ReadabilityJs::Error) —

if an error occurs during execution

# File 'lib/readability_js.rb', line 40

def self.parse(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0)
  begin
    result = ReadabilityJs::Nodo.parse(html, url: url, debug: debug, max_elems_to_parse: max_elems_to_parse, nb_top_candidates: nb_top_candidates, char_threshold: char_threshold, classes_to_preserve: classes_to_preserve, keep_classes: keep_classes, disable_json_ld: disable_json_ld, serializer: serializer, allow_video_regex: allow_video_regex, link_density_modifier: link_density_modifier)
    normalize_result(result)
  rescue => e
    raise ReadabilityJs::Error.new e.message
  end
end

.parse_extended(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ `Hash`

Like #parse but with additional pre- and post-processing to enhance content extraction.

‘html’ is a required parameters, all others are optional.

Parameters:

html (String) —

The HTML document as a string.
url (String, nil) (defaults to: nil) —

The URL of the document (optional, used for resolving relative links).
debug (Boolean) (defaults to: false) —

Enable debug mode (default: false).
max_elems_to_parse (Integer) (defaults to: 0) —

Maximum number of elements to parse (default: 0, meaning no limit).
nb_top_candidates (Integer) (defaults to: 5) —

Number of top candidates to consider (default: 5).
char_threshold (Integer) (defaults to: 500) —

Minimum number of characters for an element to be considered (default: 500).
classes_to_preserve (Array<String>) (defaults to: []) —

List of CSS classes to preserve in the output (default: []).
keep_classes (Boolean) (defaults to: false) —

Whether to keep the original classes in the output (default: false).
disable_json_ld (Boolean) (defaults to: false) —

Disable JSON-LD parsing (default: false).
serializer (String, nil) (defaults to: nil) —

Serializer to use for output (optional).
allow_video_regex (String, nil) (defaults to: nil) —

Regular expression to allow video URLs (optional).
link_density_modifier (Float) (defaults to: 0) —

Modifier for link density calculation (default: 0).

Returns:

(Hash) —

A hash containing the extracted content and metadata.

Raises:

(ReadabilityJs::Error) —

if an error occurs during execution

# File 'lib/readability_js.rb', line 70

def self.parse_extended(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0)
  result = Extended::before_cleanup html
  result = parse result, url: url, debug: debug, max_elems_to_parse: max_elems_to_parse, nb_top_candidates: nb_top_candidates, char_threshold: char_threshold, classes_to_preserve: classes_to_preserve, keep_classes: keep_classes, disable_json_ld: disable_json_ld, serializer: serializer, allow_video_regex: allow_video_regex, link_density_modifier: link_density_modifier
  Extended::after_cleanup result, html
end

.probably_readerable?(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ `Boolean`

Decides whether a document is probably readerable without parsing the whole document.

Only ‘html’ is a required parameter, all others are optional.

html = “<html>…</html>”

visibility_checker = <<~JS

(node) => {
 const style = node.ownerDocument.defaultView.getComputedStyle(node);
 return (style && style.display !== 'none' && style.visibility !== 'hidden' && parseFloat(style.opacity) > 0);
}

ReadabilityJs.probably_readerable?(html, min_content_length: 200, min_score: 25, visibility_checker: visibility_checker)

Parameters:

html (String) —

The HTML document as a string.
min_content_length (Integer) (defaults to: 140) —

Minimum content length to consider the document readerable
min_score (Integer) (defaults to: 20) —

Minimum score to consider the document readerable
visibility_checker (String) (defaults to: nil) —

anonymous JavaScript function definition to check node visibility as string. Uses default visibility checker if not provided.

Returns:

(Boolean) —

true if the document is probably readerable, false otherwise.

Raises:

(ReadabilityJs::Error) —

if an error occurs during execution



137
138
139

# File 'lib/readability_js.rb', line 137

def self.probably_readerable?(html, min_content_length: 140, min_score: 20, visibility_checker: nil)
  self.is_probably_readerable(html, min_content_length: min_content_length, min_score: min_score, visibility_checker: visibility_checker)
end

Module: ReadabilityJs

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.is_probably_readerable(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ Boolean

.parse(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ Hash

.parse_extended(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ Hash

.probably_readerable?(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ Boolean

.is_probably_readerable(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ `Boolean`

.parse(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ `Hash`

.parse_extended(html, url: nil, debug: false, max_elems_to_parse: 0, nb_top_candidates: 5, char_threshold: 500, classes_to_preserve: [], keep_classes: false, disable_json_ld: false, serializer: nil, allow_video_regex: nil, link_density_modifier: 0) ⇒ `Hash`

.probably_readerable?(html, min_content_length: 140, min_score: 20, visibility_checker: nil) ⇒ `Boolean`