Distillery
Distillery extracts the "content" portion out of an HTML document. It applies heuristics based on element type, location, class/id name and other attributes to try and find the content part of the HTML document and return it.
The logic for Distillery was heavily influenced by Readability, who was nice enough to make their logic open source. Readability and Distillery share nearly the same logic for locating the content HTML element on the page, however Distillery does not aim to be a direct port of that logic (see iterationlabs/ruby-readability for that).
Differences from Readability
Readability and Distillery differ in how they clean and return the found page content. Readability is focused on stripping the page content down to just paragraphs of text for distraction-free reading, and thus aggressively cleans and transforms the content element HTML. Mostly, this is the conversion of some <div>
elements and newlines to <p>
elements. Distillery does no transformation of the content element, and instead returns the content as originally seen in the HTML document.
Installation
gem install distillery
Usage
Usage is quite simple:
Distillery.distill(html_doc_as_a_string)
> "distilled content"
If you would like a more OO oriented syntax, Distillery offers a Distillery::Document
API. Like the distill
method above, its constructor takes a string that is the content of the HTML page you would like to distill:
doc = Distillery::Document.new(string_of_html)
Then you simply call #distill!
on the document object to distill it and return the distilled content.
doc.distill!
> "distilled content"
Cleaning of the content
Both the Distillery::Document#distill!
and Distillery.distill
methods by default will clean the HTML of the content to remove elements from it which are unlikely to be the actual content. Usually, this is things like social media share buttons, widgets, advertisements, etc. If do not want to clean the content, simply pass :clean => false
to either method:
doc.distill!(:clean => false)
> "raw distilled content"
In its cleaning, Distillery will also remove all <img>
tags from the content element. If you would like to preserve <img>
tags, pass the :images => true
option to the Distillery::Document#distill!
and Distillery.distill
methods. Please note that Distillery attempts to only preserve elements from cleaning that contain "content images," but it is possible images that are part of the content will still be removed.
doc.distill!(:images => true)
> "raw distilled content with <img src=\"info.png\">"
From the command line
Distillery also ships with an executable that allows you to distill documents at the command line:
Usage: distill [options] http://www.example.com/
options:
-d, --dirty Do not clean content HTML
-i, --images Keep images in the content HTML
-v, --version Print the version
-h, --help Print this help message