Burly
A Ruby gem for extracting URLs from HTML, JSON, and plaintext documents.
Getting Started
Before installing and using Burly, you'll want to have Ruby 2.6 (or newer) installed. Using a Ruby version managment tool like rbenv, chruby, or rvm is recommended.
Burly is developed using Ruby 3.4 and is tested against additional Ruby versions using Forgejo Actions.
Installation
Add Burly to your project's Gemfile and run bundle install:
source "https://rubygems.org"
gem "burly"
Usage
Using Burly to parse plaintext documents is as straightforward as:
Burly.parse(File.read("example.txt"))
Parsing JSON or HTML documents is only slightly more complicated:
Burly.parse(File.read("example.json"), mime_type: "application/json")
Burly.parse(File.read("example.html", mime_type: "text/html"))
Burly uses slightly different parsing rules for each supported MIME type:
- In plaintext documents, Burly extracts absolute URLs (e.g.
https://website.example) from the document. - In JSON documents, Burly extracts string values that only contain absolute URLs (e.g.
{ "url": "https://website.example" }and{ "urls": ["https://website.example", "https://another-website.example] }) - In HTML documents, Burly extracts absolute and relative URLs from URL attributes and srcset attributes.
In all cases, neither order nor uniqueness is guaranteed. You may also consider converting relative URLs extract from HTML documents to absolute URLs using the document's source URL and/or the <base> element's href attribute value (Ruby's URI.join class method is good for this!).
License
Burly is freely available under the MIT License.