Module: PageHub::Markdown::Embedder
- Defined in:
- lib/pagehub-markdown/processors/embedder.rb
Overview
Downloads remote textual resources from websites and allows for content extraction from HTML pages so it can be neatly embedded in another page.
Defined Under Namespace
Classes: EmbeddingError, GithubWikiProcessor, InvalidSizeError, InvalidTypeError, PageHubProcessor, Processor
Constant Summary collapse
- AllowedTypes =
Resources whose content-type is not specified in this list will be rejected
[/text\/plain/, /text\/html/, /application\/html/]
- MaximumLength =
Resources larger than 1 MByte will be rejected
1 * 1024 * 1024
- FilteredHosts =
Resources served by any of the hosts specified in this list will be rejected
[]
- Timeout =
5
Class Method Summary collapse
- .allowed?(ctype) ⇒ Boolean
-
.get_resource(raw_uri, source = "", args = "") ⇒ Object
Performs a HEAD request to validate the resource, and if it passes the checks it will be downloaded and processed if any eligible Embedder::Processor is registered.
- .register_processor(proc) ⇒ Object
Class Method Details
.allowed?(ctype) ⇒ Boolean
101 102 103 104 |
# File 'lib/pagehub-markdown/processors/embedder.rb', line 101 def allowed?(ctype) AllowedTypes.each { |t| return true if t.match ctype } false end |
.get_resource(raw_uri, source = "", args = "") ⇒ Object
Performs a HEAD request to validate the resource, and if it passes the checks it will be downloaded and processed if any eligible Embedder::Processor is registered.
Arguments:
-
raw_uri the full raw URI of the file to be embedded
-
source an optional identifier to specify the Processor
that should be used to post-process the content
-
args options that can be meaningful to the Processor, if any
Returns: A string containing the extracted data, or an empty one
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/pagehub-markdown/processors/embedder.rb', line 50 def get_resource(raw_uri, source = "", args = "") begin uri = URI.parse(raw_uri) # reject if the host is banned return "" if FilteredHosts.include?(uri.host) Net::HTTP.start(uri.host, uri.port) do |http| http.open_timeout = Timeout http.read_timeout = Timeout # get the content type and length ctype = "" clength = 0 http.head(uri.path).each { |k,v| # puts "#{k} => #{v}" ctype = v if k == "content-type" clength = v.to_i if k == "content-length" } raise InvalidTypeError.new ctype if !self.allowed?(ctype) raise InvalidSizeError.new clength if clength > MaximumLength open(raw_uri) { |f| content = f.read # invoke processors keys = [] keys << source unless source.empty? keys << raw_uri @@processors.each { |p| if p.applies_to?(keys) then content = p.process(content, raw_uri, args) break end } return content } end rescue EmbeddingError => e # we want to escalate these errors raise e rescue Exception => e # mask as a generic EmbeddingError raise EmbeddingError.new e. end "" end |
.register_processor(proc) ⇒ Object
106 107 108 109 |
# File 'lib/pagehub-markdown/processors/embedder.rb', line 106 def register_processor(proc) @@processors ||= [] @@processors << proc end |