Module: PageHub::Markdown::Embedder

Defined in:
lib/pagehub-markdown/processors/embedder.rb

Overview

Downloads remote textual resources from websites and allows for content extraction from HTML pages so it can be neatly embedded in another page.

Defined Under Namespace

Classes: EmbeddingError, GithubWikiProcessor, InvalidSizeError, InvalidTypeError, PageHubProcessor, Processor

Constant Summary collapse

AllowedTypes =

Resources whose content-type is not specified in this list will be rejected

[/text\/plain/, /text\/html/, /application\/html/]
MaximumLength =

Resources larger than 1 MByte will be rejected

1 * 1024 * 1024
FilteredHosts =

Resources served by any of the hosts specified in this list will be rejected

[]
Timeout =
5

Class Method Summary collapse

Class Method Details

.allowed?(ctype) ⇒ Boolean

Returns:

  • (Boolean)


101
102
103
104
# File 'lib/pagehub-markdown/processors/embedder.rb', line 101

def allowed?(ctype)
  AllowedTypes.each { |t| return true if t.match ctype }
  false
end

.get_resource(raw_uri, source = "", args = "") ⇒ Object

Performs a HEAD request to validate the resource, and if it passes the checks it will be downloaded and processed if any eligible Embedder::Processor is registered.

Arguments:

  1. raw_uri the full raw URI of the file to be embedded

  2. source an optional identifier to specify the Processor

    that should be used to post-process the content
    
  3. args options that can be meaningful to the Processor, if any

Returns: A string containing the extracted data, or an empty one



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/pagehub-markdown/processors/embedder.rb', line 50

def get_resource(raw_uri, source = "", args = "")
  begin
    uri = URI.parse(raw_uri)

    # reject if the host is banned
    return "" if FilteredHosts.include?(uri.host)

    Net::HTTP.start(uri.host, uri.port) do |http|
      http.open_timeout = Timeout
      http.read_timeout = Timeout

      # get the content type and length
      ctype = ""
      clength = 0
      http.head(uri.path).each { |k,v|
        # puts "#{k} => #{v}"
        ctype = v if k == "content-type"
        clength = v.to_i if k == "content-length"
      }

      raise InvalidTypeError.new ctype if !self.allowed?(ctype)
      raise InvalidSizeError.new clength if clength > MaximumLength

      open(raw_uri) { |f|
        content = f.read

        # invoke processors
        keys = []
        keys << source unless source.empty?
        keys << raw_uri
        @@processors.each { |p|
          if p.applies_to?(keys) then
            content = p.process(content, raw_uri, args)
            break
          end
        }

        return content
      }
    end
  rescue EmbeddingError => e
    # we want to escalate these errors
    raise e
  rescue Exception => e
    # mask as a generic EmbeddingError
    raise EmbeddingError.new e.message
  end

  ""
end

.register_processor(proc) ⇒ Object



106
107
108
109
# File 'lib/pagehub-markdown/processors/embedder.rb', line 106

def register_processor(proc)
  @@processors ||= []
  @@processors << proc
end