Class: LinkHeaders::Processor
- Inherits:
-
Object
- Object
- LinkHeaders::Processor
- Defined in:
- lib/linkheaders/processor.rb,
lib/linkheaders/processor/version.rb
Overview
A Link Header parser
Works for both HTML and HTTP links, and handles references to Linksets of either JSON or Text types
Constant Summary collapse
- VERSION =
"0.1.21"
Instance Attribute Summary collapse
-
#default_anchor ⇒ <Type>
<description>.
-
#factory ⇒ <LinkHeader::LinkFactory>
Get the parser factory that contains all the links.
Instance Method Summary collapse
- #check_for_linkset(responsepart:) ⇒ Object
-
#extract_and_parse(response: RestClient::Response.new) ⇒ Object
Parses a RestClient::Response.
-
#initialize(default_anchor: 'https://default.anchor.org/') ⇒ Processor
constructor
Create the Link Headers Parser and its Link factory.
-
#parse_html_link_headers(body:, anchor: '') ⇒ Object
Parses the link headers out of an HTML body, and adds links to the LinkHeader::LinkFactory object.
-
#parse_http_link_headers(headers) ⇒ Object
Consume a String of the Link Headers and parse it into individual links.
- #processJSONLinkset(href:) ⇒ Object
- #processTextLinkset(href:) ⇒ Object
- #split_http_link_headers_and_process(parts) ⇒ Object
Constructor Details
#initialize(default_anchor: 'https://default.anchor.org/') ⇒ Processor
Create the Link Headers Parser and its Link factory
28 29 30 31 |
# File 'lib/linkheaders/processor.rb', line 28 def initialize(default_anchor: 'https://default.anchor.org/') @default_anchor = default_anchor @factory = LinkHeaders::LinkFactory.new(default_anchor: @default_anchor) end |
Instance Attribute Details
#default_anchor ⇒ <Type>
Returns <description>.
21 22 23 |
# File 'lib/linkheaders/processor.rb', line 21 def default_anchor @default_anchor end |
#factory ⇒ <LinkHeader::LinkFactory>
Get the parser factory that contains all the links
21 22 23 |
# File 'lib/linkheaders/processor.rb', line 21 def factory @factory end |
Instance Method Details
#check_for_linkset(responsepart:) ⇒ Object
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/linkheaders/processor.rb', line 175 def check_for_linkset(responsepart:) warn "looking for a linkset" newlinks = Array.new factory.linksets.each do |linkset| # warn "found #{linkset.methods- Object.new.methods}" # warn "inspect #{linkset.inspect}" next unless linkset.respond_to? 'type' # warn "responds #{linkset.type} " case linkset.type when 'application/linkset+json' # warn "found a json linkset" newlinks << processJSONLinkset(href: linkset.href) when 'application/linkset' # warn "found a text linkset" newlinks << processTextLinkset(href:linkset.href) else warn "the linkset #{linkset} was not typed as 'application/linkset+json' or 'application/linkset', and it should be! (found #{linkset.type}) Ignoring..." end end newlinks end |
#extract_and_parse(response: RestClient::Response.new) ⇒ Object
Parses a RestClient::Response
The HTTP headers are parsed for Links and if those links contain a Linkset, that is retrieved and parsed If the Response is of some HTML form, this is also parsed for Link headers and Linkset links All discovered links end up in a LinkHeader::LinkFactory object (self.factory)
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/linkheaders/processor.rb', line 52 def extract_and_parse(response: RestClient::Response.new) head = response.headers body = response.body # warn "\n\n head #{head.inspect}\n\n" unless head warn "WARNING: This doesn't seem to be a RestClient response message.\nReturning blank" return [[], []] end _newlinks = parse_http_link_headers(head) # warn "HTTPlinks #{newlinks.inspect}" ['text/html', 'text/xhtml+xml', 'application/xhtml+xml'].each do |format| if head[:content_type] and head[:content_type].match(format) warn "found #{format} content - parsing" _htmllinks = parse_html_link_headers(body: body, anchor: default_anchor) # pass html body to find HTML link headers # warn "htmllinks #{htmllinks.inspect}" end end end |
#parse_html_link_headers(body:, anchor: '') ⇒ Object
Parses the link headers out of an HTML body, and adds links to the LinkHeader::LinkFactory object. Will automatically retrieve and process any LinkSet references found
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'lib/linkheaders/processor.rb', line 149 def parse_html_link_headers(body:, anchor: '') m = MetaInspector.new(anchor, document: body) # an array of elements that look like this: [{:rel=>"alternate", :type=>"application/ld+json", :href=>"http://scidata.vitk.lv/dataset/303.jsonld"}] newlinks = Array.new m.head_links.each do |l| next unless l[:href] and l[:rel] # required anchor = l[:anchor] || default_anchor l.delete(:anchor) relation = l[:rel] l.delete(:rel) href = l[:href] l.delete(:href) relations = relation.split(/\s+/) # handle the multiple relation case # warn "BODY RELATIONS #{relations}" relations.each do |rel| next unless rel.match?(/\w/) newlinks << factory.new_link(responsepart: :header, anchor: anchor, href: href, relation: rel, **l) # parsed['https://example.one.com'][:rel] = "preconnect" end end newlinks << check_for_linkset(responsepart: :body) newlinks end |
#parse_http_link_headers(headers) ⇒ Object
Consume a String of the Link Headers and parse it into individual links. Will automatically retrieve and process any LinkSet references found. All LinkHeader::Link objects end up in the LinkHeader::LinkFactory object (self.factory)
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/linkheaders/processor.rb', line 80 def parse_http_link_headers(headers) newlinks = Array.new # Link: <https://example.one.com>; rel="preconnect", <https://example.two.com>; rel="preconnect", <https://example.three.com>; rel="preconnect" links = headers[:link] return [] unless links # warn links.inspect parts = links.split(',') # ["<https://example.one.com>; rel='preconnect'", "<https://example.two.com>; rel="preconnect"".....] # warn parts # Parse each part into a named link newlinks << split_http_link_headers_and_process(parts) # creates links from the split headers and adds to factory.all_links newlinks << check_for_linkset(responsepart: :header) # all links are held in the Linkset::LinkFactory object (factory variable here). This scans the links for a linkset link to follow newlinks end |
#processJSONLinkset(href:) ⇒ Object
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
# File 'lib/linkheaders/processor.rb', line 197 def processJSONLinkset(href:) _headers, linkset = lhfetch(href, { 'Accept' => 'application/linkset+json' }) # warn "Linkset body #{linkset.inspect}\n\nLinkset headers #{_headers}\n\n" newlinks = Array.new return nil unless linkset # warn "linkset #{linkset}" # linkset = '{ "linkset": # [ # { "anchor": "http://example.net/bar", # "item": [ # {"href": "http://example.com/foo1", "type": "text/html"}, # {"href": "http://example.com/foo2"} # ], # "next": [ # {"href": "http://the.next/"} # ] # } # ] # }' linkset = JSON.parse(linkset) # warn "linkset #{linkset}" if linkset['data'] and linkset['data']['linkset'] linkset['linkset'] = linkset['data']['linkset'] end return nil unless linkset['linkset'].first linkset['linkset'].each do |ls| # warn ls.inspect, "\n" anchor = ls['anchor'] || @default_anchor ls.delete('anchor') if ls['anchor'] # we need to delete since almost all others have a list as a value attrhash = {} # warn ls.keys, "\n" ls.each_key do |relation| # relation = e.g. "item", "described-by". "cite" href = "" # warn relation # warn ls[reltype], "\n" ls[relation] = [ls[relation]] unless ls[relation].is_a? Array # force it to be a list, if it isn't ls[relation].each do |attrs| # attr = e.g. {"href": "http://example.com/foo1", "type": "text/html"} # warn "ATTR: #{attrs}" next unless attrs['href'] # this is a required attribute of a linkset relation href = attrs['href'] attrs.delete("href") # now go through the other attributes of that relation attrs.each do |attr, val| # attr = e.g. "type"; val = "text/html" attrhash[attr.to_sym] = val end end relations = relation.split(/\s+/) # handle the multiple relation case relations.each do |rel| next unless rel.match?(/\w/) newlinks << factory.new_link(responsepart: :header, anchor: anchor, href: href, relation: rel, **attrhash) # parsed['https://example.one.com'][:rel] = "preconnect" end end end newlinks end |
#processTextLinkset(href:) ⇒ Object
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
# File 'lib/linkheaders/processor.rb', line 257 def processTextLinkset(href:) newlinks = Array.new headers, linkset = lhfetch(href, { 'Accept' => 'application/linkset' }) # warn "linkset body #{linkset.inspect}" return {} unless linkset # links = linkset.scan(/(<.*?>[^<]+)/) # split on the open angle bracket, which indicates a new link links = linkset.split(/,\n*/) # split on the comma+newline # warn "Links found #{links}" links.each do |ls| # warn "working on link #{ls}" elements = ls.split(';').map {|element| element.strip!} # semicolon delimited fields # ["<https://w3id.org/a2a-fair-metrics/08-http-describedby-citeas-linkset-txt/>", "anchor=\"https://s11.no/2022/a2a-fair-metrics/08-http-describedby-citeas-linkset-txt/\"", "rel=\"cite-as\""] href = elements.shift # first element is always the link url # warn "working on link href #{href}" href = href.match(/<([^>]+)>/)[1] attrhash = {} elements.each do |e| key, val = e.split('=') val.delete_prefix!('"').delete_suffix!('"') # get rid of newlines and start/end quotes attrhash[key.to_sym] = val # split on key=val and make key a symbol end warn "No link relation type... this is bad! Skipping" unless attrhash[:rel] next unless attrhash[:rel] relation = attrhash[:rel] attrhash.delete(:rel) anchor = attrhash[:anchor] || @default_anchor attrhash.delete(:anchor) relations = relation.split(/\s+/) # handle the multiple relation case #$stderr.puts "RELATIONS #{relations}" relations.each do |rel| next unless rel.match?(/\w/) newlinks << factory.new_link(responsepart: :header, anchor: anchor, href: href, relation: rel, **attrhash) # parsed['https://example.one.com'][:rel] = "preconnect" end end newlinks end |
#split_http_link_headers_and_process(parts) ⇒ Object
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/linkheaders/processor.rb', line 96 def split_http_link_headers_and_process(parts) newlinks = Array.new parts.each do |part, _index| warn "link is: #{part}" # <https://s11.no/2022/a2a-fair-metrics/70-rda-r1-01m-t2-type-charset/test-apple-data.csv>;rel="item";type="text/csv;charset=UTF-8" # this is crazy hard, because we can't rely on quotes! part = part + ";" href = part[/<([^\]]+)>;/, 1] next unless href part = part.gsub(/<([^\]]+)>\s?;\s?/, "") # now only ;rel="item";type="text/csv;charset=UTF-8"; pro=gram pieces = part.scan(/(\S+?\s?=\s?"[^"]+?"\s?);?/).flatten # ["rel=\"item\"", "type=\"text/csv;charset=UTF-8\""] and the non-quoted stuff is ignored pieces.each do |p| part = part.gsub(p, "") # now just ";;prog=gram;" end rest = part.split(";") # ["", "", "prog=gram"] sections = {} pieces.concat(rest).each do |s| # can be more than one link property "rel='preconnect'" s.strip! unless m = s.match(%r{(\w+?)\s?=\s?"?([\w\;\:\d\.\,\=\#\-\+\/\s]+)"?}) warn " NO PATTERN MATCH ON #{s}" next end # can be rel="cite-as describedby" --> two relations in one! or "linkset+json" relation = m[1] # rel" value = m[2] # "preconnect" warn "section relation #{relation} value #{value}" sections[relation] = value # value could hold multiple relation types sections[:rel] = "preconnect" end next unless sections['rel'] # the relation is required! anchor = sections['anchor'] || default_anchor sections.delete('anchor') relation = sections['rel'] sections.delete('rel') relations = relation.split(/\s+/) # handle the multiple relation case # warn "HEADERS RELATIONS #{relations}" relations.each do |rel| next unless rel.match?(/\w/) puts "LICENCE is #{href}\n\n" if rel == "license" newlinks << factory.new_link(responsepart: :header, anchor: anchor, href: href, relation: rel, **sections) # parsed['https://example.one.com'][:rel] = "preconnect" end end newlinks end |