Class: Rfeedfinder

Inherits:
Object
  • Object
show all
Defined in:
lib/rfeedfinder/version.rb,
lib/rfeedfinder.rb

Overview

:nodoc:

Defined Under Namespace

Modules: VERSION

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(init_values = {}) ⇒ Rfeedfinder

Takes:

  • init_values (hash)

    • :proxy: (string) proxy information to use. Defaults to a blank string

    • :user_agent: (string) user agent to identify as. Defaults to Ruby/#RUBY_VERSION - Rfeedfinder VERSION

    • :from: (string) contact info to the responsible person. FIXME: Is this correct? Defaults to [email protected]

    • :keep_data: (boolean) if the data downloaded for the feeds should be returned along with the URLs. Defaults to false

    • :use_google: (boolean) tries to find a URL using a google “I’m feeling lucky” search. Defaults to false

    Example:

    Rfeedfinder.new(=> “127.0.0.1:1234”,

    :user_agent => "MyApp",
    :from => "[email protected]",
    :referer => "http://domain.com")
    

Returns a new instance of Rfeedfinder



31
32
33
# File 'lib/rfeedfinder.rb', line 31

def initialize(init_values = {})
  @options = init_values
end

Class Method Details

.feed(uri, options = {}) ⇒ Object

Takes:

  • uri (string): The URI to check

  • options (hash)

    • :proxy: (string) proxy information to use. Defaults to a blank string

    • :user_agent: (string) user agent to identify as. Defaults to Ruby/#RUBY_VERSION - Rfeedfinder VERSION

    • :from: (string) contact info to the responsible person. FIXME: Is this correct? Defaults to [email protected]

    • :keep_data: (boolean) if the data downloaded for the feeds should be returned along with the URLs. Defaults to false

    • :use_google: (boolean) tries to find a URL using a google “I’m feeling lucky” search. Defaults to false

    Example:

    Rfeedfinder.feeds(“www.google.com”, => “127.0.0.1:1234”,

    :user_agent => "MyApp",
    :from => "[email protected]",
    :referer => "http://domain.com")
    

Returns:

  • one URL as a string or nil

  • one hash if the :keep_data option is true Example: {:url => “url1”, :data => “some data”}

Raises:

  • ArgumentError if uri is not a valid URL, and :use_google => false

  • ArgumentError if :use_google => true but it’s not your lucky day



254
255
256
257
258
259
260
261
262
# File 'lib/rfeedfinder.rb', line 254

def self.feed(uri, options = {})
  options[:only_first] = true
  feedlist = Rfeedfinder.feeds(uri, options)
  unless feedlist.empty?
    return feedlist[0]
  else
    return nil
  end
end

.feeds(uri, options = {}) ⇒ Object

Takes:

  • uri (string): The URI to check

  • options (hash)

    • :proxy: (string) proxy information to use. Defaults to a blank string

    • :user_agent: (string) user agent to identify as. Defaults to Ruby/#RUBY_VERSION - Rfeedfinder VERSION

    • :from: (string) contact info to the responsible person. FIXME: Is this correct? Defaults to [email protected]

    • :keep_data: (boolean) if the data downloaded for the feeds should be returned along with the URLs. Defaults to false

    • :use_google: (boolean) tries to find a URL using a google “I’m feeling lucky” search. Defaults to false

    Example:

    Rfeedfinder.feeds(“www.google.com”, => “127.0.0.1:1234”,

    :user_agent => "MyApp",
    :from => "[email protected]",
    :referer => "http://domain.com")
    

Returns:

  • array of urls

  • array of hashes if the :keep_data option is true Example:

    {:url => “url1”, :data => “some data”},{:url => “url2”, :data => “feed data”}

Raises:

  • ArgumentError if uri is not a valid URL, and :use_google => false

  • ArgumentError if :use_google => true but it’s not your lucky day

Raises:

  • (ArgumentError)


86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# File 'lib/rfeedfinder.rb', line 86

def self.feeds(uri, options = {})
  
  # We have to create a hash for the data
  # if the user has asked us to keep the data
  options[:data] = {} if options[:keep_data]  

  options[:original_uri] = uri if !Rfeedfinder.isAValidURL?(uri) and options[:use_google]
  
  uri = URI.decode(uri)
  options[:recurs] = [uri] if options[:recurs].nil?
  fulluri = Rfeedfinder.makeFullURI(uri)

  raise ArgumentError, "#{fulluri} is not a valid URI." \
    if !Rfeedfinder.isAValidURL?(fulluri) and !options[:use_google]
  
  # Add youtube support
  if fulluri =~ /youtube\.com\/user\/(.*[^\/])/
    fulluri = "http://www.youtube.com/rss/user/#{$1}/videos.rss"
  end
  if fulluri =~ /youtube\.com\/tag\/(.*[^\/])/
    fulluri = "http://www.youtube.com/rss/tag/#{$1}/videos.rss"
  end
      
  data = Rfeedfinder.open_doc(fulluri, options)
  return [] if data.nil?

  # If we used the google link finder, then we should set the new URL
  fulluri = options[:google_link] if options[:google_link]

  # is this already a feed?
  if Rfeedfinder.isFeedData?(data)
    feedlist = [fulluri]
    Rfeedfinder.verifyRedirect(feedlist)
    return feedlist
  end
  
  #verify redirection
  newuri = Rfeedfinder.tryBrokenRedirect(data)
  if !newuri.nil? and !newuri.empty?
    options[:recurs] = [] unless options[:recurs]
    unless options[:recurs].include?(newuri)
      options[:recurs] << newuri
      return feeds(newuri, options)
    end
  end
   
  #verify frameset
  frames = Rfeedfinder.getFrameLinks(data, fulluri)
  frames.each {|newuri|
    if !newuri.nil? and !newuri.empty?
      options[:recurs] = [] unless options[:recurs]
      unless options[:recurs].include?(newuri)
        options[:recurs] << newuri
        return feeds(newuri, options)
      end
    end
  }
  
  # nope, it's a page, try LINK tags first
  outfeeds = Rfeedfinder.getLinks(data, fulluri).select {|link| Rfeedfinder.isFeed?(link, options)}
    
  #_debuglog('found %s feeds through LINK tags' % len(outfeeds))
  if outfeeds.empty?
    # no LINK tags, look for regular <A> links that point to feeds
    begin
      links = Rfeedfinder.getALinks(data, fulluri)
    rescue
      links = []
    end
    
    # Get local links
    links, locallinks = Rfeedfinder.getLocalLinks(links, fulluri)

    # TODO:
    # implement support for :only_first down her

    # look for obvious feed links on the same server
    selected_feeds = locallinks.select{|link| Rfeedfinder.isFeedLink?(link) and Rfeedfinder.isFeed?(link, options)}
    outfeeds << selected_feeds unless selected_feeds.empty?
    # outfeeds.each{|link| puts "1 #{link}"}
    
    # look harder for feed links on the same server
    selected_feeds = locallinks.select{|link| Rfeedfinder.isXMLRelatedLink?(link) and Rfeedfinder.isFeed?(link, options)} if outfeeds.empty?
    outfeeds << selected_feeds unless selected_feeds.empty?
    # outfeeds.each{|link| puts "2 #{link}"}

    # look for obvious feed links on another server
    selected_feeds = links.select {|link| Rfeedfinder.isFeedLink?(link) and Rfeedfinder.isFeed?(link, options)} if outfeeds.empty?
    outfeeds << selected_feeds unless selected_feeds.empty?
    # outfeeds.each{|link| puts "3 #{link}"}

    # look harder for feed links on another server
    selected_feeds = links.select {|link| Rfeedfinder.isXMLRelatedLink?(link) and Rfeedfinder.isFeed?(link, options)} if outfeeds.empty?
    outfeeds << selected_feeds unless selected_feeds.empty?
    # outfeeds.each{|link| puts "4 #{link}"}
  end
  
  if outfeeds.empty?
    # no A tags, guessing
    # filenames used by popular software:
    guesses = ['atom.xml', # blogger, TypePad
      'feed/', # wordpress
      'feeds/posts/default', # blogspot
      'feed/main/rss20', # fotolog
      'index.atom', # MT, apparently
      'index.rdf', # MT
      'rss.xml', # Dave Winer/Manila
      'index.xml', # MT
      'index.rss'] # Slash
      
    guesses.each { |guess|  
      uri = URI.join(fulluri, guess).to_s
      outfeeds << uri if Rfeedfinder.isFeed?(uri, options)
    }
  end
  
  # try with adding ending slash
  if outfeeds.empty? and fulluri !~ /\/$/
    outfeeds = Rfeedfinder.feeds(fulluri + "/", options)
  end
      
  # Verify redirection
  Rfeedfinder.verifyRedirect(outfeeds)
  
  # This has to be used until proper :only_first support has been built in
  outfeeds = outfeeds.first if options[:only_first] and outfeeds.size > 1
  
  if options[:keep_data]
    output = []
    outfeeds.each do |feed|
      output << {:url => feed, :data => options[:data][feed]}
    end
    return output
  else
    return outfeeds
  end
end

.isFeed?(uri, options) ⇒ Boolean

Takes:

  • uri (string)

Downloads the URI and checkes the content with the isFeedData? class method

Returns:

  • true if the uri points to a feed

  • false if not

Returns:

  • (Boolean)


289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
# File 'lib/rfeedfinder.rb', line 289

def self.isFeed?(uri, options)
  # We return false if the user only wants one result
  # and we already have found it so there aren't made
  # any additional external calls
  return false if options[:only_first] and options[:already_found_one]
  
  uri.gsub!(/\/\/www\d\./, "//www.")
  begin
    protocol = URI.split(uri)
    return false if !protocol[0].index(/^[http|https]/)
  rescue
    # URI error
    return false
  end
  
  data = Rfeedfinder.open_doc(uri, options)
  return false if data.nil?
  
  if Rfeedfinder.isFeedData?(data)
    options[:already_found_one] = true if options[:only_first]
    return true
  else
    return false
  end
end

.isFeedData?(data) ⇒ Boolean

Takes:

  • data (string)

Returns:

  • true if the data has a rss, rdf or feed tag

  • false if the data has a html tag

Returns:

  • (Boolean)


272
273
274
275
276
# File 'lib/rfeedfinder.rb', line 272

def self.isFeedData?(data)
  # if no html tag and rss, rdf or feed tag, it's a feed
  # puts data
  return ((data/"html|HTML").empty? and (!(data/:rss).nil? or !(data/:rdf).nil? or !(data/:feed).nil?))
end

Instance Method Details

#feed(uri) ⇒ Object

Takes:

  • uri (string)

Returns:

  • url (string)



53
54
55
# File 'lib/rfeedfinder.rb', line 53

def feed(uri)
  result = Rfeedfinder.feed(uri, @options.dup)
end

#feeds(uri) ⇒ Object

Takes:

  • uri (string)

Returns:

  • array of urls



42
43
44
# File 'lib/rfeedfinder.rb', line 42

def feeds(uri)
  Rfeedfinder.feeds(uri, @options.dup)
end