Class: CraigScrape::GeoListings

Inherits:
Scraper
  • Object
show all
Defined in:
lib/geo_listings.rb

Overview

GeoListings represents a parsed Craigslist geo lisiting page. (i.e. ‘http://geo.craigslist.org/iso/us’) These list all the craigslist sites in a given region.

Defined Under Namespace

Classes: BadGeoListingPath

Constant Summary collapse

GEOLISTING_BASE_URL =
%{http://geo.craigslist.org/iso/}
LOCATION_NAME =
/[ ]*\>[ ](.+)[ ]*/
PATH_SCANNER =
/(?:\\\/|[^\/])+/
URL_HOST_PART =
/^[^\:]+\:\/\/([^\/]+)[\/]?$/
SITE_PREFIX =
/^([^\.]+)/
FIND_SITES_PARTS =
/^[ ]*([\+|\-]?)[ ]*(.+)[ ]*/

Constants inherited from Scraper

Scraper::HTML_ENCODING, Scraper::HTML_TAG, Scraper::HTTP_HEADERS, Scraper::URL_PARTS

Instance Attribute Summary

Attributes inherited from Scraper

#url

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Scraper

#attributes, #downloaded?, #uri

Constructor Details

#initialize(init_via = nil) ⇒ GeoListings

The geolisting constructor works like all other Scraper objects, in that it accepts a string ‘url’. See the Craigscrape.find_sites for a more powerful way to find craigslist sites.



28
29
30
31
32
33
# File 'lib/geo_listings.rb', line 28

def initialize(init_via = nil)
  super(init_via)

  # Validate that required fields are present, at least - if we've downloaded it from a url
  parse_error! unless location
end

Class Method Details

.find_sites(specs, base_url = GEOLISTING_BASE_URL) ⇒ Object

find_sites takes a single array of strings as an argument. Each string is to be either a location path (see sites_in_path), or a full site (in canonical form - ie “memphis.craigslist.org”). Optionally, each of this may/should contain a ‘+’ or ‘-’ prefix to indicate whether the string is supposed to include sites from the master list, or remove them from the list. If no ‘+’ or’-‘ is specified, the default assumption is ’+‘. Strings are processed from left to right, which gives a high degree of control over the selection set. Examples:

  • find_sites “us/fl”, “- miami.craigslist.org”

  • find_sites “us”, “- us/nm”

  • find_sites “us”, “- us/ny”, “+ newyork.craigslist.org”

  • find_sites “us/ny”, “us/id”, “caribbean.craigslist.org”

There’s a lot of flexibility here, you get the idea.



123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/geo_listings.rb', line 123

def self.find_sites(specs, base_url = GEOLISTING_BASE_URL)
  ret = []
  
  specs.each do |spec|
    (op,spec = $1,$2) if FIND_SITES_PARTS.match spec

    spec = (spec.include? '.')  ? [spec] : sites_in_path(spec, base_url) 

    (op == '-') ? ret -= spec : ret |= spec
  end
  
  ret
end

.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL) ⇒ Object

This method will return an array of all possible sites that match the specified location path. Sample location paths:

  • us/ca

  • us/fl/miami

  • jp/fukuoka

  • mx

Here’s how location paths work.

  • The components of the path are to be separated by ‘/’ ‘s.

  • Up to (and optionally, not including) the last component, the path should correspond against a valid GeoLocation url with the prefix of ‘geo.craigslist.org/iso/’

  • the last component can either be a site’s ‘prefix’ on a GeoLocation page, or, the last component can just be a geolocation page itself, in which case all the sites on that page are selected.

  • the site prefix is the first dns record in a website listed on a GeoLocation page. (So, for the case of us/fl/miami , the last ‘miami’ corresponds to the ‘south florida’ link on ‘http://geo.craigslist.org/iso/us/fl’



70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/geo_listings.rb', line 70

def self.sites_in_path(full_path, base_url = GEOLISTING_BASE_URL)
  # the base_url parameter is mostly so we can test this method
  
  # Unfortunately - the easiest way to understand much of this is to see how craigslist returns 
  # these geolocations. Watch what happens when you request us/fl/non-existant/page/here.
  # I also made this a little forgiving in a couple ways not specified with official support, per 
  # the rules above.
  full_path_parts = full_path.scan PATH_SCANNER

  # We'll either find a single site in this loop andf return that, or, we'll find a whole listing
  # and set the geo_listing object to reflect that
  geo_listing = nil
  full_path_parts.each_with_index do |part, i|

    # Let's un-escape the path-part, if needed:
    part.gsub! "\\/", "/"        

    # If they're specifying a single site, this will catch and return it immediately
    site = geo_listing.sites.find{ |n,s| 
      (SITE_PREFIX.match s and $1 == part) or n == part
    } if geo_listing

    # This returns the site component of the found array
    return [site.last] if site 

    begin
      # The URI escape is mostly needed to translate the space characters
      l = GeoListings.new base_url+full_path_parts[0...i+1].collect{|p| URI.escape p}.join('/')
    rescue CraigScrape::Scraper::FetchError
      bad_geo_path! full_path
    end

    # This probably tells us the first part of the path was 'correct', but not the rest:
    bad_geo_path! full_path if geo_listing and geo_listing.location == l.location

    geo_listing = l
  end

  # We have a valid listing page we found, and we can just return all the sites on it:
  geo_listing.sites.collect{|n,s| s }
end

Instance Method Details

#locationObject

Returns the GeoLocation’s full name



36
37
38
39
40
41
42
43
44
# File 'lib/geo_listings.rb', line 36

def location
  unless @location
    cursor = html % 'h3 > b > a:first-of-type'
    cursor = cursor.next if cursor       
    @location = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
  end
  
  @location
end

#sitesObject

Returns a hash of site name to urls in the current listing



47
48
49
50
51
52
53
54
55
56
57
# File 'lib/geo_listings.rb', line 47

def sites
  unless @sites
    @sites = {}
    (html / 'div#list > a').each do |el_a|
      site_name = he_decode strip_html(el_a.inner_html)
      @sites[site_name] = $1 if URL_HOST_PART.match el_a[:href]
    end
  end
  
  @sites
end