Class: CraigScrape::Listings
Overview
Listings represents a parsed Craigslist listing page and is generally returned by CraigScrape.scrape_listing
Constant Summary collapse
- LABEL =
/^(.+?)[ ]*[\-]?$/
- LOCATION =
/^[ ]*\((.*?)\)$/
- IMG_TYPE =
/^[ ]*(.+)[ ]*$/
- HEADER_DATE =
/^[ ]*(?:Sun|Mon|Tue|Wed|Thu|Fri|Sat)[ ]+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Nov|Dec)[ ]+([0-9]{1,2})[ ]*$/i
- SUMMARY_DATE =
/^[ ]([^ ]+)[ ]+([^ ]+)[ ]*[\-][ ]*$/
- NEXT_PAGE_LINK =
/^[ ]*(?:next [\d]+ postings|Next \>\>)[ ]*$/
- XPATH_POST_DATE =
"*[@class='itemdate']"
- XPATH_PAGENAV_LINKS =
"//*[@class='ban']//a"
Constants inherited from Scraper
Scraper::HTML_ENCODING, Scraper::HTML_TAG, Scraper::HTTP_HEADERS, Scraper::URL_PARTS
Instance Attribute Summary
Attributes inherited from Scraper
Class Method Summary collapse
-
.parse_summary(p_element, date = nil) ⇒ Object
Takes a paragraph element and returns a mostly-parsed Posting We separate this from the rest of the parsing both for readability and ease of testing.
Instance Method Summary collapse
-
#next_page ⇒ Object
Returns a Listings object of the next_page_url on the current listings object.
-
#next_page_href ⇒ Object
String, URL Path href-fragment of the next page link.
-
#next_page_url ⇒ Object
String, Full URL Path of the ‘next page’ link.
-
#posts ⇒ Object
Array, PostSummary objects found in the listing.
Methods inherited from Scraper
#downloaded?, #initialize, #uri
Constructor Details
This class inherits a constructor from CraigScrape::Scraper
Class Method Details
.parse_summary(p_element, date = nil) ⇒ Object
Takes a paragraph element and returns a mostly-parsed Posting We separate this from the rest of the parsing both for readability and ease of testing
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
# File 'lib/listings.rb', line 119 def self.parse_summary(p_element, date = nil) #:nodoc: ret = {} title_anchor = nil section_anchor = nil # This loop got a little more complicated after Craigslist start inserting weird <spans>'s in # its list summary postings (See test_new_listing_span051710) p_element.search('a').each do |a_el| # We want the first a-tag that doesn't have spans in it to be the title anchor if title_anchor.nil? title_anchor = a_el if !a_el.at('span') # We want the next a-tag after the title_anchor to be the section anchor elsif section_anchor.nil? section_anchor = a_el # We have no need to tranverse these further: break end end location_tag = p_element.at 'font' has_pic_tag = p_element.at 'span' href = nil location = he_decode p_element.at('font').inner_html if location_tag ret[:location] = $1 if location and LOCATION.match location ret[:img_types] = [] if has_pic_tag img_type = he_decode has_pic_tag.inner_html img_type = $1.tr('^a-zA-Z0-9',' ') if IMG_TYPE.match img_type ret[:img_types] = img_type.split(' ').collect{|t| t.to_sym} end ret[:section] = he_decode(section_anchor.inner_html) if section_anchor ret[:post_date] = date if p_element.at_xpath(XPATH_POST_DATE) # Post 12/3 if /\A([^ ]+) ([\d]+)\Z/.match p_element.at_xpath(XPATH_POST_DATE).content.strip ret[:post_date] = CraigScrape.most_recently_expired_time $1, $2.to_i end elsif SUMMARY_DATE.match he_decode(p_element.children[0]) # Old style ret[:post_date] = CraigScrape.most_recently_expired_time $1, $2.to_i end if title_anchor label = he_decode title_anchor.inner_html ret[:label] = $1 if LABEL.match label ret[:href] = title_anchor[:href] end ret end |
Instance Method Details
#next_page ⇒ Object
Returns a Listings object of the next_page_url on the current listings object
113 114 115 |
# File 'lib/listings.rb', line 113 def next_page CraigScrape::Listings.new URI.encode(next_page_url) if next_page_url end |
#next_page_href ⇒ Object
String, URL Path href-fragment of the next page link
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/listings.rb', line 69 def next_page_href unless @next_page_href if html.at_xpath(XPATH_PAGENAV_LINKS) # Post 12/3 next_link = html.xpath(XPATH_PAGENAV_LINKS).find{|link| NEXT_PAGE_LINK.match link.content} @next_page_href = next_link[:href] if next_link else # Old style cursor = html.at 'p:last-of-type' cursor = cursor.at 'a' if cursor # Category Listings have their 'next 100 postings' link at the end of the doc in a p tag next_link = cursor if cursor and NEXT_PAGE_LINK.match cursor.inner_html # Search listings put their next page in a link towards the top next_link = (html / 'a').find{ |a| he_decode(a.inner_html) == '<b>Next>></b>' } unless next_link # Some search pages have a bug, whereby a 'next page' link isn't displayed, # even though we can see that theres another page listed in the page-number links block at the top # and bottom of the listing page unless next_link cursor = html % 'div.sh:first-of-type > b:last-of-type' # If there's no 'a' in the next sibling, we'll have just performed a nil assignment, otherwise # We're looking good. next_link = cursor.next_element if cursor and /^[\d]+$/.match cursor.inner_html end # We have an anchor tag - so - let's assign the href: @next_page_href = next_link[:href] if next_link end end @next_page_href end |
#next_page_url ⇒ Object
String, Full URL Path of the ‘next page’ link
108 109 110 |
# File 'lib/listings.rb', line 108 def next_page_url (next_page_href) ? url_from_href(next_page_href) : nil end |
#posts ⇒ Object
Array, PostSummary objects found in the listing
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/listings.rb', line 22 def posts unless @posts current_date = nil @posts = [] # All we care about are p and h4 tags. This seemed to be the only way I could do this on Nokogiri: = html.search('*').reject{|n| !/^(?:p|h4)$/i.match n.name } # The last p in the list is sometimes a 'next XXX pages' link. We don't want to include this in our PostSummary output: .pop if ( .length > 0 and .last.at('a') and NEXT_PAGE_LINK.match .last.at('a').inner_html ) # Now we iterate though the listings: .each do |el| case el.name when 'p' post_summary = self.class.parse_summary el, current_date # Validate that required fields are present: parse_error! unless [post_summary[:label],post_summary[:href]].all?{|f| f and f.length > 0} post_summary[:url] = url_from_href post_summary[:href] @posts << CraigScrape::Posting.new(post_summary) when 'h4' # Let's make sense of the h4 tag, and then read all the p tags below it if HEADER_DATE.match he_decode(el.inner_html) # Generally, the H4 tags contain valid dates. When they do - this is easy: current_date = CraigScrape.most_recently_expired_time $1, $2 elsif html.at('h4:last-of-type') == el # There's a specific bug in craigslist, where these nonsense h4's just appear without anything relevant inside them. # They're safe to ignore if they're not the last h4 on the page. I fthey're the last h4 on the page, # we need to pull up the full post in order to accurate tell the date. # Setting this to nil will achieve the eager-load. current_date = nil end end end end @posts end |