Class: CraigScrape::Posting
Overview
Posting represents a fully downloaded, and parsed, Craigslist post. This class is generally returned by the listing scrape methods, and contains the post summaries for a specific search url, or a general listing category
Constant Summary collapse
- POST_DATE =
/Date:[^\d]*((?:[\d]{2}|[\d]{4})\-[\d]{1,2}\-[\d]{1,2}[^\d]+[\d]{1,2}\:[\d]{1,2}[ ]*[AP]M[^a-z]+[a-z]+)/i
- LOCATION =
/Location\:[ ]+(.+)/
- HEADER_LOCATION =
/\((.+)\)$/
- POSTING_ID =
/Posting[ ]?ID\:[ ]*([\d]+)/
- REPLY_TO =
/(.+)/
- PRICE =
/((?:^\$[\d]+(?:\.[\d]{2})?)|(?:\$[\d]+(?:\.[\d]{2})?$))/
- USERBODY_PARTS =
NOTE: we implement the (?:) to first check the ‘old’ style format, and then the ‘new style’ (As of 12/03’s parse changes)
/^(.+)\<div id\=\"userbody\">(.+)\<br[ ]*[\/]?\>\<br[ ]*[\/]?\>(.+)\<\/div\>(.+)$/m
- HTML_HEADER =
/^(.+)\<div id\=\"userbody\">/m
- IMAGE_SRC =
/\<im[a]?g[e]?[^\>]*src=(?:\'([^\']+)\'|\"([^\"]+)\"|([^ ]+))[^\>]*\>/
- REQUIRED_FIELDS =
This is used to determine if there’s a parse error
%w(contents posting_id post_time header title full_section)
- XPATH_USERBODY =
"//*[@id='userbody']"
- XPATH_POSTINGBODY =
"//*[@id='postingbody']"
- XPATH_BLURBS =
"//ul[@class='blurbs']"
- XPATH_PICS =
["//*[@class='tn']/a/@href", # For some posts (the newest ones on 01/20/12) we find the images: "//*[@id='thumbs']/a/@href" ].join('|')
- XPATH_REPLY_TO =
["//*[@class='dateReplyBar']/small/a", # For some posts (the newest ones on 01/20/12) we find the reply to this way: "//*[@class='dateReplyBar']/*[@id='replytext']/following-sibling::a" ].join('|')
- XPATH_POSTINGBLOCK =
"//*[@class='postingidtext' or @class='postinginfos']"
- XPATH_POSTED_DATE =
"//*[@class='postinginfos']/*[@class='postinginfo']/date"
Constants inherited from Scraper
Scraper::HTML_ENCODING, Scraper::HTML_TAG, Scraper::HTTP_HEADERS, Scraper::URL_PARTS
Instance Attribute Summary collapse
-
#href ⇒ Object
readonly
This is really just for testing, in production use, uri.path is a better solution.
Attributes inherited from Scraper
Instance Method Summary collapse
-
#contents ⇒ Object
String, The full-html contents of the post.
-
#contents_as_plain ⇒ Object
Returns the post contents with all html tags removed.
-
#deleted_by_author? ⇒ Boolean
Returns true if this Post was parsed, and represents a ‘This posting has been deleted by its author.’ notice.
-
#flagged_for_removal? ⇒ Boolean
Returns true if this Post was parsed, and merely a ‘Flagged for Removal’ page.
-
#full_section ⇒ Object
Array, hierarchial representation of the posts section.
-
#has_img? ⇒ Boolean
true if post summary has ‘img(s)’.
-
#has_pic? ⇒ Boolean
true if post summary has ‘pic(s)’.
-
#has_pic_or_img? ⇒ Boolean
true if post summary has either the img or pic label This is always able to be pulled from the listing post-summary, and should never cause an additional page load.
-
#header ⇒ Object
String, The contents of the item’s html body heading.
-
#header_as_plain ⇒ Object
Returns the header with all html tags removed.
-
#images ⇒ Object
Array, urls of the post’s images that are not hosted on craigslist.
-
#img_types ⇒ Object
Array, which image types are listed for the post.
-
#initialize(*args) ⇒ Posting
constructor
Create a new Post via a url (String), or supplied parameters (Hash).
-
#is_active_post? ⇒ Boolean
This is mostly used to determine if the post should be checked for parse errors.
-
#label ⇒ Object
Returns The post label.
-
#location ⇒ Object
String, the location of the item, as best could be parsed.
-
#pics ⇒ Object
Array, urls of the post’s craigslist-hosted images.
-
#post_date ⇒ Object
Reflects only the date portion of the posting.
-
#post_time ⇒ Object
Time, reflects the full timestamp of the posting.
-
#posting_has_expired? ⇒ Boolean
Returns true if this Post was parsed, and represents a ‘This posting has expired.’ notice.
-
#posting_id ⇒ Object
Integer, Craigslist’s unique posting id.
-
#price ⇒ Object
Returns the best-guess of a price, judging by the label’s contents.
-
#reply_to ⇒ Object
String, represents the post’s reply-to address, if listed.
-
#section ⇒ Object
Retrieves the most-relevant craigslist ‘section’ of the post.
-
#system_post? ⇒ Boolean
Some posts (deleted_by_author, flagged_for_removal) are common template posts that craigslist puts up in lieu of an original This returns true or false if that case applies.
-
#title ⇒ Object
String, the item’s title.
Methods inherited from Scraper
#attributes, #downloaded?, #uri
Constructor Details
#initialize(*args) ⇒ Posting
Create a new Post via a url (String), or supplied parameters (Hash)
50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/posting.rb', line 50 def initialize(*args) super(*args) # Validate that required fields are present, at least - if we've downloaded it from a url if args.first.kind_of? String and is_active_post? unparsed_fields = REQUIRED_FIELDS.find_all{|f| val = send(f) val.nil? or (val.respond_to? :length and val.length == 0) } parse_error! unparsed_fields unless unparsed_fields.empty? end end |
Instance Attribute Details
#href ⇒ Object (readonly)
This is really just for testing, in production use, uri.path is a better solution
47 48 49 |
# File 'lib/posting.rb', line 47 def href @href end |
Instance Method Details
#contents ⇒ Object
String, The full-html contents of the post
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
# File 'lib/posting.rb', line 155 def contents unless @contents @contents = if html.at_xpath(XPATH_POSTINGBODY) # For some posts (the newest ones on 01/20/12) craigslist made this really # easy for us. html.at_xpath(XPATH_POSTINGBODY).children.to_s elsif html_source # Otherwise we have to parse this in a convoluted way from the userbody # section: user_body end # This helps clean up the whitespace around the sides, in case we got any: @contents = he_decode(@contents).strip if @contents end @contents end |
#contents_as_plain ⇒ Object
Returns the post contents with all html tags removed
345 346 347 |
# File 'lib/posting.rb', line 345 def contents_as_plain strip_html contents end |
#deleted_by_author? ⇒ Boolean
Returns true if this Post was parsed, and represents a ‘This posting has been deleted by its author.’ notice
259 260 261 262 263 264 265 |
# File 'lib/posting.rb', line 259 def @deleted_by_author = ( system_post? and header_as_plain == "This posting has been deleted by its author." ) if @deleted_by_author.nil? @deleted_by_author end |
#flagged_for_removal? ⇒ Boolean
Returns true if this Post was parsed, and merely a ‘Flagged for Removal’ page
250 251 252 253 254 255 256 |
# File 'lib/posting.rb', line 250 def flagged_for_removal? @flagged_for_removal = ( system_post? and header_as_plain == "This posting has been flagged for removal" ) if @flagged_for_removal.nil? @flagged_for_removal end |
#full_section ⇒ Object
Array, hierarchial representation of the posts section
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/posting.rb', line 87 def full_section unless @full_section @full_section = [] (html_head / "*[@class='bchead']//a").each do |a| @full_section << he_decode(a.inner_html) unless a['id'] and a['id'] == 'ef' end if html_head # For some posts (the newest ones on 01/20/12) craigslist is pre-pending # a silly "CL" to the section. Let's strip that: @full_section.delete_at(0) if @full_section[0] == 'CL' end @full_section end |
#has_img? ⇒ Boolean
true if post summary has ‘img(s)’. ‘imgs’ are different then pics, in that the resource is not hosted on craigslist’s server. This is always able to be pulled from the listing post-summary, and should never cause an additional page load
318 319 320 |
# File 'lib/posting.rb', line 318 def has_img? img_types.include? :img end |
#has_pic? ⇒ Boolean
true if post summary has ‘pic(s)’. ‘pics’ are different then imgs, in that craigslist is hosting the resource on craigslist’s servers This is always able to be pulled from the listing post-summary, and should never cause an additional page load
324 325 326 |
# File 'lib/posting.rb', line 324 def has_pic? img_types.include? :pic end |
#has_pic_or_img? ⇒ Boolean
true if post summary has either the img or pic label This is always able to be pulled from the listing post-summary, and should never cause an additional page load
330 331 332 |
# File 'lib/posting.rb', line 330 def has_pic_or_img? img_types.length > 0 end |
#header ⇒ Object
String, The contents of the item’s html body heading
66 67 68 69 70 71 72 73 |
# File 'lib/posting.rb', line 66 def header unless @header h2 = html_head.at 'h2' if html_head @header = he_decode h2.inner_html if h2 end @header end |
#header_as_plain ⇒ Object
Returns the header with all html tags removed. Granted, the header should usually be plain, but in the case of a ‘system_post’ we may get tags in here
351 352 353 |
# File 'lib/posting.rb', line 351 def header_as_plain strip_html header end |
#images ⇒ Object
Array, urls of the post’s images that are not hosted on craigslist
217 218 219 220 221 222 223 224 225 226 |
# File 'lib/posting.rb', line 217 def images # Keep in mind that when users post html to craigslist, they're often not posting wonderful html... @images = ( contents ? contents.scan(IMAGE_SRC).collect{ |a| a.find{|b| !b.nil? } } : [] ) unless @images @images end |
#img_types ⇒ Object
Array, which image types are listed for the post. This is always able to be pulled from the listing post-summary, and should never cause an additional page load
301 302 303 304 |
# File 'lib/posting.rb', line 301 def img_types @img_types || [ (images.length > 0) ? :img : nil, (pics.length > 0) ? :pic : nil ].compact end |
#is_active_post? ⇒ Boolean
This is mostly used to determine if the post should be checked for parse errors. Might be useful for someone else though
363 364 365 |
# File 'lib/posting.rb', line 363 def is_active_post? [flagged_for_removal?, posting_has_expired?, ].none? end |
#label ⇒ Object
Returns The post label. The label would appear at first glance to be indentical to the header - but its not. The label is cited on the listings pages, and generally includes everything in the header - with the exception of the location. Sometimes there’s additional information ie. ‘(map)’ on rea listings included in the header, that aren’t to be listed in the label This is also used as a bandwidth shortcut for the craigwatch program, and is a guaranteed identifier for the post, that won’t result in a full page load from the post’s url.
289 290 291 292 293 294 295 296 297 |
# File 'lib/posting.rb', line 289 def label unless @label or system_post? @label = header @label = $1 if location and /(.+?)[ ]*\(#{location}\).*?$/.match @label end @label end |
#location ⇒ Object
String, the location of the item, as best could be parsed
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
# File 'lib/posting.rb', line 175 def location if @location.nil? and html if html.at_xpath(XPATH_BLURBS) # This is the post-12/3/12 style: # Sometimes the Location is in the body : @location = $1 if html.xpath(XPATH_BLURBS).first.children.any?{|c| LOCATION.match c.content} elsif craigslist_body # Location (when explicitly defined): cursor = craigslist_body.at 'ul' unless @location # This is the legacy style: # Note: Apa section includes other things in the li's (cats/dogs ok fields) cursor.children.each do |li| if LOCATION.match li.inner_html @location = he_decode($1) and break break end end if cursor # Real estate listings can work a little different for location: unless @location cursor = craigslist_body.at 'small' cursor = cursor.previous until cursor.nil? or cursor.text? @location = he_decode(cursor.to_s.strip) if cursor end end # So, *sometimes* the location just ends up being in the header, I don't know why. # This happens on old-style and new-style posts: @location = $1 if @location.nil? and HEADER_LOCATION.match header end @location end |
#pics ⇒ Object
Array, urls of the post’s craigslist-hosted images
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
# File 'lib/posting.rb', line 229 def pics unless @pics @pics = [] if html if html.at_xpath(XPATH_PICS) @pics = html.xpath(XPATH_PICS).collect(&:value) elsif craigslist_body # This is the pre-12/3/12 style: # Now let's find the craigslist hosted images: img_table = (craigslist_body / 'table').find{|e| e.name == 'table' and e[:summary] == 'craigslist hosted images'} @pics = (img_table / 'img').collect{|i| i[:src]} if img_table end end end @pics end |
#post_date ⇒ Object
Reflects only the date portion of the posting. Does not include hours/minutes. This is useful when reflecting the listing scrapes, and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.
278 279 280 281 282 |
# File 'lib/posting.rb', line 278 def post_date @post_date = post_time.to_date unless @post_date or post_time.nil? @post_date end |
#post_time ⇒ Object
Time, reflects the full timestamp of the posting
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# File 'lib/posting.rb', line 119 def post_time unless @post_time if html.at_xpath(XPATH_POSTED_DATE) # For some posts (the newest ones on 01/20/12) craigslist made this really # easy for us. @post_time = DateTime.parse(html.at_xpath(XPATH_POSTED_DATE)) else # The bulk of the post time/dates are parsed via a simple regex: cursor = html_head.at 'hr' if html_head cursor = cursor.next until cursor.nil? or POST_DATE.match cursor.to_s @post_time = DateTime.parse($1) if $1 end end @post_time end |
#posting_has_expired? ⇒ Boolean
Returns true if this Post was parsed, and represents a ‘This posting has expired.’ notice
268 269 270 271 272 273 274 |
# File 'lib/posting.rb', line 268 def posting_has_expired? @posting_has_expired = ( system_post? and header_as_plain == "This posting has expired." ) if @posting_has_expired.nil? @posting_has_expired end |
#posting_id ⇒ Object
Integer, Craigslist’s unique posting id
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/posting.rb', line 137 def posting_id if @posting_id elsif USERBODY_PARTS.match html_source # Old style: = $4 cursor = Nokogiri::HTML , nil, HTML_ENCODING cursor = cursor.next until cursor.nil? or @posting_id = $1.to_i if POSTING_ID.match .to_s else # Post 12/3 @posting_id = $1.to_i if POSTING_ID.match html.xpath(XPATH_POSTINGBLOCK).to_s end @posting_id end |
#price ⇒ Object
Returns the best-guess of a price, judging by the label’s contents. Price is available when pulled from the listing summary and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.
336 337 338 339 340 341 342 |
# File 'lib/posting.rb', line 336 def price unless @price (header and PRICE.match label) ? @price = Money.new($1.tr('$','').to_i*100, 'USD') : nil end @price end |
#reply_to ⇒ Object
String, represents the post’s reply-to address, if listed
104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/posting.rb', line 104 def reply_to unless @reply_to if html.at_xpath(XPATH_REPLY_TO) @reply_to = html.at_xpath(XPATH_REPLY_TO).content else cursor = html_head.at 'hr' if html_head cursor = cursor.next until cursor.nil? or cursor.name == 'a' @reply_to = $1 if cursor and REPLY_TO.match he_decode(cursor.inner_html) end end @reply_to end |
#section ⇒ Object
Retrieves the most-relevant craigslist ‘section’ of the post. This is generally the same as full_section.last. However, this (sometimes/rarely) conserves bandwidth by pulling this field from the listing post-summary
308 309 310 311 312 313 314 |
# File 'lib/posting.rb', line 308 def section unless @section @section = full_section.last if full_section end @section end |
#system_post? ⇒ Boolean
Some posts (deleted_by_author, flagged_for_removal) are common template posts that craigslist puts up in lieu of an original This returns true or false if that case applies
357 358 359 |
# File 'lib/posting.rb', line 357 def system_post? [contents,posting_id,post_time,title].all?{|f| f.nil?} end |
#title ⇒ Object
String, the item’s title
76 77 78 79 80 81 82 83 84 |
# File 'lib/posting.rb', line 76 def title unless @title title_tag = html_head.at 'title' if html_head @title = he_decode title_tag.inner_html if title_tag @title = nil if @title and @title.length == 0 end @title end |