Class: CraigScrape::Posting

Inherits:
Scraper
  • Object
show all
Defined in:
lib/posting.rb

Overview

Posting represents a fully downloaded, and parsed, Craigslist post. This class is generally returned by the listing scrape methods, and contains the post summaries for a specific search url, or a general listing category

Constant Summary collapse

POST_DATE =
/Date:[^\d]*((?:[\d]{2}|[\d]{4})\-[\d]{1,2}\-[\d]{1,2}[^\d]+[\d]{1,2}\:[\d]{1,2}[ ]*[AP]M[^a-z]+[a-z]+)/i
LOCATION =
/Location\:[ ]+(.+)/
HEADER_LOCATION =
/\((.+)\)$/
POSTING_ID =
/Posting[ ]?ID\:[ ]*([\d]+)/
REPLY_TO =
/(.+)/
PRICE =
/((?:^\$[\d]+(?:\.[\d]{2})?)|(?:\$[\d]+(?:\.[\d]{2})?$))/
USERBODY_PARTS =

NOTE: we implement the (?:) to first check the ‘old’ style format, and then the ‘new style’ (As of 12/03’s parse changes)

/^(.+)\<div id\=\"userbody\">(.+)\<br[ ]*[\/]?\>\<br[ ]*[\/]?\>(.+)\<\/div\>(.+)$/m
HTML_HEADER =
/^(.+)\<div id\=\"userbody\">/m
IMAGE_SRC =
/\<im[a]?g[e]?[^\>]*src=(?:\'([^\']+)\'|\"([^\"]+)\"|([^ ]+))[^\>]*\>/
REQUIRED_FIELDS =

This is used to determine if there’s a parse error

%w(contents posting_id post_time header title full_section)
XPATH_USERBODY =
"//*[@id='userbody']"
XPATH_POSTINGBODY =
"//*[@id='postingbody']"
XPATH_BLURBS =
"//ul[@class='blurbs']"
XPATH_PICS =
["//*[@class='tn']/a/@href",
# For some posts (the newest ones on 01/20/12) we find the images:
"//*[@id='thumbs']/a/@href"
].join('|')
XPATH_REPLY_TO =
["//*[@class='dateReplyBar']/small/a",
# For some posts (the newest ones on 01/20/12) we find the reply to this way:
"//*[@class='dateReplyBar']/*[@id='replytext']/following-sibling::a" 
].join('|')
XPATH_POSTINGBLOCK =
"//*[@class='postingidtext' or @class='postinginfos']"
XPATH_POSTED_DATE =
"//*[@class='postinginfos']/*[@class='postinginfo']/date"

Constants inherited from Scraper

Scraper::HTML_ENCODING, Scraper::HTML_TAG, Scraper::HTTP_HEADERS, Scraper::URL_PARTS

Instance Attribute Summary collapse

Attributes inherited from Scraper

#url

Instance Method Summary collapse

Methods inherited from Scraper

#attributes, #downloaded?, #uri

Constructor Details

#initialize(*args) ⇒ Posting

Create a new Post via a url (String), or supplied parameters (Hash)



50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/posting.rb', line 50

def initialize(*args)
  super(*args)

  # Validate that required fields are present, at least - if we've downloaded it from a url
  if args.first.kind_of? String and is_active_post?
    unparsed_fields = REQUIRED_FIELDS.find_all{|f| 
      val = send(f)
      val.nil? or (val.respond_to? :length and val.length == 0)
    } 
    parse_error! unparsed_fields unless unparsed_fields.empty?
  end  

end

Instance Attribute Details

#hrefObject (readonly)

This is really just for testing, in production use, uri.path is a better solution



47
48
49
# File 'lib/posting.rb', line 47

def href
  @href
end

Instance Method Details

#contentsObject

String, The full-html contents of the post



155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# File 'lib/posting.rb', line 155

def contents
  unless @contents
    @contents = if html.at_xpath(XPATH_POSTINGBODY)
      # For some posts (the newest ones on 01/20/12) craigslist made this really 
      # easy for us. 
      html.at_xpath(XPATH_POSTINGBODY).children.to_s
    elsif html_source
      # Otherwise we have to parse this in a convoluted way from the userbody
      # section:
      user_body
    end
    
    # This helps clean up the whitespace around the sides, in case we got any: 
    @contents = he_decode(@contents).strip if @contents
  end
  
  @contents
end

#contents_as_plainObject

Returns the post contents with all html tags removed



345
346
347
# File 'lib/posting.rb', line 345

def contents_as_plain
  strip_html contents
end

#deleted_by_author?Boolean

Returns true if this Post was parsed, and represents a ‘This posting has been deleted by its author.’ notice

Returns:

  • (Boolean)


259
260
261
262
263
264
265
# File 'lib/posting.rb', line 259

def deleted_by_author?
  @deleted_by_author = (
    system_post? and header_as_plain == "This posting has been deleted by its author."
  ) if @deleted_by_author.nil?
  
  @deleted_by_author
end

#flagged_for_removal?Boolean

Returns true if this Post was parsed, and merely a ‘Flagged for Removal’ page

Returns:

  • (Boolean)


250
251
252
253
254
255
256
# File 'lib/posting.rb', line 250

def flagged_for_removal?
  @flagged_for_removal = (
    system_post? and header_as_plain == "This posting has been flagged for removal"
  ) if @flagged_for_removal.nil?
  
  @flagged_for_removal
end

#full_sectionObject

Array, hierarchial representation of the posts section



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/posting.rb', line 87

def full_section
  unless @full_section
    @full_section = []
    
    (html_head / "*[@class='bchead']//a").each do |a|
      @full_section << he_decode(a.inner_html) unless a['id'] and a['id'] == 'ef'
    end if html_head
    
    # For some posts (the newest ones on 01/20/12) craigslist is pre-pending
    # a silly "CL" to the section. Let's strip that:
    @full_section.delete_at(0) if @full_section[0] == 'CL'
  end

  @full_section
end

#has_img?Boolean

true if post summary has ‘img(s)’. ‘imgs’ are different then pics, in that the resource is not hosted on craigslist’s server. This is always able to be pulled from the listing post-summary, and should never cause an additional page load

Returns:

  • (Boolean)


318
319
320
# File 'lib/posting.rb', line 318

def has_img?
  img_types.include? :img
end

#has_pic?Boolean

true if post summary has ‘pic(s)’. ‘pics’ are different then imgs, in that craigslist is hosting the resource on craigslist’s servers This is always able to be pulled from the listing post-summary, and should never cause an additional page load

Returns:

  • (Boolean)


324
325
326
# File 'lib/posting.rb', line 324

def has_pic?
  img_types.include? :pic
end

#has_pic_or_img?Boolean

true if post summary has either the img or pic label This is always able to be pulled from the listing post-summary, and should never cause an additional page load

Returns:

  • (Boolean)


330
331
332
# File 'lib/posting.rb', line 330

def has_pic_or_img?
  img_types.length > 0
end

#headerObject

String, The contents of the item’s html body heading



66
67
68
69
70
71
72
73
# File 'lib/posting.rb', line 66

def header
  unless @header
    h2 = html_head.at 'h2' if html_head
    @header = he_decode h2.inner_html if h2
  end
  
  @header
end

#header_as_plainObject

Returns the header with all html tags removed. Granted, the header should usually be plain, but in the case of a ‘system_post’ we may get tags in here



351
352
353
# File 'lib/posting.rb', line 351

def header_as_plain
  strip_html header
end

#imagesObject

Array, urls of the post’s images that are not hosted on craigslist



217
218
219
220
221
222
223
224
225
226
# File 'lib/posting.rb', line 217

def images
  # Keep in mind that when users post html to craigslist, they're often not posting wonderful html...
  @images = ( 
    contents ? 
      contents.scan(IMAGE_SRC).collect{ |a| a.find{|b| !b.nil? } } :
      [] 
  ) unless @images
  
  @images
end

#img_typesObject

Array, which image types are listed for the post. This is always able to be pulled from the listing post-summary, and should never cause an additional page load



301
302
303
304
# File 'lib/posting.rb', line 301

def img_types
  @img_types || [ (images.length > 0) ? :img : nil, 
    (pics.length > 0) ? :pic : nil ].compact
end

#is_active_post?Boolean

This is mostly used to determine if the post should be checked for parse errors. Might be useful for someone else though

Returns:

  • (Boolean)


363
364
365
# File 'lib/posting.rb', line 363

def is_active_post?
  [flagged_for_removal?, posting_has_expired?, deleted_by_author?].none?
end

#labelObject

Returns The post label. The label would appear at first glance to be indentical to the header - but its not. The label is cited on the listings pages, and generally includes everything in the header - with the exception of the location. Sometimes there’s additional information ie. ‘(map)’ on rea listings included in the header, that aren’t to be listed in the label This is also used as a bandwidth shortcut for the craigwatch program, and is a guaranteed identifier for the post, that won’t result in a full page load from the post’s url.



289
290
291
292
293
294
295
296
297
# File 'lib/posting.rb', line 289

def label
  unless @label or system_post?
    @label = header
    
    @label = $1 if location and /(.+?)[ ]*\(#{location}\).*?$/.match @label
  end
  
  @label
end

#locationObject

String, the location of the item, as best could be parsed



175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
# File 'lib/posting.rb', line 175

def location
  if @location.nil? and html
   
    if html.at_xpath(XPATH_BLURBS)
      # This is the post-12/3/12 style:

      # Sometimes the Location is in the body :
      @location = $1 if html.xpath(XPATH_BLURBS).first.children.any?{|c| 
        LOCATION.match c.content}

    elsif craigslist_body
      # Location (when explicitly defined):
      cursor = craigslist_body.at 'ul' unless @location

      # This is the legacy style:
      # Note: Apa section includes other things in the li's (cats/dogs ok fields)
      cursor.children.each do |li|
        if LOCATION.match li.inner_html
          @location = he_decode($1) and break
          break
        end
      end if cursor

      # Real estate listings can work a little different for location:
      unless @location
        cursor = craigslist_body.at 'small'
        cursor = cursor.previous until cursor.nil? or cursor.text?
        
        @location = he_decode(cursor.to_s.strip) if cursor
      end
      
    end
    
    # So, *sometimes* the location just ends up being in the header, I don't know why.
    # This happens on old-style and new-style posts:
    @location = $1 if @location.nil? and HEADER_LOCATION.match header
  end
  
  @location
end

#picsObject

Array, urls of the post’s craigslist-hosted images



229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# File 'lib/posting.rb', line 229

def pics
  unless @pics
    @pics = []
    
    if html 
      if html.at_xpath(XPATH_PICS)
        @pics = html.xpath(XPATH_PICS).collect(&:value)
      elsif craigslist_body
        # This is the pre-12/3/12 style:
        # Now let's find the craigslist hosted images:
        img_table = (craigslist_body / 'table').find{|e| e.name == 'table' and e[:summary] == 'craigslist hosted images'}
      
        @pics = (img_table / 'img').collect{|i| i[:src]} if img_table
      end
    end
  end
  
  @pics
end

#post_dateObject

Reflects only the date portion of the posting. Does not include hours/minutes. This is useful when reflecting the listing scrapes, and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.



278
279
280
281
282
# File 'lib/posting.rb', line 278

def 
  @post_date = post_time.to_date unless @post_date or post_time.nil?
  
  @post_date
end

#post_timeObject

Time, reflects the full timestamp of the posting



119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# File 'lib/posting.rb', line 119

def post_time
  unless @post_time
    if html.at_xpath(XPATH_POSTED_DATE)
      # For some posts (the newest ones on 01/20/12) craigslist made this really 
      # easy for us. 
      @post_time = DateTime.parse(html.at_xpath(XPATH_POSTED_DATE))
    else
      # The bulk of the post time/dates are parsed via a simple regex:
      cursor = html_head.at 'hr' if html_head
      cursor = cursor.next until cursor.nil? or POST_DATE.match cursor.to_s
      @post_time = DateTime.parse($1) if $1
    end
  end
  
  @post_time
end

#posting_has_expired?Boolean

Returns true if this Post was parsed, and represents a ‘This posting has expired.’ notice

Returns:

  • (Boolean)


268
269
270
271
272
273
274
# File 'lib/posting.rb', line 268

def posting_has_expired?
  @posting_has_expired = (
    system_post? and header_as_plain == "This posting has expired."
  ) if @posting_has_expired.nil?
  
  @posting_has_expired
end

#posting_idObject

Integer, Craigslist’s unique posting id



137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/posting.rb', line 137

def posting_id
  if @posting_id 

  elsif USERBODY_PARTS.match html_source
    # Old style:
    html_footer = $4
    cursor = Nokogiri::HTML html_footer, nil, HTML_ENCODING 
    cursor = cursor.next until cursor.nil? or 
    @posting_id = $1.to_i if POSTING_ID.match html_footer.to_s
  else
    # Post 12/3
    @posting_id = $1.to_i if POSTING_ID.match html.xpath(XPATH_POSTINGBLOCK).to_s
  end

  @posting_id
end

#priceObject

Returns the best-guess of a price, judging by the label’s contents. Price is available when pulled from the listing summary and can be safely used if you wish conserve bandwidth by not pulling an entire post from a listing scrape.



336
337
338
339
340
341
342
# File 'lib/posting.rb', line 336

def price
  unless @price
    (header and PRICE.match label) ? 
      @price = Money.new($1.tr('$','').to_i*100, 'USD') : nil
  end
  @price
end

#reply_toObject

String, represents the post’s reply-to address, if listed



104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/posting.rb', line 104

def reply_to
  unless @reply_to
    if html.at_xpath(XPATH_REPLY_TO)
      @reply_to = html.at_xpath(XPATH_REPLY_TO).content
    else
      cursor = html_head.at 'hr' if html_head
      cursor = cursor.next until cursor.nil? or cursor.name == 'a'
      @reply_to = $1 if cursor and REPLY_TO.match he_decode(cursor.inner_html)
    end
  end
  
  @reply_to
end

#sectionObject

Retrieves the most-relevant craigslist ‘section’ of the post. This is generally the same as full_section.last. However, this (sometimes/rarely) conserves bandwidth by pulling this field from the listing post-summary



308
309
310
311
312
313
314
# File 'lib/posting.rb', line 308

def section
  unless @section
    @section = full_section.last if full_section  
  end
  
  @section
end

#system_post?Boolean

Some posts (deleted_by_author, flagged_for_removal) are common template posts that craigslist puts up in lieu of an original This returns true or false if that case applies

Returns:

  • (Boolean)


357
358
359
# File 'lib/posting.rb', line 357

def system_post?
  [contents,posting_id,post_time,title].all?{|f| f.nil?}
end

#titleObject

String, the item’s title



76
77
78
79
80
81
82
83
84
# File 'lib/posting.rb', line 76

def title
  unless @title
    title_tag = html_head.at 'title' if html_head
    @title = he_decode title_tag.inner_html if title_tag
    @title = nil if @title and @title.length == 0
  end

  @title
end