Module: BioBiostarsAnalytics
- Defined in:
- lib/bio-biostars-analytics/biostars-analytics.rb
Constant Summary collapse
- @@CATEGORIES =
Categories in Biostar: Type ID Type
1 Question 2 Answer 3 Comment 4 Tutorial 5 Blog 6 Forum 7 News 8 9 Tool 10 FixMe 11 Video 12 Job 13 Research Paper 14 Tip 15 Poll 16 Ad
16
Class Method Summary collapse
- .cli ⇒ Object
-
.extract_date(datestring) ⇒ Object
Extract the date (day, month, year) from a Biostar forum post formatted date string.
-
.minecontent(log, id) ⇒ Object
Extracts data from the rendered forum post as well as the Biostar’s “post” API.
-
.minehistory(log, age) ⇒ Object
Extracts data from Biostar’s “stats” API.
Class Method Details
.cli ⇒ Object
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 |
# File 'lib/bio-biostars-analytics/biostars-analytics.rb', line 305 def self.cli if not ARGV.length.between?(2, 3) or not ARGV[0].match(/\d+/) or not ARGV[1].match(/\d+/) or (ARGV.length == 3 and not ARGV[2].match(/\d+/))then puts 'Usage: biostars-analytics max_post_number months_look_back [min_post_number]' puts '' puts 'Required parameters:' puts ' max_post_number : highest number (ID) of the post that should' puts ' be mined for data; the crawler will go over' puts ' posts min_post_number to max_post_number' puts ' months_look_back : how many months back should queries to the' puts ' Biostar API go (1 month = 30 days); default' puts ' value is 1' puts '' puts 'Optional parameters:' puts ' min_post_number : lowest number (ID) of the post that should' puts ' be mined for data' puts '' puts 'Output (date matches the script\'s invokation):' puts ' <date>_crawled.tsv : data mined from crawling over posts' puts ' <date>_api.tsv : data extracted from the Biostar API' puts '' puts 'Example: mining Biostars in March 2014:' puts ' biostars-analytics 96000 54' exit 1 end max_post_number = ARGV[0].to_i months_look_back = ARGV[1].to_i min_post_number = 1 min_post_number = ARGV[2].to_i if ARGV.length == 3 # Make sure not to buffer stdout, so that it is possible to # snoop around whilst the script is running. STDOUT.sync = true today = Time.now.strftime('%Y%m%d') crawler_log = File.open("#{today}_crawled.tsv", 'w') api_log = File.open("#{today}_api.tsv", 'w') (min_post_number..max_post_number).each { |i| minecontent(crawler_log, i) } @post_age = {} @user_age = {} (1..months_look_back*30).to_a.reverse.each { |i| minehistory(api_log, i) } crawler_log.close api_log.close end |
.extract_date(datestring) ⇒ Object
Extract the date (day, month, year) from a Biostar forum post formatted date string.
32 33 34 35 36 37 38 39 40 |
# File 'lib/bio-biostars-analytics/biostars-analytics.rb', line 32 def self.extract_date(datestring) # Major headache: weird years like "3.4 years ago" if datestring.match(/\d+\.\d+ years ago/) then return Chronic.parse("#{(datestring.sub(/\d+\./, '').sub(/\s.*$/, '').to_i * 5.2).to_i} weeks ago", :now => Chronic.parse(datestring.sub(/\.\d+/, ''))) else return Chronic.parse(datestring) end end |
.minecontent(log, id) ⇒ Object
Extracts data from the rendered forum post as well as the Biostar’s “post” API.
Algorithm:
-
mine data from the rendered forum post
-
retrieve limited information from Biostar’s API
-
check that gathered data matches up
-
log it
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/bio-biostars-analytics/biostars-analytics.rb', line 49 def self.minecontent(log, id) # This hash aggregates information about a particular Biostar question and its answers/comments: post = { 'id' => id } # # First: mine data from the rendered forum post # url = "http://www.biostars.org/p/#{id}/" page = nil begin page = open(url) rescue return end if page.base_uri.to_s != url then # Answer URL. return end # Question URL that contains the question, its answers and edits. doc = Hpricot(page.read) # Bail out if this page does not explicitly mentions a question. return unless doc.search('doc.title') or doc.search('doc.title')[0].inner_html.match(/^Question:/) users = [] # Extract user interactions: questions asked, answered and edits being made times = doc.search('span.relativetime|div.lastedit').map { |element| element.inner_html.sub(/^[^0-9]+/, '').sub(/by\s+$/, '').split("\n").first.strip } links = (doc/'a').delete_if { |link| if link.get_attribute('href') then not link.get_attribute('href').match(/^\/u\/\d+\//) # Has to be a relative link, or we catch Dropbox link-outs too... else true end }.map { |userlink| "#{userlink.get_attribute('href').gsub(/[\/u]+/, '')}\t#{userlink.inner_html}" } votes = doc.search('div.vote-count').map { |vote| if vote.inner_html.match(/^\d+$/) then vote.inner_html.to_i else nil end } = doc.search('a.tag').map { |link| link.inner_html } # Sanity check: times and users need to match up (i.e., both arrays need to be of the same length) unless times.length == links.length then $stderr.puts "Post ##{id}: recorded times and author links do not match up (#{times.length} vs. #{links.length})." return end # Sanity check: there cannot be more votes than times/links if votes.length > times.length then $stderr.puts "Post ##{id}: there are more votes than recorded user actions? (#{votes.length} vs. #{links.length})" return end # Question/answer specific stats regarding votes: question_vote = votes[0] answer_number = votes[1..-1].compact.length answer_min_vote = votes[1..-1].compact.sort[0] answer_max_vote = votes[1..-1].compact.sort[-1] answer_avg_vote = nil answer_avg_vote = (answer_min_vote + answer_max_vote).to_f / 2.0 if answer_min_vote and answer_max_vote # Helper variables to deal with the "votes" array, which is shorter than the times/links arrays. # These variables determine when the index counter for the "votes" array is incremented and when # said index is valid. vote_used = false vote_index = 0 # Go through each time occurrence/author link pair (and also consider votes): post['records'] = times.length times.each_index { |index| # Sanity check: first time is not an update... if index == 0 and times[index].match(/updated/) then $stderr.puts "Post ##{id}: First recorded time is also an update?" return end # Sanity check: first time is also not a comment... if index == 0 and votes[index] == nil then $stderr.puts "Post ##{id}: First recorded time is a comment?" return end action = 'answered' action = 'asked' if index == 0 if votes[vote_index] == nil and not vote_used then action = 'commented' vote_used = true end if times[index].match(/updated/) then action = 'edited' else vote_index += 1 vote_used = false end times[index] = times[index].sub(/^[^0-9]+/, '') datetime = extract_date(times[index]) post["#{index}"] = { 'datestring' => times[index], 'year' => datetime.year, 'month' => datetime.month, 'day' => datetime.day, 'action' => action, 'uid' => links[index], 'question_vote' => question_vote, 'answer_number' => answer_number, 'answer_min_vote' => answer_min_vote, 'answer_max_vote' => answer_max_vote, 'answer_avg_vote' => answer_avg_vote, 'tags' => } } page.close # # Second: retrieve limited information from Biostar's API # url = "http://www.biostars.org/api/post/#{id}/" begin doc = JSON.parse(open(url).read) rescue return end # Extract the limited information the API offers: post['api_creation_date'] = Chronic.parse(doc['creation_date']) post['api_answer_number'] = doc['answer_count'] post['api_question_vote'] = doc['score'] post['api_type'] = doc['type'] post['api_type_id'] = doc['type_id'] # # Third: check that gathered data matches up (API and data mined results are matching) # # Warning: number of answers matches # # Cannot be used as sanity check, because the Biostar implementation actually returns # a wrong number of answers. For example, http://www.biostars.org/p/7542/ (20 March 2014) # says "4 answers" even though there are clearly just three answers being displayed. # The same applies to underreporting of answers, such as in http://www.biostars.org/p/10927/ # (20 March 2014), where 12 answers are shown on the web-page, but the summary on top # reports only 11 answers. unless post['api_answer_number'] == post['0']['answer_number'] then $stderr.puts "Post ##{id}: number of answers differ (#{post['api_answer_number']} vs. #{post['0']['answer_number']}). Resetting number returned by API; using actual count of answers visible to the user." post['api_answer_number'] = post['0']['answer_number'] end # Sanity check: voting score for the question matches unless post['api_question_vote'] == post['0']['question_vote'] then $stderr.puts "Post ##{id}: mismatch between API's reported question vote and data mined voting score (#{post['api_question_vote']} vs. #{post['0']['question_vote']})." return end # # Fourth: log it # (0..post['records']-1).each { |index| record = post["#{index}"] log.puts "#{post['id']}\t#{record['datestring']}\t#{record['year']}\t#{record['month']}\t#{record['day']}\t#{record['action']}\t#{record['uid']}\t#{record['question_vote']}\t#{record['answer_number']}\t#{record['answer_min_vote']}\t#{record['answer_max_vote']}\t#{record['answer_avg_vote']}\t#{record['tags'].join(',')}\t#{post['api_type']}\t#{post['api_type_id']}" } end |
.minehistory(log, age) ⇒ Object
Extracts data from Biostar’s “stats” API.
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
# File 'lib/bio-biostars-analytics/biostars-analytics.rb', line 220 def self.minehistory(log, age) url = "http://www.biostars.org/api/stats/#{age}/" begin stats = JSON.parse(open(url).read) rescue return end # Extract the limited information the API offers: parseddate = Chronic.parse(stats['date']) stats['year'] = parseddate.year stats['month'] = parseddate.month stats['day'] = parseddate.day (1..@@CATEGORIES).each { |category| stats["new_posts_in_category_#{category}"] = 0 } # Types of votes in Biostar: # Accept # Bookmark # Downvote # Upvote stats['new_votes_of_type_Accept'] = 0 stats['new_votes_of_type_Bookmark'] = 0 stats['new_votes_of_type_Downvote'] = 0 stats['new_votes_of_type_Upvote'] = 0 stats['posters'] = [] stats['poster_ages'] = [] stats['root_post_ages'] = [] stats['vote_post_ages'] = [] stats['biostarbabies'] = [] if stats.has_key?('x_new_users') then stats['x_new_users'].each { |post| @user_age[post['id']] = age stats['biostarbabies'] = stats['biostarbabies'] + [ post['id'] ] } stats['new_users'] = stats['x_new_users'].length else stats['new_users'] = 0 end if stats.has_key?('x_new_posts') then stats['x_new_posts'].each { |post| @post_age[post['id']] = age stats['posters'] = stats['posters'] + [ post['author_id'] ] stats['poster_ages'] = stats['poster_ages'] + [ @user_age[post['author_id']] ] stats['root_post_ages'] = stats['root_post_ages'] + [ @post_age[post['root_id']] ] if post['root_id'] != post['id'] stats["new_posts_in_category_#{post['type_id']}"] = stats["new_posts_in_category_#{post['type_id']}"] + 1 } stats['new_posts'] = stats['x_new_posts'].length else stats['new_posts'] = 0 end # Poster age might not be applicable when having gone too far back in time... stats['poster_ages'].reject! { |i| i == nil } if stats.has_key?('x_new_votes') then stats['x_new_votes'].each { |vote| stats['vote_post_ages'] = stats['vote_post_ages'] + [ @post_age[vote['post_id']] ] if vote['type'] == 'Upvote' or vote['type'] == 'Downvote' stats["new_votes_of_type_#{vote['type']}"] = stats["new_votes_of_type_#{vote['type']}"] + 1 } stats['new_votes'] = stats['x_new_votes'].length else stats['new_votes'] = 0 end line = "#{age}\t#{stats['date']}\t#{stats['year']}\t#{stats['month']}\t#{stats['day']}\t" (1..@@CATEGORIES).each { |category| line << "#{stats["new_posts_in_category_#{category}"]}\t" } line << "#{stats['new_votes_of_type_Accept']}\t" line << "#{stats['new_votes_of_type_Bookmark']}\t" line << "#{stats['new_votes_of_type_Downvote']}\t" line << "#{stats['new_votes_of_type_Upvote']}\t" line << "#{stats['new_posts']}\t#{stats['new_votes']}\t#{stats['new_users']}\t" line << "#{stats['posters'].join(',')}\t#{stats['poster_ages'].join(',')}\t#{stats['root_post_ages'].join(',')}\t#{stats['vote_post_ages'].join(',')}\t#{stats['biostarbabies'].join(',')}\t" log.puts line end |