Module: ScraperUtils::DbUtils

Defined in:
lib/scraper_utils/db_utils.rb

Overview

Utilities for database operations in scrapers

Class Method Summary collapse

Class Method Details

.cleanup_old_recordsObject

Clean up records older than 30 days and approx once a month vacuum the DB



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# File 'lib/scraper_utils/db_utils.rb', line 61

def self.cleanup_old_records
  cutoff_date = (Date.today - 30).to_s
  vacuum_cutoff_date = (Date.today - 35).to_s

  stats = ScraperWiki.sqliteexecute(
    "SELECT COUNT(*) as count, MIN(date_scraped) as oldest FROM data WHERE date_scraped < ?",
    [cutoff_date]
  ).first

  deleted_count = stats["count"]
  oldest_date = stats["oldest"]

  return unless deleted_count.positive? || ENV["VACUUM"]

  LogUtils.log "Deleting #{deleted_count} applications scraped between #{oldest_date} and #{cutoff_date}"
  ScraperWiki.sqliteexecute("DELETE FROM data WHERE date_scraped < ?", [cutoff_date])

  return unless rand < 0.03 || (oldest_date && oldest_date < vacuum_cutoff_date) || ENV["VACUUM"]

  LogUtils.log "  Running VACUUM to reclaim space..."
  ScraperWiki.sqliteexecute("VACUUM")
end

.collect_saves!Object

Enable in-memory collection mode instead of saving to SQLite



9
10
11
# File 'lib/scraper_utils/db_utils.rb', line 9

def self.collect_saves!
  @collected_saves = []
end

.collected_savesArray<Array>

Get all collected save calls



20
21
22
# File 'lib/scraper_utils/db_utils.rb', line 20

def self.collected_saves
  @collected_saves
end

.save_immediately!Object

Save to disk rather than collect



14
15
16
# File 'lib/scraper_utils/db_utils.rb', line 14

def self.save_immediately!
  @collected_saves = nil
end

.save_record(record) ⇒ void

This method returns an undefined value.

Saves a record to the SQLite database with validation and logging

Raises:

  • If record fails validation



29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/scraper_utils/db_utils.rb', line 29

def self.save_record(record)
  # Validate required fields
  required_fields = %w[council_reference address description info_url date_scraped]
  required_fields.each do |field|
    if record[field].to_s.empty?
      raise ScraperUtils::UnprocessableRecord, "Missing required field: #{field}"
    end
  end

  # Validate date formats
  %w[date_scraped date_received on_notice_from on_notice_to].each do |date_field|
    Date.parse(record[date_field]) unless record[date_field].to_s.empty?
  rescue ArgumentError
    raise ScraperUtils::UnprocessableRecord,
          "Invalid date format for #{date_field}: #{record[date_field].inspect}"
  end

  # Determine primary key based on presence of authority_label
  primary_key = if record.key?("authority_label")
                  %w[authority_label council_reference]
                else
                  ["council_reference"]
                end
  if @collected_saves
    @collected_saves << record
  else
    ScraperWiki.save_sqlite(primary_key, record)
    ScraperUtils::DataQualityMonitor.log_saved_record(record)
  end
end