Class: ScraperUtils::DataQualityMonitor

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/data_quality_monitor.rb

Overview

Monitors data quality during scraping by tracking successful vs failed record processing Automatically triggers an exception if the error rate exceeds a threshold

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.statsObject (readonly)

Returns the value of attribute stats.



10
11
12
# File 'lib/scraper_utils/data_quality_monitor.rb', line 10

def stats
  @stats
end

Class Method Details

.extract_authority(record) ⇒ Object

Extracts authority label and ensures stats are setup for record



22
23
24
25
26
27
# File 'lib/scraper_utils/data_quality_monitor.rb', line 22

def self.extract_authority(record)
  authority_label = (record&.key?("authority_label") ? record["authority_label"] : "").to_sym
  @stats ||= {}
  @stats[authority_label] ||= { saved: 0, unprocessed: 0 }
  authority_label
end

.log_saved_record(record) ⇒ void

This method returns an undefined value.

Logs a successfully saved record

Parameters:

  • record (Hash)

    The record that was saved



64
65
66
67
68
# File 'lib/scraper_utils/data_quality_monitor.rb', line 64

def self.log_saved_record(record)
  authority_label = extract_authority(record)
  @stats[authority_label][:saved] += 1
  ScraperUtils::LogUtils.log "Saving record #{authority_label&.empty? ? '' : "for #{authority_label}: "}#{record['council_reference']} - #{record['address']}"
end

.log_unprocessable_record(exception, record) ⇒ void

This method returns an undefined value.

Logs an unprocessable record and raises an exception if error threshold is exceeded The threshold is 5 + 10% of saved records

Parameters:

  • exception (Exception)

    The exception that caused the record to be unprocessable

  • record (Hash, nil)

    The record that couldn’t be processed

Raises:



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/scraper_utils/data_quality_monitor.rb', line 44

def self.log_unprocessable_record(exception, record)
  authority_label = extract_authority(record)
  @stats[authority_label][:unprocessed] += 1
  details = if record&.key?('council_reference') && record&.key?('address')
              "#{record['council_reference']} - #{record['address']}"
            else
              record.inspect
            end
  ScraperUtils::LogUtils.log "Erroneous record #{details}: #{exception}"
  return unless @stats[authority_label][:unprocessed] > threshold(authority_label)

  raise ScraperUtils::UnprocessableSite,
        "Too many unprocessable_records for #{authority_label}: " \
        "#{@stats[authority_label].inspect} - aborting processing of site!"
end

.start_authority(authority_label) ⇒ Object

Notes the start of processing an authority and clears any previous stats

Parameters:

  • authority_label (Symbol)

    The authority we are processing



16
17
18
19
# File 'lib/scraper_utils/data_quality_monitor.rb', line 16

def self.start_authority(authority_label)
  @stats ||= {}
  @stats[authority_label] = { saved: 0, unprocessed: 0 }
end

.threshold(authority_label) ⇒ Object

Threshold for unprocessable records Initial base of 5.01 (override using MORPH_UNPROCESSABLE_BASE) Initial percentage of 10% (override using MORPH_UNPROCESSABLE_PERCENTAGE)



32
33
34
35
# File 'lib/scraper_utils/data_quality_monitor.rb', line 32

def self.threshold(authority_label)
  ENV.fetch('MORPH_UNPROCESSABLE_BASE', 5.01).to_f +
    (@stats[authority_label][:saved].to_i * ENV.fetch('MORPH_UNPROCESSABLE_PERCENTAGE', 10.0).to_f / 100.0) if @stats&.fetch(authority_label, nil)
end