errata

Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Real-world usage

Brighter Planet logo

We use errata for data science at Brighter Planet and in production at

The killer combination:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata (this library!) - apply corrections in a transparent way
  4. data_miner - import data idempotently

Inspiration

There's a process for reporting errata on RFC:

screenshot of the RFC Editor

Example

Every errata has a table structure based on the IETF RFC Editor's "How to Report Errata".

date name email type section action x y condition notes
2011-03-22 Ian Hough [email protected] meta Intended use http://example.com/original-data-with-errors.xls A hypothetical document that uses non-ISO country names
2011-03-22 Ian Hough [email protected] technical Country Name replace /ANTIGUA & BARBUDA/ ANTIGUA AND BARBUDA
2011-03-22 Ian Hough [email protected] technical Country Name replace /BOLIVIA/ BOLIVIA, PLURINATIONAL STATE OF
2011-03-22 Ian Hough [email protected] technical Country Name replace /BOSNIA & HERZEGOVINA/ BOSNIA AND HERZEGOVINA
2011-03-22 Ian Hough [email protected] technical Country Name replace /BRITISH VIRGIN ISLANDS/ VIRGIN ISLANDS, BRITISH
2011-03-22 Ian Hough [email protected] technical Country Name replace /COTE D'IVOIRE/ CÔTE D'IVOIRE
2011-03-22 Ian Hough [email protected] technical Country Name replace /DEM\. PEOPLE'S REP\. OF KOREA/ KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF
2011-03-22 Ian Hough [email protected] technical Country Name replace /DEM\. REP\. OF THE CONGO/ CONGO, THE DEMOCRATIC REPUBLIC OF THE
2011-03-22 Ian Hough [email protected] technical Country Name replace /HONG KONG SAR/ HONG KONG
2011-03-22 Ian Hough [email protected] technical Country Name replace /IRAN \(ISLAMIC REPUBLIC OF\)/ IRAN, ISLAMIC REPUBLIC OF

Which would be saved as a CSV:

date,name,email,type,section,action,x,y,condition,notes
2011-03-22,Ian Hough,[email protected],meta,Intended use,,http://example.com/original-data-with-errors.xls,,A hypothetical document that uses non-ISO country names
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/ANTIGUA & BARBUDA/,ANTIGUA AND BARBUDA,,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/BOLIVIA/,"BOLIVIA, PLURINATIONAL STATE OF",,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/BOSNIA & HERZEGOVINA/,BOSNIA AND HERZEGOVINA,,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/BRITISH VIRGIN ISLANDS/,"VIRGIN ISLANDS, BRITISH",,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/COTE D'IVOIRE/,CÔTE D'IVOIRE,,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/DEM\.  PEOPLE'S REP\. OF KOREA/,"KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF",,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/DEM\. REP\. OF THE CONGO/,"CONGO, THE DEMOCRATIC REPUBLIC OF THE",,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/HONG KONG SAR/,HONG KONG,,
2011-03-22,Ian Hough,[email protected],technical,Country Name,replace,/IRAN \(ISLAMIC REPUBLIC OF\)/,"IRAN, ISLAMIC REPUBLIC OF",,

And then used

errata = Errata.new(:url => 'http://example.com/errata.csv')
original = RemoteTable.new(:url => 'http://example.com/original-data-with-errors.xls')
original.each do |row|
  errata.correct! row # destructively correct each row
end

UTF-8

Assumes all input strings are UTF-8. Otherwise there can be problems with Ruby 1.9 and Regexp::FIXEDENCODING. Specifically, ASCII-8BIT regexps might be applied to UTF-8 strings (or vice-versa), resulting in Encoding::CompatibilityError.

More advanced usage

The earth library has dozens of real-life examples showing errata in action:

Model Reference Errata file
Country data_miner.rb wri_errata.csv
Aircraft data_miner.rb faa_errata.csv
Airports data_miner.rb openflights_errata.csv
Automobile model variants data_miner.rb feg_errata.csv

Authors

Copyright (c) 2012 Brighter Planet. See LICENSE for details.