csvpack
work with tabular data packages using comma-separated values (CSV) datafiles in text with datapackage.json; download, read into and query comma-separated values (CSV) datafiles with your SQL database (e.g. SQLite, PostgreSQL, ...) of choice and much more
- home :: github.com/csv11/csvpack
- bugs :: github.com/csv11/csvpack/issues
- gem :: rubygems.org/gems/csvpack
- rdoc :: rubydoc.info/gems/csvpack
- forum :: [email protected]
Usage
What's a tabular data package?
Tabular Data Package is a simple structure for publishing and sharing tabular data with the following key features:
- Data is stored in CSV (comma separated values) files
- Metadata about the dataset both general (e.g. title, author) and the specific data files (e.g. schema) is stored in a single JSON file named
datapackage.json
which follows the Data Package format
Here's a minimal example of a tabular data package holding two files, that is, data.csv
and datapackage.json
:
data.csv
:
Brewery,City,Name,Abv
Andechser Klosterbrauerei,Andechs,Doppelbock Dunkel,7%
Augustiner Bräu München,München,Edelstoff,5.6%
Bayerische Staatsbrauerei Weihenstephan,Freising,Hefe Weissbier,5.4%
Brauerei Spezial,Bamberg,Rauchbier Märzen,5.1%
Hacker-Pschorr Bräu,München,Münchner Dunkel,5.0%
Staatliches Hofbräuhaus München,München,Hofbräu Oktoberfestbier,6.3%
...
datapackage.json
:
{
"name": "beer",
"resources": [
{
"path": "data.csv",
"schema": {
"fields": [{ "name": "Brewery", "type": "string" },
{ "name": "City", "type": "string" },
{ "name": "Name", "type": "string" },
{ "name": "Abv", "type": "number" }]
}
}
]
}
Where to find data packages?
For some real world examples see the Data Packages Listing (Sources) at the Data Hub.io • Frictionless Data Initiative website for a start. Tabular data packages include:
Name | Comments |
---|---|
country-codes |
Comprehensive country codes: ISO 3166, ITU, ISO 4217 currency codes and many more |
language-codes |
ISO Language Codes (639-1 and 693-2) |
currency-codes |
ISO 4217 Currency Codes |
gdb |
Country, Regional and World GDP (Gross Domestic Product) |
s-and-p-500-companies |
S&P 500 Companies with Financial Information |
un-locode |
UN-LOCODE Codelist |
gold-prices |
Gold Prices (Monthly in USD) |
bond-yields-uk-10y |
10 Year UK Government Bond Yields (Long-Term Interest Rate) |
and many more
Code, Code, Code - Script Your Data Workflow with Ruby
require 'csvpack'
CsvPack.import(
's-and-p-500-companies',
'gdb'
)
Using CsvPack.import
will:
1) download all data packages to the ./pack
folder
2) (auto-)add all tables to an in-memory SQLite database using SQL create_table
commands via ActiveRecord
migrations e.g.
create_table :constituents_financials do |t|
t.string :symbol # Symbol (string)
t.string :name # Name (string)
t.string :sector # Sector (string)
t.float :price # Price (number)
t.float :dividend_yield # Dividend Yield (number)
t.float :price_earnings # Price/Earnings (number)
t.float :earnings_share # Earnings/Share (number)
t.float :book_value # Book Value (number)
t.float :_52_week_low # 52 week low (number)
t.float :_52_week_high # 52 week high (number)
t.float :market_cap # Market Cap (number)
t.float :ebitda # EBITDA (number)
t.float :price_sales # Price/Sales (number)
t.float :price_book # Price/Book (number)
t.string :sec_filings # SEC Filings (string)
end
3) (auto-)import all datasets using SQL inserts e.g.
INSERT INTO constituents_financials
(symbol,
name,
sector,
price,
dividend_yield,
price_earnings,
earnings_share,
book_value,
_52_week_low,
_52_week_high,
market_cap,
ebitda,
price_sales,
price_book,
sec_filings)
VALUES
('MMM',
'3M Company',
'Industrials',
162.27,
2.11,
22.28,
7.284,
25.238,
123.61,
162.92,
104.0,
8.467,
3.28,
6.43,
'http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM')
4) (auto-)add ActiveRecord models for all tables.
So what? Now you can use all the "magic" of ActiveRecord to query the datasets. Example:
pp Constituent.count
# SELECT COUNT(*) FROM "constituents"
# => 496
pp Constituent.first
# SELECT "constituents".* FROM "constituents" ORDER BY "constituents"."id" ASC LIMIT 1
# => #<Constituent:0x9f8cb78
# id: 1,
# symbol: "MMM",
# name: "3M Company",
# sector: "Industrials">
pp Constituent.find_by!( symbol: 'MMM' )
# SELECT "constituents".*
# FROM "constituents"
# WHERE "constituents"."symbol" = "MMM"
# LIMIT 1
# => #<Constituent:0x9f8cb78
# id: 1,
# symbol: "MMM",
# name: "3M Company",
# sector: "Industrials">
pp Constituent.find_by!( name: '3M Company' )
# SELECT "constituents".*
# FROM "constituents"
# WHERE "constituents"."name" = "3M Company"
# LIMIT 1
# => #<Constituent:0x9f8cb78
# id: 1,
# symbol: "MMM",
# name: "3M Company",
# sector: "Industrials">
pp Constituent.where( sector: 'Industrials' ).count
# SELECT COUNT(*) FROM "constituents"
# WHERE "constituents"."sector" = "Industrials"
# => 63
pp Constituent.where( sector: 'Industrials' ).all
# SELECT "constituents".*
# FROM "constituents"
# WHERE "constituents"."sector" = "Industrials"
# => [#<Constituent:0x9f8cb78
# id: 1,
# symbol: "MMM",
# name: "3M Company",
# sector: "Industrials">,
# #<Constituent:0xa2a4180
# id: 8,
# symbol: "ADT",
# name: "ADT Corp (The)",
# sector: "Industrials">,...]
and so on
Frequently Asked Questions (F.A.Qs) and Answers
Q: How to dowload a data package ("by hand")?
Use the CsvPack::Downloader
class to download a data package
to your disk (by default data packages get stored in ./pack
).
dl = CsvPack::Downloader.new
dl.fetch( 'language-codes' )
dl.fetch( 's-and-p-500-companies' )
dl.fetch( 'un-locode')
Will result in:
-- pack
|-- language-codes
| |-- data
| | |-- ietf-language-tags.csv
| | |-- language-codes-3b2.csv
| | |-- language-codes-full.csv
| | `-- language-codes.csv
| `-- datapackage.json
|-- s-and-p-500-companies
| |-- data
| | `-- constituents.csv
| `-- datapackage.json
`-- un-locode
|-- data
| |-- code-list.csv
| |-- country-codes.csv
| |-- function-classifiers.csv
| |-- status-indicators.csv
| `-- subdivision-codes.csv
`-- datapackage.json
Q: How to add and import a data package ("by hand")?
Use the CsvPack::Pack
class to read-in a data package
and add and import into an SQL database.
pack = CsvPack::Pack.new( './pack/un-locode/datapackage.json' )
pack.tables.each do |table|
table.up! # (auto-) add table using SQL create_table via ActiveRecord migration
table.import! # import all records using SQL inserts
end
Q: How to connect to a different SQL database?
You can connect to any database supported by ActiveRecord. If you do NOT establish a connection in your script - the standard (default fallback) is using an in-memory SQLite3 database.
SQLite
For example, to create an SQLite3 database on disk - lets say mine.db
-
use in your script (before the CsvPack.import
statement):
ActiveRecord::Base.establish_connection( adapter: 'sqlite3',
database: './mine.db' )
PostgreSQL
For example, to connect to a PostgreSQL database use in your script
(before the CsvPack.import
statement):
require 'pg' ## pull-in PostgreSQL (pg) machinery
ActiveRecord::Base.establish_connection( adapter: 'postgresql'
username: 'ruby',
password: 'topsecret',
database: 'database' )
Install
Just install the gem:
$ gem install csvpack
Alternatives
See the "Tools and Plugins for working with Data Packages" page at the Frictionless Data Initiative.
License
The csvpack
scripts are dedicated to the public domain.
Use it as you please with no restrictions whatsoever.
Questions? Comments?
Send them along to the ruby-talk mailing list. Thanks!