CSV Humanitarian eXchange Language (HXL) Parser / Reader

csvhuman library / gem - read tabular data in the CSV Humanitarian eXchange Language (HXL) format, that is, comma-separated values (CSV) line-by-line records with a hashtag (meta data) line using the Humanitarian eXchange Language (HXL) rules

What's Humanitarian eXchange Language (HXL)?

Humanitarian eXchange Language (HXL) is a (meta data) convention for adding agreed on hashtags e.g. #org,#country,#sex+#targeted,#adm1 inline in a (single new line / row) between the last header row and the first data row for sharing tabular data across organisations (during a humanitarian crisis). Example:

What,,,Who,Where,For whom,
Record,Sector/Cluster,Subsector,Organisation,Country,Males,Females,Subregion
,#sector+en,#subsector,#org,#country,#sex+#targeted,#sex+#targeted,#adm1
001,WASH,Subsector 1,Org 1,Country 1,100,100,Region 1
002,Health,Subsector 2,Org 2,Country 2,,,Region 2
003,Education,Subsector 3,Org 3,Country 2,250,300,Region 3
004,WASH,Subsector 4,Org 1,Country 3,80,95,Region 4

Usage

Pass in an array of arrays (or a stream responding to #each with an array of strings). Example:

pp CsvHuman.parse( [["Organisation", "Cluster", "Province" ], ## or use HXL.parse
                    [ "#org", "#sector", "#adm1" ],
                    [ "Org A", "WASH", "Coastal Province" ],
                    [ "Org B", "Health", "Mountain Province" ],
                    [ "Org C", "Education", "Coastal Province" ],
                    [ "Org A", "WASH", "Plains Province" ]]

resulting in:

[{"org" => "Org A", "sector" => "WASH",      "adm1" => "Coastal Province"},
 {"org" => "Org B", "sector" => "Health",    "adm1" => "Mountain Province"},
 {"org" => "Org C", "sector" => "Education", "adm1" => "Coastal Province"},
 {"org" => "Org A", "sector" => "WASH",      "adm1" => "Plains Province"}]

Or pass in the text. Example:

pp CsvHuman.parse( <<TXT )      ## or use HXL.parse
  What,,,Who,Where,For whom,
  Record,Sector/Cluster,Subsector,Organisation,Country,Males,Females,Subregion
  ,#sector+en,#subsector,#org,#country,#sex+#targeted,#sex+#targeted,#adm1
  001,WASH,Subsector 1,Org 1,Country 1,100,100,Region 1
  002,Health,Subsector 2,Org 2,Country 2,,,Region 2
  003,Education,Subsector 3,Org 3,Country 2,250,300,Region 3
  004,WASH,Subsector 4,Org 1,Country 3,80,95,Region 4
TXT

resulting in:

[{"sector+en"    => "WASH",
  "subsector"    => "Subsector 1",
  "org"          => "Org 1",
  "country"      => "Country 1",
  "sex+targeted" => [100, 100],
  "adm1"         => "Region 1"},
 {"sector+en"    => "Health",
  "subsector"    => "Subsector 2",
  "org"          => "Org 2",
  "country"      => "Country 2",
  "sex+targeted" => [nil, nil],
  "adm1"         => "Region 2"},
 {"sector+en"    => "Education",
  "subsector"    => "Subsector 3",
  "org"          => "Org 3",
  "country"      => "Country 2",
  "sex+targeted" => [250, 300],
  "adm1"         => "Region 3"},
 {"sector+en"    => "WASH",
  "subsector"    => "Subsector 4",
  "org"          => "Org 1",
  "country"      => "Country 3",
  "sex+targeted" => [80, 95],
  "adm1"         => "Region 4"}]

What about Enumerable?

Yes, every reader includes Enumerable and runs on each. Use new or open without a block to get the enumerator (iterator). Example:

csv = CsvHuman.new( <<TXT )      ## or use HXL.new
  What,,,Who,Where,For whom,
  Record,Sector/Cluster,Subsector,Organisation,Country,Males,Females,Subregion
  ,#sector+en,#subsector,#org,#country,#sex+#targeted,#sex+#targeted,#adm1
  001,WASH,Subsector 1,Org 1,Country 1,100,100,Region 1
  002,Health,Subsector 2,Org 2,Country 2,,,Region 2
  003,Education,Subsector 3,Org 3,Country 2,250,300,Region 3
  004,WASH,Subsector 4,Org 1,Country 3,80,95,Region 4
TXT )
it  = csv.to_enum
pp it.next  
# => {"sector+en"    => "WASH",
#     "subsector"    => "Subsector 1",
#     "org"          => "Org 1",
#     "country"      => "Country 1",
#     "sex+targeted" => [100, 100],
#     "adm1"         => "Region 1"}


# -or-

csv = CsvHuman.open( "./test.csv" )     # or use HXL.open
it  = csv.to_enum
pp it.next
# => {"sector+en"    => "WASH",
#     "subsector"    => "Subsector 1",
#     "org"          => "Org 1",
#     "country"      => "Country 1",
#     "sex+targeted" => [100, 100],
#     "adm1"         => "Region 1"}
pp it.next
# => {"sector+en"    => "Health",
#     "subsector"    => "Subsector 2",
#     "org"          => "Org 2",
#     "country"      => "Country 2",
#     "sex+targeted" => [nil, nil],
#     "adm1"         => "Region 2"}

More Ways to Use

csv = CsvHuman.new( recs )
csv.each do |rec|
  pp rec
end


CsvHuman.parse( recs ).each do |rec|
  pp rec
end


pp CsvHuman.read( "./test.csv" )

CsvHuman.foreach( "./test.csv" ) do |rec|
  pp rec
end

#...

or use the HXL alias:

hxl = HXL.new( recs )
hxl.each do |rec|
  pp rec
end


HXL.parse( recs ).each do |rec|
  pp rec
end


pp HXL.read( "./test.csv" )

HXL.foreach( "./test.csv" ) do |rec|
  pp rec
end

#...

Note: More aliases for CsvHuman, HXL? Yes, you can use CsvHum, CSV_HXL, CSVHXL too.

What about symbol keys for hashes?

Yes, you can use the header_converter keyword option. Use :symbol for (auto-)converting header tags (strings) to symbols. Note: the symbol converter will remove all hashtags (#) and spaces and will change the plus (+) to underscore (_) and remove all non-alphanumeric (e.g. !?$%) chars.

Example:

txt =<<TXT
What,,,Who,Where,For whom,
Record,Sector/Cluster,Subsector,Organisation,Country,Males,Females,Subregion
,#sector+en,#subsector,#org,#country,#sex+#targeted,#sex+#targeted,#adm1
001,WASH,Subsector 1,Org 1,Country 1,100,100,Region 1
002,Health,Subsector 2,Org 2,Country 2,,,Region 2
003,Education,Subsector 3,Org 3,Country 2,250,300,Region 3
004,WASH,Subsector 4,Org 1,Country 3,80,95,Region 4
TXT

pp CsvHuman.parse( txt, :header_converter => :symbol )      ## or use HXL.parse

# -or-

options = { :header_converter => :symbol }
pp CsvHuman.parse( txt, options )  

resulting in:

[{:sector_en    => "WASH",
  :subsector    => "Subsector 1",
  :org          => "Org 1",
  :country      => "Country 1",
  :sex_targeted => [100, 100],
  :adm1         => "Region 1"},
 # ...
 {:sector_en    => "WASH",
  :subsector    => "Subsector 4",
  :org          => "Org 1",
  :country      => "Country 3",
  :sex_targeted => [80, 95],
  :adm1         => "Region 4"}]

Built-in header converters include:

Converter Comments
:none string key; uses "normalized" tag e.g. "#adm1 +code"
:default string key; strips hashtags and spaces e.g. "admin+code"
:symbol symbol key; strips hashtags and spaces and converts plus (+) to underscore (_) and removes all non-alphanumerics e.g. :admin_code

Or add your own converters. Example:

pp CsvHuman.parse( txt, header_converter: ->(h) { h.upcase } )

resulting in:

[{"#SECTOR +EN"    => "WASH",
  "#SUBSECTOR"     => "Subsector 1",
  "#ORG"           => "Org 1",
  "#COUNTRY"       => "Country 1",
  "#SEX +TARGETED" => [100, 100],
  "#ADM1"          => "Region 1"},
 # ...
]

A custom header converter is a method that gets the (normalized) header tag passed in (e.g. #sector +en) as a string and returns a string or symbol to use for the hash key in records.

Tag Helpers

Normalize. Use CsvHuman::Tag.normalize to pretty print or normalize a tag. All parts get downcased (lowercased), all attributes sorted by a-to-z, all extra or missing hashtags or pluses added or removed, all extra or missing spaces added or removed. Example:

HXL::Tag.normalize( "#sector+en" )
# => "#sector +en"
HXL::Tag.normalize( "#SECTOR EN" )
# => "#sector +en"
HXL::Tag.normalize( "# SECTOR  + #EN " )
# => "#sector +en"
HXL::Tag.normalize( "SECTOR EN" )
# => "#sector +en"
# ...

Split. Use CsvHuman::Tag.split to split (and normalize) a tag into its parts. Example:

HXL::Tag.split( "#sector+en" )
# => ["sector", "en"]
HXL::Tag.split( "#SECTOR EN" )
# => ["sector", "en"]
HXL::Tag.split( "# SECTOR  + #EN " )
# => ["sector", "en"]
HXL::Tag.split( "SECTOR EN" )
# => ["sector", "en"]

## sort attributes a-to-z
HXL::Tag.split( "#affected +f +children" )
# => ["affected", "children", "f"]
HXL::Tag.split( "#population +children +affected +m" )
# => ["population", "affected", "children", "m"]
HXL::Tag.split( "#population+children+affected+m" )
# => ["population", "affected", "children", "m"]
HXL::Tag.split( "#population+#children+#affected+#m" )
# => ["population", "affected", "children", "m"]
HXL::Tag.split( "#population #children #affected #m" )
# => ["population", "affected", "children", "m"]
HXL::Tag.split( "POPULATION CHILDREN AFFECTED M" )
# => ["population", "affected", "children", "m"]
#...

Frequently Asked Questions (FAQ) and Answers

Q: How to deal with un-tagged fields?

A: Un-tagged fields get skipped / ignored.

Q: How to deal with duplicate / repeated fields (e.g. #sex+#targeted,#sex+#targeted)?

A: Repeated fields (auto-magically) get turned into an array / list.

License

The csvhuman scripts are dedicated to the public domain. Use it as you please with no restrictions whatsoever.

Questions? Comments?

Send them along to the wwwmake forum. Thanks!