A ruby DSL for building long complex regular expressions used frequently across large data extraction projects.
This is the simplest possible example, but given the following text: t = “$10.99 (555) 444-3322 Boston, MA”
Instead of writing a regular expression like the following: data = t.match(/$(d+*.d+)s*((d3)sd3-d4)s*(w+),s(w+)/)
and then retrieving the data with match result indices: data => “10.99” data => “(555) 444-3322” data => “Boston” data => “MA”
We can use Houdini to build the regular expression in easy to read and reusable pieces
Define a regular expression for use across your project: Houdini.define(:word, /w+/)
Call hmatch or hscan on a string and build your expression pieces with the r() method by defining the expression inline or using a pre-defined expression. Then build your match results with the m( ) method.
data = t.hmatch do
r(/\d+[,\d]*\.\d+/, "amount")
r(/\(\d{3}\)\s+\d{3}-\d{4}/, "phone_number")
r(:word, "city", "state")
m("\\$(amount) (phone_number) (city), (state)")
end
Now you can access the data as methods on the resulting object:
data.amount => “10.99” data.phone_number => “(555) 444-3322” data.city => “Boston” data.state => “MA”
or access the original MatchData object
data.match => #<MatchData “$10.99 (555) 444-3322 Boston, MA” 1:“10.99” 2:“(555) 444-3322” 3:“Boston” 4:“MA”>