Mida
Description
A Microdata parser and extractor library for ruby. This is based on the latest Published version of the Microdata Specification dated 5th April 2011.
Installation
Mida keeps RubyGems up-to-date with its latest version, so installing is as easy as:
gem install mida
Requirements:
-
Nokogiri
Command Line Usage
To use the command line tool, supply it with the urls or filenames that you would like to be parsed (by default each item is output as yaml):
mida http://lawrencewoodman.github.io/mida/news/
If you want to search for specific types you can use the -t
switch followed by a Regular Expression:
mida -t /person/i http://lawrencewoodman.github.io/mida/news/
For more information look at mida
‘s help:
mida -h
Library Usage
The following examples assume that you have required mida
and open-uri
.
Extracting Microdata from a page
All the Microdata is extracted from a page when a new Mida::Document
instance is created.
To extract all the Microdata from a webpage:
url = 'http://example.com'
open(url) {|f| doc = Mida::Document.new(f, url)}
The top-level Items
will be held in an array accessible via doc.items
.
To simply list all the top-level Items
that have been found:
puts doc.items
Searching
If you want to search for an Item
that has a specific itemtype
/vocabulary this can be done with the search
method.
To return all the Items
that use one of Google’s Review vocabularies:
doc.search(%r{http://data-vocabulary\.org.*?review.*?}i)
Inspecting an Item
Each Item
is a Mida::Item
instance and has four main methods of interest: type
, vocabulary
, properties
and id
.
To find out the itemtype
of the Item
:
puts doc.items.first.type
To find out the itemid
of the Item
:
puts doc.items.first.id
Properties are returned as a hash containing name/values pairs. The values will be an array of either String
or Mida::Item
instances.
To see the properties
of the Item
:
puts doc.items.first.properties
Working with Vocabularies
Mida allows you to define vocabularies, so that input data can be constrained to match expected patterns. By default a generic vocabulary (Mida::GenericVocabulary
) is registered which will match against any itemtype
with any number of properties.
If you want to specify a vocabulary you create a class derived from Mida::Vocabulary
. As an example the following describes a subset of Google’s Review vocabulary:
class Rating < Mida::Vocabulary
itemtype %r{http://data-vocabulary.org/rating}i
has_one 'best'
has_one 'worst'
has_one 'value'
end
class Review < Mida::Vocabulary
itemtype %r{http://data-vocabulary.org/review}i
has_one 'itemreviewed'
has_one 'rating' do
extract Rating, Mida::DataType::Text
end
end
When you create a subclass of Mida::Vocabulary
it automatically registers the Vocabulary.
Now if Mida is parsing some input and manages to match against the Review
itemtype
, it will only allow the specified properties and will reject any that don’t have the correct number. It will also set Item#vocabulary
accordingly, e.g.
doc.items.first.vocabulary # => Review
If you want to include the properties of another vocabulary you can use include_vocabulary
:
class Thing < Mida::Vocabulary
itemtype %r{http://example.com/vocab/thing}i
has_one 'name', 'description'
end
class Book < Mida::Vocabulary
itemtype %r{http://example.com/vocab/book}i
include_vocabulary Thing
has_one 'title', 'author'
end
class Collection < Mida::Vocabulary
itemtype %r{http://example.com/vocab/collection}i
has_many 'item' do
extract Thing
end
end
In the above if you gave a Book
as an item of Collection
this would be accepted because it includes the Thing
vocabulary.
Bugs/Feature Requests
If you find a bug or want to make a feature request, please report it at the Mida project’s issues tracker on github.
Licence
Copyright © 2011-2013 Lawrence Woodman <[email protected]>. This software is licensed under the MIT Licence. Please see the file, LICENCE.rdoc, for details.