Tripleloop

A DSL for extracting data from hash-like objects into RDF statements (i.e. triples or quads).

Usage

Start by creating some extractor classes. Each extractor maps one or several document fragments to RDF statments.

class ArticleCoreExtractor < Tripleloop::Extractor
  bind(:doi) { |doc| RDF::DOI.send(doc[:doi]) }

  map(:title)          { |title|   [doi, RDF::DC11.title, title, RDF::NPGG.articles] }
  map(:published_date) { |date |   [doi, RDF::DC11.date, Date.parse(date), RDF::NPGG.articles] }
  map(:product)        { |product| [doi, RDF::NPG.product, RDF::NPGP.nature, RDF::NPGG.articles] }
end

class SubjectsExtractor < Tripleloop::Extractor
  bind(:doi) { |doc| RDF::DOI.send(doc[:doi]) }

  map(:subjects) { |subjects|
    subjects.map { |s|
      [doi, RDF::NPG.hasSubject, RDF::NPGS.send(s) ]
    }
  }
end

Once defined, extractors can be composed into a DocumentProcessor class.

class NPGProcessor < Tripleloop::DocumentProcessor
  extractors :article_core, :subjects
end

The processor can then be fed with a collection of hash like documents and return RDF data grouped by extractor name.

data = NPGProcessor.batch_process(documents)
=> { :article_core => [[<RDF::URI:0x00000002651ce0(http://dx.doi.org/10.1038/481241e)>, 
                        <RDF::URI:0x1b0c060(http://purl.org/dc/elements/1.1/title)>, 
                       "Developmental biology: Watching cells die in real time"],...], 
     :subjects => [...] }

Notice that the output retuned by the batch_process method is still a plain ruby data structure, and not an instance of RDF::Statement. The actual job of instantiating RDF statements and writing them to disc is in fact responsability of the Tripleloop::RDFWriter class, which can be used as follows:

Tripleloop::RDFWriter.new(data, :dataset_path => Pathname.new("my-datasets")).write

This will create the following two files:

  • my-dataset/article_core.nq
  • my-dataset/subjects.nq

When #write method is executed, RDFWriter will internally generate RDF triples, delegating the RDF serialisation job to RDF.rb's RDF::Writer. The only logic involved in the implementation of Tripleloop::RDFWriter#write concerns the assignment of the right RDF serialisation format and file extension. When all the RDF statements generated by an extractor do specify also a graph (as in the example above), the writer will use the RDF::NQuads::Writer, falling back to RDF::NTriples::Writer otherwise.