Infoboxer
Infoboxer is pure-Ruby Wikipedia (and generic MediaWiki) client and parser, targeting information extraction (hence the name).
It can be useful in tasks like:
- get a plaintext abstract of an article (paragraphs before first heading);
- get structured data variables from page's infobox;
- list page's sections and count paragraphs, images and tables in them;
- convert some huge "comparison table" to data;
- and much, much more!
The whole idea is: you can have any Wikipedia page as a parsed tree with obvious structure, you can navigate that tree easily, and you have a bunch of hi-level helpers method, so typical information extraction tasks should be super-easy, one-liners in best cases.
(For those already thinking "Why should you do this, we already have DBPedia?" -- please, read "Reasons" page in our wiki.)
Showcase
Infoboxer.wikipedia.
get('Breaking Bad (season 1)').
sections('Episodes').templates(name: 'Episode table').
fetch('episodes').templates(name: /^Episode list/).
fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')
# => [{"EpisodeNumber"=>#<Var(EpisodeNumber): 1>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 1>, "Title"=>#<Var(Title): Pilot>, "ShortSummary"=>#<Var(ShortSummary): Walter White, a 50-year old che...>},
# {"EpisodeNumber"=>#<Var(EpisodeNumber): 2>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 2>, "Title"=>#<Var(Title): Cat's in the Bag...>, "ShortSummary"=>#<Var(ShortSummary): Walt and Jesse try to dispose o...>},
# ...and so on
Do you feel it now?
You also can take a look at Showcase.
Usage
Install gem
Install it as usual: gem 'infoboxer'
in your Gemfile, then bundle install
.
Or just [sudo] gem install infoboxer
if you prefer.
Grab the page
# From English Wikipedia
page = Infoboxer.wikipedia.get('Argentina')
# or
page = Infoboxer.wp.get('Argentina')
# From other language Wikipedia:
page = Infoboxer.wikipedia('fr').get('Argentina')
# From any wiki with the same engine:
page = Infoboxer.wiki('http://companywiki.com').get('Our Product')
See more examples and options at Retrieving pages
Play with page
Basically, page is a tree of Nodes, you can think of it as some kind of DOM.
So, you can navigate it:
# Simple traversing and inspect
node = page.children.first.children.first
node.to_tree
node.to_text
# Various lookups
page.lookup(:Template, name: /^Infobox/)
On the top of the basic navigation Infoboxer adds some useful shortcuts for convenience and brevity, which allows things like this:
page.section('Episodes').tables.first
To put it all in one piece, also take a look at Data extraction tips and tricks.
infoboxer executable
Just try infoboxer
command.
Without any options, it starts IRB session with infoboxer required and included into main namespace.
With -w
option, it provides a shortcut to MediaWiki instance you want.
Like this:
$ infoboxer -w https://en.wikipedia.org/w/api.php
> get('Argentina')
=> #<Page(title: "Argentina", url: "https://en.wikipedia.org/wiki/Argentina"): ....
You can also use shortcuts like infoboxer -w wikipedia
for common
wikies (and, just for fun, infoboxer -wikipedia
also).
Advanced topics
- Reasons for Infoboxer creation;
- Parsing quality (TL;DR: very good, but not ideal);
- Performance (TL;DR: 0.1-0.4 sec for parsing hugest pages);
- Localization (TL;DR: For now, you'll need some work to use Infoboxer's most advanced features with non-English or non-WikiMedia wikis; basic and mid-level features work always);
- If you plan to use Wikipedia or sister projects data in production, please consider Wikipedia terms and conditions.
Compatibility
As of now, Infoboxer reported to be compatible with any MRI Ruby since 2.0.0 (1.9.3 previously, dropped since Infoboxer 0.2.0). In Travis-CI tests, JRuby is failing due to bug in old Java 7/Java 8 SSL certificate support (see here), and Rubinius failing 3 specs of 500 by mystery, which is uninvestigated yet.
Therefore, those Ruby versions are excluded from Travis config, though, they may still work for you.
Links
License
MIT.