File: README — Documentation for lucid

LucidWorks-Ruby

Ruby bindings for the REST API of the LucidWorks family of search products.

The LucidWorks family of products are search engines that combine the open source search technologies Lucene and Solr with open source crawlers, a management UI and a REST API. The LucidWorks REST API provides a programmatic way to manage collections, data-sources, scheduling and many of the other objects and tasks involved in running a search engine.

Information

You can view the LucidWorks-Ruby documentation in RDoc format here:

rubydoc.info/github/lucidimagination/lucidworks-ruby/master/frames

The LucidWorks REST API is documented here:

lucidworks.lucidimagination.com/display/LWEUG/Rest+API

Bug reports

Where should people file bugs? GitHub? That implies we have open sourced this already. An email address at Lucid?

Installation

Install the gem:

gem install lucid_works

Or add it to your Gemfile, then run bundle install:

gem "lucid_works"

Show Me the Money

This single statement (note the periods) will connect to a LucidWorks server running on the local machine, create a collection called “News” and a data-source called “cnn” for the cnn.com website, then start a crawl. Cut and paste into Irb:

require 'lucid_works'

LucidWorks::Server.new("http://localhost:8888").
  create_collection(:name => 'News').
  create_datasource(:name => 'cnn',
                    :crawler => 'lucid.aperture', :type => 'web',
                    :url => 'http://cnn.com', :crawl_depth => '1').
  start_crawl!

Now, how does it work:

Object Model

The LucidWorks object model looks something like this:

Server -+- Collection -+- Datasource -+- Status
        |              |              +- History
        |              |              +- Schedule
        |              |              +- Index
        |              |              +- Crawldata
        |              |              +- Job
        |              +- Field
        |              +- Index
        |              +- Info
        |              +- Settings
        |              +- Activity -+- Status
        |                           +- History
        |
        +- Logs -+- Index -+- Summary
        |        +- Query -+- Summary
        |
        +- Crawlers
        +- Version

This is what has been modeled so far. The actual REST API is more extensive.

Usage

Server

The starting point for our communication with a LucidWorks server is a LucidWorks::Server object, e.g. for a LucidWorks server running on the local machine, on the standard port:

server = LucidWorks::Server.new("http://localhost:8888")

Collections

Collections are modeled using the LucidWorks::Collection class. LucidWorks::Server has_many :collections, therefore:

To retrieve collections:

@server.collections                   -> an array LucidWorks::Collection

puts @server.collections.map(&:name)

@server.collection("name")            -> a single LucidWorks::Collection

Create a collection:

collection = @server.build_collection(:name => "MY_STUFF")
collection.save

or

collection = @server.create_collection(:name => "MY_STUFF")

Delete a collection:

collection.destroy

Wipe all indexed data from a collection:

collection.empty!

Collection Info

The Collection::Info contains a lot of data about the state of a collection.

info = @server.collection('coll1').info -> a LucidWorks::Collection::Info

info.index_num_docs  ->  12345
info.index_size      ->  "44.3 MB"

Collection Settings

The Collection::Settings class contains indexing and querying settings for the collection.

settings = @server.collection('collection1').settings -> a LucidWorks::Collection::Settings

settings.query_parser    ->  "lucid"
settings.synonym_list    ->  ["Lawyer", "Attorney", "one", "1", ...]

Field

Collection has_many :fields. The Field class models data about a collection’s field.

field = @server.collection('collection1').field('body')  -> a LucidWorks::Field

field.field_type  ->  "text_en"
field.facet       ->  false

Datasources

Collection has_many :datasources. Datasources are modeled using the LucidWorks::Datasource class. They support all the standard ORM methods, e.g.

collection.datasources       -> an array of LucidWorks::Datasource

collection.datasource(123)   -> a single LucidWorks::Datasource

datasource = collection.create_datasource(
  :crawler => 'lucid.aperture',
  :type => 'web',
  :name => "example.com",
  :url => "http://example.com/",
  :crawl_depth => 1
)

Note that the latter does not start a crawl of the datasource.

To start a datasource crawling:

datasource.start_crawl!

To stop a datasource crawl:

datasource.stop_crawl!

To delete all the data crawled from a data-source:

datasource.empty!

The ORM

This library implements a simple ORM (object relational model) on top of the LucidWorks REST API which behaves somewhat like ActiveResource/ActiveRecord (if you want to know why we didn’t just use ActiveResource, see the Rationale section).

Base

LucidWorks::Base is the ORM foundation of this library. It supports many of the ActiveRecord style methods. e.g. given a Thing model:

class Thing < LucidWorks::Base
end

Then Thing will have the following class methods:

thing = Thing.new(:attrib => value, :parent => parent)        -> unsaved Thing

Thing.create(:attr => value, ..., :parent => parent)          -> saved Thing

Thing.find(:all, :parent => parent)                           -> Array of Thing

Thing.find(id, :parent => parent)                             -> a Thing

The ‘parent’ must be another LucidWorks::Base model or a LucidWorks::Server; this is only required when the class is used stand-alone. If the model is created/retrieved from an association, this value is set for you automatically.

thing.save                                                    -> true/false
thing.destroy

Has_many associations

The has_many association is used to associate a resource with another collection resource. Given:

class Thing < LucidWorks::Base
  has_many :others
end

Then

thing.others                            -> array of Other

thing.other(id)                         -> an Other

thing.new_other(:attr => val, ...)      -> an unsaved Other

thing.create_other(:attr => val, ...)   -> saved Other

Has_one associations

The has_one association is used to associate a resource with another singleton resource that is transient, i.e. can be created and destroyed.

class Thing < LucidWorks::Base
  has_one :whatnot
end

class Whatnot < LucidWorks::Base
  self.singleton = true
  belongs_to :thing
end

Then

thing.whatnot                           -> a retrieved Whatnot

thing.build_whatnot                     -> an unsaved Whatnot

Belongs_to associations

Te belongs to association augments the model with methods to access its parent. Given:

class Whatnot < LucidWorks::Base
  self.singleton = true
  belongs_to :thing
end

Then:

whatnot.thing         -> A Thing

For more information on association see LucidWorks::Associations::ClassMethods

Schema

A class may have a schema defined as follows:

class ThingWithSchema < LucidWorks::Base
  schema do
    attribute :string1,  :string
    attribute :bool1,    :boolean
    attribute :integer1, :integer
    attributes :string2, :string3, :string4
    attributes :bool2,   :bool3, :type => :boolean
    attributes :int2,    :int3,  :type => :integer
    attribute :string_with_values, :values => ['one', 'two']
    attribute :dontsendme, :omit_during_update => true
    attribute :sendnull,   :string, :nil_when_blank => true
  end
end

Classes with a schema may have validations applied to its attributes. The default attribute type is :string. See LucidWorks::Schema for more details.

Rationale

Originally this library started out as a set of ActiveResource classes. This required a lot of hacking of ActiveResource as ActiveResource makes a lot of assumptions about the way a REST API should work - it’s basically just designed to talk to Rails applications - and many REST APIs, including this one, don’t conform to those rules. Among the changes required to ActiveResource were:

Don’t require attributes always be nested inside :resource => on create and update.
Allow client-side generation of a resource ID during create.
Support has_one and has_many associations.

However eventually this strategy hit a brick wall that would have been extremely expensive to hurdle. We needed the following features:

The ability to talk to the same API on more than one server simultaneously.
Support file uploads using multi-part post.

Given the design of ActiveResource these would have been expensive to implement and it became simpler to just write a simple ORM by marrying ActiveModel and RestClient.

License

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.